2001-07-26 07:27:50

by Andrew Morton

[permalink] [raw]
Subject: ext3-2.4-0.9.4

An update to the ext3 filesystem for 2.4 kernels is available at

http://www.uow.edu.au/~andrewm/linux/ext3/

The diffs are against linux-2.4.7 and linux-2.4.6-ac5.

The changelog is there. One rarely-occurring but oopsable bug
was fixed and several quite significant performance enhancements
have been made. These are in addition to the performance fixes
which went into 0.9.3.

Ted has put out a prelease of e2fsprogs-1.23 which supports
filesystem type `auto' in /etc/fstab, so it is now possible to
switch between ext3- and non-ext3-kernels without changing
any configuration.

It is recommended that users of earlier ext3 releases upgrade
to 0.9.4.

For people who are undertaking performance testing, it is perhaps
useful to point out that ext3 operates in one of three different
journalling modes, and that these modes have very different
functionality and very different performance characteristics.
Really, you need to test all three and balance the functionality
which each mode offers against the throughput which you obtain
in your application.


The modes are:

data=writeback

This is classic metadata-only journalling. File data is written
back to the main fs lazily. After a crash+recovery the fs's
structural integrity is preserved, but the *contents* of files
can and will contain old, stale data. Potentially hundreds of
megabytes of it.

This is the fastest mode for normal filesystem applications.

data=ordered

The fs ensures that file data is written into the main fs prior
to committing its metadata. Hence after a crash+recovery, your
files will contain the correct data.

This is the default operating mode and throughput is good. It
adds about one second to a four minute kernel compile when
compared with ext2. Under heavier loads the difference
becomes larger.

data=journal

All data (as well as to metadata) is written to the journal
before it is released to the main fs for writeback.

This is a specialised mode - for normal fs usage you're better
off using ordered data, which has the same benefits of not corrupting
data after crash+recovery. However for applications which require
synchronous operation such as mail spools and synchronously exported
NFS servers, this can be a performance win. I have seen dbench
figures in this mode (where the files were opened O_SYNC) running
at ten times the throughput of ext2. Not that this is the expected
benefit for other applications!


Looking at the above issues, one may initially think that the
post-recovery data corruption is a serious issue with writeback mode,
and that there are big advantages to using journalled or ordered data.

However, even in these modes the affected files may be shorter-than-expected
after recovery, because the app hadn't finished writing them yet. And
usually, a truncated file is just as useless as one which contains
garbage - it needs to be deleted.

It's not really as simple as that - for small (< a few hundred k) files,
it tends to be the case that either the whole file is intact after a crash,
or none of it is. This is because the journalling mechanism starts a
new transaction every five seconds, and a typical open/write/close operation
usually fits entirely inside this window.

There is also a security issue to be considered: a recovered writeback-mode
filesystem will expose other people's old data to unintended recipients.


Hopefully this description will help people make their deployment choices.
If not, assistance is available on the [email protected] mailing list.

-


2001-07-26 11:08:33

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Andrew Morton wrote:

> data=journal
>
> All data (as well as to metadata) is written to the journal
> before it is released to the main fs for writeback.
>
> This is a specialised mode - for normal fs usage you're better
> off using ordered data, which has the same benefits of not corrupting
> data after crash+recovery. However for applications which require
> synchronous operation such as mail spools and synchronously exported
> NFS servers, this can be a performance win. I have seen dbench

In ordered and journal mode, are meta data operations, namely creating a
file, rename(), link(), unlink() "synchronous" in the sense that after
the call has returned, the effect of this call is never lost, i. e., if
link(2) has returned and the machine crashes immediately, will the next
recovery ALWAYS recover the link?

Or will ext3 still need chattr +S?

Does it still support chattr +S at all?

Synchronous meta data operations are crucial for mail transfer agents
such as Postfix or qmail. Postfix has up until now been setting
chattr +S /var/spool/postfix, making original (esp. soft-updating) BSD
file systems significantly faster for data (payload) writes in this
directory than ext2.

Note: I'm not on the ext3-users list. Please Cc: back replies.

--
Matthias Andree

2001-07-26 11:36:13

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Matthias Andree wrote:
>
> On Thu, 26 Jul 2001, Andrew Morton wrote:
>
> > data=journal
> >
> > All data (as well as to metadata) is written to the journal
> > before it is released to the main fs for writeback.
> >
> > This is a specialised mode - for normal fs usage you're better
> > off using ordered data, which has the same benefits of not corrupting
> > data after crash+recovery. However for applications which require
> > synchronous operation such as mail spools and synchronously exported
> > NFS servers, this can be a performance win. I have seen dbench
>
> In ordered and journal mode, are meta data operations, namely creating a
> file, rename(), link(), unlink() "synchronous" in the sense that after
> the call has returned, the effect of this call is never lost, i. e., if
> link(2) has returned and the machine crashes immediately, will the next
> recovery ALWAYS recover the link?

No, they're not synchronous by default. After recovery they
will either be wholly intact, or wholly absent.

> Or will ext3 still need chattr +S?

Yes, if the app doesn't support O_SYNC or fsync(). I believe
that MTA's *do* support those things.

> Does it still support chattr +S at all?

Yes.

> Synchronous meta data operations are crucial for mail transfer agents
> such as Postfix or qmail. Postfix has up until now been setting
> chattr +S /var/spool/postfix, making original (esp. soft-updating) BSD
> file systems significantly faster for data (payload) writes in this
> directory than ext2.

If postfix is capable of opening the files O_SYNC or of doing
fsync() on them then the `chattr +s' is no longer necessary - unlike
ext2, when the O_SYNC write() or the fsync() return, the directory
contents (as well as the inode, bitmaps, data, etc) will all be tight on
disk and will be restored after a crash.

This should speed things up considerably, especially with journalled-data
mode. I need to test and characterise this some more to come up with some
quantitative results and configuration recommendations.


BTW, if you have more-than-modest throughput requirements, don't
even *think* of mounting the fs with `mount -o sync'. Our performance
in this mode is terrible :(

I have a hack somewhere which fixes this as much as it can be fixed, but
I didn't even bother committing it. It's feasible, but tiresome.

A better solution is to fix some lock inversion problems in the core
kernel which prevent optimal implementation of data-journalling
filesystems. I don't really expect this to occur medium-term or ever.

A middle-ground solution may be to add an fs-private `osync' mount
option, so all files are treated similarly to O_SYNC, which would
work well.

-

2001-07-26 12:30:17

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Andrew Morton wrote:

> > In ordered and journal mode, are meta data operations, namely creating a
> > file, rename(), link(), unlink() "synchronous" in the sense that after
> > the call has returned, the effect of this call is never lost, i. e., if
> > link(2) has returned and the machine crashes immediately, will the next
> > recovery ALWAYS recover the link?
>
> No, they're not synchronous by default. After recovery they
> will either be wholly intact, or wholly absent.
>
> > Or will ext3 still need chattr +S?
>
> Yes, if the app doesn't support O_SYNC or fsync(). I believe
> that MTA's *do* support those things.
>
> > Does it still support chattr +S at all?
>
> Yes.
>
> > Synchronous meta data operations are crucial for mail transfer agents
> > such as Postfix or qmail. Postfix has up until now been setting
...
> A middle-ground solution may be to add an fs-private `osync' mount
> option, so all files are treated similarly to O_SYNC, which would
> work well.

You seem to be missing the point, because I wasn't verbose enough, so I
will try to rephrase this and explain. This may turn out to be a feature
request. :-}

Before going into detail, MTAs do know about fsync(). ext3 synching
relevant directory parts as part of fsync() is a great achievement.
Finally, more than five years after initial complaints, Linux is SLOWLY
getting somewhere for speeding up reliable MTA operation.

But that's the smaller piece. Common MTAs such as Postfix or qmail
rename or link files into place (their queues, the mail spool). With the
advent of journalling came the atomicity of rename operations. That's
also a great achievement.

However, the remaining problem is being synchronous with respect to open
(fixed for ext3 with your fsync() as I understand it), rename, link and
unlink. With ext2, and as you write it, with ext3 as well, there is
currently no way to tell when the link/rename has been committed to
disk, unless you set mount -o sync or chattr +S or call sync() (the
former is not an option because it's far too expensive).


The official statement by Dr. Wietse Venema (who wrote Postfix) is,
Postfix REQUIRES synchronous directory updates (open, rename, link,
unlink, in order of decreasing importance). Wietse refuses to wrap all
these calls for Linux.

Similar assumptions hold for qmail.


So, what would help the common MTA? osync wouldn't, MTAs know how to use
fsync(). dirsync or bsdstyle or however it's called, as chattr and
mount options, would help. This option should make all directory
operations (open/creat/fsync, rename, link, unlink, symlink, possibly
close) synchronous in respect to affected directory and meta data while
leaving application data (payload) operations asynchronous (applications
can then choose when to call fsync() to flush the data to disk).

A much better file system for an MTA might be ext3fs with
data=journalled and dirsync mount/chattr option. Would you deem it
possible to get such an option done before ext3fs 1.0.0?

I hope this makes the requirements of this particular group of
applications clear.

Thanks again to everyone involved with the ext3fs development.

--
Matthias Andree

2001-07-26 12:31:57

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, Jul 26, 2001 at 09:42:37PM +1000, Andrew Morton wrote:

If postfix is capable of opening the files O_SYNC or of doing
fsync() on them then the `chattr +s' is no longer necessary -
unlike ext2, when the O_SYNC write() or the fsync() return, the
directory contents (as well as the inode, bitmaps, data, etc) will
all be tight on disk and will be restored after a crash.

This should speed things up considerably, especially with
journalled-data mode. I need to test and characterise this some
more to come up with some quantitative results and configuration
recommendations.

Postfix does an fsync on file before closing them, it then does a
rename and expects once rename as returned, the renamed actually
occured --- even if the fs crashes. It also expects if you fsync a
file, then it will appear in the parent directory with certainty and
not say /lost+found after fsck on reboot.

Without +s under ext2, you can loose file(s) in /lost+found because
open+write+fsync+close works and ensures the data is on disk, but the
parent directory doesn't get synced to disk, so it might get lost.




--cw

2001-07-26 12:59:31

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Matthias Andree wrote:

> So, what would help the common MTA?

Not relying on non-supported semantics to save your ass.

Rename() is atomic in the sense that you either see the
old name or the new name, but I don't know of systems
which guarantee atomicity across a system crash.

In fact, knowing how hard disks work mechanically, only
journaling filesystems could have an extention to make
this work. Ie. this is NOT something you can rely on ;)

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-26 13:18:05

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Rik van Riel wrote:

> In fact, knowing how hard disks work mechanically, only
> journaling filesystems could have an extention to make
> this work. Ie. this is NOT something you can rely on ;)

This is not about failing hard disks. It is about premature
acknowledgment of something which has not happened at that time.

Linux cannot possibly fix all incomplete protocols, specifications and
implementation, but it can fix its own behaviour.

Everything is about speed, and allowing the MTA to use a (weaker)
dirsync rather than allsync option would speed things up without
sacrificing reliability.

MTA reliability is NOT about failing disk drives. If it falls over, you
notice that. If files are in the wrong directory or not there at all,
you don't necessarily notice until someone complains.

Please don't get in the way of finally fixing things just because
someone might have a broken item that could endanger your data. I have a
huge magnet here...

The competition is there and it has names: BSD + ufs + softupdates,
Solaris + logging ufs. Read MTA mailing lists before obstructing.

Thanks.

--
Matthias Andree

2001-07-26 13:23:45

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Rik van Riel wrote:
>
> > In fact, knowing how hard disks work mechanically, only
> > journaling filesystems could have an extention to make
> > this work. Ie. this is NOT something you can rely on ;)
>
> This is not about failing hard disks. It is about premature
> acknowledgment of something which has not happened at that time.

So you didn't read what I was writing.

Let me explain it to you slowly:

Disks. Write. One. Write. At. A. Time.

A rename often needs as many as 4 or 5 writes,
ergo, you CANNOT do a rename atomically without
journaling and transactions.

> The competition is there and it has names: BSD + ufs + softupdates,
> Solaris + logging ufs. Read MTA mailing lists before obstructing.

BSD + softupdates is physically incapable of doing what
you suggest it does. This can be proven from the lack
of transactions and the way hard disks work physically.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-26 13:52:33

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> On Thu, 26 Jul 2001, Rik van Riel wrote:
> > In fact, knowing how hard disks work mechanically, only
> > journaling filesystems could have an extention to make
> > this work. Ie. this is NOT something you can rely on ;)
>
> This is not about failing hard disks. It is about premature
> acknowledgment of something which has not happened at that time.

Rik is right. It isnt just about premature notification - its about
atomicity. At the point you are notified the data has been queued for disk
I/O. Even on traditional BSD ufs with synchronous metadata you still had
points where a crash left the rename partially complete and nothing but a
log or an atomic update system is going to fix that.

> The competition is there and it has names: BSD + ufs + softupdates,
> Solaris + logging ufs. Read MTA mailing lists before obstructing.

All of which are - not unsuprisingly - using a log. In fact Solaris logging
ufs and ext3 are very similar ideas - adding a log to an existing fs.

Alan

2001-07-26 13:58:53

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Rik van Riel wrote:

> On Thu, 26 Jul 2001, Matthias Andree wrote:
> > On Thu, 26 Jul 2001, Rik van Riel wrote:
> >
> > > In fact, knowing how hard disks work mechanically, only
> > > journaling filesystems could have an extention to make
> > > this work. Ie. this is NOT something you can rely on ;)
> >
> > This is not about failing hard disks. It is about premature
> > acknowledgment of something which has not happened at that time.
>
> So you didn't read what I was writing.

Sorry.

> Let me explain it to you slowly:
>
> Disks. Write. One. Write. At. A. Time.
>
> A rename often needs as many as 4 or 5 writes,
> ergo, you CANNOT do a rename atomically without
> journaling and transactions.

You're missing the point, with this as the previous mail. The MTA is not
going to change from one unsupported/incompatible interface (that only
Linux suffers from) and if it did, it would still do the wrong thing.

MTAs often run multiple processes, and if these all open the same
directory and sync it while others have changes open that don't need a
sync at that time and will sync later, you're getting no further than
with chattr +S or mount -o sync.

It's not about atomicity itself, but about
first. write. all. required. blocks. for. a. certain. change.
physically. to. disc. and. only. after. this. do. return. from.
rename, link, unlink. function. calls.

I'm aware of phase-tree concepts ("single block write switches from one
consistent state to another") and I'm aware that ext3fs and reiserfs do
feature atomic renames (after crash recovery).

That a drive might fall over or the power might fail before all writes
of a certain rename operation have completed is harmless UNLESS you lied
to someone that the operation was already complete (when it wasn't).

> > The competition is there and it has names: BSD + ufs + softupdates,
> > Solaris + logging ufs. Read MTA mailing lists before obstructing.
>
> BSD + softupdates is physically incapable of doing what
> you suggest it does. This can be proven from the lack
> of transactions and the way hard disks work physically.

You misunderstood me. I'm not talking about atomicity.

--
Matthias Andree

2001-07-26 13:55:43

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Alan Cox wrote:

> > The competition is there and it has names: BSD + ufs + softupdates,
> > Solaris + logging ufs. Read MTA mailing lists before obstructing.
>
> All of which are - not unsuprisingly - using a log. In fact
> Solaris logging ufs and ext3 are very similar ideas - adding a
> log to an existing fs.

Softupdates isn't using logging. Furthermore, even
the journaling filesystems won't all guarantee that
the various parts of a rename() operation will all
be in the same transaction.

An MTA which relies on this is therefore Broken(tm).

cheers,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-26 14:02:53

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Matthias Andree wrote:
>
> A much better file system for an MTA might be ext3fs with
> data=journalled and dirsync mount/chattr option.

OK, I've taken a closer look at this. ext3 has picked up some
cruft from ext2's sync handling which it does not need in the
least.

It will be fairly straightforward and a useful cleanup to
provide the following semantics for either synchronous
mounts or `chattr +S' directories:

* All metadata operations (rename, unlink, link, symlink, etc)
will be synchronous. So when the system call returns, the data
is crash-proofed.

* All write()s will be synchronous. So when the write() system
call returns, the data written and all associated metadata
will be crash-proofed.

O_SYNC and fsync() will not be necessary - in fact they'll
slow things down slightly by forcing an unnecessary and
probably empty commit.

If you crash in the middle of a write, you may end up with a truncated
file on recovery.

This is in fact the behaviour right now, but the performance is
not good.

The performance problem at present is that large write()s have unnecessary
commits in the middle of them. This is due to the abovementioned
cruft in ext3_get_block() and the things it calls.

> Would you deem it
> possible to get such an option done before ext3fs 1.0.0?

We'd prefer not - we're trying to stabilise things quite
sternly at present. However that doesn't prevent work
on 1.1.0 :)

Seems like a worthwhile thing to do - I'll cut a branch
and do this. It'll take a couple of weeks - as usual, most
of the work is in development and use of test tools...
But I can't predict at this time when we'll merge it into
the mainline fs.

> I hope this makes the requirements of this particular group of
> applications clear.

Yes, it was useful - thanks.

-

2001-07-26 14:06:03

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Rik van Riel wrote:
>
>
> Furthermore, even the journaling filesystems won't all guarantee that
> the various parts of a rename() operation will all be in the same
> transaction.

umm.. I'd certainly hope that they do guarantee this.

The only operations which can't trivially fit into a single
transaction are write() and truncate().

-

2001-07-26 14:32:39

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Alan Cox wrote:

> Rik is right. It isnt just about premature notification - its about
> atomicity. At the point you are notified the data has been queued for disk
> I/O. Even on traditional BSD ufs with synchronous metadata you still had
> points where a crash left the rename partially complete and nothing but a
> log or an atomic update system is going to fix that.

No. Atomic update systems and logs can by no means fix premature
acknowledgements:

Proof:

Assume the OS has a phase tree kind of thing or log that requires
just a single-block write for an atomic rename.

Assume an MTA calls rename(), and the OS by whatever means notifies it of
completion, but actually, the data is only queued, not written.

Assume The MTA receives the acknowledgement (e. g. rename call
returned), sends a "250 mail action complete" packet across the network.

Assume the machine sends the network packed, but not the queued disk
block and then crashes.

--> The single block is lost, the rename operation is lost, but the
operation had been acknowledged. Consequence: the mail is lost. q. e. d.

All this boils down to:

1. The OS _MUST_ know when a write operation has been physically
committed to non-volatile storage.

2. The OS _MUST_ _NOT_ acknowledge the (assumedly synchronous operation)
any earlier. (This may well include switching off drive write
buffering.)

If the OS cannot fulfill these two basic requirements, I can save all
the log or FS atomicity efforts because they don't get me anywhere.

The problem is not that the operation can fail, the problem IS premature
acknowledgement. Even with atomic updates, as shown above.

Note, of course there is no premature acknowledgement for the
Linux-default asynchronous directory update. There IS for -o sync or
chattr +S -- and that's what MTAs to to guarantee data integrity, and
that's why I'm still suggesting dirsync or something to remedy the
negative data write performance of full-sync.

If the OS tell me "write completed" when it means "I queued your data
for writing", it is BROKEN.

That's my point.

And since the common POSIX OS lacks a dedicated notification feature for
e. g. rename, MTAs have no other choice than to rely on "has completed
when the syscall returns".

BTW, my Linux rename(2) man page doesn't document EIO condition, FreeBSD
4.3-STABLE and SUS v2 do.

--
Matthias Andree

2001-07-26 14:45:29

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Rik van Riel wrote:

> An MTA which relies on this is therefore Broken(tm).

MTAs rely on TRULY, ULTIMATELY AND DEFINITELY SYNCHRONOUS directory
updates, nothing else. And because they do so, and most systems have
them, and MTAs are portable, they choose chattr +S on Linux. And that's
a performance problem because it doesn't come for free, but also with
synchronous data updates, which are unnecessary because there is
fsync().

That's already the complete story about MTAs on Linux.

If Linux HAD a mode (it doesn't) to have just synchronous directory
updates, MTAs could stop using chattr +S and be faster.


MTAs do NOT care how the file system is internally managed, they only
rely on the rename operation having completed physically on disk before
the "my rename call has returned 0" event. They expect that with the
call returning the rename operation has completed ultimately, finally,
for good, definitely and the old file will not reappear after a crash.

(Note that the atomicity addressed in the man pages and Unix
specifications is a different one: it deals with the visibility of the
changes in the system, not with the functioning of the file system.)

That's why *BSD + softupdates is still recommended over Linux for pure
mail transfer agents by people.

This still implies the drive doesn't lie to the OS about the completion
of write requests: write cache == off.

--
Matthias Andree

2001-07-26 15:03:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

In article <[email protected]> you wrote:
> On Thu, 26 Jul 2001, Rik van Riel wrote:
>
>> An MTA which relies on this is therefore Broken(tm).

> MTAs rely on TRULY, ULTIMATELY AND DEFINITELY SYNCHRONOUS directory
> updates, nothing else.

And thus they are broken, all caps don't make that less true.

> And because they do so, and most systems have them,

"and most systems have them"...

> MTAs do NOT care how the file system is internally managed, they only
> rely on the rename operation having completed physically on disk before
> the "my rename call has returned 0" event. They expect that with the
> call returning the rename operation has completed ultimately, finally,
> for good, definitely and the old file will not reappear after a crash.

So they rely on undocumented and non standadisized semantics of some
implementations. I'd call this buggy.

Christoph

--
Whip me. Beat me. Make me maintain AIX.

2001-07-26 15:08:00

by Matthias Andree

[permalink] [raw]
Subject: RFC: chattr/lsattr +DS? was: ext3-2.4-0.9.4

On Fri, 27 Jul 2001, Andrew Morton wrote:

> > Would you deem it
> > possible to get such an option done before ext3fs 1.0.0?
>
> We'd prefer not - we're trying to stabilise things quite
> sternly at present. However that doesn't prevent work
> on 1.1.0 :)
>
> Seems like a worthwhile thing to do - I'll cut a branch
> and do this. It'll take a couple of weeks - as usual, most
> of the work is in development and use of test tools...
> But I can't predict at this time when we'll merge it into
> the mainline fs.

So the summary of all this is, as I understand it: for ext3fs 1.0, treat
it with chattr +S and the like as if it was ext2fs, it may or may not be
faster with "mount -o data=journalled" and is well worthwhile for an MTA
to try, a weaker sync option may be introduced after ext3fs 1.0.

Sounds good.

I'm dropping the ext3-users mailing list for now since this is getting
more general.


However, since the ReiserFS team also showed interest in a similar
functionality, and they don't yet support chattr, would it be useful to
specify a "D" option for chattr already?

I have a suggestion: if D is set, but S isn't, no effect. If S is set,
but D is unset, treat S as in the past. If S is set, and D is set,
directory updates are synchronous like with S, but data updates are
asynchronous in spite of S.

This way, booting a kernel without chattr "D" flag support or mounting
the file system as ext2 would have it default to the safer
everything-synchronously mode.

--
Matthias Andree

2001-07-26 15:28:35

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> them, and MTAs are portable, they choose chattr +S on Linux. And that's
> a performance problem because it doesn't come for free, but also with
> synchronous data updates, which are unnecessary because there is
> fsync().

chattr +S and atomic updates hitting disk then returning to the app will
give the same performance. You can also fsync() the directory.

> the "my rename call has returned 0" event. They expect that with the
> call returning the rename operation has completed ultimately, finally,
> for good, definitely and the old file will not reappear after a crash.

Actually the old file re-appearing after the crash is irrelevant. It will
have a previously logged message id. And if you are not doing message id
histories then you have replay races at the SMTP level anyway

> This still implies the drive doesn't lie to the OS about the completion
> of write requests: write cache == off.

Write cache off is not a feature available on many modern disks. You
already lost the battle before you started.

Alan

2001-07-26 15:27:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thursday 26 July 2001 16:32, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Alan Cox wrote:
> > Rik is right. It isnt just about premature notification - its about
> > atomicity. At the point you are notified the data has been queued
> > for disk I/O. Even on traditional BSD ufs with synchronous metadata
> > you still had points where a crash left the rename partially
> > complete and nothing but a log or an atomic update system is going
> > to fix that.
>
> No. Atomic update systems and logs can by no means fix premature
> acknowledgements:
>
> Proof:
>
> Assume the OS has a phase tree kind of thing or log that requires
> just a single-block write for an atomic rename.
>
> Assume an MTA calls rename(), and the OS by whatever means notifies
> it of completion, but actually, the data is only queued, not written.
>
> Assume The MTA receives the acknowledgement (e. g. rename call
> returned), sends a "250 mail action complete" packet across the
> network.
>
> Assume the machine sends the network packed, but not the queued disk
> block and then crashes.
>
> --> The single block is lost, the rename operation is lost, but the
> operation had been acknowledged. Consequence: the mail is lost. q. e.
> d.
>
> All this boils down to:
>
> 1. The OS _MUST_ know when a write operation has been physically
> committed to non-volatile storage.

We're working on that, see the "[PATCH] 64 bit scsi read/write" thread
on linux-fsdevel. About half of it is devoted to investigating the
detailed semantics of physical write completion.

> 2. The OS _MUST_ _NOT_ acknowledge the (assumedly synchronous
> operation) any earlier. (This may well include switching off drive
> write buffering.)

Yes, for now that's how you have to do it.

> If the OS cannot fulfill these two basic requirements, I can save all
> the log or FS atomicity efforts because they don't get me anywhere.
>
> The problem is not that the operation can fail, the problem IS
> premature acknowledgement. Even with atomic updates, as shown above.

Right now the interface for determining that the operation has actually
completed is "sync". Yes, that sucks but with journalling or atomic
commit it's not nearly as expensive as you might think. My early flush
patch does nearly the equivalent of sync, 10 times a second and it
actually improves performance (it does not attempt to do this under
high load of course).

We *should* have something like sys_sync_dev(majorminor) or
sys_sync_fs(mountpoint) (whatever that would look like). For
phase-tree the semantics are that the call doesn't return until the
metaroot of the then-current "branching" tree is known to be safely on
disk. (Side note: it's ok to allow subsequent updates on the same
filesystem to procede while an outstanding sync_dev is waiting for
confirmation from the block layer, because these don't affect the
filesystem state the sync_fs is waiting on.)

As I understand it, Ext2 allows much the same semantics. While we do
need to do something about exposing a more elegant interface, with Ext3
you should be ok with +S and a "sync" just before you report to the
world that the mail transaction is complete. Ext3 does *not* leave a
lot of dirty blocks hanging around in normal operation, so sync is not
nearly as slow as it is with good old Ext2.

> Note, of course there is no premature acknowledgement for the
> Linux-default asynchronous directory update. There IS for -o sync or
> chattr +S -- and that's what MTAs to to guarantee data integrity, and
> that's why I'm still suggesting dirsync or something to remedy the
> negative data write performance of full-sync.
>
> If the OS tell me "write completed" when it means "I queued your data
> for writing", it is BROKEN.
>
> That's my point.
>
> And since the common POSIX OS lacks a dedicated notification feature
> for e. g. rename, MTAs have no other choice than to rely on "has
> completed when the syscall returns".
>
> BTW, my Linux rename(2) man page doesn't document EIO condition,
> FreeBSD 4.3-STABLE and SUS v2 do.

Sounds like a man page bug.

--
Daniel

2001-07-26 15:34:15

by Andrew Morton

[permalink] [raw]
Subject: Re: RFC: chattr/lsattr +DS? was: ext3-2.4-0.9.4

Matthias Andree wrote:
>
> On Fri, 27 Jul 2001, Andrew Morton wrote:
>
> > > Would you deem it
> > > possible to get such an option done before ext3fs 1.0.0?
> >
> > We'd prefer not - we're trying to stabilise things quite
> > sternly at present. However that doesn't prevent work
> > on 1.1.0 :)
> >
> > Seems like a worthwhile thing to do - I'll cut a branch
> > and do this. It'll take a couple of weeks - as usual, most
> > of the work is in development and use of test tools...
> > But I can't predict at this time when we'll merge it into
> > the mainline fs.
>
> So the summary of all this is, as I understand it: for ext3fs 1.0, treat
> it with chattr +S and the like as if it was ext2fs, it may or may not be
> faster with "mount -o data=journalled" and is well worthwhile for an MTA
> to try, a weaker sync option may be introduced after ext3fs 1.0.
>
> Sounds good.
>
> I'm dropping the ext3-users mailing list for now since this is getting
> more general.
>
> However, since the ReiserFS team also showed interest in a similar
> functionality, and they don't yet support chattr, would it be useful to
> specify a "D" option for chattr already?

chattr is an ext[23]-specific thing. reiserfs could certainly
support a similar thing if they have a few bits spare in the
inode.

> I have a suggestion: if D is set, but S isn't, no effect. If S is set,
> but D is unset, treat S as in the past. If S is set, and D is set,
> directory updates are synchronous like with S, but data updates are
> asynchronous in spite of S.

I don't think this would be needed until really proven necessary - for
data, fsync() should work for all filesystems.

There would be one benefit in splitting sync from datasync,
and that is for applications which do not write() their
data in large enough chunks.

When I fix the get_block thing, O_SYNC, `chattr +S' and `mount
-o sync' will provide good, fast synchronous write()s - the
fs will run a commit at the end of the write(). That's just fine as long
as the app is writing its data in goodly chunks. If it is is using 4k
or 8k chunks (eg: default stdio) then throughput will suffer. That
would be rather silly of it though.

-

2001-07-26 15:42:57

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Daniel Phillips wrote:
>
> Ext3 does *not* leave a
> lot of dirty blocks hanging around in normal operation, so sync is not
> nearly as slow as it is with good old Ext2.

eek.

In fully-journalled data mode, we write everything to the journal
in a linear chunk, wait on it, write a commit block, wait on that
and then release all the just-journalled data into the main
filesystem for conventional bdflush/kupdate writeback in twenty
seconds time.

Calling anything which uses fsync_dev() would cause all that writeback
data to be written out and waited on, with the consequential seeking
storm. Disastrous.

Note that fsync() is OK - in full data journalling mode nothing
is ever attached to i_dirty_buffers.

-

2001-07-26 15:48:55

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Christoph Hellwig schrieb am Donnerstag, den 26. Juli 2001:

> > MTAs do NOT care how the file system is internally managed, they only
> > rely on the rename operation having completed physically on disk before
> > the "my rename call has returned 0" event. They expect that with the
> > call returning the rename operation has completed ultimately, finally,
> > for good, definitely and the old file will not reappear after a crash.
>
> So they rely on undocumented and non standadisized semantics of some
> implementations. I'd call this buggy.

If each in the set of "supported systems" document this behaviour for
themselves, there is no bug. I didn't check however for systems other
than FreeBSD 4.x and Linux. And "Linux support" forces these semantics
with chattr +S, at a high price.

Go tell your opinion to those people that refuse to wrap their
rename/link calls with open()/fsync() calls to the respective parents,
particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
don't possibly know all MTAs.

You will encounter these or similar questions/objections:

1. what systems apart from Linux need this kind of Pampers?

2. manual lookups of parent directories cause additional overhead better
avoided in performance critical systems.

You would not be the first one to tell them...

2001-07-26 15:53:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

In article <[email protected]>,
Matthias Andree <[email protected]> wrote:
>
>However, the remaining problem is being synchronous with respect to open
>(fixed for ext3 with your fsync() as I understand it), rename, link and
>unlink. With ext2, and as you write it, with ext3 as well, there is
>currently no way to tell when the link/rename has been committed to
>disk, unless you set mount -o sync or chattr +S or call sync() (the
>former is not an option because it's far too expensive).

Congratulations. You have been brainwashed by Dan Bernstein.

Use fsync() on the directory.

Logical, isn't it?

Linus

2001-07-26 15:53:55

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> Go tell your opinion to those people that refuse to wrap their
> rename/link calls with open()/fsync() calls to the respective parents,
> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> don't possibly know all MTAs.

I've pointed things out to Mr Bernstein before. His normal replies are not
helpful and generally vary between random ravings and threatening to sue
people who publish things on web pages he disagrees with.

Alan

2001-07-26 15:58:55

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Daniel Phillips wrote:

> As I understand it, Ext2 allows much the same semantics. While we do
> need to do something about exposing a more elegant interface, with Ext3
> you should be ok with +S and a "sync" just before you report to the
> world that the mail transaction is complete. Ext3 does *not* leave a
> lot of dirty blocks hanging around in normal operation, so sync is not
> nearly as slow as it is with good old Ext2.

That wasn't my impression, particularly not with data=journalling which
can drop data into the log. It's just: why sync the world if synching
directories does the job and relevant data is synched manually with
fsync()?

However, how big are chances that these interfaces will spread outside
of Linux? That's the crucial point for portable applications. If it's a
kernel <-> libc interface, OK, no problem, but if it's a user-space
interface, it might easily become a useless invention because no-one
uses it in real life. You don't support multiple interfaces in a
portable application because that's a maintenance disaster and often
causes reliability problems because on different platforms, code takes
different paths, so applications won't usually choose limited-use
interfaces (such as sendfile).

BTW, your Message-ID is unqualified == on a collision course in mail
duplicate killers.

2001-07-26 16:13:55

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Matthias Andree wrote:
> Christoph Hellwig schrieb am Donnerstag, den 26. Juli 2001:
>
> > So they rely on undocumented and non standadisized semantics of some
> > implementations. I'd call this buggy.
>
> If each in the set of "supported systems" document this
> behaviour for themselves, there is no bug.

The MTA depends on behaviour which is undefined. Now you
want to go blame the OS ?

> Go tell your opinion to those people that refuse to wrap their
> rename/link calls with open()/fsync() calls to the respective parents,
> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> don't possibly know all MTAs.

If you care about your email, probably you should either
teach these people about standards like POSIX or SuS
(and tell them to not rely on undefined behaviour) or
switch to an MTA which isn't broken in various ways ;)

cheers,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-26 16:21:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>> Go tell your opinion to those people that refuse to wrap their
>> rename/link calls with open()/fsync() calls to the respective parents,
>> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
>> don't possibly know all MTAs.
>
>I've pointed things out to Mr Bernstein before. His normal replies are not
>helpful and generally vary between random ravings and threatening to sue
>people who publish things on web pages he disagrees with.

Now, now, Alan. He has strong opinions, I'll agree, but I've never see
him threaten to _sue_.

Also, I think he eventually agreed on the logic of fsync() on the
directory, and we even had a bug report (quickly fixed) for reiserfs
because it got confused by it.

Of course, knowing Dan, I suspect the fsync() is accompanied by several
lines of derogatory comments about the need for it (not that I've
checked).

Everybody tends to agree that synchronous IO is stupid and slow, but
some people are just so fixated with "That is how it has been done for
20 years..".

Logging filesystems together with explicit logging points (namely,
"fsync()") are very obviously a superior answer from a technical
standpoint, but that doesn't impact the emotional arguments ("but I want
things to stay the same!").

Linus

2001-07-26 16:45:40

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> If you care about your email, probably you should either
> teach these people about standards like POSIX or SuS
> (and tell them to not rely on undefined behaviour) or
> switch to an MTA which isn't broken in various ways ;)

POSIX and SuS are actually not helpful here. They don't cover how to force
namespace to disk, only data and metadata for the file. So you can portably
stick your data onto disk, portably be sure its on disk, but not portably be
sure the directory entries are on disk.

Alan

2001-07-26 16:44:00

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> >I've pointed things out to Mr Bernstein before. His normal replies are not
> >helpful and generally vary between random ravings and threatening to sue
> >people who publish things on web pages he disagrees with.
>
> Now, now, Alan. He has strong opinions, I'll agree, but I've never see
> him threaten to _sue_.

Ask Alexey about the end of the syncookie "debate"

2001-07-26 16:55:10

by Larry McVoy

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, Jul 26, 2001 at 04:18:59PM +0000, Linus Torvalds wrote:
> In article <[email protected]>,
> Alan Cox <[email protected]> wrote:
> >> Go tell your opinion to those people that refuse to wrap their
> >> rename/link calls with open()/fsync() calls to the respective parents,
> >> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> >> don't possibly know all MTAs.
> >
> >I've pointed things out to Mr Bernstein before. His normal replies are not
> >helpful and generally vary between random ravings and threatening to sue
> >people who publish things on web pages he disagrees with.
>
> Now, now, Alan. He has strong opinions, I'll agree, but I've never see
> him threaten to _sue_.

In the for what it is worth department, I spent the day with Daniel after
the kernel summit meeting a while back, we talked file systems for about
6 or 7 hours. While I'll plead guilty to getting mad at him (his ego
is up there with mine :-), I came away impressed with his knowledge.
I get the feeling that he thinks deeply about the problems he works on,
he's probably right a lot of the time, *and* as with many deep thinkers,
he has a problem communicating his ideas.

This is a common problem, and I'm not sure Daniel is fully aware of it.
One cannot expect other people to have done the same thinking and have
the same context, and when they do not, it is easy to get frustrated.
I think that some of Daniel's "ravings" are probably just frustration
that the other person "doesn't get it".

That doesn't mean that Daniel is the right hand of God or anything, I've
seen him do some stupid things but I've seen all of us do some stupid
things, so that doesn't mean much. I think Daniel does way more smart
things than stupid things, and not all of us can claim that (sort of
like half of the drivers are below average, noone likes that idea either).

What I'm trying to say is that I think Daniel is one of the good guys,
even though his user interface could stand improvement (a common thing
amongst smart people) and it looks like it would be smart to figure out
how to work with him.

Just my opinion...
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2001-07-26 17:16:31

by Andre Pang

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, Jul 26, 2001 at 09:54:52AM -0700, Larry McVoy wrote:

> What I'm trying to say is that I think Daniel is one of the good guys,
> even though his user interface could stand improvement (a common thing
> amongst smart people) and it looks like it would be smart to figure out
> how to work with him.

there's a work-in-progress called ReiserSMTP[1] which rewrites
some bits of qmail so it works better with ReiserFS, although i
can imagine that it would improve things on Linux as a whole.

this is getting off-topic, but since the various parties involved
(Linux vs djb/Wietse/etc[2]) are probably never going to agree
on semantics, i'm wondering if it's possible to ask them to
write the software in such a way that it's possible to 'drop in'
your own functions relevant for sync'ing. then the MTA writers
can go and use their traditional filesystem assumptions and
Linux users can produce very small patches to support the
correct behaviour under Linux.

it would be _nice_ if the ext3 guys would be more willing to
implement directory-syncing on link/rename/etc, though, even as
an option. a 'mount -o dirsync' would be enough; no need for
chattr +D stuff. Linux tends to have a bad name as a platform
as an MTA just because of all this, which is a shame. it would be
nice if a fix is possible. *nudge nudge Mr. Morton* :)

[1] http://www.jedi.claranet.fr/reisersmtp.html

[2] hey, this might be the first time they agree on
anything!


--
#ozone/algorithm <[email protected]> - trust.in.love.to.save

2001-07-26 17:26:41

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Rik van Riel wrote:

> The MTA depends on behaviour which is undefined. Now you
> want to go blame the OS ?

No, the behaviour is defined on certain systems. Not sure if that
comprises all supported systems.

I'm not blaming anybody besides Linux which does not offer the "noasync"
(FreeBSD) compromise between sync and async. I don't see any reason why
this option cannot be there. Is it too expensive too implement? No-one
said so.

I cannot tell if and how the MTA authors checked all their supported OSs
how they handle metadata updates.

> If you care about your email, probably you should either
> teach these people about standards like POSIX or SuS
> (and tell them to not rely on undefined behaviour) or
> switch to an MTA which isn't broken in various ways ;)

Wee. And then, I tell the system to comply with that as well, don't I?
;)

--
Matthias Andree

2001-07-26 17:42:08

by Larry McVoy

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Arrg, I take it all back, I'm taking about Daniel Phillips not Daniel
Bernstein. I tend to agree with Alan about Mr Bernstein.

Thanks to Richard Gooch for pointing out that I'm asleep at the switch.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2001-07-26 18:01:37

by Hans Reiser

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Andre Pang wrote:
>
> On Thu, Jul 26, 2001 at 09:54:52AM -0700, Larry McVoy wrote:
>
> > What I'm trying to say is that I think Daniel is one of the good guys,
> > even though his user interface could stand improvement (a common thing
> > amongst smart people) and it looks like it would be smart to figure out
> > how to work with him.
>
> there's a work-in-progress called ReiserSMTP[1] which rewrites
> some bits of qmail so it works better with ReiserFS, although i
> can imagine that it would improve things on Linux as a whole.

It stopped due to flakiness on the part of all parties including myself, the programmer, and the
sponsor, but it would be nice if a sponsor and programmer came along to restart it.

>
> this is getting off-topic, but since the various parties involved
> (Linux vs djb/Wietse/etc[2]) are probably never going to agree
> on semantics, i'm wondering if it's possible to ask them to
> write the software in such a way that it's possible to 'drop in'
> your own functions relevant for sync'ing. then the MTA writers
> can go and use their traditional filesystem assumptions and
> Linux users can produce very small patches to support the
> correct behaviour under Linux.
>
> it would be _nice_ if the ext3 guys would be more willing to
> implement directory-syncing on link/rename/etc, though, even as
> an option. a 'mount -o dirsync' would be enough; no need for
> chattr +D stuff. Linux tends to have a bad name as a platform
> as an MTA just because of all this, which is a shame. it would be
> nice if a fix is possible. *nudge nudge Mr. Morton* :)
>
> [1] http://www.jedi.claranet.fr/reisersmtp.html
>
> [2] hey, this might be the first time they agree on
> anything!
>
> --
> #ozone/algorithm <[email protected]> - trust.in.love.to.save
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


No, Linus is right and the MTA guys are just wrong. The mailers are the place to fix things, not
the kernel. If the mailer guys want to depend on the kernel being stupidly designed, tough.
Someone should fix their mailer code and then it would run faster on Linux than on any other
platform.

Hans

2001-07-26 18:33:11

by Richard A Nelson

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Linus Torvalds wrote:

> In article <[email protected]>,
> Alan Cox <[email protected]> wrote:
> >> Go tell your opinion to those people that refuse to wrap their
> >> rename/link calls with open()/fsync() calls to the respective parents,
> >> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> >> don't possibly know all MTAs.

[snip]
> Also, I think he eventually agreed on the logic of fsync() on the
> directory, and we even had a bug report (quickly fixed) for reiserfs
> because it got confused by it.

In looking at the synchronous directory options, I'm unsure as to
the 'real' status wrt fsync() on a directory:
1) Does fsync() of a directory work on most/all current FS?
2) Does it work on 2.2.x as well as 2.4.x?
--
Rick Nelson
"... being a Linux user is sort of like living in a house inhabited
by a large family of carpenters and architects. Every morning when
you wake up, the house is a little different. Maybe there is a new
turret, or some walls have moved. Or perhaps someone has temporarily
removed the floor under your bed." - Unix for Dummies, 2nd Edition
-- found in the .sig of Rob Riggs, [email protected]

2001-07-26 19:39:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4


On Thu, 26 Jul 2001, Richard A Nelson wrote:
>
> In looking at the synchronous directory options, I'm unsure as to
> the 'real' status wrt fsync() on a directory:
> 1) Does fsync() of a directory work on most/all current FS?

Modulo bugs, yes.

Now, there's another issue, of course: if you have an important mail-spool
on some of the less tested filesystems, I would consider you crazy
regardless of fsync() working ;). I don't think anybody has ever verified
that fsync() (or much anything else wrt writing) does the right thing on
NTFS, for example.

> 2) Does it work on 2.2.x as well as 2.4.x?

Yes. However, there may be performance issues. As with just about
anything, we didn't start optimizing things until it became a real issue,
and in some cases at least historically the filesystems fell back on just
doing a whole "fsync_dev()" if they had nothing better to do.

I think later 2.2.x kernels (ie the ones past the point where Alan took
over) probably have the fsync() optimizations at least for ext2.

Linus

2001-07-26 20:05:17

by Richard A Nelson

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Linus Torvalds wrote:

> > 1) Does fsync() of a directory work on most/all current FS?
>
> Modulo bugs, yes.

Great, that was a big concern

> Now, there's another issue, of course: if you have an important mail-spool
> on some of the less tested filesystems, I would consider you crazy
> regardless of fsync() working ;). I don't think anybody has ever verified
> that fsync() (or much anything else wrt writing) does the right thing on
> NTFS, for example.

Caveat Emptor ;-)

> > 2) Does it work on 2.2.x as well as 2.4.x?
>
> Yes. However, there may be performance issues. As with just about
> anything, we didn't start optimizing things until it became a real issue,
> and in some cases at least historically the filesystems fell back on just
> doing a whole "fsync_dev()" if they had nothing better to do.
>
> I think later 2.2.x kernels (ie the ones past the point where Alan took
> over) probably have the fsync() optimizations at least for ext2.

That should be recent enough - I push 2.2.19 for shm support and security
reasons anyway - though I see alot of folk on 2.2.16/17.

Are the optimizations more than writing out only changed blocks?
Has anyone any information on the performance differences between
optimized vs non-optimized?

Thanks, I'm feeling much better about getting this support added
--
Rick Nelson
Life'll kill ya -- Warren Zevon
Then you'll be dead -- Life'll kill ya

2001-07-26 20:26:18

by Gérard Roudier

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4



On Thu, 26 Jul 2001, Alan Cox wrote:

> > them, and MTAs are portable, they choose chattr +S on Linux. And that's
> > a performance problem because it doesn't come for free, but also with
> > synchronous data updates, which are unnecessary because there is
> > fsync().
>
> chattr +S and atomic updates hitting disk then returning to the app will
> give the same performance. You can also fsync() the directory.
>
> > the "my rename call has returned 0" event. They expect that with the
> > call returning the rename operation has completed ultimately, finally,
> > for good, definitely and the old file will not reappear after a crash.
>
> Actually the old file re-appearing after the crash is irrelevant. It will
> have a previously logged message id. And if you are not doing message id
> histories then you have replay races at the SMTP level anyway
>
> > This still implies the drive doesn't lie to the OS about the completion
> > of write requests: write cache == off.
>
> Write cache off is not a feature available on many modern disks. You
> already lost the battle before you started.

Losing the battle of brain-dead hardware is not a problem... :-)

SCSI hard disks are expected to follow the specifications. But, may be,
you are referring to IDE disks, only ...

With SCSI, you can enable write caching and also ask the device to signal
completion of actual write to the media by setting the FUA bit in the SCSI
command block (not available in WRITE(6), but available in WRITE(10)).

G?rard.

2001-07-26 20:41:00

by Daniel Phillips

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thursday 26 July 2001 17:49, Andrew Morton wrote:
> Daniel Phillips wrote:
> > Ext3 does *not* leave a
> > lot of dirty blocks hanging around in normal operation, so sync is
> > not nearly as slow as it is with good old Ext2.
>
> eek.
>
> In fully-journalled data mode, we write everything to the journal
> in a linear chunk, wait on it, write a commit block, wait on that
> and then release all the just-journalled data into the main
> filesystem for conventional bdflush/kupdate writeback in twenty
> seconds time.
>
> Calling anything which uses fsync_dev() would cause all that
> writeback data to be written out and waited on, with the
> consequential seeking storm. Disastrous.

Whoops, ok, no, this is not particularly sync-friendly. On the other
hand, I don't think your seek storm would be as bad as all that. You
can still feed enough blocks to the elevator to give it something to
chew on. On the third hand, since you are still using the generic
flushing machinery I can see you'd have quite a lot of work to do to
control the flushing accurately in this way.

> Note that fsync() is OK - in full data journalling mode nothing
> is ever attached to i_dirty_buffers.

Somewhere in there is a beautiful optimization trying to get out...

--
Daniel

2001-07-26 20:56:11

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

At 20:37 26/07/2001, Linus Torvalds wrote:
>On Thu, 26 Jul 2001, Richard A Nelson wrote:
> > In looking at the synchronous directory options, I'm unsure as to
> > the 'real' status wrt fsync() on a directory:
> > 1) Does fsync() of a directory work on most/all current FS?
>
>Modulo bugs, yes.
>
>Now, there's another issue, of course: if you have an important mail-spool
>on some of the less tested filesystems, I would consider you crazy
>regardless of fsync() working ;). I don't think anybody has ever verified
>that fsync() (or much anything else wrt writing) does the right thing on
>NTFS, for example.

NTFS doesn't even have an fsync() operation defined so calling fsync()
system call won't do anything at all. A quick look at
fs/buffer.c::sys_fsync() shows it will return -EINVAL straight away.

But considering the fsync, even if present may well trash the file or the
whole partition's data, it's just as well it doesn't happen...

Anton


--
"Nothing succeeds like success." - Alexandre Dumas
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-07-26 22:13:18

by Daniel Phillips

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thursday 26 July 2001 18:54, Larry McVoy wrote:
> On Thu, Jul 26, 2001 at 04:18:59PM +0000, Linus Torvalds wrote:
> > In article <[email protected]>,
> >
> > Alan Cox <[email protected]> wrote:
> > >> Go tell your opinion to those people that refuse to wrap their
> > >> rename/link calls with open()/fsync() calls to the respective
> > >> parents, particularly Daniel J. Bernstein, Wietse Z. Venema,
> > >> among others. I don't possibly know all MTAs.
> > >
> > >I've pointed things out to Mr Bernstein before. His normal replies
> > > are not helpful and generally vary between random ravings and
> > > threatening to sue people who publish things on web pages he
> > > disagrees with.
> >
> > Now, now, Alan. He has strong opinions, I'll agree, but I've never
> > see him threaten to _sue_.
>
> In the for what it is worth department, I spent the day with Daniel
> after the kernel summit meeting a while back, we talked file systems
> for about 6 or 7 hours. While I'll plead guilty to getting mad at
> him (his ego is up there with mine :-), I came away impressed with
> his knowledge. I get the feeling that he thinks deeply about the
> problems he works on, he's probably right a lot of the time, *and* as
> with many deep thinkers, he has a problem communicating his ideas.
>
> This is a common problem, and I'm not sure Daniel is fully aware of
> it. One cannot expect other people to have done the same thinking and
> have the same context, and when they do not, it is easy to get
> frustrated. I think that some of Daniel's "ravings" are probably just
> frustration that the other person "doesn't get it".
>
> That doesn't mean that Daniel is the right hand of God or anything,
> I've seen him do some stupid things but I've seen all of us do some
> stupid things, so that doesn't mean much. I think Daniel does way
> more smart things than stupid things, and not all of us can claim
> that (sort of like half of the drivers are below average, noone likes
> that idea either).
>
> What I'm trying to say is that I think Daniel is one of the good
> guys, even though his user interface could stand improvement (a
> common thing amongst smart people) and it looks like it would be
> smart to figure out how to work with him.
>
> Just my opinion...

Heh, very interesting, but you seem to have created a collage of two
different Daniels ;-)

--
Daniel

2001-07-27 04:21:42

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Andre Pang wrote:
>
> it would be _nice_ if the ext3 guys would be more willing to
> implement directory-syncing on link/rename/etc, though, even as
> an option. a 'mount -o dirsync' would be enough; no need for
> chattr +D stuff. Linux tends to have a bad name as a platform
> as an MTA just because of all this, which is a shame. it would be
> nice if a fix is possible. *nudge nudge Mr. Morton* :)

Perhaps I didn't understand the requirement.

I believe that `dirsync' would provide synchronous metadata
operations (ie: the metadata is crashproofed on-disk when
the syscall returns), but non-sync data. Correct?

Whereas `mount -o sync' or `chattr +S' would provide synchronous
metadata operations PLUS synchronous data, so when write()
returns, the data which was written is crashproofed.

Is that your understanding of the difference?

If so, then with `dirsync', the application would have to
open the file O_SYNC (which would make the whole thing pointless!)
or it would run fsync() when it had finished writing the file.

So what it boils down to is that dirsync will improve the
efficiency of applications which do a bunch of small writes
and then an fsync.

If, however, the application is capable of doing a nice big
write() (setvbuf!) then really, the two things will be pretty
much the same.

Wait and see how the benchmarks turn out, yes?


One problem at present is that an application could be in the
middle of a nice big write(), but another thread comes up and
does a synchronous creat(). That will force a commit right in the middle
of the write(). It would be better (I think) if the write's transaction
were allowed to run to completion and the creat() caller blocks until
the write() finishes - this way the write(), the creat() and anything
else which happened during the write() would all be written out in a
single compound transaction.

Alas, we cannot run a transaction handle for more than a single
page in write() because of locking inversion problems with i_sem
and the lock_page outside ->writepage(). i_sem is trivial to fix,
but writepage is not. It has not really proven to be a problem
yet, but it would be nice to be able to _guarantee_ that writes
up to a particular size (100k, say) were 100% atomic.

-

2001-07-27 09:41:10

by Sean Hunter

[permalink] [raw]
Subject: Strange remount behaviour with ext3-2.4-0.9.4

Following the announcement on lkml, I have started using ext3 on one of my
servers. Since the server in question is a farily security-sensitive box, my
/usr partition is mounted read only except when I remount rw to install
packages.

I converted this partition to run ext3 with the mount options
"nodev,ro,data=writeback,defaults" figuring that when I need to install new
packages etc, that I could just mount rw as before and that metadata-only
journalling would be ok for this partition as it really sees very little write
activity.

When I try to remount it r/w I get a log message saying:
Jul 27 09:54:29 henry kernel: EXT3-fs: cannot change data mode on remount

...even if I give the full mount option list with the remount instruction.

I can, however, remount it as ext2 read-write, but when I try to remount as
ext3 (even read only) I get the same problem.

Wierdly, "mount" lists it as being still an ext3 partition even though it has
been remounted as ext2. I can't umount /usr because kjournald is currently
listed as using the partition.

The box in question is more-or-less RedHat 7.1, with ext3-2.4-0.9.4, kernel
2.4.7 and with the following relevant package versions:

mount-2.11g-4
util-linux-2.11f-3
e2fsprogs-1.22-2

...all from rawhide rpms.

Sean

2001-07-27 10:18:01

by Andrew Morton

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

Sean Hunter wrote:
>
> Following the announcement on lkml, I have started using ext3 on one of my
> servers. Since the server in question is a farily security-sensitive box, my
> /usr partition is mounted read only except when I remount rw to install
> packages.
>
> I converted this partition to run ext3 with the mount options
> "nodev,ro,data=writeback,defaults" figuring that when I need to install new
> packages etc, that I could just mount rw as before and that metadata-only
> journalling would be ok for this partition as it really sees very little write
> activity.
>
> When I try to remount it r/w I get a log message saying:
> Jul 27 09:54:29 henry kernel: EXT3-fs: cannot change data mode on remount
>
> ...even if I give the full mount option list with the remount instruction.

hmm.. The mount option handling there is a bit bogus.

What we *should* do on remount is check that the requested
journalling mode is equal to the current mode. ext3 won't
allow you to change the journalling mode on-the-fly.

So... you will have to omit the `data=xxx' portion of the
mount options when remounting. It's being invisibly added
by /bin/mount.

/bin/mount tries to be smart. If, for example you have

/dev/hdf12 /mnt/hdf12 ext3 noauto,ro,data=writeback 1

in /etc/fstab and then type

mount /dev/hdf12 -o remount,rw

then /bin/mount runs off and looks up the fstab entry and
inserts the mount options. However if you instead type

mount /dev/hdf12 /mnt/hdf12 -o remount,rw (1)

then /bin/mount does *not* look up the fstab entry, and
the remount succeeds.

ho-hum. For the while you'll have to fiddle with the mount
usage to get things working right. Equation (1) above will
work fine. Or apply the appended patch.

> I can, however, remount it as ext2 read-write, but when I try to remount as
> ext3 (even read only) I get the same problem.

You can't switch between ext2 and ext3 with a remount - unmount
is needed.

> Wierdly, "mount" lists it as being still an ext3 partition even though it has
> been remounted as ext2. I can't umount /usr because kjournald is currently
> listed as using the partition.

That sounds very weird. Could you please describe the steps
you took to create this state?

Sometimes /etc/mtab gets out of sync - especially for the
root fs. It's more reliable to look in /proc/mounts



Here's the fix for the data= handling on remount:



Index: fs/ext3/super.c
===================================================================
RCS file: /cvsroot/gkernel/ext3/fs/ext3/super.c,v
retrieving revision 1.31
diff -u -r1.31 super.c
--- fs/ext3/super.c 2001/07/19 14:43:08 1.31
+++ fs/ext3/super.c 2001/07/27 10:14:48
@@ -513,12 +513,6 @@

if (want_value(value, "data"))
return 0;
- if (is_remount) {
- printk ("EXT3-fs: cannot change data mode "
- "on remount\n");
- return 0;
- }
-
if (!strcmp (value, "journal"))
data_opt = EXT3_MOUNT_JOURNAL_DATA;
else if (!strcmp (value, "ordered"))
@@ -529,9 +523,18 @@
printk ("EXT3-fs: Invalid data option: %s\n",
value);
return 0;
+ }
+ if (is_remount) {
+ if ((*mount_options & EXT3_MOUNT_DATA_FLAGS) !=
+ data_opt) {
+ printk("EXT3-fs: cannot change data "
+ "mode on remount\n");
+ return 0;
+ }
+ } else {
+ *mount_options &= ~EXT3_MOUNT_DATA_FLAGS;
+ *mount_options |= data_opt;
}
- *mount_options &= ~EXT3_MOUNT_DATA_FLAGS;
- *mount_options |= data_opt;
} else {
printk ("EXT3-fs: Unrecognized mount option %s\n",
this_char);

2001-07-27 12:24:26

by Sean Hunter

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

On Fri, Jul 27, 2001 at 08:24:14PM +1000, Andrew Morton wrote:
> Sean Hunter wrote:
> >
> > Following the announcement on lkml, I have started using ext3 on one of my
> > servers. Since the server in question is a farily security-sensitive box, my
> > /usr partition is mounted read only except when I remount rw to install
> > packages.
> >
> > I converted this partition to run ext3 with the mount options
> > "nodev,ro,data=writeback,defaults" figuring that when I need to install new
> > packages etc, that I could just mount rw as before and that metadata-only
> > journalling would be ok for this partition as it really sees very little write
> > activity.
> >
> > When I try to remount it r/w I get a log message saying:
> > Jul 27 09:54:29 henry kernel: EXT3-fs: cannot change data mode on remount
> >
> > ...even if I give the full mount option list with the remount instruction.
>
> hmm.. The mount option handling there is a bit bogus.
>
> What we *should* do on remount is check that the requested
> journalling mode is equal to the current mode. ext3 won't
> allow you to change the journalling mode on-the-fly.

Indeed.

>
> So... you will have to omit the `data=xxx' portion of the
> mount options when remounting. It's being invisibly added
> by /bin/mount.

Thought so. I tried both ways just to be sure.

> /bin/mount tries to be smart. If, for example you have
>
> /dev/hdf12 /mnt/hdf12 ext3 noauto,ro,data=writeback 1
>
> in /etc/fstab and then type
>
> mount /dev/hdf12 -o remount,rw
>
> then /bin/mount runs off and looks up the fstab entry and
> inserts the mount options. However if you instead type
>
> mount /dev/hdf12 /mnt/hdf12 -o remount,rw (1)
>
> then /bin/mount does *not* look up the fstab entry, and
> the remount succeeds.

Interesting, and (almost) 100% true

sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,data=writeback,remount
mount: you must specify the filesystem type
sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,data=writeback,remount -text3
mount: /usr not mounted already, or bad option
sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,remount -text3
sean@henry:~$ mount
/dev/sdb6 on / type ext3 (rw)
none on /proc type proc (rw)
/dev/sda1 on /boot type ext2 (ro,nosuid,nodev)
/dev/sdc6 on /home type ext3 (rw,nosuid,nodev,data=ordered)
/dev/sda8 on /usr type ext3 (ro,nodev)
/dev/sda5 on /var type ext3 (rw,nosuid,nodev,sync,data=journal)
none on /dev/pts type devpts (rw,gid=5,mode=620)

It succeeds as long as I don't specify the journal type.

>
> ho-hum. For the while you'll have to fiddle with the mount
> usage to get things working right. Equation (1) above will
> work fine. Or apply the appended patch.
>
> > I can, however, remount it as ext2 read-write, but when I try to remount as
> > ext3 (even read only) I get the same problem.
>
> You can't switch between ext2 and ext3 with a remount - unmount
> is needed.

Wierd. This certainly looked to all the world as though it worked for me. Thus:

sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,remount -text2

...doesn't give me an error, but:

sean@henry:~$ mount
/dev/sdb6 on / type ext3 (rw)
none on /proc type proc (rw)
/dev/sda1 on /boot type ext2 (ro,nosuid,nodev)
/dev/sdc6 on /home type ext3 (rw,nosuid,nodev,data=ordered)
/dev/sda8 on /usr type ext3 (ro,nodev)
^^^^
/dev/sda5 on /var type ext3 (rw,nosuid,nodev,sync,data=journal)
none on /dev/pts type devpts (rw,gid=5,mode=620)


> > Wierdly, "mount" lists it as being still an ext3 partition even though it has
> > been remounted as ext2. I can't umount /usr because kjournald is currently
> > listed as using the partition.
>
> That sounds very weird. Could you please describe the steps
> you took to create this state?

See above.

> Sometimes /etc/mtab gets out of sync - especially for the
> root fs. It's more reliable to look in /proc/mounts

sean@henry:~$ cat /proc/mounts
/dev/root / ext3 rw 0 0
/proc /proc proc rw 0 0
/dev/sda1 /boot ext2 ro,nosuid,nodev 0 0
/dev/sdc6 /home ext3 rw,nosuid,nodev 0 0
/dev/sda8 /usr ext3 ro,nodev 0 0
/dev/sda5 /var ext3 rw,nosuid,nodev,sync 0 0
none /dev/pts devpts rw 0 0

sean@henry:~$ cat /etc/mtab
/dev/sdb6 / ext3 rw 0 0
none /proc proc rw 0 0
/dev/sda1 /boot ext2 ro,nosuid,nodev 0 0
/dev/sdc6 /home ext3 rw,nosuid,nodev,data=ordered 0 0
/dev/sda8 /usr ext3 ro,nodev 0 0
/dev/sda5 /var ext3 rw,nosuid,nodev,sync,data=journal 0 0
none /dev/pts devpts rw,gid=5,mode=620 0 0

Its almost as if mount is just silently ignoring the "-t" option when I specify
ext2.

>
>
> Here's the fix for the data= handling on remount:

I'll try this when its safe to reboot the box.

Thanks very much for your help.

Sean

2001-07-27 16:25:08

by Lawrence Greenfield

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Hi,

I'm one of those icky application programmers attempting to make
reliable software across different versions of Unix.

We need to get data to disk portably, quickly, and reliably.

I love it when I see things like: "No, Linus is right and the MTA
guys are just wrong."

This sort of attitude is just ridiculous. Unix had a defined set of
semantics. This might have been stupid semantics, but it had them.
Then journalling filesystems, softupdates, and Linux async updates
came along and destroyed those semantics, preventing those of us who
want to write reliable applications using the filesystem from doing
so. At least Oracle doesn't change the definition of COMMIT.

When I contacted the Linux JFS team about the semantics of link(), I
was told that there is _no way_ of forcing a link() to disk. Not an
fsync() on the file, not an fsync() on the directory, just _not
possible_.

Great.

Then we come to ext2. "Oh, just call fsync() on the directory and
you'll be fine." Well, wait, a second, if ext2 isn't ordering the
metadata writes, a crash at the wrong time (whether or not I've called
fsync()) may lose directory entries---even directory entries unrelated
to the files I'm doing operations on! Greeeeat.

Thus why all reasonably paranoid MTAs and other mail programs say "use
chattr +S on ext2"---we need ordered metadata writes.

Ok, journalled filesystems are better. At least crashes aren't going
to affect random files on disk. But since link() and the like don't
force a commit, we need some way---some reasonably portable way---of
getting that on disk. On softupdates, calling fsync() on a file
forces all directory entries pointing to that file to disk. This is
pretty reasonable. 1 fsync() call.

Why do we all cringe when we're told to call fsync() on the directory?
Several reasons:
. not needed on any other variety of Unix
. two fsync() calls force two different syncronization points: the
application is forcing ordering on the OS that may not be needed.
(Thus performance doesn't "fly" when you need multiple fsyncs.)
. directory may have other modifications going on that we're not
interested in

You want to help performance? Give us an fsync() that works on
multiple file descriptors at once, or an async fsync() call. Don't
make us fight the OS on getting data to disk.

Larry

2001-07-27 16:50:19

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> This sort of attitude is just ridiculous. Unix had a defined set of
> semantics. This might have been stupid semantics, but it had them.

The unix defined semantics are very simple and very clear. They btw
dont contain the guarantees that certain email system authors think they do
and they never have.

rename() itself is new as of 4BSD, rather than ever being in true unix.
True unix did the right thing. It said 'this problem is hard, this problem
is application specific, do it at application level'.

> When I contacted the Linux JFS team about the semantics of link(), I
> was told that there is _no way_ of forcing a link() to disk. Not an
> fsync() on the file, not an fsync() on the directory, just _not
> possible_.

I would expect an fsync of the directory to do that. It does on other
Linux file systems so it violates the least suprise bit. Right now JFS
isnt a standard file system on Linux however, and they have much left to do.
I suspect its something to ask them about.

> Thus why all reasonably paranoid MTAs and other mail programs say "use
> chattr +S on ext2"---we need ordered metadata writes.

And then your IDE disk gets you anyway. Also if you write metadata first
then you risk delivering email to the wrong person instead.

> You want to help performance? Give us an fsync() that works on
> multiple file descriptors at once, or an async fsync() call. Don't
> make us fight the OS on getting data to disk.

And what pray does an asynchronous fsync do. It seems to be a nop to me.

Doing reliabile transactions on disk is a hard problem. That is why oracle
and friends have spent many man years of research on this kind of problem.
Current unix mailers do the smoke mirrors and prayer bit to reduce the
probability a little that is all, regardless of fs and os.

Alan

2001-07-27 16:57:39

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Fri, 27 Jul 2001, Lawrence Greenfield wrote:

> I'm one of those icky application programmers attempting to make
> reliable software across different versions of Unix.
>
> We need to get data to disk portably, quickly, and reliably.
>
> I love it when I see things like: "No, Linus is right and the MTA
> guys are just wrong."
>
> This sort of attitude is just ridiculous. Unix had a defined set of
> semantics. This might have been stupid semantics, but it had them.

The stuff you people seem to insist on, however, most
definately isn't part of the defined set of semantics.

If you believe otherwise, feel free to point out the
relevant sections in POSIX / SuS / ...

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-27 17:17:29

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Fri, Jul 27, 2001 at 12:24:56PM -0400, Lawrence Greenfield wrote:
> I love it when I see things like: "No, Linus is right and the MTA
> guys are just wrong."
>
> This sort of attitude is just ridiculous. Unix had a defined set of
> semantics. This might have been stupid semantics, but it had them.
> Then journalling filesystems, softupdates, and Linux async updates
> came along and destroyed those semantics, preventing those of us who
> want to write reliable applications using the filesystem from doing
> so. At least Oracle doesn't change the definition of COMMIT.

First off, would you care to quote chapter and verse of these
"defined semantics" ? Do you mean the BSD source?

Traditional FFS/UFS achieves "safety" at a terrible cost to
performance. I can barely stand the wait to untar XFree86 on Solaris8
on a PII-333, even with UFS logging -- I'd rather use my Pentium 166
laptop running Linux! ext2 solved this performance issue many years
ago by recognizing that the FFS metadata scheme was not really safe
either; instead the intelligence was put into e2fsck, and where
necessary, the applications. (Do I hear faint echoes of the
"lint" v. "cc" design criterion ... ?)

The infrastructure is now in place to solve these problems in ext3,
without imposing a least-common-denominator approach that degrades
overall system performance. In these instances "Linus is right" when
he notes that (1) the proposed immediate solution does not really solve
the problem, and (2) once in there, developers will rely on its precise
semantics, making them difficult to get right later on, and providing
no incentive to do so. In many such instances "undefined" behavior is
the best intermediate solution.

As one can see from the "gkernel-commit" traffic, Andrew Morton has
not only taken away useful information from this thread, he's already
halfway to a solution, in just a day, because Matthias Andree took
the time to describe the functional requirements instead of just
whining that "it's not like BSD."

> Thus why all reasonably paranoid MTAs and other mail programs say "use
> chattr +S on ext2"---we need ordered metadata writes.

And that's precisely the type of thing we want -- unused features should
not impact the rest of the system.

Regards,

Bill Rugolsky

2001-07-27 17:44:29

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

>
> "Paul G. Allen" <[email protected]> writes:
>
> > Do the newer kernel releases support the 760 MP chipset? Will they
> > anytime soon? (If not I will see if I can put it in myself.)
>
> There is better support in 2.4.7 (especially IDE) but there is not complete
> support.
>
> I don't know of anyone planning on finishing up any the pieces so feel free.
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-07-27 17:51:09

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> These are tangential issues. Not everybody uses IDE disks. I'm not
> asking for things that are impossible. Just because sometimes the

Actually if I remember rightly the problem is mathematically insoluble

> The application can avoid the wrong file problem by zeroing out data
> before releasing it to the OS to reallocate.

When you zero out the data what order do you want those writes in relative
to the rename

> An async fsync allows me to issue multiple fsyncs and then wait for
> all of them to complete, hopefully in the same framework that I would
> do async I/O (but that's an argument for another day).

Ok.. right that makes more sense. So you actually want 'begin_fsync' and
'wait_fsync_all' type stuff

> Doing reliabile transactions on disk is a hard problem. That is why oracle
> and friends have spent many man years of research on this kind of problem.
> Current unix mailers do the smoke mirrors and prayer bit to reduce the
> probability a little that is all, regardless of fs and os.
>
> Isn't the point of the operating system to try to make it as easy as
> possible to do these things correctly?

The OS doesnt have enough information. To do transactions you must know the
entire material that corresponds to the transaction and bound it. That isnt
something the kernel has the knowledge about.

The job of the OS is to make the simple things easy, and the hard possible.
Not to burden the simple with the cost of the hard. That why the chattr +S
is such a nice solution in many ways

Alan

2001-07-27 20:40:16

by Michal Jaegermann

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

On Fri, Jul 27, 2001 at 10:32:21AM +0100, Sean Hunter wrote:
> Following the announcement on lkml, I have started using ext3 on one of my
> servers. Since the server in question is a farily security-sensitive box, my
> /usr partition is mounted read only except when I remount rw to install
> packages.

Regardless of possible weirdness in a "smart" behaviour of 'mount' what
one exactly buys running a journaling file system on a _read only_
partition? fsck times will be the same (unless you crashed when
installing new software :-).

Michal

2001-07-27 20:46:06

by Alan

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

> Regardless of possible weirdness in a "smart" behaviour of 'mount' what
> one exactly buys running a journaling file system on a _read only_
> partition? fsck times will be the same (unless you crashed when
> installing new software :-).

Several things:

1. The simple case of remounting an fs read-only is easy, since no
writes means no journal

2. The software suspend case is horrible. Right now mixing a
journalling fs and swsuspend tends to cause disk corruption because
journalling fs's write to disk when told to mount read only

3. Failed drives. Here the journalling mount overrides the read only
request and the machine locks up preventing data recovery except
by copying the whole 80Gb disk image to another disk

Been there, it sucks

4. Snapshots. Making read only snapshots can be very useful, and there
you want the replay of the log to be into the page cache but not
written back to physical media until its marked read-write

Alan

2001-07-27 21:08:41

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

On Fri, Jul 27, 2001 at 09:46:57PM +0100, Alan Cox wrote:

2. The software suspend case is horrible. Right now mixing a
journalling fs and swsuspend tends to cause disk corruption because
journalling fs's write to disk when told to mount read only

this is hard to fix... the fs needs to replay things to make things
consistent, and in many cases doing an 'in-memory' replay isn't an
option (ie. remember which stuff needs to replayed and read from the
journal instead of disk when required to do so)

4. Snapshots. Making read only snapshots can be very useful, and there
you want the replay of the log to be into the page cache but not
written back to physical media until its marked read-write

R/O snapshots are best done in the fs if possible, al la
WAFL. Something like that for resierfs or TUX2 would rule so much (you
more-or-less need need a tree-based fs and reference counting for all
the magic bits). In fact, doing it as the fs layer means you could
have r/w snapshots with COW semantics.



--cw

2001-07-27 21:12:11

by Daniel Phillips

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Friday 27 July 2001 19:41, Lawrence Greenfield wrote:
> From: Alan Cox <[email protected]>
> > Lawrence Greenfield wrote:
> > > You want to help performance? Give us an fsync() that works on
> > > multiple file descriptors at once, or an async fsync() call.
> > > Don't make us fight the OS on getting data to disk.
> >
> > And what pray does an asynchronous fsync do. It seems to be a nop
> to me.
>
> An async fsync allows me to issue multiple fsyncs and then wait for
> all of them to complete, hopefully in the same framework that I would
> do async I/O (but that's an argument for another day).

I'll say. While it's truly desirable, all known filesystems are *far*
from being able to do that. An efficient, reliable fsync would do the
trick for you, or even an efficient sync. And somewhere in Andrew
Morton's bag of tricks is something to fix you up too, read his
comments carefully.

Looking forward, a sanely defined filesystem transaction interface
from userland would give the best possible combination of performance
and reliability.[1] Since we now have four filesystems (five if you
count JFFS) that could implement such a transaction interface, now is
the time to figure out what it would look like. That would include
accomodating the needs of MTA developers. It would be Linux-specific
for sure. It would also be progress. If it turned out to be the
fastest way to run a mailer we'd see it migrate to other nixes soon
enough.

> Doing reliabile transactions on disk is a hard problem. That is
> why oracle and friends have spent many man years of research on this
> kind of problem.

Tell me about it ;-)

> Current unix mailers do the smoke mirrors and prayer
> bit to reduce the probability a little that is all, regardless of fs
> and os.
>
> Isn't the point of the operating system to try to make it as easy as
> possible to do these things correctly?

begin_transaction (filesystem_handle);
<send the mail>;
if (!end_transaction (filesystem_handle))
<confirm sent>;

Something like that.[2] Caveat: this is blue-sky stuff, it is not
going to solve your problem today. Andrew Morton and Hans Reiser are
working on solving the problem today by giving you at least one mode
where rename is synchronous, or at least giving you a fast fsync.

I'm with those who think that a little short-term pain is worth it if
the final result is superior.

> Otherwise you force anyone who wants to write a reliable application
> (be it e-mail or not) to go to Oracle and one wonders why fsync() is
> even implemented.

[1] Al Viro pointed out that such a transaction interface could open up
new possibilities for DOS attacks, something that has to be anticipated
in the design.

[2] I see Alan suggested essentially the same thing in another branch
of this thread. Then by the "million flies" theorum...

--
Daniel

2001-07-27 21:25:06

by Alan

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

> more-or-less need need a tree-based fs and reference counting for all
> the magic bits). In fact, doing it as the fs layer means you could
> have r/w snapshots with COW semantics.

You dont want r/w snapshots for archiving. An archive of a previous date is
worthless if it can't be absolutely utterly and definitively read only.

It is hard to do well, but its an important item. One possiiblity is to do
it by replaying the log to a r/w snapshot (in ram) over a r/o snapshot (on
stable media)

2001-07-27 21:27:46

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

On Fri, Jul 27, 2001 at 10:23:44PM +0100, Alan Cox wrote:

You dont want r/w snapshots for archiving. An archive of a
previous date is worthless if it can't be absolutely utterly and
definitively read only.

sure, for archiving you don't, but for other purposes you might

RO is easier and what most people want, this is all WAFL gives right now

RW has it's uses too, especially if you can clone /foo/bar to
/foo/blem and such like, a cheaper more elegant way of cp -Rupdl I
guess

It is hard to do well, but its an important item. One possiiblity
is to do it by replaying the log to a r/w snapshot (in ram) over a
r/o snapshot (on stable media)

you can probably get away without the need for replay... just build
and in-memory extent list of blocks to would otherwise have been
rewritten and the journal offsets, before you read a block, you check
to see if you need to get from journal first

obviously you need to make sure you get the last insatce of each block
in the journal should there be more than one



--cw

2001-07-27 17:41:59

by Lawrence Greenfield

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Date: Fri, 27 Jul 2001 17:50:29 +0100 (BST)
Cc: [email protected]
From: Alan Cox <[email protected]>

[...]
> Thus why all reasonably paranoid MTAs and other mail programs say "use
> chattr +S on ext2"---we need ordered metadata writes.

And then your IDE disk gets you anyway. Also if you write metadata first
then you risk delivering email to the wrong person instead.

These are tangential issues. Not everybody uses IDE disks. I'm not
asking for things that are impossible. Just because sometimes the
hardware screws you isn't a good reason for not trying to do the right
thing.

The application can avoid the wrong file problem by zeroing out data
before releasing it to the OS to reallocate.

> You want to help performance? Give us an fsync() that works on
> multiple file descriptors at once, or an async fsync() call. Don't
> make us fight the OS on getting data to disk.

And what pray does an asynchronous fsync do. It seems to be a nop to me.

An async fsync allows me to issue multiple fsyncs and then wait for
all of them to complete, hopefully in the same framework that I would
do async I/O (but that's an argument for another day).

Doing reliabile transactions on disk is a hard problem. That is why oracle
and friends have spent many man years of research on this kind of problem.
Current unix mailers do the smoke mirrors and prayer bit to reduce the
probability a little that is all, regardless of fs and os.

Isn't the point of the operating system to try to make it as easy as
possible to do these things correctly?

Otherwise you force anyone who wants to write a reliable application
(be it e-mail or not) to go to Oracle and one wonders why fsync() is
even implemented.

Larry

2001-07-28 14:58:10

by kaih

[permalink] [raw]
Subject: Re: Strange remount behaviour with ext3-2.4-0.9.4

[email protected] (Alan Cox) wrote on 27.07.01 in <[email protected]>:

> > more-or-less need need a tree-based fs and reference counting for all
> > the magic bits). In fact, doing it as the fs layer means you could
> > have r/w snapshots with COW semantics.
>
> You dont want r/w snapshots for archiving.

Not for archiving, but when you want to run something and then throw it
away again, for example. You could do that by just holding onto a ro
snapshot and then replacing the rw tree with it later, but by having two
rw trees you don't need to stop your regular operations.

For this to really be useful, you'd want it as an inheritable per-process
thing, similar to aviro's namespace thing.

MfG Kai

2001-07-28 16:52:55

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Alan Cox <[email protected]> writes:

> Also if you write metadata first then you risk delivering email to
> the wrong person instead.

The MTAs do this:

Open temp file
Write to temp file
fsync() temp file
rename() temp file into mail spool
indicate success to remote MTA

As long as rename() does not return until the metadata are committed,
this should be a reliable delivery mechanism. After a crash, you
might end up with the temp file still there, or with the file having a
link count of two (temp file and spool file). But you can clean up
all of this at boot time; if the temp file is gone and the spool file
is present, then the transaction was completed.

(Yes, you might not have returned the success code to the remote MTA,
but that just means you might do a double delivery. That is an
acceptable failure mode; corrupting, losing, or misdirecting mail is
not.)

How does this scheme "risk delivering mail to the wrong person
instead"?

If you have metadata journalling, all you need for this algorithm to
work is to have rename() write to the journal before returning. Is
this true for any of the current journalling file systems on Linux?

- Pat

2001-07-28 19:03:10

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> How does this scheme "risk delivering mail to the wrong person
> instead"?

With the fsync it looks ok for most cases. It depends on the actions of
a rename touching only one disk block - which of course it doesn't do. Even
so with the fsync on a sane fs I cant see that problem occuring

> If you have metadata journalling, all you need for this algorithm to
> work is to have rename() write to the journal before returning. Is
> this true for any of the current journalling file systems on Linux?

Ext3 I believe so, Reiserfs I would assume so but Hans can answer
definitively

2001-07-28 22:46:05

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Hans Reiser wrote:

> No, Linus is right and the MTA guys are just wrong. The mailers are
> the place to fix things, not the kernel. If the mailer guys want to
> depend on the kernel being stupidly designed, tough. Someone should
> fix their mailer code and then it would run faster on Linux than on
> any other platform.

Well, some systems are even documented that way, so there's nothing with
"depend on the kernel being stupidly designed", but "depend on what
mount(8) says".

MTA authors don't play games, they also write that their software relies
on this behaviour, as laid out.

--
Matthias Andree

2001-07-28 23:16:02

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Fri, 27 Jul 2001, Rik van Riel wrote:

> The stuff you people seem to insist on, however, most
> definately isn't part of the defined set of semantics.

And even if it's "inherited wisdom", you cannot simply tell those people
"don't rely on that" if - as claimed - you can't even force a link() to
disk.

> If you believe otherwise, feel free to point out the
> relevant sections in POSIX / SuS / ...

The standard is only useful if it specifies how to get data safely on
disk - it is quite explicit for fsync(), but you evidently cannot
fsync() a link().

--
Matthias Andree

2001-07-28 23:50:29

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, 29 Jul 2001, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Hans Reiser wrote:
>
> > No, Linus is right and the MTA guys are just wrong. The mailers are
> > the place to fix things, not the kernel. If the mailer guys want to
> > depend on the kernel being stupidly designed, tough. Someone should
> > fix their mailer code and then it would run faster on Linux than on
> > any other platform.
>
> Well, some systems are even documented that way, so there's nothing
> with "depend on the kernel being stupidly designed", but "depend on
> what mount(8) says".

The key word here is "some systems".

> MTA authors don't play games, they also write that their software
> relies on this behaviour, as laid out.

"MTA authors don't play games" ?!?!

I wonder how that explains things like QMQP or the
next-to-useless bounce messages generated by Notes ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-28 23:47:39

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, 29 Jul 2001, Matthias Andree wrote:

> The standard is only useful if it specifies how to get data safely on
> disk - it is quite explicit for fsync(), but you evidently cannot
> fsync() a link().

As Linus said, fsync() on the directory.

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-29 00:08:20

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sat, 28 Jul 2001, Rik van Riel wrote:

> > The standard is only useful if it specifies how to get data safely on
> > disk - it is quite explicit for fsync(), but you evidently cannot
> > fsync() a link().
>
> As Linus said, fsync() on the directory.

Relying on that to work on other operating systems is no better than
demanding synchronous meta data writes: relying on undocumented
behaviour.

If we spake about Linux-specific applications, that'd be okay, but we
speak about portable applications, and the diversity is bigger than
useful. Speed is not the only problem the OS has to solve.

--
Matthias Andree

2001-07-29 01:53:32

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sat, Jul 28, 2001 at 08:03:37PM +0100, Alan Cox wrote:

Ext3 I believe so, Reiserfs I would assume so but Hans can answer
definitively

Reiserfs does not, nor are creates or unlink operations synchronous.

For MTAs it just happens to work: if you fsync the way transactions
are written means the metadata for the dirtectories is written as part
of the transaction --- but I think this is a quirk and not by design?

Chris?




--cw

2001-07-29 01:53:42

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Alan Cox wrote:
>
>...
> > If you have metadata journalling, all you need for this algorithm to
> > work is to have rename() write to the journal before returning. Is
> > this true for any of the current journalling file systems on Linux?
>
> Ext3 I believe so, Reiserfs I would assume so but Hans can answer
> definitively

For ext3: this is true if something forces a commit. Apart from data in
`-o data=writeback' mode, a commit syncs the entire filesystem.
Things which force a commit include:

- completing a write() on an O_SYNC file.
- Performing any metadata operation on a `chattr +S' object
- Performing any metadata operation on an object on a `mount -o sync'
filesystem.

In `data=journal' or `data=ordered' mode, any of these things will
commit everything to non-volatile storage.

-

2001-07-29 02:51:36

by Mike Touloumtzis

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, Jul 29, 2001 at 02:08:12AM +0200, Matthias Andree wrote:
> On Sat, 28 Jul 2001, Rik van Riel wrote:
>
> > As Linus said, fsync() on the directory.
>
> Relying on that to work on other operating systems is no better than
> demanding synchronous meta data writes: relying on undocumented
> behaviour.

You are blurring the boundaries between "undocumented behavior" and
"OS-specific behavior". fsync() on a directory to sync metadata is a
defined (according to my copy of fsync(2)), Linux-specific behavior.
It is also very reasonable IMHO and in keeping with the traditional
Unix notion of directories as lists of files.

I argue that using defined Linux behavior to implement what you want
on Linux systems _is_ better than relying on undocumented behavior,
and I think most people would agree. If you don't do this you have
not really ported the software to Linux; you instead have some
standards compliant software that "kinda usually works on Linux".
You could argue that no one should localize their software to
different versions of Unix, but you would be by far in the minority.

http://www.google.com/search?q=autoconf

Writing portable Unix software has always meant some degree
of system-specific accomodation. It's a bummer but it's life;
otherwise Unix wouldn't evolve.

miket

2001-07-29 09:28:18

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sat, 28 Jul 2001, Mike Touloumtzis wrote:

> You are blurring the boundaries between "undocumented behavior" and
> "OS-specific behavior". fsync() on a directory to sync metadata is a
> defined (according to my copy of fsync(2)), Linux-specific behavior.
> It is also very reasonable IMHO and in keeping with the traditional
> Unix notion of directories as lists of files.

No-one claims that fsync() the directory is a bad interface - it's
non-portable however. Actually, chattr +S is well-documented - it just
doesn't work on ReiserFS or Minix for now, and it may be unnecessarily
slow on ext2.

As pointed out more than once, "synchronous meta data" is documented e.
g. for FreeBSD, so in at least these two cases, the box relies on
documented behaviour.

> http://www.google.com/search?q=autoconf
>
> Writing portable Unix software has always meant some degree
> of system-specific accomodation. It's a bummer but it's life;
> otherwise Unix wouldn't evolve.

How can autoconf figure if you need to fsync() the directory? Apart from
that, which Unix MTA uses autoconf?

Remember, the whole discussion is about getting rid of the need for
chattr +S and offering the admin the chance to mount or flag a directory
for synchronous meta data updates.

--
Matthias Andree

2001-07-29 13:44:58

by Hans Reiser

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Matthias Andree wrote:
>
> On Thu, 26 Jul 2001, Hans Reiser wrote:
>
> > No, Linus is right and the MTA guys are just wrong. The mailers are
> > the place to fix things, not the kernel. If the mailer guys want to
> > depend on the kernel being stupidly designed, tough. Someone should
> > fix their mailer code and then it would run faster on Linux than on
> > any other platform.
>
> Well, some systems are even documented that way, so there's nothing with
> "depend on the kernel being stupidly designed", but "depend on what
> mount(8) says".
>
> MTA authors don't play games, they also write that their software relies
> on this behaviour, as laid out.
>
> --
> Matthias Andree
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Documenting their code won't make it fast or well designed.

Hans

2001-07-29 14:00:58

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, 29 Jul 2001, Matthias Andree wrote:
> On Sat, 28 Jul 2001, Rik van Riel wrote:
>
> > > The standard is only useful if it specifies how to get data safely on
> > > disk - it is quite explicit for fsync(), but you evidently cannot
> > > fsync() a link().
> >
> > As Linus said, fsync() on the directory.
>
> Relying on that to work on other operating systems is no better than
> demanding synchronous meta data writes: relying on undocumented
> behaviour.
>
> If we spake about Linux-specific applications, that'd be okay, but we
> speak about portable applications, and the diversity is bigger than
> useful. Speed is not the only problem the OS has to solve.

I guess many MTAs have a small libc inside of them exactly
in order to handle things like this without fouling up the
core code too much.

Time to make your favorite MTA use link_slowly() ;)

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-29 14:16:30

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, 29 Jul 2001, Matthias Andree wrote:

> How can autoconf figure if you need to fsync() the directory? Apart
> from that, which Unix MTA uses autoconf?

Zmailer uses autoconf, Exim also has some nice
tool to make itself build for the right OS using
the right interfaces.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-30 00:34:01

by Chris Mason

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4



On Sunday, July 29, 2001 01:53:48 PM +1200 Chris Wedgwood <[email protected]>
wrote:

> On Sat, Jul 28, 2001 at 08:03:37PM +0100, Alan Cox wrote:
>
> Ext3 I believe so, Reiserfs I would assume so but Hans can answer
> definitively
>
> Reiserfs does not, nor are creates or unlink operations synchronous.
>
> For MTAs it just happens to work: if you fsync the way transactions
> are written means the metadata for the dirtectories is written as part
> of the transaction --- but I think this is a quirk and not by design?
>
> Chris?

Correct, in the current 2.4.x code, its a quirk. fsync(any object) ==
fsync(all pending metadata, including renames).

There is a transcation tracking patch floating around out there that makes
reiserfs fsync/O_SYNC much faster by only committing the last transaction a
given file/dir was involved in. I had sent this to alan just after 2.4.7
came out, but it looks like I need to resend.

Anyway, during a rename, this patch updates the inode transaction tracking
stuff so an fsync on the file should also commit the directory changes.
But, that isn't something I really intend to advertise much, since the
accepted linux way is fsync(dir).

-chris

2001-07-29 23:19:28

by Mike Touloumtzis

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, Jul 29, 2001 at 11:28:10AM +0200, Matthias Andree wrote:
>
> How can autoconf figure if you need to fsync() the directory? Apart from
> that, which Unix MTA uses autoconf?

My point was not that they should be using autoconf;
I don't know if they are or not. My point was that
they should use existing published interfaces that are
reasonable, rather than push for guarantees that impose
new requirements on filesystems. And even without
autoconf it's not hard to figure out what system you're
running on.

rename(tmpfile, spoolfile);
#ifdef __linux___
fsync(tmpdir);
fsync(spooldir);
#endif
/* transaction is complete */

>
> Remember, the whole discussion is about getting rid of the need for
> chattr +S and offering the admin the chance to mount or flag a directory
> for synchronous meta data updates.

Right; and I'm arguing that the way to get rid of the need
for chattr +S is to incorporate directory fsync() in the
MTAs, not to cram more features into the filesystems.

Problem: MTA needs to know when rename() has been forced
to disk.

Solution 1: MTA authors use fsync(dirfd) on Linux.

Analysis: This is not the most portable solution, but it
should work on any FS that supports Linux semantics. You
can't expect such semantics on FAT and other filesystems
that are just supported for compatibility reasons. But you
could, say, switch filesystems for performance reasons, and
not have your MTA start mysteriously failing, because you
are using the official, documented API to do what you want
to do (at the very least you would be in a much stronger
position when pushing a bug fix :-).

Solution 2: Linux semantics are changed so that rename()
returns only when the data hits the disk. All filesystems
are expected to implement this change.

Analysis: This sucks. It precludes some filesystem design
choices, prevents users from making a speed/reliability
tradeoff, and makes each filesystem more complex.

Solution 3: Some filesystems implement synchronous
directory updates for renames, using filesystem-specific
feature flags, chattr, etc.

Analysis: I wouldn't want to try to dictate anything to
the FS authors, but this solution seems inferior to me.
Each filesystem would have to implement such a flag to
become "MTA compatible". Why add a complex feature to the
filesystem when it can already be accessed via a userspace
API? It will be more complex for administrators too --
they will have to know which filesystems implement the
synchronous directory metadata.

There are lots of filesystems out there. Why not use
an interface they should all support rather than ask for
per-filesystem, filesystem-specific improvements?

miket

2001-07-30 12:48:15

by Ketil Froyn

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Sun, 29 Jul 2001, Matthias Andree wrote:

> On Sat, 28 Jul 2001, Mike Touloumtzis wrote:
>
> > You are blurring the boundaries between "undocumented behavior" and
> > "OS-specific behavior". fsync() on a directory to sync metadata is a
> > defined (according to my copy of fsync(2)), Linux-specific behavior.
> > It is also very reasonable IMHO and in keeping with the traditional
> > Unix notion of directories as lists of files.

> > http://www.google.com/search?q=autoconf
> >
> > Writing portable Unix software has always meant some degree
> > of system-specific accomodation. It's a bummer but it's life;
> > otherwise Unix wouldn't evolve.
>
> How can autoconf figure if you need to fsync() the directory?

Simple! Grep the fsync(2) manpage ;)

Ketil the joker

2001-07-30 13:49:31

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Chris Mason <[email protected]> writes:

> Correct, in the current 2.4.x code, its a quirk. fsync(any object) ==
> fsync(all pending metadata, including renames).

This does not help. The MTAs are doing fsync() on the temporary file
and then using the *subsequent* rename() as the committing operation.

> Anyway, during a rename, this patch updates the inode transaction
> tracking stuff so an fsync on the file should also commit the
> directory changes. But, that isn't something I really intend to
> advertise much, since the accepted linux way is fsync(dir).

It would be nice to have an option (on either the directory or the
mountpoint) to cause all metadata updates to commit to the journal
without causing all operations to be fully synchronous. This would
provide compatibility with BSD-centric code without taking the
performance hit of synchronous data. Heck, just having link() and
rename() perform a commit would be good enough for almost all
applications.

- Pat

2001-07-30 13:54:41

by Alan

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

> Chris Mason <[email protected]> writes:
>
> > Correct, in the current 2.4.x code, its a quirk. fsync(any object) ==
> > fsync(all pending metadata, including renames).
>
> This does not help. The MTAs are doing fsync() on the temporary file
> and then using the *subsequent* rename() as the committing operation.

Which is quaint, because as we've pointed out repeatedly to you rename
is not an atomic operation. Even on a simple BSD or ext2 style fs it can
be two directory block writes, metadata block writes, a bitmap write
and a cylinder group write.

> It would be nice to have an option (on either the directory or the
> mountpoint) to cause all metadata updates to commit to the journal
> without causing all operations to be fully synchronous. This would

You mean fsync() on the directory.

Alan

2001-07-30 14:39:05

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Alan Cox <[email protected]> writes:

> > Chris Mason <[email protected]> writes:
> >
> > > Correct, in the current 2.4.x code, its a quirk. fsync(any object) ==
> > > fsync(all pending metadata, including renames).
> >
> > This does not help. The MTAs are doing fsync() on the temporary file
> > and then using the *subsequent* rename() as the committing operation.
>
> Which is quaint, because as we've pointed out repeatedly to you rename
> is not an atomic operation. Even on a simple BSD or ext2 style fs it can
> be two directory block writes, metadata block writes, a bitmap write
> and a cylinder group write.

But not on a journalling filesystem. I assume that a journal "commit"
is atomic. If it is not, then fsync() on the directory does not solve
the problem either.

Put another way, I am suggesting a mount-time or directory option to
effectively cause rename() and link() to automatically be followed by
an fsync() of the containing directory. (Actually, from this
perspective, maybe you could fix the MTA in user space with LD_PRELOAD
hackery or somesuch. Hm...)

> > It would be nice to have an option (on either the directory or the
> > mountpoint) to cause all metadata updates to commit to the journal
> > without causing all operations to be fully synchronous. This would
>
> You mean fsync() on the directory.

In other words, "Get the MTA authors to change their code." That is a
nice little war, but it is fought at the expense of users who just
want to use the code provided by their vendor and have it work.

The situation is this:

The relevant standards (POSIX, SuS, etc.) provide no way to perform
reliable transactions on a file system.

BSD provides one solution, which is synchronous metatdata. (I am
assuming modern BSDs already deal with the multiple-disk-block
problem to make these transactions properly atomic. Is this
assumption false?)

Linux provides a different solution, which is fsync() on the
directory.

All MTAs, and other apps besides, currently use the BSD solution for
reliable transactions.

Is it really so absurd to ask Linux to provide efficient support of
the BSD semantics as an option?

- Pat

2001-07-30 16:22:52

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On 30 Jul 2001, Patrick J. LoPresti wrote:

> performance hit of synchronous data. Heck, just having link() and
> rename() perform a commit would be good enough for almost all
> applications.

It would be "good enough" for some applications,
but it would be absolutely disastrous for most
applications I run (ie. moving source code around).

Exactly what is wrong with doing fsync() on the
directory ?

Why do you want us to turn link() and rename()
into link_slowly() and rename_slowly() ?

Why can't you use a simple wrapper function to
do this for you ?

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-30 16:28:22

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On 30 Jul 2001, Patrick J. LoPresti wrote:

> The relevant standards (POSIX, SuS, etc.) provide no way to perform
> reliable transactions on a file system.
>
> BSD provides one solution, which is synchronous metatdata. (I am
> assuming modern BSDs already deal with the multiple-disk-block
> problem to make these transactions properly atomic. Is this
> assumption false?)
>
> Linux provides a different solution, which is fsync() on the
> directory.
>
> All MTAs, and other apps besides, currently use the BSD solution for
> reliable transactions.
>
> Is it really so absurd to ask Linux to provide efficient support of
> the BSD semantics as an option?

Yes. You could fix this issue in userland very easily,
it might even work with an LD_PRELOAD ...

Besides BSD softupdates and the various journaling
filesystems which are in use on other Unixen also
don't provide the 4.3BSD solution any more ...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-30 16:46:23

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Rik van Riel <[email protected]> writes:

> Exactly what is wrong with doing fsync() on the
> directory ?

Nothing, except that it requires source code changes to every
application which expects BSD semantics for these operations.
Anecdotal evidence suggests at least the MTA authors are resistant to
making such changes.

> Why do you want us to turn link() and rename()
> into link_slowly() and rename_slowly() ?

I don't by default, only as an option. You know, just like "chattr
-S" or "mount -o sync" means do_everything_slowly().

> Why can't you use a simple wrapper function to
> do this for you ?

It would not be all that simple; it would have to parse the arguments
to figure out the containing directories, open() a file descriptor on
each, and fsync() them. Not impossible, but it does introduce several
those additional system calls as performance hits and points of
failure, not to mention possible race conditions.

Still, I suppose you could do this well enough in the C library. You
might even want it to be the default when "__USE_BSD" is defined or
something.

But it still seems simpler to me just to make it an option in the file
system.

In your next message, you say:

> Besides BSD softupdates and the various journaling
> filesystems which are in use on other Unixen also
> don't provide the 4.3BSD solution any more ...

This surprises me if it is true; do you have a reference? And what
mechanism *do* the modern BSDs provide to commit metadata changes to
disk?

- Pat

2001-07-30 17:03:35

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On 30 Jul 2001, Patrick J. LoPresti wrote:
> Rik van Riel <[email protected]> writes:
>
> > Exactly what is wrong with doing fsync() on the
> > directory ?
>
> Nothing, except that it requires source code changes to every
> application which expects BSD semantics for these operations.
> Anecdotal evidence suggests at least the MTA authors are resistant to
> making such changes.

You may need to make them anyway for Digital's AdvFS,
IRIX XFS, IBM JFS, Veritas' VXFS and BSD softupdates.

Lets face it, FFS is no longer the only available
filesystem. Don't expect FFS semantics from other
filesystems.

> > Why can't you use a simple wrapper function to
> > do this for you ?
>
> It would not be all that simple; it would have to parse the
> arguments to figure out the containing directories, open() a
> file descriptor on each, and fsync() them.

Hmmm, then maybe we'd just want some flag to fsync()
telling the kernel to also sync the parent directory
of the file and do whatever it needs to do to get the
rename() or link() committed ?

> But it still seems simpler to me just to make it an option in
> the file system.

It's always simpler when it's not YOU who has to
implement it ;)

cheers,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-30 17:12:57

by Lawrence Greenfield

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

From: "Patrick J. LoPresti" <[email protected]>
Date: 30 Jul 2001 12:46:13 -0400

> Besides BSD softupdates and the various journaling
> filesystems which are in use on other Unixen also
> don't provide the 4.3BSD solution any more ...

This surprises me if it is true; do you have a reference? And what
mechanism *do* the modern BSDs provide to commit metadata changes to
disk?

BSD softupdates allows you to call fsync() on the file, and this will
sync the directories all the way up to the root if necessary.

Thus BSD fsync() actually guarantees that when it returns, the file
(and all of it's filenames) will survive a reboot.

Sendmail does:
fd = open(tmp)
write(fd)
fsync(fd)
rename(tmp, final)
fsync(fd)

Cyrus IMAP does:
fd = open(tmp)
write(fd)
fsync(fd)
link(tmp, final1)
link(tmp, final2)
link(tmp, final3)
fsync(fd)
close(fd)
unlink(tmp)

The idea that Linux fsync() doesn't actually make the file survive
reboots is pretty ridiculous.

Larry


2001-07-30 17:26:07

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Lawrence Greenfield wrote:
> From: "Patrick J. LoPresti" <[email protected]>
> Date: 30 Jul 2001 12:46:13 -0400
>
> > Besides BSD softupdates and the various journaling
> > filesystems which are in use on other Unixen also
> > don't provide the 4.3BSD solution any more ...
>
> This surprises me if it is true; do you have a reference? And what
> mechanism *do* the modern BSDs provide to commit metadata changes to
> disk?
>
> BSD softupdates allows you to call fsync() on the file, and this will
> sync the directories all the way up to the root if necessary.
>
> Thus BSD fsync() actually guarantees that when it returns, the file
> (and all of it's filenames) will survive a reboot.

Note that this is very different from the "link() should be
synchronous()" mantra we've been hearing over the last days.

These fsync() semantics make lots of sense to me, I'm all
for it.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-30 17:37:57

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, Jul 30, 2001 at 02:25:51PM -0300, Rik van Riel wrote:

Note that this is very different from the "link() should be
synchronous()" mantra we've been hearing over the last days.

These fsync() semantics make lots of sense to me, I'm all
for it.

And what if the file has hundreds or thousands of links? How do we
cleanly keep track of all those?



--cw

2001-07-30 17:49:48

by Lawrence Greenfield

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Date: Tue, 31 Jul 2001 05:38:13 +1200
From: Chris Wedgwood <[email protected]>

On Mon, Jul 30, 2001 at 02:25:51PM -0300, Rik van Riel wrote:

Note that this is very different from the "link() should be
synchronous()" mantra we've been hearing over the last days.

These fsync() semantics make lots of sense to me, I'm all
for it.

And what if the file has hundreds or thousands of links? How do we
cleanly keep track of all those?

You don't have to keep track of all of them, just the uncommitted
ones. I could imagine the filesystem forcing periodic commits on
pathological files (those with thousands of links) to limit the number
of pending directory operations per file.

While the softupdates paper doesn't appear to directly address this
concern, clearly their implementation has to deal with it in some way.

Larry

2001-07-30 18:00:48

by Chris Mason

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4



On Monday, July 30, 2001 01:49:12 PM -0400 Lawrence Greenfield
<[email protected]> wrote:

> Date: Tue, 31 Jul 2001 05:38:13 +1200
> From: Chris Wedgwood <[email protected]>
>
> On Mon, Jul 30, 2001 at 02:25:51PM -0300, Rik van Riel wrote:
>
> Note that this is very different from the "link() should be
> synchronous()" mantra we've been hearing over the last days.
>
> These fsync() semantics make lots of sense to me, I'm all
> for it.
>
> And what if the file has hundreds or thousands of links? How do we
> cleanly keep track of all those?
>
> You don't have to keep track of all of them, just the uncommitted
> ones.

Well, the idea is to get it done in the VFS layer. reiserfs, ext3, and
probably the other journaled filesystems could keep track of the last
transacation and inode was involved with, making the softupdate style
fsync(file) to commit a rename easy.

But, ext2 and the normal filesystems don't have it quite so good.

-chris


2001-07-30 21:04:17

by Anthony DeBoer

[permalink] [raw]
Subject: rename() (was Re: ext3-2.4-0.9.4)

Patrick J. LoPresti <[email protected]> wrote:
>The MTAs do this:
>
> Open temp file
> Write to temp file
> fsync() temp file
> rename() temp file into mail spool
> indicate success to remote MTA

Don't forget the unlink() temp file just before or after that last step.

>As long as rename() does not return until the metadata are committed,
>this should be a reliable delivery mechanism. ...

As I understand it, rename() was originally invented for tasks like
installing a new /bin/sh with guarantees that another process running
at the same time would not fail to find a shell, and that if the system
fell over during the install you'd still have a shell on reboot.

See http://www.qef.com/ftp/rename.ps for an interesting history from
someone who was there at the time. It's undated, but probably a decade
old.

It's my considered opinion that rename() _should_ fsync the target
directory before returning, and between that and the fsync() call on
the file itself (an install program should do the same call sequence as
above) you get the guarantee that the file is intact before you unlink
the temp version and return success. OTOH, link() and unlink() are not
in the business of providing guarantees like that, and should not sync.

--
Anthony de Boer, curator, Anthony's Home for Aged Computing Machinery
<[email protected]>

2001-07-30 21:39:57

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, Jul 30, 2001 at 01:59:04PM -0400, Chris Mason wrote:

Well, the idea is to get it done in the VFS layer. reiserfs, ext3, and
probably the other journaled filesystems could keep track of the last
transacation and inode was involved with, making the softupdate style
fsync(file) to commit a rename easy.

But, right now, the VFS layer doesn't know about magic attributes
(such as ext2/3 +S). The VFS would have to be taught about these and
some other things to support both asynchronous and synchronous
metadata updates (and presumably other smarts too). The trouble is
these attributes themselves and how they are stored is fs specific, we
could always mandate that as of 2.5.x all filesystems _can_ support
some kind of extended API and defined a minimalist set of attributes
for all filesystems and then allow specific filesystems to have their
own. Arguably if people are going to force ACLs upon the world, then
a common API would be nice across XFS, resierfs4, JFFS, etc. (NTFS
can use an API specific to the FS itself as NTFS ACLs are much more
complex and different looking beasts that those from early POSIX
drafts).

For journalling filesystems, it would be really nice if setting an
attribute was all that was required to make rename(2) atomic (or at
the very least to make sure that if the rename system call returns,
the data has been written to non-volatile storage).



--cw


2001-07-31 00:21:38

by Matti Aarnio

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, Jul 26, 2001 at 03:51:35PM +0000, Linus Torvalds wrote:
> To: [email protected]
> From: [email protected] (Linus Torvalds)
> Subject: Re: ext3-2.4-0.9.4
> Date: Thu, 26 Jul 2001 15:51:35 +0000 (UTC)
....
> Use fsync() on the directory.
>
> Logical, isn't it?

No. I don't see why I should opendir() a directory, fsync()
that handle, and closedir() the handle. I would definitely prefer:

lsync(dirpath)

This could, even, behave like lstat() with the path: if the last name
segment is symlink, the sync is done on the i-node data of symlink, not
on what it (possibly) points to.

I didn't check if POSIX folks have thought of that.

> Linus

/Matti Aarnio

2001-07-31 00:22:48

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Lawrence Greenfield wrote:

> The idea that Linux fsync() doesn't actually make the file survive
> reboots is pretty ridiculous.

That doesn't apply to ReiserFS or ext3fs, it does apply to ext2fs and
possibly others.

--
Matthias Andree

2001-07-31 00:16:38

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Rik van Riel wrote:

> Exactly what is wrong with doing fsync() on the
> directory ?

It's non-portable and a kludge.

> Why do you want us to turn link() and rename()
> into link_slowly() and rename_slowly() ?

Opening up the directory requires lots of inode lookups which are
unnecessary.

> Why can't you use a simple wrapper function to
> do this for you ?

Because it's more inefficient than necessary and it bloats the
application.

--
Matthias Andree

2001-07-31 00:25:18

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Rik van Riel wrote:

> > Thus BSD fsync() actually guarantees that when it returns, the file
> > (and all of it's filenames) will survive a reboot.
>
> Note that this is very different from the "link() should be
> synchronous()" mantra we've been hearing over the last days.

Indeed, but this might still require MTA fixing probably, and opening a
file you just want to rename is quite expensive an operation.

--
Matthias Andree

2001-07-31 00:28:48

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Rik van Riel wrote:

> Hmmm, then maybe we'd just want some flag to fsync()
> telling the kernel to also sync the parent directory
> of the file and do whatever it needs to do to get the
> rename() or link() committed ?

Heck, you can't tell the kernel to do rename/link/open/unlink
synchronously in-band. This list doesn't care for other OS's. The
semantics FreeBSD (e. g.) offers ARE indeed documented.

This won't work out without kernel support. Portable reliability doesn't
come for free.

chattr +S is bad (slow). bloating all applications to include every
possible brain fart that the random FS inventor let go is even worse.

2001-07-31 00:34:08

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, 31 Jul 2001, Matthias Andree wrote:
> On Mon, 30 Jul 2001, Rik van Riel wrote:
>
> > Hmmm, then maybe we'd just want some flag to fsync()
> > telling the kernel to also sync the parent directory
> > of the file and do whatever it needs to do to get the
> > rename() or link() committed ?
>
> Heck, you can't tell the kernel to do rename/link/open/unlink
> synchronously in-band. This list doesn't care for other OS's.
> The semantics FreeBSD (e. g.) offers ARE indeed documented.

Go back a few posts and read about the semantics
FreeBSD has when the filesystem is mounted with
softupdates.

Then take a deep breath.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-31 00:57:14

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Thu, 26 Jul 2001, Linus Torvalds wrote:

> In article <[email protected]>,
> Matthias Andree <[email protected]> wrote:
> >
> >However, the remaining problem is being synchronous with respect to open
> >(fixed for ext3 with your fsync() as I understand it), rename, link and
> >unlink. With ext2, and as you write it, with ext3 as well, there is
> >currently no way to tell when the link/rename has been committed to
> >disk, unless you set mount -o sync or chattr +S or call sync() (the
> >former is not an option because it's far too expensive).
>
> Congratulations. You have been brainwashed by Dan Bernstein.

No, I asked Wietse Venema what assumptions Postfix makes. Since he
refuses to fsync() directories, he has Postfix set chattr +S to enforce
the semantics he expects. No problem here.

> Use fsync() on the directory.
>
> Logical, isn't it?

Why go all the lengths to look up each single directory path component
again just to fsync() stuff that doesn't belong to you and that you
don't want synched, possibly the entire device?

Chase up to the root manually, because Linux' ext2 violates SUS v2
fsync() (which requires meta data synched BTW), as has been pointed out
(and fixed in ReiserFS and ext3)?

Admittedly, MTAs are (supposed to be) (per command of RFC-1123) more
paranoid than the average application - and per lack of standard whether
rename/link & Co. need to be synchronous or asynchronous, this is a
problem for the MTA.

So, please tell my why Single Unix Specification v2 specifies EIO for
rename. Asynchronous I/O cannot possibly trigger immediate EIO.

--
Matthias Andree

2001-07-31 01:17:16

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, 31 Jul 2001, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Linus Torvalds wrote:
>
> > Congratulations. You have been brainwashed by Dan Bernstein.

[snip fsync() on directory ... on second thought this isn't enough]

> Chase up to the root manually, because Linux' ext2 violates SUS
> v2 fsync() (which requires meta data synched BTW), as has been
> pointed out (and fixed in ReiserFS and ext3)?

Agreed. fsync() on the file needs to write the meta
data, this includes the directory and (if needed)
the parent directories all the way up to the root.

> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.

Crap. An asynchronous rename() can hit the situation
where it cannot read the disk when searching for the
directory it wants to move the file to.

rename(/from/a/b/file, /to/d/f/file) can fail when
the system gets an IO access on reading "d".

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-31 01:23:44

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, 31 Jul 2001, Matti Aarnio wrote:
> On Thu, Jul 26, 2001 at 03:51:35PM +0000, Linus Torvalds wrote:

> > Use fsync() on the directory.
> >
> > Logical, isn't it?
>
> No. I don't see why I should opendir() a directory, fsync()
> that handle, and closedir() the handle.

And it wouldn't even be enough. Who guarantees you that
the parent directory of this directory has been written
to disk and we won't lose the entry pointing to this
directory on a crash ?

> I would definitely prefer:
>
> lsync(dirpath)

Nice idea. Of course, fsync(file) also has the obligation
to make sure all the metadata of the file is written to
disk. Lots of people seem to be convinced this also includes
the metadata needed to _reach_ the file all the way from the
root of the filesystem...

> I didn't check if POSIX folks have thought of that.

Nice addition. Easier to use than fsync() - no need to
open the file - and probably easier to implement in the
kernel because this way we'll be handing the whole path
to the kernel, whereas fsync() would have the dubious
task of finding out how this file can be traced all the
way down from the root of the filesystem.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-31 01:35:34

by Mike Castle

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, Jul 31, 2001 at 02:57:00AM +0200, Matthias Andree wrote:
> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.

It also specifies EIO as possible for write().

Are you saying that, since SUS2 specifies that write() is capable of
returning EIO, and asynchronous I/O cannot possibly trigger immediate EIO,
that all calls to write() should by synchronous?

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-07-31 01:29:24

by Andrew McNamara

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

>> This does not help. The MTAs are doing fsync() on the temporary file
>> and then using the *subsequent* rename() as the committing operation.
>
>Which is quaint, because as we've pointed out repeatedly to you rename
>is not an atomic operation. Even on a simple BSD or ext2 style fs it can
>be two directory block writes, metadata block writes, a bitmap write
>and a cylinder group write.

This is almost (but not quite) irrelevant. The receiving MTA simply
wants the fsync()/rename() system call to not return until everything
(including directory blocks) have been written to disk, at which point,
it says to the remote end "250 OK". If the receiving machine goes down
at any point up until this one, the sending system will resend the
message. (Yes, the receiving system may have a corrupt directory, and
this is a problem).

---
Andrew McNamara (System Architect)

connect.com.au Pty Ltd
Lvl 3, 213 Miller St, North Sydney, NSW 2060, Australia
Phone: +61 2 9409 2117, Fax: +61 2 9409 2111

2001-07-31 05:25:39

by Lawrence Greenfield

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

Date: Mon, 30 Jul 2001 22:23:29 -0300 (BRST)
From: Rik van Riel <[email protected]>
[...]
> I would definitely prefer:
>
> lsync(dirpath)
[...]
Nice addition. Easier to use than fsync() - no need to
open the file - and probably easier to implement in the
kernel because this way we'll be handing the whole path
to the kernel, whereas fsync() would have the dubious
task of finding out how this file can be traced all the
way down from the root of the filesystem.

It's not as good as fsync() just doing what it's suppose to do.
You'll force applications that want to issue multiple link()s to issue
multiple lsync()s, forcing the kernel to serialize all of the disk
writes when the application just wants one file (and all of it's
associated filenames) to disk.

Yes, I understand that implementing fsync() so that it syncs all names
to reach the file is difficult. But if you want the best performance,
you don't want to make applications issue multiple calls each of which
force their own synchronous writes.

Not to mention us whiny application writers won't be happy throwing
lsync()s all over the place.

Larry


2001-07-31 15:40:30

by Matti Aarnio

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

The thing about filesystems, and how dimmly MTAs (should) consider
some performance tweaks is something I have tried to describe at
ZMailer's manual in part about its the queue:

http://www.zmailer.org/zman/zadm-queues.html

On Tue, Jul 31, 2001 at 01:25:06AM -0400, Lawrence Greenfield wrote:
...
> It's not as good as fsync() just doing what it's suppose to do.
> You'll force applications that want to issue multiple link()s to issue
> multiple lsync()s, forcing the kernel to serialize all of the disk
> writes when the application just wants one file (and all of it's
> associated filenames) to disk.
>
> Yes, I understand that implementing fsync() so that it syncs all names
> to reach the file is difficult. But if you want the best performance,
> you don't want to make applications issue multiple calls each of which
> force their own synchronous writes.
>
> Not to mention us whiny application writers won't be happy throwing
> lsync()s all over the place.
>
> Larry

I quite agree.

Filesystems are not, unfortunately, rollbackfull logged and committable
databases, even if we like to use them often in that way.

An MTA with a fundamental design point of not using any privileged
programs (no suid anything!) and least esoteric technology possible
(for wide portability) can only use message submission means available
to it everywhere -- implementing the queue inside a database system
is definitely a possibility. Possibly yielding higher performance
than one using filesystem for it, but at what cost ??
(I am thinking of SleepyCat DB multiaccess transaction supported
version.)

/Matti Aarnio

2001-07-31 16:36:14

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, 31 Jul 2001, Matti Aarnio wrote:

> The thing about filesystems, and how dimmly MTAs (should) consider
> some performance tweaks is something I have tried to describe at
> ZMailer's manual in part about its the queue:
>
> http://www.zmailer.org/zman/zadm-queues.html
>
> On Tue, Jul 31, 2001 at 01:25:06AM -0400, Lawrence Greenfield wrote:
> ...
> > It's not as good as fsync() just doing what it's suppose to do.
> > You'll force applications that want to issue multiple link()s to issue
> > multiple lsync()s, forcing the kernel to serialize all of the disk
> > writes when the application just wants one file (and all of it's
> > associated filenames) to disk.
> >
> > Yes, I understand that implementing fsync() so that it syncs all names
> > to reach the file is difficult. But if you want the best performance,
> > you don't want to make applications issue multiple calls each of which
> > force their own synchronous writes.
> >
> > Not to mention us whiny application writers won't be happy throwing
> > lsync()s all over the place.
> >
> > Larry
>
> I quite agree.
>
> Filesystems are not, unfortunately, rollbackfull logged and committable
> databases, even if we like to use them often in that way.

Well it depends on which file system you are talking about. NTFS is for
all intents and purposes a rollbackfull logged and committable
(relational) database and a file system at the same time. It's a shame M$
don't release the specs for it, otherwise it would be just what you are
looking for. - It will take us forever to reverse engineer the
journalling part of NTFS. You can see how long it is taking us just to
get the actual file system part.. and journalling on top of that is going
to be even worse. (Of course once we have the file system part there is
nothing to stop us doing our own thing with respect to journalling but
that's a different discussion.)

Anton

>
> An MTA with a fundamental design point of not using any privileged
> programs (no suid anything!) and least esoteric technology possible
> (for wide portability) can only use message submission means available
> to it everywhere -- implementing the queue inside a database system
> is definitely a possibility. Possibly yielding higher performance
> than one using filesystem for it, but at what cost ??
> (I am thinking of SleepyCat DB multiaccess transaction supported
> version.)
>
> /Matti Aarnio
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-07-31 18:38:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4


On Tue, 31 Jul 2001, Matti Aarnio wrote:
> >
> > Logical, isn't it?
>
> No. I don't see why I should opendir() a directory, fsync()
> that handle, and closedir() the handle. I would definitely prefer:
>
> lsync(dirpath)

Btw, you don't have to do opendir() - that just wastes time. Just do
something like

int lsync(char *path)
{
int err, fd;
fd = open(path, 0);
if (fd >= 0) {
err = fsync(fd);
close(fd);
}
return err;
}

and you're done. But it won't do the symlink thing...

Linus

2001-07-31 21:28:03

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Mike Castle wrote:

> On Tue, Jul 31, 2001 at 02:57:00AM +0200, Matthias Andree wrote:
> > So, please tell my why Single Unix Specification v2 specifies EIO for
> > rename. Asynchronous I/O cannot possibly trigger immediate EIO.
>
> It also specifies EIO as possible for write().
>
> Are you saying that, since SUS2 specifies that write() is capable of
> returning EIO, and asynchronous I/O cannot possibly trigger immediate EIO,
> that all calls to write() should by synchronous?

No, I'm wondering about the semantics. Of course, write() can be
synchronous (O_SYNC or fs mounted sync e. g.).

2001-07-31 21:30:55

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, 31 Jul 2001, Lawrence Greenfield wrote:

> Not to mention us whiny application writers won't be happy throwing
> lsync()s all over the place.

Not portable -> won't happen usually.

--
Matthias Andree

2001-07-31 21:30:03

by Matthias Andree

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Mon, 30 Jul 2001, Rik van Riel wrote:

> > I didn't check if POSIX folks have thought of that.
>
> Nice addition. Easier to use than fsync() - no need to
> open the file - and probably easier to implement in the
> kernel because this way we'll be handing the whole path
> to the kernel, whereas fsync() would have the dubious
> task of finding out how this file can be traced all the
> way down from the root of the filesystem.

If I understand SUS v2 correctly, fsync() must sync meta data
corresponding to the file.

If Linux ext2 doesn't to that, it might be a good idea to change that so
it does.

--
Matthias Andree

2001-07-31 21:54:35

by Mike Castle

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, Jul 31, 2001 at 11:29:47PM +0200, Matthias Andree wrote:
> If I understand SUS v2 correctly, fsync() must sync meta data
> corresponding to the file.


Where can I find a common definition for "meta data."

For example, I consider meta data to be things kept in the inode only
(size, timestamps, permissions). Indirect blocks, maybe. But, considering
how, in the unix world, file names are NOT associated with files, I have
never considered file names to be meta data. Instead, file names is a set
of data associated with special files known as "directories." So, it is
obvious, to me, that expecting fsync to sync changes to directory entries
is silly.

Obviously, however, you have a different definition of what meta data is.

Does SUS2 provide a definition for meta data?

A quick glance at the webside didn't turn anything up for me, but I would
not be surprised that I may have missed it.

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-07-31 23:46:19

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Tue, Jul 31, 2001 at 11:29:47PM +0200, Matthias Andree wrote:

If I understand SUS v2 correctly, fsync() must sync meta data
corresponding to the file.

If Linux ext2 doesn't to that, it might be a good idea to change
that so it does.

Define 'meta-data' --- linux sync's any inode and/or bitmap changes,
fsyn on a file will ensure it is intact but not that it can't get
lost.



--cw

2001-07-31 23:53:29

by Rik van Riel

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.4

On Wed, 1 Aug 2001, Chris Wedgwood wrote:
> On Tue, Jul 31, 2001 at 11:29:47PM +0200, Matthias Andree wrote:
>
> If I understand SUS v2 correctly, fsync() must sync meta data
> corresponding to the file.
>
> If Linux ext2 doesn't to that, it might be a good idea to change
> that so it does.
>
> Define 'meta-data' --- linux sync's any inode and/or bitmap
> changes, fsyn on a file will ensure it is intact but not that it
> can't get lost.

Syntactically correct, but quite useless IMHO ;)

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/