2002-07-12 16:18:19

by Dax Kelson

[permalink] [raw]
Subject: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Tested:

ext3 data=ordered
ext3 data=writeback
reiserfs
reiserfs notail

http://www.gurulabs.com/ext3-reiserfs.html

Any suggestions or comments appreciated.

Dax Kelson
Guru Labs


2002-07-12 17:04:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 12, 2002 10:21 -0600, Dax Kelson wrote:
> ext3 data=ordered
> ext3 data=writeback
> reiserfs
> reiserfs notail
>
> http://www.gurulabs.com/ext3-reiserfs.html
>
> Any suggestions or comments appreciated.

Did you try data=journal mode on ext3? For real-life workloads sync-IO
workloads like mail (e.g. not benchmarks where the system is 100% busy)
you can have considerable performance benefits from doing the sync IO
directly to the journal instead of partly to the journal and partly to
the rest of the filesystem.

The reason why "real life" is important here is because the data=journal
mode writes all the files to disk twice - once to the journal and again
to the filesystem, so you must have some "slack" in your disk bandwidth
in order to benefit from this increased throughput on the part of the
mail transport.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-12 17:17:41

by Kwijibo

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

I compared reiserfs with notails and with tails to
ext3 in journaled mode about a month ago.
Strangely enough the machine that was being
built is eventually slated for a mail machine. I used
postmark to simulate the mail environment.

Benchmarks are available here:
http://labs.zianet.com

Let me know if I am missing any info on there.

Steven

Andreas Dilger wrote:

>On Jul 12, 2002 10:21 -0600, Dax Kelson wrote:
>
>
>>ext3 data=ordered
>>ext3 data=writeback
>>reiserfs
>>reiserfs notail
>>
>>http://www.gurulabs.com/ext3-reiserfs.html
>>
>>Any suggestions or comments appreciated.
>>
>>
>
>Did you try data=journal mode on ext3? For real-life workloads sync-IO
>workloads like mail (e.g. not benchmarks where the system is 100% busy)
>you can have considerable performance benefits from doing the sync IO
>directly to the journal instead of partly to the journal and partly to
>the rest of the filesystem.
>
>The reason why "real life" is important here is because the data=journal
>mode writes all the files to disk twice - once to the journal and again
>to the filesystem, so you must have some "slack" in your disk bandwidth
>in order to benefit from this increased throughput on the part of the
>mail transport.
>
>Cheers, Andreas
>--
>Andreas Dilger
>http://www-mddsp.enel.ucalgary.ca/People/adilger/
>http://sourceforge.net/projects/ext2resize/
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>



2002-07-12 17:35:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 12, 2002 11:26 -0600, [email protected] wrote:
> I compared reiserfs with notails and with tails to
> ext3 in journaled mode about a month ago.
> Strangely enough the machine that was being
> built is eventually slated for a mail machine. I used
> postmark to simulate the mail environment.
>
> Benchmarks are available here:
> http://labs.zianet.com
>
> Let me know if I am missing any info on there.

Yes, I saw this benchmark when it was first posted. It isn't clear
from the web pages that you are using data=journal for ext3. Note
that this is only a benefit for sync I/O workloads like mail and
NFS, but not other types of usage. Also, for sync I/O workloads
you can get a big boost by using an external journal device.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-12 20:31:41

by Chris Mason

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Fri, 2002-07-12 at 12:21, Dax Kelson wrote:
> Tested:
>
> ext3 data=ordered
> ext3 data=writeback
> reiserfs
> reiserfs notail
>
> http://www.gurulabs.com/ext3-reiserfs.html
>
> Any suggestions or comments appreciated.

postmark is an interesting workload, but it does not do fsync or renames
on the working set, and postfix does lots of both while delivering.
postmark does do a good job of showing the difference between lots of
files in one directory (great for reiserfs) and lots of directories with
fewer files in each (better for ext3).

Andreas Dilger already mentioned -o data=journal on ext3, you can try
the beta reiserfs patches that add support for data=journal and
data=ordered at:

ftp.suse.com/pub/people/mason/patches/data-logging

They improve reiserfs performance for just about everything, but
data=journal is especially good for fsync/O_SYNC heavy workloads.

Andrew Morton sent me a benchmark of his that tries to simulate
postfix. He has posted it to l-k before but a quick google search found
dead links only, so I'm attaching it. What I like about his synctest is
the results are consistent and you can play with various
fsync/rename/unlink options.

-chris


Attachments:
synctest.c (7.13 kB)

2002-07-13 04:40:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Friday 12 July 2002 18:21, Dax Kelson wrote:
> Any suggestions or comments appreciated.

"it is clear that IF your server is stable and not prone to crashing, and/or
you have the write cache on your hard drives battery backed, you should
strongly consider using the writeback journaling mode of Ext3 versus ordered."

You probably want to suggest UPS there rather than battery backed disk
cache, since the writeback caching is predominantly on the cpu side.

--
Daniel

2002-07-14 20:37:55

by Dax Kelson

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Fri, 2002-07-12 at 10:21, Dax Kelson wrote:
>
> Any suggestions or comments appreciated.
>

Thanks for the feedback. Look for more testing from us soon addressing
the suggestions brought up.

Dax

2002-07-15 08:23:28

by Sam Vilain

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Dax Kelson <[email protected]> wrote:

> > Any suggestions or comments appreciated.
> Thanks for the feedback. Look for more testing from us soon addressing
> the suggestions brought up.

One more thing - can I just make the comment that testing freshly formatted filesystems is not going to show up ext2's real weaknesses, that happen to old filesystems - particularly those where the filesystem has been allowed to fill up.

I timed *15 minutes* for a system I admin to unlink a single 1G file on a fairly old ext2 filesystem the other day (perhaps ext3 would have improved this, I'm not sure). It took 30 minutes to scan a snort log directory log on ext2, but less than 2 minutes on reiser - and only 3 seconds once it was in the buffercache.

You are testing for a mail server - how many mailboxes are in your spool directory for the tests? Try it with about five to ten thousand mailboxes and see how your results vary.
--
Sam Vilain, [email protected] WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

Although Mr Chavez 'was democratically elected,' one had to bear in
mind that 'Legitimacy is something that is conferred not just by a
majority of the voters'"
- The office of George "Dubya" Bush commenting on the Venezuelan
election

2002-07-15 11:18:12

by Alan

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 09:26, Sam Vilain wrote:
> You are testing for a mail server - how many mailboxes are in your spool
> directory for the tests? Try it with about five to ten thousand
> mailboxes and see how your results vary.

If your mail server can't get heirarchical mail spools right, get one
that can.

Alan

2002-07-15 11:59:23

by Sam Vilain

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Alan Cox <[email protected]> wrote:

> > You are testing for a mail server - how many mailboxes are in your spool
> > directory for the tests? Try it with about five to ten thousand
> > mailboxes and see how your results vary.
> If your mail server can't get heirarchical mail spools right, get one
> that can.

Translation

"Yes, we know that there is no directory hashing in ext2/3. You'll have to find another solution to the problem, I'm afraid. Why not ease the burden on the filesystem by breaking up the task for it, and giving it to it in small pieces. That way it's much less likely to choke."

:-)

Sure, you could set up hierarchical mail spools. But it sure stinks of a temporary solution for a long-term problem. What about the next application that grows to massive proportions?

Hey, while I've got your attention, how do you go about debugging your kernel? I'm trying to add fair scheduling to the new O(1) scheduler, something of a token bucket filter counting jiffies used by a process/user/s_context (in scheduler_tick()) and tweaking their priority accordingly (in effective_prio()). It'd be really nice if I could run it under UML or something like that so I can trace through it with gdb, but I couldn't get the UML patch to apply to your tree. Any hints?
--
Sam Vilain, [email protected] WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

2002-07-15 12:06:12

by Matti Aarnio

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, Jul 15, 2002 at 01:30:51PM +0100, Alan Cox wrote:
> On Mon, 2002-07-15 at 09:26, Sam Vilain wrote:
> > You are testing for a mail server - how many mailboxes are in your spool
> > directory for the tests? Try it with about five to ten thousand
> > mailboxes and see how your results vary.
>
> If your mail server can't get heirarchical mail spools right, get one
> that can.

Long ago (10-15 internet-years ago..) I followed testing of
FFS-family of filesystems in Squid cache.

We noticed at Solaris machines using UFS, than when the directory
data size grew above the number of blocks directly addressable by
the direct-index pointers in the i-node, system speed plummeted.
(Or perhaps it was something a bit smaller, like 32 kB)

Consider: 4 kB block size, 12 direct indexes: 48 kB directory size.

Spend 16 bytes for each file name + auxiliary data: 3000 files/subdirs

Optimal would be to store the files inside only the first block,
e.g. the directory shall not grow over 4k (or 1k, or ..)

Name subdirs as: 00 thru 7F (128+2, 12 bytes ?)
Possibly do that in 2 layers: 128^2 = 16384 subdirs, each
with 50 long named users (even more files?): 820 000 users.

Tune the subdir hashing function to suit your application, and
you should be happy.


Putting all your eggs in one basket (files in one directory)
is not a smart thing.


> Alan

/Matti Aarnio

2002-07-15 12:10:23

by Alan

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 13:02, Sam Vilain wrote:
> Alan Cox <[email protected]> wrote:
> "Yes, we know that there is no directory hashing in ext2/3. You'll have
> to find another solution to the problem, I'm afraid. Why not ease the
> burden on the filesystem by breaking up the task for it, and giving it
> to it in small pieces. That way it's much less likely to choke."

Actually there are several other reasons for it. It sucks a lot less
when you need to use ls and friends to inspect part of the spool. It
also makes it much easier to split the mail spool over multiple disks as
it grows without having to backup/restore the spool area

> Sure, you could set up hierarchical mail spools. But it sure stinks of a
> temporary solution for a long-term problem. What about the next
> application that grows to massive proportions?

JFS ?

> Hey, while I've got your attention, how do you go about debugging your
> kernel? I'm trying to add fair scheduling to the new O(1) scheduler,
> something of a token bucket filter counting jiffies used by a
> process/user/s_context (in scheduler_tick()) and tweaking their
> priority accordingly (in effective_prio()). It'd be really nice if I
> could run it under UML or something like that so I can trace through
> it with gdb, but I couldn't get the UML patch to apply to your tree.
> Any hints?

The UML tree and my tree don't quite merge easily. Your best bet is to
grab the Red Hat Limbo beta packages for the kernel source, which if I
remember rightly are both -ac based and include the option to build UML

Alan

2002-07-15 13:38:23

by Chris Mason

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 09:23, Alan Cox wrote:
> On Mon, 2002-07-15 at 13:02, Sam Vilain wrote:
> > Alan Cox <[email protected]> wrote:
> > "Yes, we know that there is no directory hashing in ext2/3. You'll have
> > to find another solution to the problem, I'm afraid. Why not ease the
> > burden on the filesystem by breaking up the task for it, and giving it
> > to it in small pieces. That way it's much less likely to choke."
>
> Actually there are several other reasons for it. It sucks a lot less
> when you need to use ls and friends to inspect part of the spool. It
> also makes it much easier to split the mail spool over multiple disks as
> it grows without having to backup/restore the spool area

Another good reason is i_sem. If you've got more than one process doing
something to that directory, you spend lots of time waiting for the
semaphore. I think it was andrew that reminded me i_sem is held on
fsync, so fync(dir) to make things safe after a rename can slow things
down.

reiserfs only needs fsync(file), ext3 needs fsync(anything on the fs).
If ext3 would promise to make fsync(file) sufficient forever, it might
help the mta authors tune.

-chris


2002-07-15 15:09:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, Jul 15, 2002 at 01:02:01PM +0100, Sam Vilain wrote:
> Hey, while I've got your attention, how do you go about debugging your
> kernel? I'm trying to add fair scheduling to the new O(1) scheduler,
> something of a token bucket filter counting jiffies used by a
> process/user/s_context (in scheduler_tick()) and tweaking their
> priority accordingly (in effective_prio()). It'd be really nice if I
> could run it under UML or something like that so I can trace through
> it with gdb, but I couldn't get the UML patch to apply to your tree.
> Any hints?

-aa ships with both uml and o1 scheduler. I need uml for everything non
hardware related so expect it to be always uptodate there. However since
I merged the O(1) scheduler there is the annoyance that sometime wakeup
events don't arrive at least until kupdate reschedule or something
like that (of course only with uml, not with real hardware). Also
pressing keys is enough to unblock it. I didn't debugged it hard yet.
Accoring to Jeff it's a problem with cli that masks signals.

Andrea

2002-07-15 15:19:31

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Consider this argument:

Given: On ext3, fsync() of any file on a partition commits all
outstanding transactions on that partition to the log.

Given: data=ordered forces pending data writes for a file to happen
before related transactions are committed to the log.

Therefore: With data=ordered, fsync() of any file on a partition
syncs the outstanding writes of EVERY file on that
partition.

Is this argument correct? If so, it suggests that data=ordered is
actually the *worst* possible journalling mode for a mail spool.

One other thing. I think this statement is misleading:

IF your server is stable and not prone to crashing, and/or you
have the write cache on your hard drives battery backed, you
should strongly consider using the writeback journaling mode of
Ext3 versus ordered.

This makes it sound like data=writeback is somehow unsafe when
machines crash. I do not think this is true. If your application
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode. "Battery backed caches" and the
like have nothing to do with it. And if your application is written
incorrectly, then other journalling modes will reduce but not
eliminate the chances for things to break catastrophically on a crash.

So if the partition is dedicated to correct applications, like a mail
spool is, then data=writeback is perfectly safe. If it is faster,
too, then it really is a no-brainer.

- Pat

2002-07-15 16:03:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 15, 2002 13:02 +0100, Sam Vilain wrote:
> "Yes, we know that there is no directory hashing in ext2/3. You'll
> have to find another solution to the problem, I'm afraid. Why not ease
> the burden on the filesystem by breaking up the task for it, and giving
> it to it in small pieces. That way it's much less likely to choke."

Amusingly, there IS directory hashing available for ext2 and ext3, and
it is just as fast as reiserfs hashed directories. See:

http://people.nl.linux.org/~phillips/htree/paper/htree.html

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-15 16:09:24

by Daniel Phillips

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Monday 15 July 2002 18:03, Andreas Dilger wrote:
> On Jul 15, 2002 13:02 +0100, Sam Vilain wrote:
> > "Yes, we know that there is no directory hashing in ext2/3. You'll
> > have to find another solution to the problem, I'm afraid. Why not ease
> > the burden on the filesystem by breaking up the task for it, and giving
> > it to it in small pieces. That way it's much less likely to choke."
>
> Amusingly, there IS directory hashing available for ext2 and ext3, and
> it is just as fast as reiserfs hashed directories. See:
>
> http://people.nl.linux.org/~phillips/htree/paper/htree.html

Faster, last time I checked. I really must test against XFS and JFS at
some point.

--
Daniel

2002-07-15 17:28:36

by Chris Mason

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 11:22, Patrick J. LoPresti wrote:
> Consider this argument:
>
> Given: On ext3, fsync() of any file on a partition commits all
> outstanding transactions on that partition to the log.
>
> Given: data=ordered forces pending data writes for a file to happen
> before related transactions are committed to the log.
>
> Therefore: With data=ordered, fsync() of any file on a partition
> syncs the outstanding writes of EVERY file on that
> partition.
>
> Is this argument correct? If so, it suggests that data=ordered is
> actually the *worst* possible journalling mode for a mail spool.
>

Yes. In practice this doesn't hurt as much as it could, because ext3
does a good job of letting more writers come in before forcing the
commit. What hurts you is when a forced commit comes in the middle of
creating the file. A data write that could have been contiguous gets
broken into two or more writes instead.

> One other thing. I think this statement is misleading:
>
> IF your server is stable and not prone to crashing, and/or you
> have the write cache on your hard drives battery backed, you
> should strongly consider using the writeback journaling mode of
> Ext3 versus ordered.
>
> This makes it sound like data=writeback is somehow unsafe when
> machines crash. I do not think this is true. If your application
> (e.g., Postfix) is written correctly (which it is), so it calls
> fsync() when it is supposed to, then data=writeback is *exactly* as
> safe as any other journalling mode.

Almost. data=writeback makes it possible for the old contents of a
block to end up in a newly grown file. There are a few ways this can
screw you up:

1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.

2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).

> "Battery backed caches" and the
> like have nothing to do with it.

Nope, battery backed caches don't make data=writeback more or less safe
(with respect to the data anyway). They do make data=ordered and
data=journal more safe.

> And if your application is written
> incorrectly, then other journalling modes will reduce but not
> eliminate the chances for things to break catastrophically on a crash.
>
> So if the partition is dedicated to correct applications, like a mail
> spool is, then data=writeback is perfectly safe. If it is faster,
> too, then it really is a no-brainer.

For mail servers, data=journal is your friend. ext3 sometimes needs a
bigger log for it (reiserfs data=journal patches don't), but the
performance increase can be significant.

-chris


2002-07-15 17:45:29

by Sam Vilain

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Andreas Dilger <[email protected]> wrote:

> Amusingly, there IS directory hashing available for ext2 and ext3, and
> it is just as fast as reiserfs hashed directories. See:
> http://people.nl.linux.org/~phillips/htree/paper/htree.html

You learn something new every day. So, with that in mind - what has reiserfs got that ext2 doesn't?

- tail merging, giving much more efficient space usage for lots of small
files.
- B*Tree allocation offering ``a 1/3rd reduction in internal
fragmentation in return for slightly more complicated insertions and
deletion algorithms'' (from the htree paper).
- online resizing in the main kernel (ext2 needs a patch -
http://ext2resize.sourceforge.net/).
- Resizing does not require the use of `ext2prepare' run on the
filesystem while unmounted to resize over arbitrary boundaries.
- directory hashing in the main kernel

On the flipside, ext2 over reiserfs:

- support for attributes without a patch or 2.4.19-pre4+ kernel
- support for filesystem quotas without a patch
- there is a `dump' command (but it's useless, because it hangs when you
run it on mounted filesystems - come on, who REALLY unmounts their
filesystems for a nightly dump? You need a 3 way mirror to do it
while guaranteeing filesystem availability...)

I'd be very interested in seeing postmark results without the hierarchical directory structure (which an unpatched postfix doesn't support), with about 5000 mailboxes with and without the htree patch (or with the htree patch but without that directory indexed, if that is possible).
--
Sam Vilain, [email protected] WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

Try to be the best of what you are, even if what you are is no good.
ASHLEIGH BRILLIANT

2002-07-15 18:30:49

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> One other thing. I think this statement is misleading:
>
> IF your server is stable and not prone to crashing, and/or you
> have the write cache on your hard drives battery backed, you
> should strongly consider using the writeback journaling mode of
> Ext3 versus ordered.
>
> This makes it sound like data=writeback is somehow unsafe when
> machines crash. I do not think this is true. If your application

Well, if your fsync() completes...

> (e.g., Postfix) is written correctly (which it is), so it calls
> fsync() when it is supposed to, then data=writeback is *exactly* as
> safe as any other journalling mode. "Battery backed caches" and the
> like have nothing to do with it. And if your application is written
> incorrectly, then other journalling modes will reduce but not
> eliminate the chances for things to break catastrophically on a crash.

...then you're right. If the machine crashes amidst the fsync()
operation, but has scheduled meta data before file contents, then
journal recovery can present you a file that contains bogus data which
will confuse some applications. I believe Postfix will recover from
this condition either way, see its file is hosed and ignore or discard
it (depending on what it is), but software that blindly relies on a
special format without checking will barf.

All of this assumes two things:

1. the application actually calls fsync()

2. the application can detect if fsync() succeeded before the crash
(like fsync -> fchmod -> fsync, structured file contents, whatever).

> So if the partition is dedicated to correct applications, like a mail
> spool is, then data=writeback is perfectly safe. If it is faster,
> too, then it really is a no-brainer.

These ordering promises also apply to applications that do not call
fsync() or that cannot detect hosed files. Been there, seen that, with
CVS on unpatched ReiserFS as of Linux-2.4.19-presomething: suddenly one
,v file contained NUL blocks. The server barfed, the (remote!) client
segfaulted... yes, it's almost as bad as it can get.

Not catastrophic, tape backup available, but it gave some time to
restore the file and investigate this issue nonetheless. It boiled down
to "nobody's fault, but missing feature". With data=ordered or
data=journal, I would have either had my old ,v file around or a proper
new one.

I'm now using Chris Mason's data-logging patches to try and see how
things work out, I had one crash with an old version, then updated to
the -11 version and have yet to see something break again.

I'd certainly appreciate if these patches were merged early in
2.4.20-pre so they get some testing and can be in 2.4.20 and Linux had
two file systems with data=ordered to choose from.

Disclaimer: I don't know anything except the bare existence, about XFS
or JFS. Feel free to add comments.

--
Matthias Andree

Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

[email protected] (Sam Vilain) writes:
> - there is a `dump' command (but it's useless, because it hangs when you
> run it on mounted filesystems - come on, who REALLY unmounts their
> filesystems for a nightly dump? You need a 3 way mirror to do it
> while guaranteeing filesystem availability...)

According to everybody, dump is deprecated (and it shouldn't work reliably
with 2.4, in two words: "forget it")...

--
Mathieu Chouquet-Stringer E-Mail : [email protected]
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde

2002-07-15 19:10:31

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Chris Mason <[email protected]> writes:

> > One other thing. I think this statement is misleading:
> >
> > IF your server is stable and not prone to crashing, and/or you
> > have the write cache on your hard drives battery backed, you
> > should strongly consider using the writeback journaling mode of
> > Ext3 versus ordered.
> >
> > This makes it sound like data=writeback is somehow unsafe when
> > machines crash. I do not think this is true. If your application
> > (e.g., Postfix) is written correctly (which it is), so it calls
> > fsync() when it is supposed to, then data=writeback is *exactly* as
> > safe as any other journalling mode.
>
> Almost. data=writeback makes it possible for the old contents of a
> block to end up in a newly grown file.

Only if the application is already broken.

> There are a few ways this can screw you up:
>
> 1) that newly grown file is someone's inbox, and the old contents of the
> new block include someone else's private message.
>
> 2) That newly grown file is a control file for the application, and the
> application expects it to contain valid data within (think sendmail).

In a correctly-written application, neither of these things can
happen. (See my earlier message today on fsync() and MTAs.) To get a
file onto disk reliably, the application must 1) flush the data, and
then 2) flush a "validity" indicator. This could be a sequence like:

create temp file
flush data to temp file
rename temp file
flush rename operation

In this sequence, the file's existence under a particular name is the
indicator of its validity.

If you skip either of these flush operations, you are not behaving
reliably. Skipping the first flush means the validity indicator might
hit the disk before the data; so after a crash, you might see invalid
data in an allegedly valid file. Skipping the second flush means you
do not know that the validity indicator has been set, so you cannot
report success to whoever is waiting for this "reliable write" to
happen.

It is possible to make an application which relies on data=ordered
semantics; for example, skipping the "flush data to temp file" step
above. But such an application would be broken for every version of
Unix *except* Linux in data=ordered mode. I would call that an
incorrect application.

> Nope, battery backed caches don't make data=writeback more or less safe
> (with respect to the data anyway). They do make data=ordered and
> data=journal more safe.

A theorist would say that "more safe" is a sloppy concept. Either an
operation is safe or it is not. As I said in my last message,
data=ordered (and data=journal) can reduce the risk for poorly written
apps. But they cannot eliminate that risk, and for a correctly
written app, data=writeback is 100% as safe.

- Pat

2002-07-15 19:23:51

by Sam Vilain

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Mathieu Chouquet-Stringer <[email protected]> wrote:

> > - there is a `dump' command (but it's useless, because it hangs when you
> > run it on mounted filesystems - come on, who REALLY unmounts their
> > filesystems for a nightly dump? You need a 3 way mirror to do it
> > while guaranteeing filesystem availability...)
> According to everybody, dump is deprecated (and it shouldn't work reliably
> with 2.4, in two words: "forget it")...

It's a shame, because `tar' doesn't save things like inode attributes and
places unnecessary load on the VFS layer. It also takes considerably
longer than dump did on one backup server I admin - like ~12 hours to back
up ~26G in ~414k inodes to a tape capable of about 1MB/sec. But that's
probably the old directory hashing thing again, there are some
reeeeaaallllllly large directories on that machine...

Ah, the joys of legacy.
--
Sam Vilain, [email protected] WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

If you think the United States has stood still, who built the
largest shopping center in the world?
RICHARD M NIXON

2002-07-15 19:40:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Chris Mason wrote:
>
> ...
> If ext3 would promise to make fsync(file) sufficient forever, it might
> help the mta authors tune.

ext3 promises. This side-effect is bolted firmly into the design
of ext3 and it's hard to see any way in which it will go away.

-

2002-07-15 20:52:15

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> In a correctly-written application, neither of these things can
> happen. (See my earlier message today on fsync() and MTAs.) To get a
> file onto disk reliably, the application must 1) flush the data, and
> then 2) flush a "validity" indicator. This could be a sequence like:
>
> create temp file
> flush data to temp file
> rename temp file
> flush rename operation
>
> In this sequence, the file's existence under a particular name is the
> indicator of its validity.

Assume that most applications are broken then.

I assume that most will just call close() or fclose() and exit() right
away. Does fclose() imply fsync()?

Some applications will not even check the [f]close() return value...

> It is possible to make an application which relies on data=ordered
> semantics; for example, skipping the "flush data to temp file" step
> above. But such an application would be broken for every version of
> Unix *except* Linux in data=ordered mode. I would call that an
> incorrect application.

Or very specific, at least.

> > Nope, battery backed caches don't make data=writeback more or less safe
> > (with respect to the data anyway). They do make data=ordered and
> > data=journal more safe.
>
> A theorist would say that "more safe" is a sloppy concept. Either an
> operation is safe or it is not. As I said in my last message,
> data=ordered (and data=journal) can reduce the risk for poorly written
> apps. But they cannot eliminate that risk, and for a correctly
> written app, data=writeback is 100% as safe.

IF that application uses a marker to mark completion. If it does not,
data=ordered will be the safe bet, regardless of fsync() or not. The
machine can crash BEFORE the fsync() is called.

--
Matthias Andree

2002-07-15 21:11:16

by Chris Mason

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 15:13, Patrick J. LoPresti wrote:

> > 1) that newly grown file is someone's inbox, and the old contents of the
> > new block include someone else's private message.
> >
> > 2) That newly grown file is a control file for the application, and the
> > application expects it to contain valid data within (think sendmail).
>
> In a correctly-written application, neither of these things can
> happen. (See my earlier message today on fsync() and MTAs.) To get a
> file onto disk reliably, the application must 1) flush the data, and
> then 2) flush a "validity" indicator. This could be a sequence like:
>
> create temp file
> flush data to temp file
> rename temp file
> flush rename operation

Yes, most mtas do this for queue files, I'm not sure how many do it for
the actual spool file. mail server authors are more than welcome to
recommend the best safety/performance combo for their product, and to
ask the FS guys which combinations are safe.

-chris


2002-07-15 21:13:35

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 15, 2002 18:48 +0100, Sam Vilain wrote:
> Andreas Dilger <[email protected]> wrote:
>
> > Amusingly, there IS directory hashing available for ext2 and ext3, and
> > it is just as fast as reiserfs hashed directories. See:
> > http://people.nl.linux.org/~phillips/htree/paper/htree.html
>
> You learn something new every day. So, with that in mind - what has
> reiserfs got that ext2 doesn't?
>
> - tail merging, giving much more efficient space usage for lots of small
> files.

Well, there was a tail merging patch for ext2, but it has been dropped
for now. In reality, any benchmarks with reiserfs (except the
very-small-files case) will run with tail packing disabled because it
kills performance.

> - B*Tree allocation offering ``a 1/3rd reduction in internal
> fragmentation in return for slightly more complicated insertions and
> deletion algorithms'' (from the htree paper).
> - online resizing in the main kernel (ext2 needs a patch -
> http://ext2resize.sourceforge.net/).

Yes, I wrote it...

> - Resizing does not require the use of `ext2prepare' run on the
> filesystem while unmounted to resize over arbitrary boundaries.

That is coming this summer. It will be part of some changes to support
"meta blockgroups", and the resizing comes for free at the same time.

> - directory hashing in the main kernel

Probably will happen in 2.5, as Andrew is already testing htree support
for ext3. It is also in the ext3 CVS tree for 2.4, so I wouldn't be
surprised if it shows up in 2.4 also.

> On the flipside, ext2 over reiserfs:
>
> - support for attributes without a patch or 2.4.19-pre4+ kernel
> - support for filesystem quotas without a patch
> - there is a `dump' command (but it's useless, because it hangs when you
> run it on mounted filesystems - come on, who REALLY unmounts their
> filesystems for a nightly dump? You need a 3 way mirror to do it
> while guaranteeing filesystem availability...)

Well, the dump can only be inconsistent for files that are being changed
during the dump itself. As for hanging the system, that would be a bug
regardless of whether it was dump or "dd" reading from the block device.
A bug related to this was fixed, probably in 2.4.19-preX somewhere.

> I'd be very interested in seeing postmark results without the
> hierarchical directory structure (which an unpatched postfix doesn't
> support), with about 5000 mailboxes with and without the htree patch
> (or with the htree patch but without that directory indexed, if that
> is possible).

Let me know what you find. It is possible to use an htree-patched
kernel and not have indexed directories - just don't mount with
"-o index". Note that there is a data-corrupting bug somewhere in
the ext3 htree code, so I wouldn't suggest using indexed directories
except for test.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-15 21:22:21

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Matthias Andree <[email protected]> writes:

> I assume that most will just call close() or fclose() and exit() right
> away. Does fclose() imply fsync()?

Not according to my close(2) man page:

A successful close does not guarantee that the data has
been successfully saved to disk, as the kernel defers
writes. It is not common for a filesystem to flush the
buffers when the stream is closed. If you need to be sure
that the data is physically stored use fsync(2). (It will
depend on the disk hardware at this point.)

Note that this means writing a truly reliable shell or Perl script is
tricky. I suppose you can "use POSIX qw(fsync);" in Perl. But what
do you do for a shell script? /bin/sync :-) ?

> Some applications will not even check the [f]close() return value...

Such applications are broken, of course.

> > It is possible to make an application which relies on data=ordered
> > semantics; for example, skipping the "flush data to temp file" step
> > above. But such an application would be broken for every version of
> > Unix *except* Linux in data=ordered mode. I would call that an
> > incorrect application.
>
> Or very specific, at least.

Hm. Does BSD with soft updates guarantee anything about write
ordering on fsync()? In particular, does it promise to commit the
data before the metadata?

> > A theorist would say that "more safe" is a sloppy concept. Either an
> > operation is safe or it is not. As I said in my last message,
> > data=ordered (and data=journal) can reduce the risk for poorly written
> > apps. But they cannot eliminate that risk, and for a correctly
> > written app, data=writeback is 100% as safe.
>
> IF that application uses a marker to mark completion. If it does not,
> data=ordered will be the safe bet, regardless of fsync() or not. The
> machine can crash BEFORE the fsync() is called.

Without marking completion, there is no safe bet. Without calling
fsync(), you *never* know when the data will hit the disk. It is very
hard to build a reliable system that way... For an MTA, for example,
you can never safely inform the remote mailer that you have accepted
the message. But this problem goes beyond MTAs; very few applications
live in a vacuum.

Reliable systems are tricky. I guess this is why Oracle and Sybase
make all that money.

- Pat

2002-07-15 21:28:13

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Chris Mason <[email protected]> writes:

> Yes, most mtas do this for queue files, I'm not sure how many do it for
> the actual spool file.

Maybe the control files are small enough to fit in one disk block,
making the operations atomic in practice. Or something.

> mail server authors are more than welcome to recommend the best
> safety/performance combo for their product, and to ask the FS guys
> which combinations are safe.

Yeah, but it's a shame if those combinations require performance hits
like "synchronous directory updates" or, worse, "fsync() == sync()".

I really wish MTA authors would just support Linux's "fsync the
directory" approach. It is simple, reliable, and fast. Yes, it does
require Linux-specific support in the application, but that's what
application authors should expect when there is a gap in the
standards.

- Pat

2002-07-15 21:36:06

by Thunder from the hill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Hi,

On 15 Jul 2002, Patrick J. LoPresti wrote:
> Note that this means writing a truly reliable shell or Perl script is
> tricky. I suppose you can "use POSIX qw(fsync);" in Perl. But what do
> you do for a shell script? /bin/sync :-) ?

Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be
done with it.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-15 21:42:24

by Alan

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 21:55, Matthias Andree wrote:
> I assume that most will just call close() or fclose() and exit() right
> away. Does fclose() imply fsync()?

It doesn't.

> Some applications will not even check the [f]close() return value...

We are only interested in reliable code. Anything else is already
fatally broken.

-- quote --
Not checking the return value of close is a common but
nevertheless serious programming error. File system
implementations which use techniques as ``write-behind''
to increase performance may lead to write(2) succeeding,
although the data has not been written yet. The error
status may be reported at a later write operation, but it
is guaranteed to be reported on closing the file. Not
checking the return value when closing the file may lead
to silent loss of data. This can especially be observed
with NFS and disk quotas.

2002-07-15 21:56:56

by Ketil Froyn

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On 15 Jul 2002, Patrick J. LoPresti wrote:

> Without calling fsync(), you *never* know when the data will hit the
> disk.

Doesn't bdflush ensure that data is written to disk within 30 seconds or
some tunable number of seconds?

Ketil

2002-07-15 21:55:23

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 15 Jul 2002, Alan Cox wrote:

> We are only interested in reliable code. Anything else is already
> fatally broken.
>
> -- quote --
> Not checking the return value of close is a common but
> nevertheless serious programming error. File system

As in 6. on http://www.apocalypse.org/pub/u/paul/docs/commandments.html
(The Ten Commandments for C Programmers, by Henry Spencer).

2002-07-15 22:11:58

by Richard A Nelson

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On 15 Jul 2002, Patrick J. LoPresti wrote:

> I really wish MTA authors would just support Linux's "fsync the
> directory" approach. It is simple, reliable, and fast. Yes, it does
> require Linux-specific support in the application, but that's what
> application authors should expect when there is a gap in the
> standards.

This is exactly what sendmail did in its 8.12.0 release (2001/09/08)

--
Rick Nelson
"...very few phenomena can pull someone out of Deep Hack Mode, with two
noted exceptions: being struck by lightning, or worse, your *computer*
being struck by lightning."
(By Matt Welsh)

2002-07-15 23:05:53

by Matti Aarnio

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, Jul 15, 2002 at 11:59:48PM +0200, Ketil Froyn wrote:
> On 15 Jul 2002, Patrick J. LoPresti wrote:
> > Without calling fsync(), you *never* know when the data will hit the
> > disk.
>
> Doesn't bdflush ensure that data is written to disk within 30 seconds or
> some tunable number of seconds?

It TRIES TO, it does not guarantee anything.

The MTA systems are an example of software suites which have
transaction requirements. The goal has been usually stated
as: must not fail to deliver.

Practical implementations without full-blown all encompassing
transactions will usually mean that the message "will be delivered
at least once", e.g. double-delivery can happen.

One view to MTA behaviour is moving the message from one substate
to another during its processing.

These days, usually, the transaction database for MTAs is UNIX
filesystem. For ZMailer I have considered (although not actually
done - yet) using SleepyCat DB files for the transaction subsystem.
There are great challenges in failure compartementalisation, and
integrity, when using that kind of integrated database mechanisms.
Getting SEGV is potentially _very_ bad thing!

> Ketil

/Matti Aarnio

2002-07-16 00:59:38

by Lawrence Greenfield

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

From: "Patrick J. LoPresti" <[email protected]>
Date: 15 Jul 2002 17:31:07 -0400
[...]
I really wish MTA authors would just support Linux's "fsync the
directory" approach. It is simple, reliable, and fast. Yes, it does
require Linux-specific support in the application, but that's what
application authors should expect when there is a gap in the
standards.

Actually, it's not all that simple (you have to find the enclosing
directories of any files you're modifying, which might require string
manipulation) or necessarily all that fast (you're doubling the number
of system calls and now the application is imposing an ordering on the
filesystem that didn't exist before).

It's only necessary for ext2. Modern Linux filesystems (such as ext3
or reiserfs) don't require it.

Finally: ext2 isn't safe even if you do call fsync() on the directory!

Let's consider: some filesystem operation modifies two different
blocks. This operation is safe if block A is written before block
B.

. FFS guarantees this by performing the writes synchronously: block A
is written when it is changed, followed by block B when it is changed.

. Journalling filesystems (ext3, reiserfs) guarantee this by
journalling the operation and forcing that journal entry to disk
before either A or B can be modified.

. What does ext2 do (in the default mode)? It modifies A, it modifies
B, and then leaves it up to the buffer cache to write them back---and
the buffer cache might decide to write B before A.

We're finally getting to some decent shared semantics on
filesystems. Reiserfs, ext3, FFS w/ softupdates, vxfs, etc., all work
with just fsync()ing the file (though an fsync() is required after a
link() or rename() operation). Let's encourage all filesystems to
provide these semantics and make it slightly easier on us stupid
application programmers.

Larry




2002-07-16 01:53:22

by Thunder from the hill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Hi,

On 15 Jul 2002, Patrick J. LoPresti wrote:
> Doing that instead of fsync'ing the
> file adds at most two system calls (to open and close the directory),

Keep the directory fd open all the time, and flush it when needed. This
gets rid of the open(dir, dd); fsync(dd); close(dd);, you just have:
open(dir, dd); once, then fsync(dd); fsync(dd); ... and then one close(dd);

Not too much of an overhead, is it?

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-16 01:40:42

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Lawrence Greenfield <[email protected]> writes:

> Actually, it's not all that simple (you have to find the enclosing
> directories of any files you're modifying, which might require string
> manipulation)

No, you have to find the directories you are modifying. And the
application knows darn well which directories it is modifying.

Don't speculate. Show some sample code, and let's see how hard it
would be to use the "Linux way". I am betting on "not hard at all".

> or necessarily all that fast (you're doubling the number of system
> calls and now the application is imposing an ordering on the
> filesystem that didn't exist before).

No, you are not doubling the number of system calls. As I have tried
to point out repeatedly, doing this stuff reliably and portably
already requires a sequence like this:

write data
flush data
write "validity" indicator (e.g., rename() or fchmod())
flush validity indicator

On Linux, flushing a rename() means calling fsync() on the directory
instead of the file. That's it. Doing that instead of fsync'ing the
file adds at most two system calls (to open and close the directory),
and those can be amortized over many operations on that directory
(think "mail spool"). So the system call overhead is non-existent.

As for "imposing an ordering on the filesystem that didn't exist
before", that is complete nonsense. This is imposing *precisely* the
ordering required for reliable operation; no more, no less. Relying
on mount options, "chattr +S", or journaling artifacts for your
ordering is the inefficient approach; since they impose extra
ordering, they can never be faster and will usually be slower.

> It's only necessary for ext2. Modern Linux filesystems (such as ext3
> or reiserfs) don't require it.

Only because they take the performance hit of flushing the whole log
to disk on every fsync(). Combine that with "data=ordered" and see
what happens to your performance. (Perhaps "data=ordered" should be
called "fsync=sync".) I would rather get back the performance and
convince application authors to understand what they are doing.

> Finally: ext2 isn't safe even if you do call fsync() on the directory!

Wrong.

write temp file
fsync() temp file
rename() temp file to actual file
fsync() directory

No matter where this crashes, it is perfectly safe on ext2. (If not,
ext2 is badly broken.) The worst that can happen after a crash is
that the file might exist with both the old name and the new name.
But an application can detect this case on startup and clean it up.

- Pat

2002-07-16 07:04:29

by Dax Kelson

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 2002-07-15 at 09:22, Patrick J. LoPresti wrote:

> One other thing. I think this statement is misleading:
>
> IF your server is stable and not prone to crashing, and/or you
> have the write cache on your hard drives battery backed, you
> should strongly consider using the writeback journaling mode of
> Ext3 versus ordered.

I rewrote that statement on the website.

Dax Kelson
Guru Labs

2002-07-16 08:12:51

by Stelian Pop

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, Jul 15, 2002 at 06:48:05PM +0100, Sam Vilain wrote:

> On the flipside, ext2 over reiserfs:
[...]
> - there is a `dump' command (but it's useless, because it hangs when you
> run it on mounted filesystems - come on, who REALLY unmounts their
> filesystems for a nightly dump? You need a 3 way mirror to do it
> while guaranteeing filesystem availability...)

dump(8) doesn't hang when dumping mounted filesystems. You are refering
to a genuine bug which was fixed some time ago.

However, in some rare occasions, dump can save corrupted data when
saving a mounted and generaly highly active filesystem. Even then,
in 99% of the cases it doesn't really matter because the corrupted
files will get saved by the next incremental dump.

Come on, who REALLY expects to have consistent backups without
either unmounting the filesystem or using some snapshot techniques ?

Stelian.
--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com

2002-07-16 08:15:19

by Stelian Pop

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, Jul 15, 2002 at 02:47:04PM -0400, Mathieu Chouquet-Stringer wrote:

> According to everybody, dump is deprecated (and it shouldn't work reliably
> with 2.4, in two words: "forget it")...

This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)

Stelian.
--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com

2002-07-16 12:19:58

by Gerhard Mack

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Stelian Pop wrote:

> On Mon, Jul 15, 2002 at 02:47:04PM -0400, Mathieu Chouquet-Stringer wrote:
>
> > According to everybody, dump is deprecated (and it shouldn't work reliably
> > with 2.4, in two words: "forget it")...
>
> This needs to be "according to Linus, dump is deprecated". Given the
> interest Linus has manifested for backups, I wouldn't really rely on
> his statement :-)

Either way dump is not likely to give you a reliable backup when used
with a 2.4.x kernel.

Gerhard


--
Gerhard Mack

[email protected]

<>< As a computer I find your faith in technology amusing.

2002-07-16 12:28:30

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 15 Jul 2002, Thunder from the hill wrote:

> Hi,
>
> On 15 Jul 2002, Patrick J. LoPresti wrote:
> > Note that this means writing a truly reliable shell or Perl script is
> > tricky. I suppose you can "use POSIX qw(fsync);" in Perl. But what do
> > you do for a shell script? /bin/sync :-) ?
>
> Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be
> done with it.

Or steal one from FreeBSD (written by Paul Saab), fix the err() function
and be done with it.

.../usr.bin/fsync/fsync.{1,c}

Interesting side note -- mind the O_RDONLY:

for (i = 1; i < argc; ++i) {
if ((fd = open(argv[i], O_RDONLY)) < 0)
err(1, "open %s", argv[i]);

if (fsync(fd) != 0)
err(1, "fsync %s", argv[1]);
close(fd);
}

2002-07-16 12:25:07

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Stelian Pop wrote:

> Come on, who REALLY expects to have consistent backups without
> either unmounting the filesystem or using some snapshot techniques ?

The who uses [s|g]tar, cpio, afio, dsmc (Tivoli distributed storage
manager), ...

Low-level snapshots don't do any good, they just freeze the "halfway
there" on-disk structure.

2002-07-16 12:32:46

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 15 Jul 2002, Chris Mason wrote:

> On Mon, 2002-07-15 at 15:13, Patrick J. LoPresti wrote:
>
> > > 1) that newly grown file is someone's inbox, and the old contents of the
> > > new block include someone else's private message.
> > >
> > > 2) That newly grown file is a control file for the application, and the
> > > application expects it to contain valid data within (think sendmail).
> >
> > In a correctly-written application, neither of these things can
> > happen. (See my earlier message today on fsync() and MTAs.) To get a
> > file onto disk reliably, the application must 1) flush the data, and
> > then 2) flush a "validity" indicator. This could be a sequence like:
> >
> > create temp file
> > flush data to temp file
> > rename temp file
> > flush rename operation
>
> Yes, most mtas do this for queue files, I'm not sure how many do it for
> the actual spool file. mail server authors are more than welcome to

Less. For one, Postfix' local(8) daemon relies on synchronous directory
update for Maildir spools. For mbox spool, the problem is less
prevalent, because spool files usually exist already and fsync() is
sufficient (and fsync() is done before local(8) reports success to the
queue manager).

--
Matthias Andree

2002-07-16 12:30:32

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Matti Aarnio wrote:

> These days, usually, the transaction database for MTAs is UNIX
> filesystem. For ZMailer I have considered (although not actually
> done - yet) using SleepyCat DB files for the transaction subsystem.
> There are great challenges in failure compartementalisation, and
> integrity, when using that kind of integrated database mechanisms.
> Getting SEGV is potentially _very_ bad thing!

Read: lethal to the spool. Has SleepyCat DB learned to recover from
ENOSPC in the meanwhile? I had a db1.85 file corrupt after ENOSPC once...

--
Matthias Andree

2002-07-16 12:40:41

by Stelian Pop

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 02:27:56PM +0200, Matthias Andree wrote:

> > Come on, who REALLY expects to have consistent backups without
> > either unmounting the filesystem or using some snapshot techniques ?
>
> The who uses [s|g]tar, cpio, afio, dsmc (Tivoli distributed storage
> manager), ...
>
> Low-level snapshots don't do any good, they just freeze the "halfway
> there" on-disk structure.

But [s|g]tar, cpio, afio (don't know about dsmc) also freeze the
"halfway there" data, but at the file level instead (application
instead of filesystem)...

Stelian.
--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com

2002-07-16 12:45:10

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> On Linux, flushing a rename() means calling fsync() on the directory
> instead of the file. That's it. Doing that instead of fsync'ing the
> file adds at most two system calls (to open and close the directory),
> and those can be amortized over many operations on that directory
> (think "mail spool"). So the system call overhead is non-existent.

Indeed, but I can also leave the file descriptor open on any file system
on any system except SOME of Linux'. (Ok, this precludes systems that
don't offer POSIX synchronous completion semantics, but these systems
don't nessarily have fsync() either).

> ordering required for reliable operation; no more, no less. Relying
> on mount options, "chattr +S", or journaling artifacts for your
> ordering is the inefficient approach; since they impose extra
> ordering, they can never be faster and will usually be slower.

It is sometimes the only way, if the application is unaware. I hope I'm
not loosening a flame war if I mention qmail now, which is not even
softupdates aware. Without chattr +S or mount -o sync, nothing is to be
gained. OTOH, where mount -o sync only makes directory updates
synchronous, it's not too expensive, which is why the +D approach is
still useful there.

> > It's only necessary for ext2. Modern Linux filesystems (such as ext3
> > or reiserfs) don't require it.
>
> Only because they take the performance hit of flushing the whole log
> to disk on every fsync(). Combine that with "data=ordered" and see
> what happens to your performance. (Perhaps "data=ordered" should be
> called "fsync=sync".) I would rather get back the performance and
> convince application authors to understand what they are doing.

1. data=ordered is more than fsync=sync. It guarantees that data blocks
are flushed before flushing the meta data blocks that reference the data
blocks. Try this on ext2fs and lose.

2. sync() is unreliable, it can return control to the caller earlier
than what is sound. It can "complete" at any time it desires without
having completed.
(Probably so it can ever return as new blocks are written by another
process, but at least SUS v2 did not detail on this).

3. Application authors do not desire fsync=sync semantics, but they want
to rely on "fsync(fd) also syncs recent renames". It comes as a
now-guaranteed side effect of how ext3fs works, so I am told.

I'm not sure how the ext3fs journal works internally, but it'd fine with
all applications if only that part of a file system be synched that is
really relevant to the current fsync(fd). No more. It seems as though
fsync==sync is an artifact that ext2 also suffers from.

--
Matthias Andree

2002-07-16 12:47:15

by Stelian Pop

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 08:22:53AM -0400, Gerhard Mack wrote:

> > This needs to be "according to Linus, dump is deprecated". Given the
> > interest Linus has manifested for backups, I wouldn't really rely on
> > his statement :-)
>
> Either way dump is not likely to give you a reliable backup when used
> with a 2.4.x kernel.

Since you are so well informed, maybe you could share your knowledge
with us.

I'm the dump maintainer, so I'll be very interested in knowing how
comes that dump works for me and many other users... :-)

Stelian.
--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com

2002-07-16 12:50:11

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Stelian Pop wrote:

> > Low-level snapshots don't do any good, they just freeze the "halfway
> > there" on-disk structure.
>
> But [s|g]tar, cpio, afio (don't know about dsmc) also freeze the
> "halfway there" data, but at the file level instead (application
> instead of filesystem)...

Not if some day somebody implements file system level snapshots for
Linux. Until then, better have garbled file contents constrained to a
file than random data as on-disk layout changes with hefty directory
updates.

dsmc fstat()s the file it is currently reading regularly and retries the
dump as the changes, and gives up if it is updated too often. Not sure
about the server side, and certainly not a useful option for sequential
devices that you directly write on. Looks like a cache for the biggest
file is necessary.

2002-07-16 13:02:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 02:53:01PM +0200, Matthias Andree wrote:
> Not if some day somebody implements file system level snapshots for
> Linux. Until then, better have garbled file contents constrained to a
> file than random data as on-disk layout changes with hefty directory
> updates.

or the blockdevice-level snapshots already implemented in Linux..

2002-07-16 15:08:25

by Gerhard Mack

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Stelian Pop wrote:

> Date: Tue, 16 Jul 2002 14:49:56 +0200
> From: Stelian Pop <[email protected]>
> To: Gerhard Mack <[email protected]>
> Cc: Mathieu Chouquet-Stringer <[email protected]>,
> [email protected]
> Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
>
> On Tue, Jul 16, 2002 at 08:22:53AM -0400, Gerhard Mack wrote:
>
> > > This needs to be "according to Linus, dump is deprecated". Given the
> > > interest Linus has manifested for backups, I wouldn't really rely on
> > > his statement :-)
> >
> > Either way dump is not likely to give you a reliable backup when used
> > with a 2.4.x kernel.
>
> Since you are so well informed, maybe you could share your knowledge
> with us.
>
> I'm the dump maintainer, so I'll be very interested in knowing how
> comes that dump works for me and many other users... :-)
>

I'll save myself the trouble when Linus said it better than I could:

Note that dump simply won't work reliably at all even in
2.4.x: the buffer cache and the page cache (where all the
actual data is) are not coherent. This is only going to
get even worse in 2.5.x, when the directories are moved
into the page cache as well.

So anybody who depends on "dump" getting backups right is
already playing russian rulette with their backups. It's
not at all guaranteed to get the right results - you may
end up having stale data in the buffer cache that ends up
being "backed up".


In other words you have a backup system that works some of the time or
even most of the time... brilliant!

Gerhard

--
Gerhard Mack

[email protected]

<>< As a computer I find your faith in technology amusing.

2002-07-16 15:19:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:
> On Tue, 16 Jul 2002, Stelian Pop wrote:
>
> > Date: Tue, 16 Jul 2002 14:49:56 +0200
> > From: Stelian Pop <[email protected]>
> > To: Gerhard Mack <[email protected]>
> > Cc: Mathieu Chouquet-Stringer <[email protected]>,
> > [email protected]
> > Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
> >
> > On Tue, Jul 16, 2002 at 08:22:53AM -0400, Gerhard Mack wrote:
> >
> > > > This needs to be "according to Linus, dump is deprecated". Given the
> > > > interest Linus has manifested for backups, I wouldn't really rely on
> > > > his statement :-)
> > >
> > > Either way dump is not likely to give you a reliable backup when used
> > > with a 2.4.x kernel.
> >
> > Since you are so well informed, maybe you could share your knowledge
> > with us.
> >
> > I'm the dump maintainer, so I'll be very interested in knowing how
> > comes that dump works for me and many other users... :-)
> >
>
> I'll save myself the trouble when Linus said it better than I could:
>
> Note that dump simply won't work reliably at all even in
> 2.4.x: the buffer cache and the page cache (where all the
> actual data is) are not coherent. This is only going to
> get even worse in 2.5.x, when the directories are moved
> into the page cache as well.
>
> So anybody who depends on "dump" getting backups right is
> already playing russian rulette with their backups. It's
> not at all guaranteed to get the right results - you may
> end up having stale data in the buffer cache that ends up
> being "backed up".
>
>
> In other words you have a backup system that works some of the time or
> even most of the time... brilliant!

just to clarify, the above implicitly assumes the fs is mounted
read-write while you're dumping it. if the fs is mounted readonly or if
it's unmounted, there is no problem with dumping it. Also note that dump
has the same problem with read-write mounted fs also in 2.2, and I guess
in 2.0 too, it's nothing new of 2.4, it just gets more visible the more
logical dirty caches we have.

Andrea

2002-07-16 15:36:44

by Stelian Pop

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:

> In other words you have a backup system that works some of the time or
> even most of the time... brilliant!

Dump is a backup system that works 100% of the time when used as
it was designed to: on unmounted filesystems (or mounted R/O).

It is indeed brilliant to have it work, even most of the time,
in conditions it wasn't designed for.

Stelian.
--
Stelian Pop <[email protected]>
Alcove - http://www.alcove.com

2002-07-16 15:51:06

by Thunder from the hill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> > Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be
> > done with it.
>
> Or steal one from FreeBSD (written by Paul Saab), fix the err() function
> and be done with it.
>
> .../usr.bin/fsync/fsync.{1,c}
>
> Interesting side note -- mind the O_RDONLY:
>
> for (i = 1; i < argc; ++i) {
> if ((fd = open(argv[i], O_RDONLY)) < 0)
> err(1, "open %s", argv[i]);
>
> if (fsync(fd) != 0)
> err(1, "fsync %s", argv[1]);
> close(fd);
> }

Pretty much the thing I had in mind, except that the close return code is
disregarded here...

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-16 19:24:00

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Thunder from the hill wrote:

> > if (fsync(fd) != 0)
> > err(1, "fsync %s", argv[1]);
> > close(fd);
> > }
>
> Pretty much the thing I had in mind, except that the close return code is
> disregarded here...

Indeed, but OTOH, what error is close to report when the file is opened
read-only?

2002-07-16 19:35:45

by Thunder from the hill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> Indeed, but OTOH, what error is close to report when the file is opened
> read-only?

Well, you can still get EIO, EINTR, EBADF. Whatever you say, disregarding
the close return code is never any good.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-16 19:35:45

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Christoph Hellwig wrote:

> On Tue, Jul 16, 2002 at 02:53:01PM +0200, Matthias Andree wrote:
> > Not if some day somebody implements file system level snapshots for
> > Linux. Until then, better have garbled file contents constrained to a
> > file than random data as on-disk layout changes with hefty directory
> > updates.
>
> or the blockdevice-level snapshots already implemented in Linux..

That would require three atomic steps:

1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write

and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.

--
Matthias Andree

2002-07-16 19:42:50

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Stelian Pop wrote:

> On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:
>
> > In other words you have a backup system that works some of the time or
> > even most of the time... brilliant!
>
> Dump is a backup system that works 100% of the time when used as
> it was designed to: on unmounted filesystems (or mounted R/O).

Practical question: how do I get a file system mounted R/O for backup
with dump without putting that system into single-user mode?
Particularly when running automated backups, this is an issue. I cannot
kill all writers (syslog, Postfix, INN, CVS server, ...) on my
production machines just for the sake of taking a backup.

2002-07-16 19:49:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 16, 2002 21:38 +0200, Matthias Andree wrote:
> On Tue, 16 Jul 2002, Christoph Hellwig wrote:
> > On Tue, Jul 16, 2002 at 02:53:01PM +0200, Matthias Andree wrote:
> > > Not if some day somebody implements file system level snapshots for
> > > Linux. Until then, better have garbled file contents constrained to a
> > > file than random data as on-disk layout changes with hefty directory
> > > updates.
> >
> > or the blockdevice-level snapshots already implemented in Linux..
>
> That would require three atomic steps:
>
> 1. mount read-only, flushing all pending updates
> 2. take snapshot
> 3. mount read-write
>
> and then backup the snapshot. A snapshots of a live file system won't
> do, it can be as inconsistent as it desires -- if your corrupt target is
> moving or not, dumping it is not of much use.

Luckily, there is already an interface which does this -
sync_supers_lockfs(), which the LVM code will use if it is patched in.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-16 20:01:32

by Shawn

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

You don't.

This is where you have a filesystem where syslog, xinetd, blogd,
bloatd-config-d2, raffle-ticketd DO NOT LIVE.

People forget so easily the wonders of multiple partitions.

On 07/16, Matthias Andree said something like:
> On Tue, 16 Jul 2002, Stelian Pop wrote:
>
> > On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:
> >
> > > In other words you have a backup system that works some of the time or
> > > even most of the time... brilliant!
> >
> > Dump is a backup system that works 100% of the time when used as
> > it was designed to: on unmounted filesystems (or mounted R/O).
>
> Practical question: how do I get a file system mounted R/O for backup
> with dump without putting that system into single-user mode?
> Particularly when running automated backups, this is an issue. I cannot
> kill all writers (syslog, Postfix, INN, CVS server, ...) on my
> production machines just for the sake of taking a backup.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Shawn Leas
[email protected]

So, do you live around here often?
-- Stephen Wright

2002-07-16 20:10:00

by Thunder from the hill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> > or the blockdevice-level snapshots already implemented in Linux..
>
> That would require three atomic steps:
>
> 1. mount read-only, flushing all pending updates
> 2. take snapshot
> 3. mount read-write
>
> and then backup the snapshot. A snapshots of a live file system won't
> do, it can be as inconsistent as it desires -- if your corrupt target is
> moving or not, dumping it is not of much use.

Well, couldn't we just kindof lock the file system so that while backing
up no writes get through to the real filesystem? This will possibly
require a lot of memory (or another space to write to), but it might be
done?

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 03:04:22PM -0500, Shawn wrote:
> You don't.
>
> This is where you have a filesystem where syslog, xinetd, blogd,
> bloatd-config-d2, raffle-ticketd DO NOT LIVE.
>
> People forget so easily the wonders of multiple partitions.

I'm sorry, but I don't understand how it's going to change anything. For
sure, it makes your life easier because you don't have to shutdown all your
programs that have files opened in R/W mode. But in the end, you will have
to shutdown something to remount the partition in R/O mode and usually you
don't want or can't afford to do that.

--
Mathieu Chouquet-Stringer E-Mail : [email protected]
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde

2002-07-16 20:19:41

by Shawn

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

In this case, can you use a RAID mirror or something, then break it?

Also, there's the LVM snapshot at the block layer someone already
mentioned, which when used with smaller partions is less overhead.
(less FS delta)

This problem isn't that complex.

On 07/16, Mathieu Chouquet-Stringer said something like:
> On Tue, Jul 16, 2002 at 03:04:22PM -0500, Shawn wrote:
> > You don't.
> >
> > This is where you have a filesystem where syslog, xinetd, blogd,
> > bloatd-config-d2, raffle-ticketd DO NOT LIVE.
> >
> > People forget so easily the wonders of multiple partitions.
>
> I'm sorry, but I don't understand how it's going to change anything. For
> sure, it makes your life easier because you don't have to shutdown all your
> programs that have files opened in R/W mode. But in the end, you will have
> to shutdown something to remount the partition in R/O mode and usually you
> don't want or can't afford to do that.
>
> --
> Mathieu Chouquet-Stringer E-Mail : [email protected]
> It is exactly because a man cannot do a thing that he is a
> proper judge of it.
> -- Oscar Wilde
--
Shawn Leas
[email protected]

I bought my brother some gift-wrap for Christmas. I took it to the Gift
Wrap department and told them to wrap it, but in a different print so he
would know when to stop unwrapping.
-- Stephen Wright

Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, Jul 16, 2002 at 03:22:31PM -0500, Shawn wrote:
> In this case, can you use a RAID mirror or something, then break it?
>
> Also, there's the LVM snapshot at the block layer someone already
> mentioned, which when used with smaller partions is less overhead.
> (less FS delta)
>
> This problem isn't that complex.

I agree but I guess that if Matthias asked the question that way, he
probably meant he doesn't have a raid mirror or "something" (as you
say)... If you didn't plan your install (meaning you don't have the nice
raid or anything else), you're basically screwed...

--
Mathieu Chouquet-Stringer E-Mail : [email protected]
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde

2002-07-16 21:08:55

by James Antill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

"Patrick J. LoPresti" <[email protected]> writes:

> Lawrence Greenfield <[email protected]> writes:
>
> > Actually, it's not all that simple (you have to find the enclosing
> > directories of any files you're modifying, which might require string
> > manipulation)
>
> No, you have to find the directories you are modifying. And the
> application knows darn well which directories it is modifying.
>
> Don't speculate. Show some sample code, and let's see how hard it
> would be to use the "Linux way". I am betting on "not hard at all".

I added fsync() on directories to exim-3.31, it took about 2hrs
coding and another hours testing it (with strace) to make sure it was
doing the right thing. That was from almost never seeing the source
before.
The only reason it took that long was because that version of exim
altered the spool in a couple of different places. Forward porting to
3.951 took about 20minutes IIRC (that version only plays witht he
spool in one place).

--
# James Antill -- [email protected]
:0:
* ^From: .*james@and\.org
/dev/null

2002-07-16 21:22:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 16, 2002 23:06 +0200, Matthias Andree wrote:
> On Tue, 16 Jul 2002, Thunder from the hill wrote:
> > On Tue, 16 Jul 2002, Matthias Andree wrote:
> > > That would require three atomic steps:
> > >
> > > 1. mount read-only, flushing all pending updates
> > > 2. take snapshot
> > > 3. mount read-write
> > >
> > > and then backup the snapshot. A snapshots of a live file system won't
> > > do, it can be as inconsistent as it desires -- if your corrupt target is
> > > moving or not, dumping it is not of much use.
> >
> > Well, couldn't we just kindof lock the file system so that while backing
> > up no writes get through to the real filesystem? This will possibly
> > require a lot of memory (or another space to write to), but it might be
> > done?
>
> But you would want to backup a consistent file system, so when entering
> the freeze or snapshot mode, you must flush all pending data in such a
> way that the snapshot is consistent (i. e. needs not fsck action
> whatsoever).

This is all done already for both LVM and EVMS snapshots. The filesystem
(ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
frozen, the snapshot is created, and the filesystem becomes active again.
It takes a second or less. Then dump will guarantee 100% correct backups
of the snapshot filesystem. You would have to do a backup on the snapshot
to guarantee 100% correctness even with tar.

Most people don't care, because they don't even do backups in the first
place, until they have lost a lot of their data and they learn. Even
without snapshots, while dump isn't guaranteed to be 100% correct for
rapidly changing filesystems, I have been using it for years on both
2.2 and 2.4 without any problems on my home systems. I have even
restored data from those same backups...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-16 21:36:15

by Thunder from the hill

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

Hi,

On Tue, 16 Jul 2002, Andreas Dilger wrote:
> This is all done already for both LVM and EVMS snapshots. The filesystem
> (ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
> frozen, the snapshot is created, and the filesystem becomes active again.
> It takes a second or less.

Anyway, we could do that in parallel if we did it like that:

sync -> significant data is being written
lock -> data writes stay cached, but aren't written
snapshot
unlock -> data is getting written
now unmount the snapshout (clean it)
write the modified snapshot to disk...

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-16 22:16:46

by John Stoffel

[permalink] [raw]
Subject: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)


It's really quite simple in theory to do proper backups. But you need
to have application support to make it work in most cases. It would
flow like this:

1. lock application(s), flush any outstanding transactions.
2. lock filesystems, flush any outstanding transactions.

3a. lock mirrored volume, flush any outstanding transactions, break
mirror.
--or--
3b. snapshot filesystem to another volume.

4. unlock volume

5. unlock filesystem

6. unlock application(s).

7. do backup against quiescent volume/filesystem.

In reality, people didn't lock filesystems (remount R/O) unless they
had too (ClearCase, Oracle, any DBMS, etc are the exceptions), since
the time hit was too much. The chances of getting a bad backup on
user home directories or mail spools wasn't worth the extra cost to be
sure to get a clean backup. For the exceptions, that's why god made
backup windows and such. These days, those windows are miniscule, so
the seven steps outlined above are what needs to happen these days for
a trully reliable backup of important data.

John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
[email protected] - http://www.lucent.com - 978-399-0479



2002-07-16 22:30:40

by Thunder from the hill

[permalink] [raw]
Subject: Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Hi,

I do it like this:

-> Reconfigure port switch to use B server
-> Backup A server
-> Replay B server journals on A server
-> Switch to A server
-> Backup B server
-> Replay A server journals on B server
-> Reconfigure port switch to dynamic mode

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-16 23:19:29

by Zack Weinberg

[permalink] [raw]
Subject: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Thunder wrote:
> On Tue, 16 Jul 2002, Matthias Andree wrote:
> > Indeed, but OTOH, what error is close to report when the file is
> > opened read-only?
>
> Well, you can still get EIO, EINTR, EBADF. Whatever you say,
> disregarding the close return code is never any good.

Making use of the close return value is also never any good.

Consider: There is no guarantee that close will detect errors. Only
NFS and Coda implement f_op->flush methods. For files on all other
file systems, sys_close will always return success (assuming the file
descriptor was open in the first place); the data may still be sitting
in the page cache. If you need the data pushed to the physical disk,
you have to call fsync.

Consider: If you have called fsync, and it returned successfully, an
immediate call to close is guaranteed to return successfully. (Any
hypothetical f_op->flush method would have nothing to do; if not, that
filesystem does not correctly implement fsync.)

Therefore, I would argue that it is wrong for any application ever to
inspect close's return value. Either the program does not need data
integrity guarantees, or it should be using fsync and paying attention
to that instead.

There's also an ugly semantic bind if you make close detect errors.
If close returns an error other than EBADF, has that file descriptor
been closed? The standards do not specify. If it has not been
closed, you have a descriptor leak. But if it has been closed, it is
too late to recover from the error. [As far as I know, Unix
implementations generally do close the descriptor.]

The manpage that was quoted earlier in this thread is incorrect in
claiming that errors will be detected by close; it should be fixed.

zw

2002-07-16 23:50:13

by Alan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, 2002-07-17 at 00:22, Zack Weinberg wrote:
> Making use of the close return value is also never any good.

This is untrue

> Consider: There is no guarantee that close will detect errors. Only
> NFS and Coda implement f_op->flush methods. For files on all other
> file systems, sys_close will always return success (assuming the file
> descriptor was open in the first place); the data may still be sitting
> in the page cache. If you need the data pushed to the physical disk,
> you have to call fsync.

close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error. The fact it doesn't do it now is no
excuse for sloppy programming

> There's also an ugly semantic bind if you make close detect errors.
> If close returns an error other than EBADF, has that file descriptor
> been closed? The standards do not specify. If it has not been
> closed, you have a descriptor leak. But if it has been closed, it is
> too late to recover from the error. [As far as I know, Unix
> implementations generally do close the descriptor.]

If it bothers you close it again 8)

> The manpage that was quoted earlier in this thread is incorrect in
> claiming that errors will be detected by close; it should be fixed.

The man page matches the stsndard. Implementation may be a subset of the
allowed standard right now, but don't program to implementation
assumptions, it leads to nasty accidents

2002-07-16 23:59:42

by David Miller

[permalink] [raw]
Subject: Re: close return value

From: Alan Cox <[email protected]>
Date: 17 Jul 2002 02:03:02 +0100

close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error. The fact it doesn't do it now is no
excuse for sloppy programming

Practice dictates that if you make close() return error values
your whole system will blow up. Try it out for yourself.
I can tell you of at least 1 app that is going to explode :-)

I believe Linus mentioned way back when that this is a "shall not"
when we had similar problems with NFS returning errors from close().

2002-07-17 00:07:57

by Zack Weinberg

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, Jul 17, 2002 at 02:03:02AM +0100, Alan Cox wrote:
> On Wed, 2002-07-17 at 00:22, Zack Weinberg wrote:
> > Making use of the close return value is also never any good.
>
> This is untrue

I beg to differ.

> > Consider: There is no guarantee that close will detect errors. Only
> > NFS and Coda implement f_op->flush methods. For files on all other
> > file systems, sys_close will always return success (assuming the file
> > descriptor was open in the first place); the data may still be sitting
> > in the page cache. If you need the data pushed to the physical disk,
> > you have to call fsync.
>
> close() checking is not about physical disk guarantees. It's about more
> basic "I/O completed". In some future Linux only close() might tell you
> about some kinds of I/O error.

I think we're talking past each other.

My first point is that a portable application cannot rely on close to
detect any error. Only fsync guarantees to detect any errors at all
(except ENOSPC/EDQUOT, which should come back on write; yes, I know
about the buggy NFS implementations that report them only on close).

My second point, which you deleted, is that if some hypothetical close
implementation reports an error under some circumstances, an
immediately preceding fsync call MUST also report the same error under
the same circumstances.

Therefore, if you've checked the return value of fsync, there's no
point in checking the subsequent close; and if you don't care to call
fsync, the close return value is useless since it isn't guaranteed to
detect anything.

> > There's also an ugly semantic bind if you make close detect errors.
> > If close returns an error other than EBADF, has that file descriptor
> > been closed? The standards do not specify. If it has not been
> > closed, you have a descriptor leak. But if it has been closed, it is
> > too late to recover from the error. [As far as I know, Unix
> > implementations generally do close the descriptor.]
>
> If it bothers you close it again 8)

And watch it come back with an error again, repeat ad infinitum?

> > The manpage that was quoted earlier in this thread is incorrect in
> > claiming that errors will be detected by close; it should be fixed.
>
> The man page matches the stsndard. Implementation may be a subset of the
> allowed standard right now, but don't program to implementation
> assumptions, it leads to nasty accidents

You missed the point. The manpage asserts that I/O errors are
guaranteed to be detected by close; there is no such guarantee.

zw

2002-07-17 00:22:53

by Alan

[permalink] [raw]
Subject: Re: close return value

On Wed, 2002-07-17 at 00:52, David S. Miller wrote:
> From: Alan Cox <[email protected]>
> Date: 17 Jul 2002 02:03:02 +0100
>
> close() checking is not about physical disk guarantees. It's about more
> basic "I/O completed". In some future Linux only close() might tell you
> about some kinds of I/O error. The fact it doesn't do it now is no
> excuse for sloppy programming
>
> Practice dictates that if you make close() return error values
> your whole system will blow up. Try it out for yourself.
> I can tell you of at least 1 app that is going to explode :-)
>
> I believe Linus mentioned way back when that this is a "shall not"
> when we had similar problems with NFS returning errors from close().

Our NFS can return errors from close(). So I'd get fixing the
applications.

2002-07-17 00:32:50

by Alan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, 2002-07-17 at 01:10, Zack Weinberg wrote:
> My first point is that a portable application cannot rely on close to
> detect any error. Only fsync guarantees to detect any errors at all
> (except ENOSPC/EDQUOT, which should come back on write; yes, I know
> about the buggy NFS implementations that report them only on close).

They are not buggy merely inconvenient. The reality of the NFS protocol
makes it the only viable way to do it

> My second point, which you deleted, is that if some hypothetical close
> implementation reports an error under some circumstances, an
> immediately preceding fsync call MUST also report the same error under
> the same circumstances.

I can't think of a case I'd disagree

> Therefore, if you've checked the return value of fsync, there's no
> point in checking the subsequent close; and if you don't care to call
> fsync, the close return value is useless since it isn't guaranteed to
> detect anything.

If you don't check the return code it might not detect anything. If you
do check the return code it might detect something. In fact you
contradict yourself IMHO by giving the NFS example.

> > If it bothers you close it again 8)
>
> And watch it come back with an error again, repeat ad infinitum?

The use of intelligence doesn't help. Come on I know you aren't a cobol
programmer. Check for -EBADF ...

> You missed the point. The manpage asserts that I/O errors are
> guaranteed to be detected by close; there is no such guarantee.

Disagree. It says

It is quite possible that errors on a previous write(2) operation
are first reported at the final close

Not checking the return value when closing the file may lead to silent
loss of data.

A successful close does not guarantee that the data has
been successfully saved to disk, as the kernel defers
writes. It is not common for a filesystem to flush the
buffers when the stream is closed. If you need to be sure
that the data is physically stored use fsync(2). (It will
depend on the disk hardware at this point.)

None of which guarantee what you say, and which agree about the use of
fsync being appropriate now and then

2002-07-17 00:27:16

by David Miller

[permalink] [raw]
Subject: Re: close return value

From: Alan Cox <[email protected]>
Date: 17 Jul 2002 02:35:41 +0100

Our NFS can return errors from close().

Better tell Linus.

So I'd get fixing the applications.

I wish you luck, it is quite a daunting task and nothing I would
sanely sigh up for :-)

2002-07-17 02:20:43

by Elladan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, Jul 17, 2002 at 02:03:02AM +0100, Alan Cox wrote:
> On Wed, 2002-07-17 at 00:22, Zack Weinberg wrote:
>
> > There's also an ugly semantic bind if you make close detect errors.
> > If close returns an error other than EBADF, has that file descriptor
> > been closed? The standards do not specify. If it has not been
> > closed, you have a descriptor leak. But if it has been closed, it is
> > too late to recover from the error. [As far as I know, Unix
> > implementations generally do close the descriptor.]
>
> If it bothers you close it again 8)

Consider:

Two threads share the file descriptor table.

1. Thread 1 performs close() on a file descriptor. close fails.
2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.


open() may return any file descriptor not currently in use.

Is step 3 necessary? Is it dangerous? The question is, is close
guaranteed to work, or isn't it?


Case 1: Close is guaranteed to close the file.

Thread 2 may have just re-used the file descriptor. Thus, Thread 1
closes a different file in step 3. Thread 2 is now using a bad file
descriptor, and becomes very angry because the kernel just said all was
right with the world, and then claims there was a mistake. Thread 2
leaves in a huff.


Case 2: Close is guaranteed to leave the file open on error.

Thread 2 can't have just re-used the descriptor, so the world is ok in
that sense. However, Thread 1 *must* perform step 3, or it leaks a
descriptor, the tables fill, and the world becomes a frozen wasteland.


Case 3: Close may or may not leave it open due to random chance or
filesystem peculiarities.

Thread 1 may be required to close it twice, or it may be required not to
close it twice. It doesn't know! Night is falling! The world is in
flames! Aaaaaaugh!


I believe this demonstrates the need for a standard, one way, or the
other. :-)

-J

2002-07-17 02:53:58

by Thunder from the hill

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Hi,

On Tue, 16 Jul 2002, Elladan wrote:
> Two threads share the file descriptor table.
>
> 1. Thread 1 performs close() on a file descriptor. close fails.
> 2. Thread 2 performs open().
> * 3. Thread 1 performs close() again, just to make sure.

Thread 2 shouldn't be able to reuse a currently open fd. This application
design is seriously broken.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-17 02:58:25

by Elladan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Tue, Jul 16, 2002 at 08:54:54PM -0600, Thunder from the hill wrote:
> Hi,
>
> On Tue, 16 Jul 2002, Elladan wrote:
> > Two threads share the file descriptor table.
> >
> > 1. Thread 1 performs close() on a file descriptor. close fails.
> > 2. Thread 2 performs open().
> > * 3. Thread 1 performs close() again, just to make sure.
>
> Thread 2 shouldn't be able to reuse a currently open fd. This application
> design is seriously broken.

No.

Thread 2 doesn't manage the file descriptor table, the kernel does.
Whether the kernel may re-use the descriptor or not depends on whether
the descriptor is closed or not. The kernel knows, but unless close()
behaves in a defined way, the application does not at this point. Thus,
step 3 may either be required, forbidden, or undefined.

-J

2002-07-17 03:09:51

by Thunder from the hill

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Hi,

On Tue, 16 Jul 2002, Elladan wrote:
> > Thread 2 shouldn't be able to reuse a currently open fd. This application
> > design is seriously broken.

Okay, again. It's about doing a second close() in case the first one fails
with EAGAIN. If we have to do it again, the filehandle is not closed, and
if the filehandle is not closed, the kernel knows that, and if the kernel
knows that the filehandle is still open, it won't get reassigned. Problem
gone.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-17 03:28:52

by Elladan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Tue, Jul 16, 2002 at 09:10:49PM -0600, Thunder from the hill wrote:
> Hi,
>
> On Tue, 16 Jul 2002, Elladan wrote:
> > > Thread 2 shouldn't be able to reuse a currently open fd. This application
> > > design is seriously broken.
>
> Okay, again. It's about doing a second close() in case the first one fails
> with EAGAIN. If we have to do it again, the filehandle is not closed, and
> if the filehandle is not closed, the kernel knows that, and if the kernel
> knows that the filehandle is still open, it won't get reassigned. Problem
> gone.

This is case 2, "Close is guaranteed to leave the file open on error."

In this case, all applications are required to reissue close commands
upon certain errors, or leak a file descriptor. This would be a well
defined behavior, though perhaps error prone.

However, note that this is manifestly different from case 1, "Close is
guaranteed to close the file the first time." If the system behaves via
case 1, closing the handle again is broken as the example illustrated.

The worst, of course, would be undefined behavior for close. In this
case, the application effectively can't do the right thing without
extreme measures.

-J

2002-07-17 03:43:30

by Russ Allbery

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Zack Weinberg <[email protected]> writes:

> Consider: There is no guarantee that close will detect errors. Only
> NFS and Coda implement f_op->flush methods.

And AFS, I believe. (Not in the standard kernel, of course.)

--
Russ Allbery ([email protected]) <http://www.eyrie.org/~eagle/>

2002-07-17 04:14:49

by Stephen Oberholtzer

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

At 07:22 PM 7/16/2002 -0700, Elladan wrote:
> 1. Thread 1 performs close() on a file descriptor. close fails.
> 2. Thread 2 performs open().
>* 3. Thread 1 performs close() again, just to make sure.
>
>
>open() may return any file descriptor not currently in use.

I'm confused here... the only way close() can fail is if the file descriptor is invalid (EBADF); wouldn't it be rather stupid to close() a known-to-be-bad descriptor?


--
Stevie-O

Real programmers use COPY CON PROGRAM.EXE

2002-07-17 04:36:36

by Elladan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
> At 07:22 PM 7/16/2002 -0700, Elladan wrote:
> > 1. Thread 1 performs close() on a file descriptor. close fails.
> > 2. Thread 2 performs open().
> >* 3. Thread 1 performs close() again, just to make sure.
> >
> >
> >open() may return any file descriptor not currently in use.
>
> I'm confused here... the only way close() can fail is if the file
> descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
> a known-to-be-bad descriptor?

Well, obviously, if that's the case. However, the man page for close(2)
doesn't agree (see below). close() is allowed to return EBADF, EINTR,
or EIO.

The question is, does the OS standard guarantee that the fd is closed,
even if close() returns EINTR or EIO? Just going by the normal usage of
EINTR, one might think otherwise. It doesn't appear to be documented
one way or another.

Alan said you could just issue close again to make sure - the example
shows that this is not the case. A second close is either required or
forbidden in that example - and the behavior has to be well defined or
you won't know which to do.

-J

NAME
close - close a file descriptor

SYNOPSIS
#include <unistd.h>

int close(int fd);

DESCRIPTION
close closes a file descriptor, so that it no longer refers
to any file and may be reused. Any locks held on the file it
was associated with, and owned by the process, are removed
(regardless of the file descriptor that was used to obtain the
lock).

If fd is the last copy of a particular file descriptor the
resources associated with it are freed; if the descriptor was the
last reference to a file which has been removed using unlink(2)
the file is deleted.

RETURN VALUE
close returns zero on success, or -1 if an error occurred.

ERRORS
EBADF fd isn't a valid open file descriptor.

EINTR The close() call was interrupted by a signal.

EIO An I/O error occurred.

CONFORMING TO
SVr4, SVID, POSIX, X/OPEN, BSD 4.3. SVr4 documents an
additional ENOLINK error condition.

NOTES
Not checking the return value of close is a common but
nevertheless serious programming error. File system
implementations which use techniques as `write-behind' to
increase performance may lead to write(2) succeeding, although
the data has not been written yet. The error status may be
reported at a later write operation, but it is guaranteed to be
reported on closing the file. Not checking the return value when
closing the file may lead to silent loss of data. This can
especially be observed with NFS and disk quotas.

A successful close does not guarantee that the data has
been successfully saved to disk, as the kernel defers
writes. It is not common for a filesystem to flush the
buffers when the stream is closed. If you need to be sure
that the data is physically stored use fsync(2) or
sync(2), they will get you closer to that goal (it will
depend on the disk hardware at this point).

SEE ALSO
open(2), fcntl(2), shutdown(2), unlink(2), fclose(3)

2002-07-17 07:56:17

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On 2002-07-16T17:10:32,
Zack Weinberg <[email protected]> said:

> Therefore, if you've checked the return value of fsync, there's no
> point in checking the subsequent close; and if you don't care to call
> fsync, the close return value is useless since it isn't guaranteed to
> detect anything.

There is _always_ a point in checking a return value of non void functions.

EOD.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister

2002-07-17 11:42:12

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Shawn wrote:

> In this case, can you use a RAID mirror or something, then break it?
>
> Also, there's the LVM snapshot at the block layer someone already
> mentioned, which when used with smaller partions is less overhead.
> (less FS delta)

All these "solutions" don't work out, I cannot remount R/O my partition,
and LVM low-level snapshots or breaking a RAID mirror simply won't work
out. I would have to remount r/o the partition to get a consistent image
in the first place, so the first step must fail already...

2002-07-17 11:44:16

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Andreas Dilger wrote:

> This is all done already for both LVM and EVMS snapshots. The filesystem
> (ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
> frozen, the snapshot is created, and the filesystem becomes active again.
> It takes a second or less. Then dump will guarantee 100% correct backups
> of the snapshot filesystem. You would have to do a backup on the snapshot
> to guarantee 100% correctness even with tar.

Sure. On some machines, they will go with dsmc anyhow which reads the
file and rereads if it changes under dsmc's hands.

--
Matthias Andree

2002-07-17 14:36:34

by Andreas Schwab

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Elladan <[email protected]> writes:

|> On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
|> > At 07:22 PM 7/16/2002 -0700, Elladan wrote:
|> > > 1. Thread 1 performs close() on a file descriptor. close fails.
|> > > 2. Thread 2 performs open().
|> > >* 3. Thread 1 performs close() again, just to make sure.
|> > >
|> > >
|> > >open() may return any file descriptor not currently in use.
|> >
|> > I'm confused here... the only way close() can fail is if the file
|> > descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
|> > a known-to-be-bad descriptor?
|>
|> Well, obviously, if that's the case. However, the man page for close(2)
|> doesn't agree (see below). close() is allowed to return EBADF, EINTR,
|> or EIO.
|>
|> The question is, does the OS standard guarantee that the fd is closed,
|> even if close() returns EINTR or EIO? Just going by the normal usage of
|> EINTR, one might think otherwise. It doesn't appear to be documented
|> one way or another.

POSIX says the state of the file descriptor when close fails (with errno
!= EBADF) is unspecified, which means:

The value or behavior may vary among implementations that conform to
IEEE Std 1003.1-2001. An application should not rely on the existence
or validity of the value or behavior. An application that relies on
any particular value or behavior cannot be assured to be portable
across conforming implementations.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2002-07-17 15:48:19

by Thunder from the hill

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Hi,

On Tue, 16 Jul 2002, Zack Weinberg wrote:
> the close return value is useless since it isn't guaranteed to detect
> anything.

"Isn't guaranteed to detect anything" is still a lot more encouraging to
see if it does detect anything than "Is guaranteed not to detect anything".

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-17 16:47:14

by Elladan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, Jul 17, 2002 at 04:39:28PM +0200, Andreas Schwab wrote:
> Elladan <[email protected]> writes:
>
> |> On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
> |> > At 07:22 PM 7/16/2002 -0700, Elladan wrote:
> |> > > 1. Thread 1 performs close() on a file descriptor. close fails.
> |> > > 2. Thread 2 performs open().
> |> > >* 3. Thread 1 performs close() again, just to make sure.
> |> > >
> |> > >
> |> > >open() may return any file descriptor not currently in use.
> |> >
> |> > I'm confused here... the only way close() can fail is if the file
> |> > descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
> |> > a known-to-be-bad descriptor?
> |>
> |> Well, obviously, if that's the case. However, the man page for close(2)
> |> doesn't agree (see below). close() is allowed to return EBADF, EINTR,
> |> or EIO.
> |>
> |> The question is, does the OS standard guarantee that the fd is closed,
> |> even if close() returns EINTR or EIO? Just going by the normal usage of
> |> EINTR, one might think otherwise. It doesn't appear to be documented
> |> one way or another.
>
> POSIX says the state of the file descriptor when close fails (with errno
> != EBADF) is unspecified, which means:
>
> The value or behavior may vary among implementations that conform to
> IEEE Std 1003.1-2001. An application should not rely on the existence
> or validity of the value or behavior. An application that relies on
> any particular value or behavior cannot be assured to be portable
> across conforming implementations.

This doesn't mean an OS shouldn't specify the behavior. Just because
the cross-platform standard leaves it unspecified doesn't mean the OS
should.

Consider what this says, if a particular OS doesn't pick a standard
which the application can port to. It means that the *only way* to
correctly close a file descriptor is like this:

int ret;
do {
ret = close(fd);
} while(ret == -1 && errno != EBADF);

That means, if we get an error, we have to loop until the kernel throws
a BADF error! We can't detect that the file is closed from any other
error value, because only BADF has a defined behavior.

This would sort of work, though of course be hideous, for a single
threaded app. Now consider a multithreaded app. To correctly implement
this we have to lock around all calls to close and
open/socket/dup/pipe/creat/etc...

This is clearly ridiculous, and not at all as intended. Either standard
will work for an OS (though guaranteeing close the first time is much
simpler all around), but it needs to be specified and stuck to, or you
get horrible things like this to work around a bad spec:


void lock_syscalls();
void unlock_syscalls();

int threadsafe_open(const char *file, int flags, mode_t mode)
{
int fd;
lock_syscalls();
fd = open(file, flags, mode);
unlock_syscalls();
return fd;
}

int threadsafe_close(int fd)
{
int ret;
lock_syscalls();
do {
ret = close(fd);
} while(ret == -1 && errno != EBADF);
unlock_syscalls();
return ret;
}

int threadsafe_socket() ...
int threadsafe_pipe() ...
int threadsafe_dup() ...
int threadsafe_creat() ...
int threadsafe_socketpair() ...
int threadsafe_accept() ...

-J

2002-07-17 17:05:13

by kaih

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

[email protected] (Elladan) wrote on 16.07.02 in <[email protected]>:

> I believe this demonstrates the need for a standard, one way, or the
> other. :-)

So then let's see what the actual standard says ...

--- snip ---

The Open Group Base Specifications Issue 6
IEEE Std 1003.1-2001
Copyright + 2001 The IEEE and The Open Group, All Rights reserved.
_________________________________________________________________

NAME

close - close a file descriptor

SYNOPSIS

#include <unistd.h>
int close(int fildes);

DESCRIPTION

The close() function shall deallocate the file descriptor indicated
by fildes. To deallocate means to make the file descriptor
available for return by subsequent calls to open() or other
functions that allocate file descriptors. All outstanding record
locks owned by the process on the file associated with the file
descriptor shall be removed (that is, unlocked).

If close() is interrupted by a signal that is to be caught, it
shall return -1 with errno set to [EINTR] and the state of fildes
is unspecified. If an I/O error occurred while reading from or
writing to the file system during close(), it may return -1 with
errno set to [EIO]; if this error is returned, the state of fildes
is unspecified.

When all file descriptors associated with a pipe or FIFO special
file are closed, any data remaining in the pipe or FIFO shall be
discarded.

When all file descriptors associated with an open file description
have been closed, the open file description shall be freed.

If the link count of the file is 0, when all file descriptors
associated with the file are closed, the space occupied by the file
shall be freed and the file shall no longer be accessible.

[XSR] [Option Start] If a STREAMS-based fildes is closed and the
calling process was previously registered to receive a SIGPOLL
signal for events associated with that STREAM, the calling process
shall be unregistered for events associated with the STREAM. The
last close() for a STREAM shall cause the STREAM associated with
fildes to be dismantled. If O_NONBLOCK is not set and there have
been no signals posted for the STREAM, and if there is data on the
module's write queue, close() shall wait for an unspecified time
(for each module and driver) for any output to drain before
dismantling the STREAM. The time delay can be changed via an
I_SETCLTIME ioctl() request. If the O_NONBLOCK flag is set, or if
there are any pending signals, close() shall not wait for output to
drain, and shall dismantle the STREAM immediately.

If the implementation supports STREAMS-based pipes, and fildes is
associated with one end of a pipe, the last close() shall cause a
hangup to occur on the other end of the pipe. In addition, if the
other end of the pipe has been named by fattach(), then the last
close() shall force the named end to be detached by fdetach(). If
the named end has no open file descriptors associated with it and
gets detached, the STREAM associated with that end shall also be
dismantled. [Option End]

[XSI] [Option Start] If fildes refers to the master side of a
pseudo-terminal, and this is the last close, a SIGHUP signal shall
be sent to the process group, if any, for which the slave side of
the pseudo-terminal is the controlling terminal. It is unspecified
whether closing the master side of the pseudo-terminal flushes all
queued input and output. [Option End]

[XSR] [Option Start] If fildes refers to the slave side of a
STREAMS-based pseudo-terminal, a zero-length message may be sent to
the master. [Option End]

[AIO] [Option Start] When there is an outstanding cancelable
asynchronous I/O operation against fildes when close() is called,
that I/O operation may be canceled. An I/O operation that is not
canceled completes as if the close() operation had not yet
occurred. All operations that are not canceled shall complete as if
the close() blocked until the operations completed. The close()
operation itself need not block awaiting such I/O completion.
Whether any I/O operation is canceled, and which I/O operation may
be canceled upon close(), is implementation-defined. [Option End]

[MF|SHM] [Option Start] If a shared memory object or a memory
mapped file remains referenced at the last close (that is, a
process has it mapped), then the entire contents of the memory
object shall persist until the memory object becomes unreferenced.
If this is the last close of a shared memory object or a memory
mapped file and the close results in the memory object becoming
unreferenced, and the memory object has been unlinked, then the
memory object shall be removed. [Option End]

If fildes refers to a socket, close() shall cause the socket to be
destroyed. If the socket is in connection-mode, and the SO_LINGER
option is set for the socket with non-zero linger time, and the
socket has untransmitted data, then close() shall block for up to
the current linger interval until all data is transmitted.

RETURN VALUE

Upon successful completion, 0 shall be returned; otherwise, -1
shall be returned and errno set to indicate the error.

ERRORS

The close() function shall fail if:
[EBADF]
The fildes argument is not a valid file descriptor.
[EINTR]
The close() function was interrupted by a signal.

The close() function may fail if:
[EIO]
An I/O error occurred while reading from or writing to the file
system.
_________________________________________________________________

The following sections are informative.

EXAMPLES

Reassigning a File Descriptor

The following example closes the file descriptor associated with
standard output for the current process, re-assigns standard output
to a new file descriptor, and closes the original file descriptor
to clean up. This example assumes that the file descriptor 0 (which
is the descriptor for standard input) is not closed.
#include <unistd.h>
...
int pfd;
...
close(1);
dup(pfd);
close(pfd);
...

Incidentally, this is exactly what could be achieved using:
dup2(pfd, 1);
close(pfd);

Closing a File Descriptor

In the following example, close() is used to close a file
descriptor after an unsuccessful attempt is made to associate that
file descriptor with a stream.
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define LOCKFILE "/etc/ptmp"
...
int pfd;
FILE *fpfd;
...
if ((fpfd = fdopen (pfd, "w")) == NULL) {
close(pfd);
unlink(LOCKFILE);
exit(1);
}
...

APPLICATION USAGE

An application that had used the stdio routine fopen() to open a
file should use the corresponding fclose() routine rather than
close(). Once a file is closed, the file descriptor no longer
exists, since the integer corresponding to it no longer refers to a
file.

RATIONALE

The use of interruptible device close routines should be
discouraged to avoid problems with the implicit closes of file
descriptors by exec and exit(). This volume of IEEE Std 1003.1-2001
only intends to permit such behavior by specifying the [EINTR]
error condition.

FUTURE DIRECTIONS

None.

SEE ALSO

STREAMS , fattach() , fclose() , fdetach() , fopen() , ioctl() ,
open() , the Base Definitions volume of IEEE Std 1003.1-2001,
<unistd.h>

CHANGE HISTORY

First released in Issue 1. Derived from Issue 1 of the SVID.

Issue 5

The DESCRIPTION is updated for alignment with the POSIX Realtime
Extension.

Issue 6

The DESCRIPTION related to a STREAMS-based file or pseudo-terminal
is marked as part of the XSI STREAMS Option Group.

The following new requirements on POSIX implementations derive from
alignment with the Single UNIX Specification:
* The [EIO] error condition is added as an optional error.
* The DESCRIPTION is updated to describe the state of the fildes
file descriptor as unspecified if an I/O error occurs and an [EIO]
error condition is returned.

Text referring to sockets is added to the DESCRIPTION.

The DESCRIPTION is updated for alignment with IEEE Std 1003.1j-2000
by specifying that shared memory objects and memory mapped files
(and not typed memory objects) are the types of memory objects to
which the paragraph on last closes applies.

End of informative text.
_________________________________________________________________
_________________________________________________________________

UNIX ? is a registered Trademark of The Open Group.
POSIX ? is a registered Trademark of The IEEE.
[ Main Index | XBD | XCU | XSH | XRAT ]
_________________________________________________________________
--- snip ---

The standard is very explicit here: When close() returns an error,
*YOU LOSE*.

MfG Kai

2002-07-17 17:14:25

by Andries Brouwer

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Tue, Jul 16, 2002 at 09:38:53PM -0700, Elladan wrote:

> The question is, does the OS standard guarantee that the fd is closed,
> even if close() returns EINTR or EIO? Just going by the normal usage of
> EINTR, one might think otherwise. It doesn't appear to be documented
> one way or another.
>
> Alan said you could just issue close again to make sure - the example
> shows that this is not the case. A second close is either required or
> forbidden in that example - and the behavior has to be well defined or
> you won't know which to do.

No, the behaviour is not well-defined at all.
The standard explicitly leaves undefined what happens when close returns
EINTR or EIO.

2002-07-17 17:48:39

by Richard Gooch

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Andries Brouwer writes:
> On Tue, Jul 16, 2002 at 09:38:53PM -0700, Elladan wrote:
>
> > The question is, does the OS standard guarantee that the fd is closed,
> > even if close() returns EINTR or EIO? Just going by the normal usage of
> > EINTR, one might think otherwise. It doesn't appear to be documented
> > one way or another.
> >
> > Alan said you could just issue close again to make sure - the example
> > shows that this is not the case. A second close is either required or
> > forbidden in that example - and the behavior has to be well defined or
> > you won't know which to do.
>
> No, the behaviour is not well-defined at all.
> The standard explicitly leaves undefined what happens when close
> returns EINTR or EIO.

However, the only sane thing to do is to explicitly define one way or
another. The standard is broken. Consider a threaded application,
where one thread tries to call close(), gets an error and re-tries,
because it's not sure if the fd was closed or not. If the fd *is*
closed, and the thread loops calling close(), checking for EBADF,
there is a race if another thread tries calling open()/creat()/dup().

The ambiguity in the standard thus results in the impossibility of
writing a race-free application. And no, forcing the application to
protect system calls with mutexes isn't a solution.

Linux should define explicitly what happens on error return from
close(). Let that be the new standard.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2002-07-17 17:44:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

In article <[email protected]>,
Elladan <[email protected]> wrote:
>
>Consider what this says, if a particular OS doesn't pick a standard
>which the application can port to. It means that the *only way* to
>correctly close a file descriptor is like this:
>
>int ret;
>do {
> ret = close(fd);
>} while(ret == -1 && errno != EBADF);

NO.

The above is
(a) not portable
(b) not current practice

The "not portable" part comes from the fact that (as somebody pointed
out), a threaded environment in which the kernel _does_ close the FD on
errors, the FD may have been validly re-used (by the kernel) for some
other thread, and closing the FD a second time is a BUG.

The "not practice" comes from the fact that applications do not do what
you suggest.

The fact is, what Linux does and has always done is the only reasonable
thing to do: the close _will_ tear down the FD, and the error value is
nothing but a warning to the application that there may still be IO
pending (or there may have been failed IO) on the file that the (now
closed) descriptor pointed to.

The application may want to take evasive action (ie try to write the
file again, make a backup, or just warn the user), but the file
descriptor is _gone_.

>That means, if we get an error, we have to loop until the kernel throws
>a BADF error! We can't detect that the file is closed from any other
>error value, because only BADF has a defined behavior.

But your loop is _provably_ incorrect for a threaded application. Your
explicit system call locking approach doesn't work either, because I'm
pretty certain that POSIX already states that open/close are thread
safe, so you can't just invalidate that _other_ standard.

Linus

2002-07-17 18:25:22

by Zack Weinberg

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, Jul 17, 2002 at 02:45:40AM +0100, Alan Cox wrote:
> On Wed, 2002-07-17 at 01:10, Zack Weinberg wrote:
> > My first point is that a portable application cannot rely on close to
> > detect any error. Only fsync guarantees to detect any errors at all
> > (except ENOSPC/EDQUOT, which should come back on write; yes, I know
> > about the buggy NFS implementations that report them only on close).
>
> They are not buggy merely inconvenient. The reality of the NFS protocol
> makes it the only viable way to do it

You are referring to the way NFSv2 lacks any way to request space
allocation on the server without also flushing data to disk? It was
my understanding that NFSv2 clients that did not accept the
performance hit and do all writes synchronously were considered
broken. (since, for instance, POSIX write-visibility guarantees are
violated if writes are delayed on the client.)

In v3 or v4, the WRITE/COMMIT separation lets the implementor generate
prompt ENOSPC and EDQUOT errors without performance penalty.

Another thing to keep in mind is that an application is often in a
much better position to recover from an error, particularly a
disk-full error, if it's reported on write rather than on close.
That's just a quality-of-implementation question, though.

> > > If it bothers you close it again 8)
> >
> > And watch it come back with an error again, repeat ad infinitum?
>
> The use of intelligence doesn't help. Come on I know you aren't a cobol
> programmer. Check for -EBADF ...

I wasn't talking about EBADF. How does the application know the
kernel will ever succeed in closing the file?

> Disagree. It says
>
> It is quite possible that errors on a previous write(2) operation
> are first reported at the final close
>
> Not checking the return value when closing the file may lead to silent
> loss of data.
>
> A successful close does not guarantee that the data has
> been successfully saved to disk, as the kernel defers
> writes. It is not common for a filesystem to flush the
> buffers when the stream is closed. If you need to be sure
> that the data is physically stored use fsync(2). (It will
> depend on the disk hardware at this point.)
>
> None of which guarantee what you say, and which agree about the use of
> fsync being appropriate now and then

That is not the text quoted upthread. Looks like the manpage did get
fixed, although I think the current wording is still suboptimal.

zw

2002-07-17 18:44:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

In article <[email protected]>,
Andreas Dilger <[email protected]> wrote:

| Well, the dump can only be inconsistent for files that are being changed
| during the dump itself. As for hanging the system, that would be a bug
| regardless of whether it was dump or "dd" reading from the block device.
| A bug related to this was fixed, probably in 2.4.19-preX somewhere.

Any dump on a live f/s would seem to have the problem that files are
changing as they are read and may not be consistant. I suppose there
could be some kind of "fsync and journal lock" on a file, allowing all
writes to a file to be journaled while the file is backed up. However,
such things don't scale well for big files with lots of writes, and the
file, while unchanging, may not be valid.

Backups of running files are best done by the application, like Oracle
as a for-instance. Neither the o/s nor the backup can be sure when/if
the data is in a valid state.

Tar has this problem, although not the same issues with data on the fly
in buffers.


--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-07-17 18:53:48

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

In article <[email protected]>,
Matthias Andree <[email protected]> wrote:

| dsmc fstat()s the file it is currently reading regularly and retries the
| dump as the changes, and gives up if it is updated too often. Not sure
| about the server side, and certainly not a useful option for sequential
| devices that you directly write on. Looks like a cache for the biggest
| file is necessary.

Which doesn't address the issue of data in files A, B and C, with
indices in X and Y. This only works if you flush and freeze all the
files at one time, making a perfect backup of one at a time results in
corruption if the database is busy.

My favorite example is usenet news on INN, a bunch of circular spools, a
linear history with two index files, 30-40k overview files, and all of
it changing with perhaps 3.5MB/sec data and 20-50/sec index writes. Far
better done with an application backup!

The point is, backups are hard, for many systems dump is optimal because
it's fast. After that I like cpio (-Hcrc) but that's personal
preference. All have fail cases on volatile data.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-07-17 19:01:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 17, 2002 13:45 +0200, Matthias Andree wrote:
> On Tue, 16 Jul 2002, Shawn wrote:
> > In this case, can you use a RAID mirror or something, then break it?
> >
> > Also, there's the LVM snapshot at the block layer someone already
> > mentioned, which when used with smaller partions is less overhead.
> > (less FS delta)
>
> All these "solutions" don't work out, I cannot remount R/O my partition,
> and LVM low-level snapshots or breaking a RAID mirror simply won't work
> out. I would have to remount r/o the partition to get a consistent image
> in the first place, so the first step must fail already...

Have you been reading my emails at all? LVM snapshots DO ensure that
the snapshot filesystem is consistent for journaled filesystems.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-17 19:44:12

by Lew Wolfgang

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks (whither dump?)

Hi Folks,

As an old dump user (dumpster?) I have to admit that we've
avoided ext3 and Reiserfs because of this issue. We couldn't
live without the "Tower of Hanoi".

I remember using, many years ago (SunOS 3.4), a patched
dump binary that allowed safe dumps from live UFS filesystems.
I don't remember all the details (it was 16-years ago) but
this dump would compare somehow, files before and after writing
to tape. If there was a difference it would back out the
dumped file and preserve the consistency of the tape. I don't
remember if it would go back and try the file again.

I haven't the foggest notion if this would work in these
modern times, I'm just offering it as food for thought.

Regards,
Lew Wolfgang


2002-07-17 22:04:43

by Elladan

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, Jul 17, 2002 at 05:43:57PM +0000, Linus Torvalds wrote:
> In article <[email protected]>,
> Elladan <[email protected]> wrote:
> >
> >Consider what this says, if a particular OS doesn't pick a standard
> >which the application can port to. It means that the *only way* to
> >correctly close a file descriptor is like this:
> >
> >int ret;
> >do {
> > ret = close(fd);
> >} while(ret == -1 && errno != EBADF);
>
> NO.
>
> The above is
> (a) not portable
> (b) not current practice
>
> The "not portable" part comes from the fact that (as somebody pointed
> out), a threaded environment in which the kernel _does_ close the FD on
> errors, the FD may have been validly re-used (by the kernel) for some
> other thread, and closing the FD a second time is a BUG.

That somebody was me. It appears we're in extremely violent agreement
on this issue. We both agree the code I wrote is crap. :-)

-J

2002-07-18 09:26:08

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Wed, 17 Jul 2002, Andreas Dilger wrote:

> On Jul 17, 2002 13:45 +0200, Matthias Andree wrote:
> > On Tue, 16 Jul 2002, Shawn wrote:
> > > In this case, can you use a RAID mirror or something, then break it?
> > >
> > > Also, there's the LVM snapshot at the block layer someone already
> > > mentioned, which when used with smaller partions is less overhead.
> > > (less FS delta)
> >
> > All these "solutions" don't work out, I cannot remount R/O my partition,
> > and LVM low-level snapshots or breaking a RAID mirror simply won't work
> > out. I would have to remount r/o the partition to get a consistent image
> > in the first place, so the first step must fail already...
>
> Have you been reading my emails at all? LVM snapshots DO ensure that
> the snapshot filesystem is consistent for journaled filesystems.

Please apologize, I have been busy and only reading partial threads, and had
not come across your LVM-snapshot related mails when I wrote the
previous mail.

--
Matthias Andree

2002-07-18 09:29:30

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Wed, 17 Jul 2002, bill davidsen wrote:

> In article <[email protected]>,
> Matthias Andree <[email protected]> wrote:
>
> | dsmc fstat()s the file it is currently reading regularly and retries the
> | dump as the changes, and gives up if it is updated too often. Not sure
> | about the server side, and certainly not a useful option for sequential
> | devices that you directly write on. Looks like a cache for the biggest
> | file is necessary.
>
> Which doesn't address the issue of data in files A, B and C, with
> indices in X and Y. This only works if you flush and freeze all the
> files at one time, making a perfect backup of one at a time results in
> corruption if the database is busy.

Right, but this would have to be taken up with Tivoli "do snapshot as
dsmc starts, backup from snapshot and discard snapshot on exit"

> My favorite example is usenet news on INN, a bunch of circular spools, a
> linear history with two index files, 30-40k overview files, and all of
> it changing with perhaps 3.5MB/sec data and 20-50/sec index writes. Far
> better done with an application backup!

In that case, when you are restoring from backups, you can also
regenerate index files (at least with tradspool, I never looked at the
"News in Dosen" aggregated spools like CNFS or whatever). It's really
hard if you have .dir/.pag style dbm data bases that don't mirror some
other single-file format.

2002-07-18 09:44:27

by Ketil Froyn

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Wed, 17 Jul 2002, Linus Torvalds wrote:

> >int ret;
> >do {
> > ret = close(fd);
> >} while(ret == -1 && errno != EBADF);
>
> NO.
>
> The above is
> (a) not portable
> (b) not current practice
>
> The "not portable" part comes from the fact that (as somebody pointed
> out), a threaded environment in which the kernel _does_ close the FD on
> errors, the FD may have been validly re-used (by the kernel) for some
> other thread, and closing the FD a second time is a BUG.
>
> The "not practice" comes from the fact that applications do not do what
> you suggest.
>
> The fact is, what Linux does and has always done is the only reasonable
> thing to do: the close _will_ tear down the FD, and the error value is
> nothing but a warning to the application that there may still be IO
> pending (or there may have been failed IO) on the file that the (now
> closed) descriptor pointed to.

Is this what happens when EINTR is received as well? If so, is there any
point to EINTR? Ie. close() was interrupted, but finished anyway. Would
any application care?

If there is any pending IO when this happens, is it possible to find out
when this is finished? If not, an MTA getting this would have to
temporarily defer the mail it received and hope it doesn't get an EINTR on
close() next time, I guess.

Ketil


2002-07-18 14:52:59

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Tue, 16 Jul 2002, Andreas Dilger wrote:

> This is all done already for both LVM and EVMS snapshots. The filesystem
> (ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
> frozen, the snapshot is created, and the filesystem becomes active again.
> It takes a second or less. Then dump will guarantee 100% correct backups
> of the snapshot filesystem. You would have to do a backup on the snapshot
> to guarantee 100% correctness even with tar.

I think I'm missing a part of this, the "a snapshot is created" sounds a
lot like "here a miracle occurs." Where is this snapshot saved? And how do
you take it in one sec regardless of f/s size? Is this one of those
theoretical things which requires two mirrored copies of the f/s so you
will still have RAID-1 after you break one? Or are changes journaled
somewhere until the snapshot is transferred to external media? And how do
you force applications to stop with their files in a valid state?

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-07-18 15:07:04

by Rik van Riel

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Thu, 18 Jul 2002, Bill Davidsen wrote:

> I think I'm missing a part of this, the "a snapshot is created" sounds a
> lot like "here a miracle occurs." Where is this snapshot saved? And how
> do you take it in one sec regardless of f/s size?

LVM. Systems like LVM already provide a logical->physical block
mapping on disk, so they might as well provide multiple mappings.

If the live filesystem writes to a particular disk block, the
snapshot will keep referencing the old blocks while the filesystem
gets to work on its own data. Copy on Write snapshots for block
devices...

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-18 15:06:26

by Bill Davidsen

[permalink] [raw]
Subject: Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Tue, 16 Jul 2002 [email protected] wrote:

> 3a. lock mirrored volume, flush any outstanding transactions, break
> mirror.
> --or--
> 3b. snapshot filesystem to another volume.

Good summary. The problem is that 3a either requires a double morror or
leaving the f/s un mirrored, and 3b can take a very long time for a big
f/s.

In general mauch of this can be addressed by only backing up small f/s and
using an application backup utility to backup the big stuff. Fortunately
the most common problem apps are databases and and they include this
capability.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-07-18 15:24:37

by Rik van Riel

[permalink] [raw]
Subject: Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Thu, 18 Jul 2002, Bill Davidsen wrote:
> On Tue, 16 Jul 2002 [email protected] wrote:
>
> > 3a. lock mirrored volume, flush any outstanding transactions, break
> > mirror.
> > --or--
> > 3b. snapshot filesystem to another volume.
>
> Good summary. The problem is that 3a either requires a double morror or
> leaving the f/s un mirrored, and 3b can take a very long time for a big
> f/s.

3b should be fairly quick since you only need to do an in-memory
copy of some LVM metadata.

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-18 15:48:01

by John Stoffel

[permalink] [raw]
Subject: Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)


Bill> On Tue, 16 Jul 2002 [email protected] wrote:
>> 3a. lock mirrored volume, flush any outstanding transactions, break
>> mirror.
>> --or--
>> 3b. snapshot filesystem to another volume.

Bill> Good summary. The problem is that 3a either requires a double
Bill> morror or leaving the f/s un mirrored, and 3b can take a very
Bill> long time for a big f/s.

Yup, 3a isn't a totally perfect solution, though triple mirrors (if
you can afford them) work well. We actually do this for some servers
where we can't afford the application down time of locking the DB for
extended times, but we also don't have triple mirrors either. It's a
tradeoff.

I really prefer 3b, since it's more efficient, faster, and more
robust. To snapshot a filesystem, all you need to do is:

- create backing store for the snapshot, usually around 10-15% of the
size of the original volume. Depends on volatility of data.
- lock the app(s).
- lock the filesystem and flush pending transactions.
- copy the metadata describing the filesystem
- insert a COW handler into the FS block write path
- mount the snapshot elsewhere
- unlock the FS
- unlock the app

Whenever the app writes a block into the FS, copy the original block
to the backing store, then write the new block to storage.

All the backups see if the quiescent data store, so it can do a clean
backup.

When you're done, just unmount the snapshot and delete it, then remove
the backing store. There is an overhead for doing this, but it's
better than having to unmirror/remirror whole block devices to do a
backup. And cheaper in terms of disk space too.

Bill> In general mauch of this can be addressed by only backing up
Bill> small f/s and using an application backup utility to backup the
Bill> big stuff. Fortunately the most common problem apps are
Bill> databases and and they include this capability.

Define what a small file system is these days, since it could be 100gb
for some people. *grin*. It's a matter of making the tools scale
well so that the data can be secured properly.

To do a proper backup requires that all layers talk to each other, and
have some means of doing a RW lock and flush of pending transactions.
If you have that, you can do it. If you don't, you need to either
goto single user mode, re-mount RO, or pray.

John

2002-07-18 16:31:51

by Bill Davidsen

[permalink] [raw]
Subject: Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

On Thu, 18 Jul 2002 [email protected] wrote:

> I really prefer 3b, since it's more efficient, faster, and more
> robust. To snapshot a filesystem, all you need to do is:
>
> - create backing store for the snapshot, usually around 10-15% of the
> size of the original volume. Depends on volatility of data.
> - lock the app(s).
> - lock the filesystem and flush pending transactions.
> - copy the metadata describing the filesystem
> - insert a COW handler into the FS block write path
> - mount the snapshot elsewhere
> - unlock the FS
> - unlock the app
>
> Whenever the app writes a block into the FS, copy the original block
> to the backing store, then write the new block to storage.

Okay, other than the overhead and having enough filespace for Tbkup sec
(min, hr, day) of operation this is practical. In general most times you
would be doing an incremental, and the time would not be much.

> Bill> In general mauch of this can be addressed by only backing up
> Bill> small f/s and using an application backup utility to backup the
> Bill> big stuff. Fortunately the most common problem apps are
> Bill> databases and and they include this capability.
>
> Define what a small file system is these days, since it could be 100gb
> for some people. *grin*. It's a matter of making the tools scale
> well so that the data can be secured properly.

Obviously a small f/s is one you can backup without operator intervantion
to change media and in a reasonable time, which might be 10min..few hours
depending on your taste. That's kind of my rule of thumb, you're welcome
to suggest others, but if someone has to change media I can't call it
small any more.

> To do a proper backup requires that all layers talk to each other, and
> have some means of doing a RW lock and flush of pending transactions.
> If you have that, you can do it. If you don't, you need to either
> goto single user mode, re-mount RO, or pray.

With some people, pray or ignore the problem are popular.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-07-19 08:26:12

by Matthias Andree

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Wed, 17 Jul 2002, Andreas Dilger wrote:

> On Jul 17, 2002 13:45 +0200, Matthias Andree wrote:
> > On Tue, 16 Jul 2002, Shawn wrote:
> > > In this case, can you use a RAID mirror or something, then break it?
> > >
> > > Also, there's the LVM snapshot at the block layer someone already
> > > mentioned, which when used with smaller partions is less overhead.
> > > (less FS delta)
> >
> > All these "solutions" don't work out, I cannot remount R/O my partition,
> > and LVM low-level snapshots or breaking a RAID mirror simply won't work
> > out. I would have to remount r/o the partition to get a consistent image
> > in the first place, so the first step must fail already...
>
> Have you been reading my emails at all? LVM snapshots DO ensure that
> the snapshot filesystem is consistent for journaled filesystems.

What kernel version is necessary to achieve this on production kernels
(i. e. 2.4)?

Does "consistent" mean "fsck proof"?

Here's what I tried, on Linux-2.4.19-pre10-ac3 (IIRC) (ext3fs):

(from memory, history not available, different machine):
lvcreate --snapshot snap /dev/vg0/home
e2fsck -f /dev/vg0/snap
dump -0 ...

It reported zero dtime for one file and two bitmap differences.

Does "consistent" mean "consistent after you replay the log?" If so,
that's still a losing game, because I cannot fsck the snapshot (it's R/O
in the LVM case at least) to replay the journal -- and I don't assume
dump 0.4b29 (which I'm using) goes fishing in the journal, but did not
use the dump source code.

dump did not complain however, and given what e2fsck had to complain,
I'd happily force mount such a file system when just a deletion has not
completed.

--
Matthias Andree

2002-07-19 15:25:42

by Sam Vilain

[permalink] [raw]
Subject: Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

[email protected] wrote:

> 1. lock application(s), flush any outstanding transactions.
> 2. lock filesystems, flush any outstanding transactions.
> 3a. lock mirrored volume, flush any outstanding transactions, break
> mirror.
> 3b. snapshot filesystem to another volume.

Or, to avoid the penalty of locking everything and bringing it down
and stuff:

1. set a flag.

2. start backing up blocks (read them raw of course, don't want to load
those stressed higher level systems)

3. If something wants to write to a block, quickly back up the old
contents of the block before you write the new contents. Unless of
course you've already backed up that block.

Of course, step 3 does place a bit more unschedulable load on the
disk. Heck, when the backups have just started, you're doubling the
latency of the devices. You can avoid this with a transaction
journal; in fact, the cockier RDBMSes out there (eg, DMSII) don't even
bother to do this and assume that your transaction journal is on a
mirrored device - and hence there's no point in backing up the old
data, you just want to do one sweep of the disk - and replay the
journal to get current.

(note: implicit assumption: you're dealing with applications using
synchronous I/O, where it needs to be written to all mirrors before
it's trusted to be stored)

Ah, moot points - the Linux MD/LVM drivers are far too unsophisticated
to have journal devices ;-)
--
Sam Vilain, [email protected] WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

Law of Computability Applied to Social Sciences:
If at first you don't suceed, transform your data set.

2002-07-19 16:37:44

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 19, 2002 10:29 +0200, Matthias Andree wrote:
> What kernel version is necessary to achieve this on production kernels
> (i. e. 2.4)?
>
> Does "consistent" mean "fsck proof"?
>
> Here's what I tried, on Linux-2.4.19-pre10-ac3 (IIRC) (ext3fs):
>
> (from memory, history not available, different machine):
> lvcreate --snapshot snap /dev/vg0/home
> e2fsck -f /dev/vg0/snap
> dump -0 ...
>
> It reported zero dtime for one file and two bitmap differences.

That is because one critical piece is missing from 2.4, the VFS lock
patch. It is part of the LVM sources at sistina.com. Chris Mason has
been trying to get it in, but it is delayed until 2.4.19 is out.

> dump did not complain however, and given what e2fsck had to complain,
> I'd happily force mount such a file system when just a deletion has not
> completed.

You cannot mount a dirty ext3 filesystem from read-only media.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-19 20:00:18

by Shawn

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On 07/19, Andreas Dilger said something like:
> On Jul 19, 2002 10:29 +0200, Matthias Andree wrote:
> > What kernel version is necessary to achieve this on production kernels
> > (i. e. 2.4)?
> >
> > Does "consistent" mean "fsck proof"?
> >
> > Here's what I tried, on Linux-2.4.19-pre10-ac3 (IIRC) (ext3fs):
> >
> > (from memory, history not available, different machine):
> > lvcreate --snapshot snap /dev/vg0/home
> > e2fsck -f /dev/vg0/snap
> > dump -0 ...
> >
> > It reported zero dtime for one file and two bitmap differences.
>
> That is because one critical piece is missing from 2.4, the VFS lock
> patch. It is part of the LVM sources at sistina.com. Chris Mason has
> been trying to get it in, but it is delayed until 2.4.19 is out.
>
> > dump did not complain however, and given what e2fsck had to complain,
> > I'd happily force mount such a file system when just a deletion has not
> > completed.
>
> You cannot mount a dirty ext3 filesystem from read-only media.

I thought you could "mount -t ext2" ext3 volumes, and thought you could
force mount ext2.

I'm no Andreas Dilger, so don't take this like I'm disagreeing...

--
Shawn Leas
[email protected]

I went to the bank and asked to borrow a cup of money. They
said, "What for?" I said, "I'm going to buy some sugar."
-- Stephen Wright

2002-07-19 20:46:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks

On Jul 19, 2002 15:01 -0500, Shawn wrote:
> On 07/19, Andreas Dilger said something like:
> > You cannot mount a dirty ext3 filesystem from read-only media.
>
> I thought you could "mount -t ext2" ext3 volumes, and thought you could
> force mount ext2.

This is true if the ext3 filesystem is unmounted cleanly. Otherwise
there is a flag in the superblock which tells the kernel it can't
mount the filesystem because there is something there it doesn't
understand (namely the dirty journal with all of the recent changes).

This flag (EXT3_FEATURE_INCOMPAT_RECOVERY) is cleared when the
filesystem is unmounted properly, when e2fsck or a r/w mount
recovers the journal, and not coincidentally when an LVM snapshot
is created.

In case you are more curious, there are a couple of paragraphs in
linux/Documentation/filesystems/ext2.txt about the compat flags,
which are really one of the great features of ext2. You may think
that an overstatement, but without the feature flags, none of the
other enhancements that have been added to ext2 over the last few
years (and in the next few years too) would have been so easily done.

As for mounting a dirty ext2 filesystem, yes that is possible with
only a warning at mount time. That is why nobody has put much effort
into adding the snapshot hooks into ext2 yet.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-07-22 16:39:09

by Rogier Wolff

[permalink] [raw]
Subject: Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)

Alan Cox wrote:
> > And watch it come back with an error again, repeat ad infinitum?
>
> The use of intelligence doesn't help. Come on I know you aren't a cobol
> programmer. Check for -EBADF ...

Huh? My mgetty/sendfax setup did something interesting lately.

I had not finished installing it, and I got a fax. It recieved it into
/tmp, tried moving it to /var/spool/fax/incoming, failed, and left the
tempfile in /tmp. It then mailed me about the recieved fax in /tmp.

This is EXACTLY the intelligent behaviour that an application writer
can chose for when checking for error codes. Especially "don't unlink
your tempfiles" is easy if you get errors on conversion or copying....

Roger.


--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.