2009-03-16 13:36:20

by Jan Kara

[permalink] [raw]
Subject: Same magic in statfs() call for ext?

Hi,

I've just noticed that EXT2_SUPER_MAGIC == EXT3_SUPER_MAGIC ==
EXT4_SUPER_MAGIC. That is just fine for the disk format but as a result we
also return the same magic in statfs() syscall and thus a simple
application has hard time recognizing whether it works on ext2, ext3 or
ext4 (it would have to parse /proc/mounts and that is non-trivial if not
impossible when it comes to bind mounts). So should not we return different
magic numbers depending on how the filesystem is currently mounted?
Now you may ask why should the application care - and I agree that in the
ideal world it should not. But for example there's a thread on GTK mailing
list [1] where they discuss the problem that with delayed allocation and
ext4, user can easily lose his data after crash (Ted wrote about it here in
some other mail some time ago). So they would like to call fsync() after
the file is written but on ext3 that is quite heavy and because of autosave
saving happens quite often. So they'd do fsync() only if the filesystem
is mounted as ext4...
So I'm writing here so hear some opinions on returning different magic
numbers from statfs().

Honza

[1] http://mail.gnome.org/archives/gtk-devel-list/2009-March/msg00082.html
--
Jan Kara <[email protected]>
SUSE Labs, CR


2009-03-16 16:13:31

by Eric Sandeen

[permalink] [raw]
Subject: Re: Same magic in statfs() call for ext?

Jan Kara wrote:
> Hi,
>
> I've just noticed that EXT2_SUPER_MAGIC == EXT3_SUPER_MAGIC ==
> EXT4_SUPER_MAGIC.

Just noticed? *grin*

> That is just fine for the disk format but as a result we
> also return the same magic in statfs() syscall and thus a simple
> application has hard time recognizing whether it works on ext2, ext3 or
> ext4 (it would have to parse /proc/mounts and that is non-trivial if not
> impossible when it comes to bind mounts).

I have a guess as to why they want to know, and ...

> So should not we return different
> magic numbers depending on how the filesystem is currently mounted?
> Now you may ask why should the application care - and I agree that in the
> ideal world it should not. But for example there's a thread on GTK mailing
> list [1] where they discuss the problem that with delayed allocation and
> ext4, user can easily lose his data after crash

... sadly I was right. :)

> (Ted wrote about it here in
> some other mail some time ago). So they would like to call fsync() after
> the file is written but on ext3 that is quite heavy and because of autosave
> saving happens quite often. So they'd do fsync() only if the filesystem
> is mounted as ext4...
> So I'm writing here so hear some opinions on returning different magic
> numbers from statfs().
>
> Honza
>
> [1] http://mail.gnome.org/archives/gtk-devel-list/2009-March/msg00082.html

As an aside, Ted also pointed out that ext4-without-delalloc also hurts
on fsync just like ext3 does, so testing "ext3 vs. ext4" isn't quite
enough in general.

I have been a bit dismayed that app writers just want the old ext3
behavior (which still has a window for loss, doesn't it?) so that they
can get away without fsyncing. And talking to KDE folks and others, I
think that if ext3 didn't hurt so much w/ fsync, they would just happily
do the right posix-defined thing and add fsync() when needed.

But instead, since they are now justifiably afraid of fsync, we are in
this quandary. (maybe this is over-simplifying a bit).

But off the top of my head, I think that I would prefer to see
applications generally do the right, posix-conformant thing w.r.t. data
integrity (i.e. fsync()) unless, via statfs, they find out "fsync hurts,
and we're likely to be reasoonably safe without it"

IOW, adding exceptions for ext3 sounds better to me than munging ext4,
xfs, btrfs, and all future filesystems to conform to some behavior which
isn't in any API or spec ...

-Eric


2009-03-16 16:27:41

by Jan Kara

[permalink] [raw]
Subject: Re: Same magic in statfs() call for ext?

On Mon 16-03-09 11:13:13, Eric Sandeen wrote:
> Jan Kara wrote:
> > Hi,
> >
> > I've just noticed that EXT2_SUPER_MAGIC == EXT3_SUPER_MAGIC ==
> > EXT4_SUPER_MAGIC.
> Just noticed? *grin*
;-)

> > That is just fine for the disk format but as a result we
> > also return the same magic in statfs() syscall and thus a simple
> > application has hard time recognizing whether it works on ext2, ext3 or
> > ext4 (it would have to parse /proc/mounts and that is non-trivial if not
> > impossible when it comes to bind mounts).
>
> I have a guess as to why they want to know, and ...
>
> > So should not we return different
> > magic numbers depending on how the filesystem is currently mounted?
> > Now you may ask why should the application care - and I agree that in the
> > ideal world it should not. But for example there's a thread on GTK mailing
> > list [1] where they discuss the problem that with delayed allocation and
> > ext4, user can easily lose his data after crash
>
> ... sadly I was right. :)
>
> > (Ted wrote about it here in
> > some other mail some time ago). So they would like to call fsync() after
> > the file is written but on ext3 that is quite heavy and because of autosave
> > saving happens quite often. So they'd do fsync() only if the filesystem
> > is mounted as ext4...
> > So I'm writing here so hear some opinions on returning different magic
> > numbers from statfs().
> >
> > Honza
> >
> > [1] http://mail.gnome.org/archives/gtk-devel-list/2009-March/msg00082.html
>
> As an aside, Ted also pointed out that ext4-without-delalloc also hurts
> on fsync just like ext3 does, so testing "ext3 vs. ext4" isn't quite
> enough in general.
Yes, I know but it's at least some approximation.

> I have been a bit dismayed that app writers just want the old ext3
> behavior (which still has a window for loss, doesn't it?) so that they
> can get away without fsyncing. And talking to KDE folks and others, I
> think that if ext3 didn't hurt so much w/ fsync, they would just happily
> do the right posix-defined thing and add fsync() when needed.
>
> But instead, since they are now justifiably afraid of fsync, we are in
> this quandary. (maybe this is over-simplifying a bit).
>
> But off the top of my head, I think that I would prefer to see
> applications generally do the right, posix-conformant thing w.r.t. data
> integrity (i.e. fsync()) unless, via statfs, they find out "fsync hurts,
> and we're likely to be reasoonably safe without it"
>
> IOW, adding exceptions for ext3 sounds better to me than munging ext4,
> xfs, btrfs, and all future filesystems to conform to some behavior which
> isn't in any API or spec ...
Yes, I agree that if they want data on disk, they should use fsync(). But
as you say for ext3 this is not really usable so they have to somehow
recognize that "they are on a filesystem where fsync() sucks" and avoid it
as much as possible. And I feel slightly in favor of giving them enough rope
(i.e., different magic numbers in statfs) to hang themselves ;-).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-30 18:24:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: Same magic in statfs() call for ext?

On Mar 16, 2009 17:27 +0100, Jan Kara wrote:
> On Mon 16-03-09 11:13:13, Eric Sandeen wrote:
> > But off the top of my head, I think that I would prefer to see
> > applications generally do the right, posix-conformant thing w.r.t. data
> > integrity (i.e. fsync()) unless, via statfs, they find out "fsync hurts,
> > and we're likely to be reasoonably safe without it"
> >
> > IOW, adding exceptions for ext3 sounds better to me than munging ext4,
> > xfs, btrfs, and all future filesystems to conform to some behavior which
> > isn't in any API or spec ...
>
> Yes, I agree that if they want data on disk, they should use fsync(). But
> as you say for ext3 this is not really usable so they have to somehow
> recognize that "they are on a filesystem where fsync() sucks" and avoid it
> as much as possible. And I feel slightly in favor of giving them enough rope
> (i.e., different magic numbers in statfs) to hang themselves ;-).

One possibility that I've thought of in the past is to have "dynamic
data=journal" mode when fsync is being called and files are small.
What this means is that small file data will be written to the journal
on fsync instead of journaling only the metadata and flushing the data
to the filesystem in ordered mode.

While it means data is written twice to disk (once to journal, once to
fs), if there is a lot of fsync going on and the files are small then
it may still be faster than doing the seeks.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.