2009-03-27 12:53:35

by Artem Bityutskiy

[permalink] [raw]
Subject: EXT4-ish "fixes" in UBIFS

UBIFS has exactly the same properties like ext4 - in case
of power cuts:

1. truncate/write/close leads to empty files
2. create/write/rename leads to empty files

UBIFS is used in hand-held and and power-cuts are very
often there, because users just remove battery often.

I realize the "reality is different" argument, and already
concluded that we need a similar changes as Theo has done
for ext4:
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705

We have a problem that user-space people do not want to
use 'fsync()', even when they are pointed to their code
which is doing create/write/rename/close without fsync().

They just say - this is file-system bug, it is fixed in
ext4 now, just fix the bug in UBIFS.

I tell them, that is not a fix, that is band-aid, because
ext4 issues asynchronous write, and a power cut can lead
to corruptions anyway.

I tell them, we can make this in UBIFS, but please, anyway
add fsync() to your application. They say - now, we will
will not - you fix your UBIFS.

And because there is so much flood and about this, it is
so difficult to have reasonable arguments. I want to say
people - please, still use fsync(), if this is about the
performance/reliability trade-off - make it optional.
But they instead say - respected people are on our side,
go away. And point me this:
http://www.advogato.org/person/mjg59/diary/195.html
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811700
http://article.gmane.org/gmane.comp.lang.perl.perl5.porters/67352

And they say that BTRFS and XFS are going to fix userspace
as well, and point me at this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/175

This all became so messy and controversial. What should I do
to persuade userspace to use 'fsync()' even if we hack UBIFS
similarly to ext4? Suggestions?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


2009-03-28 01:22:58

by Kyungmin Park

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

Hi,

I also got these request. the file is empty at rename operatoin in
case of sudden power off.
they say it's different from jffs2. in case of jffs2, it points old
files even though power off.
then why is UBIFS different. fix it as before. I said it's not
filesystem bug. it's expected behaviors.

In my case, I persuade the application people to change their
application to use fsync. also if fsync doesn't solve this problem,
add mirror scheme, duplicate file to avoid empty file problem.

Frankly I'm not sure which one is better. how much filesystem support
it. but remember that application programmer also don't want to change
their application when filesystem is changed.
"The application is not changed, only filesystem is changed. so it's
filesystem problem, not us"

Thank you,
Kyungmin Park

On Fri, Mar 27, 2009 at 9:48 PM, Artem Bityutskiy
<[email protected]> wrote:
> UBIFS has exactly the same properties like ext4 - in case
> of power cuts:
>
> 1. truncate/write/close leads to empty files
> 2. create/write/rename leads to empty files
>
> UBIFS is used in hand-held and and power-cuts are very
> often there, because users just remove battery often.
>
> I realize the "reality is different" argument, and already
> concluded that we need a similar changes as Theo has done
> for ext4:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>
> We have a problem that user-space people do not want to
> use 'fsync()', even when they are pointed to their code
> which is doing create/write/rename/close without fsync().
>
> They just say - this is file-system bug, it is fixed in
> ext4 now, just fix the bug in UBIFS.
>
> I tell them, that is not a fix, that is band-aid, because
> ext4 issues asynchronous write, and a power cut can lead
> to corruptions anyway.
>
> I tell them, we can make this in UBIFS, but please, anyway
> add fsync() to your application. They say - now, we will
> will not - you fix your UBIFS.
>
> And because there is so much flood and about this, it is
> so difficult to have reasonable arguments. I want to say
> people - please, still use fsync(), if this is about the
> performance/reliability trade-off - make it optional.
> But they instead say - respected people are on our side,
> go away. And point me this:
> http://www.advogato.org/person/mjg59/diary/195.html
> http://thread.gmane.org/gmane.linux.kernel/811167/focus=811700
> http://article.gmane.org/gmane.comp.lang.perl.perl5.porters/67352
>
> And they say that BTRFS and XFS are going to fix userspace
> as well, and point me at this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/175
>
> This all became so messy and controversial. What should I do
> to persuade userspace to use 'fsync()' even if we hack UBIFS
> similarly to ext4? Suggestions?
>
> --
> Best Regards,
> Artem Bityutskiy (??ԣ? ????????)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2009-03-29 12:26:20

by Pavel Machek

[permalink] [raw]
Subject: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
> UBIFS has exactly the same properties like ext4 - in case
> of power cuts:
>
> 1. truncate/write/close leads to empty files
> 2. create/write/rename leads to empty files
>
> UBIFS is used in hand-held and and power-cuts are very
> often there, because users just remove battery often.
>
> I realize the "reality is different" argument, and already
> concluded that we need a similar changes as Theo has done
> for ext4:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>
> We have a problem that user-space people do not want to
> use 'fsync()', even when they are pointed to their code
> which is doing create/write/rename/close without fsync().

Well... they really don't want to spin the disk up for the
fsync(). I'm not sure if fsync() is really sensible operation to use
there.

> 1. truncate/write/close leads to empty files

this is buggy.

> 2. create/write/rename leads to empty files

...but this should not be. If we want to make that explicit, we should
provide "replace()" operation; where replace is rename that makes sure
that source file is completely on media before commiting the rename.

It is somehow similar to fsync()/rename(), but does not force disk
spin up immediately -- it only inserts "barrier" between data blocks
and rename. (And yes, it should be implemented as fsync()+rename() for
filesystems like xfs. It can be implemented as plain rename for ext3
and ext4 after the fixes...)

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-03-29 12:31:46

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

Kyungmin Park wrote:
> I also got these request. the file is empty at rename operatoin in
> case of sudden power off.
> they say it's different from jffs2. in case of jffs2, it points old
> files even though power off.

Right, because JFFS2 is synchronous :-)

> then why is UBIFS different. fix it as before. I said it's not
> filesystem bug. it's expected behaviors.

Right, this is what I've been always thinking. I've always been
thinking the FS gives no guarantees, and if you want a 100%
guarantee, use fsync() before renaming. Frankly, I still think
so. But we'll make ext4-like changes in UBIFS as well to help
the applications which do not do the sync.

> Frankly I'm not sure which one is better. how much filesystem support
> it. but remember that application programmer also don't want to change
> their application when filesystem is changed.
> "The application is not changed, only filesystem is changed. so it's
> filesystem problem, not us"

I hope Linux gurus will put it clearly after all - to fsync() or to
not fsync(). We do need clear rules of the game. For now, I still
assume the following:

1. If applications want atomic update which gives 100% guarantee,
they should fsync before rename.
2. If the application does not use fsync, FS should try to minimize
the probability of data loss by running asynchronous write-back
on rename which unlinks a direntry.
3. All this performance vs. reliability hassle should be solved
by fixing the FS, by having good defaults, by having a
"fsync/not fsync" knobs in applications.

Indeed, people mostly talk about ext3, desktops, etc. But there
is also the embedded world, where battery is removed randomly.

But will see where this all leads. I really want clean rules
for this.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 12:43:22

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

Pavel Machek wrote:
> On Fri 2009-03-27 14:48:10, Artem Bityutskiy wrote:
>> UBIFS has exactly the same properties like ext4 - in case
>> of power cuts:
>>
>> 1. truncate/write/close leads to empty files
>> 2. create/write/rename leads to empty files
>>
>> UBIFS is used in hand-held and and power-cuts are very
>> often there, because users just remove battery often.
>>
>> I realize the "reality is different" argument, and already
>> concluded that we need a similar changes as Theo has done
>> for ext4:
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=bf1b69c0db7f9b9d8f02e94d40b19fca8336b991
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=f32b730a69bd56c5c9d704d8b75f03e90e290971
>> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=8411e347c3306ed36b8ca88611bf5fbf4d27d705
>>
>> We have a problem that user-space people do not want to
>> use 'fsync()', even when they are pointed to their code
>> which is doing create/write/rename/close without fsync().
>
> Well... they really don't want to spin the disk up for the
> fsync(). I'm not sure if fsync() is really sensible operation to use
> there.

I'm personally concerned about hand-held, and in case of UBIFS
fsync is not too expensive - we work on flash and on fsync() we
write back only the stuff belonging to inode in question, and
nothing else.

>> 1. truncate/write/close leads to empty files
>
> this is buggy.

In FS, or in application?

>> 2. create/write/rename leads to empty files
>
> ..but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.

Well, OK, we can fsync() before rename, we just need clean rules
for this, so that all Linux FSes would follow them. Would be nice
to have final agreement on all this stuff.

> It is somehow similar to fsync()/rename(), but does not force disk
> spin up immediately -- it only inserts "barrier" between data blocks
> and rename. (And yes, it should be implemented as fsync()+rename() for
> filesystems like xfs. It can be implemented as plain rename for ext3
> and ext4 after the fixes...)

Right. But I guess only few file-systems would really implement
this, because this is complex.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 12:50:37

by Pavel Machek

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)


>>> We have a problem that user-space people do not want to
>>> use 'fsync()', even when they are pointed to their code
>>> which is doing create/write/rename/close without fsync().
>>
>> Well... they really don't want to spin the disk up for the
>> fsync(). I'm not sure if fsync() is really sensible operation to use
>> there.
>
> I'm personally concerned about hand-held, and in case of UBIFS
> fsync is not too expensive - we work on flash and on fsync() we
> write back only the stuff belonging to inode in question, and
> nothing else.

Well, I'm more concerned about spinning disks, having one even in my
zaurus. And I do believe that fsync() will write more data than
neccessary even in flash case.

>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?

Application is buggy; no way kernel can help there.

>>> 2. create/write/rename leads to empty files
>>
>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
>
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.

My proposal is

rename() stays.

replace(src, bar) is rename that ensures that bar will contain valid
data after powerfail.

>> It is somehow similar to fsync()/rename(), but does not force disk
>> spin up immediately -- it only inserts "barrier" between data blocks
>> and rename. (And yes, it should be implemented as fsync()+rename() for
>> filesystems like xfs. It can be implemented as plain rename for ext3
>> and ext4 after the fixes...)
>
> Right. But I guess only few file-systems would really implement
> this, because this is complex.

Complex yes, but at least ext3+ext4+btrfs should, and they really have
90% of "market share" :-). ext3 and ext4 implementations are already
done :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-03-29 12:55:33

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

Artem Bityutskiy wrote:
> Kyungmin Park wrote:
>> I also got these request. the file is empty at rename operatoin in
>> case of sudden power off.
>> they say it's different from jffs2. in case of jffs2, it points old
>> files even though power off.
>
> Right, because JFFS2 is synchronous :-)
>
>> then why is UBIFS different. fix it as before. I said it's not
>> filesystem bug. it's expected behaviors.
>
> Right, this is what I've been always thinking. I've always been
> thinking the FS gives no guarantees, and if you want a 100%
> guarantee, use fsync() before renaming. Frankly, I still think
> so. But we'll make ext4-like changes in UBIFS as well to help
> the applications which do not do the sync.
>
>> Frankly I'm not sure which one is better. how much filesystem support
>> it. but remember that application programmer also don't want to change
>> their application when filesystem is changed.
>> "The application is not changed, only filesystem is changed. so it's
>> filesystem problem, not us"
>
> I hope Linux gurus will put it clearly after all - to fsync() or to
> not fsync(). We do need clear rules of the game. For now, I still
> assume the following:
>
> 1. If applications want atomic update which gives 100% guarantee,
> they should fsync before rename.
> 2. If the application does not use fsync, FS should try to minimize
> the probability of data loss by running asynchronous write-back
> on rename which unlinks a direntry.
> 3. All this performance vs. reliability hassle should be solved
> by fixing the FS, by having good defaults, by having a
> "fsync/not fsync" knobs in applications.
>
> Indeed, people mostly talk about ext3, desktops, etc. But there
> is also the embedded world, where battery is removed randomly.

Let me elaborate why I tell about embedded. Looking into the
"Linux-2.6.29" thread, it _seems_ people assume that it is enough
if FS will start _asynchronous_ write-back after rename, so that
dirty data will not sit in the cache for long time. E.g., many
people are happy with ext3's 5 seconds. So for me it seems like
some people do not care about 100% atomicity guarantees, they are
fine with just low data loss probability.

So what I say, that in embedded we need 100% atomic updates,
because our power cuts may be frequent and random. And at this
moment only fsync() before rename may guarantee this.

And updating a file using truncate/rewrite does not guarantee
anything at all.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 13:01:15

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

Pavel Machek wrote:
>>>> 2. create/write/rename leads to empty files
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
>
> My proposal is
>
> rename() stays.

It stays and:

1. does _not_ fsync
2. has synchronous fsync added
3. stays and have asynchronous fsync added?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 13:01:53

by Andreas T.Auer

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)



On 29.03.2009 14:42 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>
>>> 1. truncate/write/close leads to empty files
>>
>> this is buggy.
>
> In FS, or in application?
In application of course. If you rewrite a huge file that way, you have
a long-time risk of loosing data in a crash, even with sychronous writes.
>
>>> 2. create/write/rename leads to empty files
In the that case the time for the risk is reduced to the rename from the
viewpoint of the application developers, which don't know modern
re-ordering filesystems.

>> ..but this should not be. If we want to make that explicit, we should
>> provide "replace()" operation; where replace is rename that makes sure
>> that source file is completely on media before commiting the rename.
It is a hard task to change all the applications, there a lot of
orphaned projects, which are still used.
> Well, OK, we can fsync() before rename, we just need clean rules
> for this, so that all Linux FSes would follow them. Would be nice
> to have final agreement on all this stuff.
>
This slows down things, but you could also delay the writing of the
metadata pointing to non-existing data. Or is there any use for it after
the crash?

2009-03-29 13:03:17

by Pavel Machek

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>>>> 2. create/write/rename leads to empty files
>>>> ..but this should not be. If we want to make that explicit, we should
>>>> provide "replace()" operation; where replace is rename that makes sure
>>>> that source file is completely on media before commiting the rename.
>>> Well, OK, we can fsync() before rename, we just need clean rules
>>> for this, so that all Linux FSes would follow them. Would be nice
>>> to have final agreement on all this stuff.
>>
>> My proposal is
>>
>> rename() stays.
>
> It stays and:
>
> 1. does _not_ fsync

Does not fsync. If someone wants to make sure one of the files is on
the disk, he should use replace(). [On non-linux systems, replace()
should be implemented as fsync/rename in libc or something.]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-03-29 13:06:50

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

ext Andreas T.Auer wrote:
>
> On 29.03.2009 14:42 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>
>>>> 1. truncate/write/close leads to empty files
>>> this is buggy.
>> In FS, or in application?
> In application of course. If you rewrite a huge file that way, you have
> a long-time risk of loosing data in a crash, even with sychronous writes.

You know, after reading all these blogs and discussions,
I will not be surprised if someone says this is an FS bug.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 13:07:56

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

Pavel Machek wrote:
> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>>>>> 2. create/write/rename leads to empty files
>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>> that source file is completely on media before commiting the rename.
>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>> to have final agreement on all this stuff.
>>> My proposal is
>>>
>>> rename() stays.
>> It stays and:
>>
>> 1. does _not_ fsync
>
> Does not fsync. If someone wants to make sure one of the files is on
> the disk, he should use replace(). [On non-linux systems, replace()
> should be implemented as fsync/rename in libc or something.]

I would be happy with these rules. But the fact is, application
people just refuse to add fsync before rename. They say that the
FS has to do this. And they say that even Linus supports them,
which is an argument I find difficult to fight against. This is
why I want clean rules.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 13:23:23

by Andreas T.Auer

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)



On 29.03.2009 15:07 Artem Bityutskiy wrote:
> Pavel Machek wrote:
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
As a user I will avoid using any fs, which requires the tons of
applications to be changed for a reasonable amount of data safety.
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename.
Because it slows down the performance.
> They say that the
> FS has to do this.
They say that FS should not write metadata for non-existing data and
even overwrite "clean" metadata with "dirty" metadata. It is up to the
fs to decide, whether fsync is needed to achieve this.

2009-03-29 13:40:28

by Pavel Machek

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
> Pavel Machek wrote:
>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>>>>> 2. create/write/rename leads to empty files
>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>> that source file is completely on media before commiting the rename.
>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>> to have final agreement on all this stuff.
>>>> My proposal is
>>>>
>>>> rename() stays.
>>> It stays and:
>>>
>>> 1. does _not_ fsync
>>
>> Does not fsync. If someone wants to make sure one of the files is on
>> the disk, he should use replace(). [On non-linux systems, replace()
>> should be implemented as fsync/rename in libc or something.]
>
> I would be happy with these rules. But the fact is, application
> people just refuse to add fsync before rename. They say that the
> FS has to do this. And they say that even Linus supports them,

That's good. fsync before rename would be ugly regression (on ext3 at
least). We should get them to use replace() syscall, not get them to
add fsyncs. [Of course, that means we need replace syscall first. :-)]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-03-29 13:55:57

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

Andreas T.Auer wrote:
> On 29.03.2009 15:07 Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
> As a user I will avoid using any fs, which requires the tons of
> applications to be changed for a reasonable amount of data safety.
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename.
> Because it slows down the performance.
>> They say that the
>> FS has to do this.
> They say that FS should not write metadata for non-existing data and
> even overwrite "clean" metadata with "dirty" metadata. It is up to the
> fs to decide, whether fsync is needed to achieve this.

Well, this makes sense, but the fact is that FS developers did
not keep this in mind. And when we have been developing UBIFS,
we also naively assumed that user-space would just call fsync
if needed. And it was easier to implement stuff this way. And
it looked like POSIX and other Linux FSes assumed that.

But well, we can change UBIFS behavior, but it would be nice
to have some agreement on all this.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 13:57:32

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

ext Pavel Machek wrote:
> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>> Pavel Machek wrote:
>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>> Pavel Machek wrote:
>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>> that source file is completely on media before commiting the rename.
>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>> to have final agreement on all this stuff.
>>>>> My proposal is
>>>>>
>>>>> rename() stays.
>>>> It stays and:
>>>>
>>>> 1. does _not_ fsync
>>> Does not fsync. If someone wants to make sure one of the files is on
>>> the disk, he should use replace(). [On non-linux systems, replace()
>>> should be implemented as fsync/rename in libc or something.]
>> I would be happy with these rules. But the fact is, application
>> people just refuse to add fsync before rename. They say that the
>> FS has to do this. And they say that even Linus supports them,
>
> That's good. fsync before rename would be ugly regression (on ext3 at
> least). We should get them to use replace() syscall, not get them to
> add fsyncs. [Of course, that means we need replace syscall first. :-)]

I'd say it is better to fix ext3 then.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-03-29 14:00:27

by Pavel Machek

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

On Sun 2009-03-29 16:57:06, Artem Bityutskiy wrote:
> ext Pavel Machek wrote:
>> On Sun 2009-03-29 16:07:35, Artem Bityutskiy wrote:
>>> Pavel Machek wrote:
>>>> On Sun 2009-03-29 16:00:45, Artem Bityutskiy wrote:
>>>>> Pavel Machek wrote:
>>>>>>>>> 2. create/write/rename leads to empty files
>>>>>>>> ..but this should not be. If we want to make that explicit, we should
>>>>>>>> provide "replace()" operation; where replace is rename that makes sure
>>>>>>>> that source file is completely on media before commiting the rename.
>>>>>>> Well, OK, we can fsync() before rename, we just need clean rules
>>>>>>> for this, so that all Linux FSes would follow them. Would be nice
>>>>>>> to have final agreement on all this stuff.
>>>>>> My proposal is
>>>>>>
>>>>>> rename() stays.
>>>>> It stays and:
>>>>>
>>>>> 1. does _not_ fsync
>>>> Does not fsync. If someone wants to make sure one of the files is on
>>>> the disk, he should use replace(). [On non-linux systems, replace()
>>>> should be implemented as fsync/rename in libc or something.]
>>> I would be happy with these rules. But the fact is, application
>>> people just refuse to add fsync before rename. They say that the
>>> FS has to do this. And they say that even Linus supports them,
>>
>> That's good. fsync before rename would be ugly regression (on ext3 at
>> least). We should get them to use replace() syscall, not get them to
>> add fsyncs. [Of course, that means we need replace syscall first. :-)]
>
> I'd say it is better to fix ext3 then.

? I don't get this.

ext3's rename() is already equivalent to proposed replace(). The
problem is that btrfs's and ubifs's renames are not.

So doing extra fsync() on ext3 is actually an performance regression
-> we do not want applications to randomly add open-coded fsyncs().

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-03-30 15:55:54

by Diego Calleja

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

On Domingo 29 Marzo 2009 14:26:00 Pavel Machek escribi?:

> ...but this should not be. If we want to make that explicit, we should
> provide "replace()" operation; where replace is rename that makes sure
> that source file is completely on media before commiting the rename.

An "ad Linus-em" counterexample:

"And if we have a Linux-specific magic system call or sync action, it's
going to be even more rarely used than fsync(). Do you think anybody
really uses the OS X FSYNC_FULL ioctl? Nope. Outside of a few databases,
it is almost certainly not going to be used, and fsync() will not be
reliable in general.

So rather than come up with new barriers that nobody will use, filesystem
people should aim to make "badly written" code "just work" unless people
are really really unlucky. Because like it or not, that's what 99% of all
code is."

2009-03-30 17:19:44

by Ric Wheeler

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

Pavel Machek wrote:
>>>> We have a problem that user-space people do not want to
>>>> use 'fsync()', even when they are pointed to their code
>>>> which is doing create/write/rename/close without fsync().
>>>>
>>> Well... they really don't want to spin the disk up for the
>>> fsync(). I'm not sure if fsync() is really sensible operation to use
>>> there.
>>>
>> I'm personally concerned about hand-held, and in case of UBIFS
>> fsync is not too expensive - we work on flash and on fsync() we
>> write back only the stuff belonging to inode in question, and
>> nothing else.
>>
>
> Well, I'm more concerned about spinning disks, having one even in my
> zaurus. And I do believe that fsync() will write more data than
> neccessary even in flash case.
>
>
>>>> 1. truncate/write/close leads to empty files
>>>>
>>> this is buggy.
>>>
>> In FS, or in application?
>>
>
> Application is buggy; no way kernel can help there.
>
>
>>>> 2. create/write/rename leads to empty files
>>>>
>>> ..but this should not be. If we want to make that explicit, we should
>>> provide "replace()" operation; where replace is rename that makes sure
>>> that source file is completely on media before commiting the rename.
>>>
>> Well, OK, we can fsync() before rename, we just need clean rules
>> for this, so that all Linux FSes would follow them. Would be nice
>> to have final agreement on all this stuff.
>>
>
> My proposal is
>
> rename() stays.
>
> replace(src, bar) is rename that ensures that bar will contain valid
> data after powerfail.
>

Surely the only way to "insure" this is to spin up the drive, write the
meta-data and data back and make sure that it is not held in volatile
write cache?

Why would calling this replace be better or more power efficient than
what you need to do today?

ric

>
>>> It is somehow similar to fsync()/rename(), but does not force disk
>>> spin up immediately -- it only inserts "barrier" between data blocks
>>> and rename. (And yes, it should be implemented as fsync()+rename() for
>>> filesystems like xfs. It can be implemented as plain rename for ext3
>>> and ext4 after the fixes...)
>>>
>> Right. But I guess only few file-systems would really implement
>> this, because this is complex.
>>
>
> Complex yes, but at least ext3+ext4+btrfs should, and they really have
> 90% of "market share" :-). ext3 and ext4 implementations are already
> done :-).
> Pavel
>

2009-03-30 22:12:26

by Pavel Machek

[permalink] [raw]
Subject: Re: replace() system call needed (was Re: EXT4-ish "fixes" in UBIFS)

Hi!

>> My proposal is
>>
>> rename() stays.
>>
>> replace(src, bar) is rename that ensures that bar will contain valid
>> data after powerfail.
>>
>
> Surely the only way to "insure" this is to spin up the drive, write the
> meta-data and data back and make sure that it is not held in volatile
> write cache?

Well, no. "will contain valid data" but may contain _old_ valid data.

So the way to do that would be "wait until you have to spin disk up
anyway or until timeout, then write data first, then do rename".

AFAICT that's semantics gnome (etc) wants.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-04-03 00:09:54

by Christian Kujau

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
> They just say - this is file-system bug, it is fixed in
> ext4 now, just fix the bug in UBIFS.

Would *mounting* the filesystem with "-o sync" help? This way no
filesystem "fixes" are needed and userland would not have to be rewritten.

Christian.
--
Alice and Bob met for the first time at Bruce Schneier's pool-party

2009-04-03 00:24:19

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 6:09 PM, Christian Kujau <[email protected]> wrote:
> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
>> They just say - this is file-system bug, it is fixed in
>> ext4 now, just fix the bug in UBIFS.
>
> Would *mounting* the filesystem with "-o sync" help? This way no
> filesystem "fixes" are needed and userland would not have to be rewritten.
>
> Christian.

Yes, mounting "-o sync" does improve ext3 performance. It sucks
though, because I do want quick writes. And mounting with sync option
slows down to disk io speeds. In my case, that's between 20 and 23
megabytes per second *big frown, quivering lip, and tears in my eyes*.
:P

2009-04-03 00:28:58

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
<[email protected]> wrote:
> On Thu, Apr 2, 2009 at 6:09 PM, Christian Kujau <[email protected]> wrote:
>> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
>>> They just say - this is file-system bug, it is fixed in
>>> ext4 now, just fix the bug in UBIFS.
>>
>> Would *mounting* the filesystem with "-o sync" help? This way no
>> filesystem "fixes" are needed and userland would not have to be rewritten.
>>
>> Christian.
>
> Yes, mounting "-o sync" does improve ext3 performance. ?It sucks
> though, because I do want quick writes. ?And mounting with sync option
> slows down to disk io speeds. ?In my case, that's between 20 and 23
> megabytes per second *big frown, quivering lip, and tears in my eyes*.
> :P
>

Oh, I should have clarified. It improves performance under heavy
load. Under normal load, mounting without sync is fine. What I tend
to do is mount with "remount,rw,sync" when heavy load is starting.
Then my system goes slowly, but latency is good. Then, when it's all
done (say a big compile, or job, or whatever), I remount without sync
again.

I'm thinking of writing a script that monitors performance, and
remounts as needed, lol. WHAT A HACK. hehe.

2009-04-03 00:38:43

by Christian Kujau

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, 2 Apr 2009, Trenton D. Adams wrote:
> Oh, I should have clarified. It improves performance under heavy
> load. Under normal load, mounting without sync is fine. What I tend
> to do is mount with "remount,rw,sync" when heavy load is starting.

Really? How does mounting with "-o sync" *improve* performance? I am
certainly aware that mounting with "-o sync" has severe performance
impacts, but was proposing it anyway *only* to tackle the data integrity
problem. However, I'm curious if usescaes in the embedded world are
equally affected by this.

> I'm thinking of writing a script that monitors performance, and
> remounts as needed, lol. WHAT A HACK. hehe.

Ugh....my brain hurts :-\

Christian.
--
Bruce Schneier once found three distinct natural number divisors of a prime
number.

2009-04-03 00:54:25

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 6:38 PM, Christian Kujau <[email protected]> wrote:
> On Thu, 2 Apr 2009, Trenton D. Adams wrote:
>> Oh, I should have clarified. ?It improves performance under heavy
>> load. ?Under normal load, mounting without sync is fine. ?What I tend
>> to do is mount with "remount,rw,sync" when heavy load is starting.
>
> Really? How does mounting with "-o sync" *improve* performance? I am
> certainly aware that mounting with "-o sync" has severe performance
> impacts, but was proposing it anyway *only* to tackle the data integrity
> problem. However, I'm curious if usescaes in the embedded world are
> equally affected by this.
>

Oh, well for my system, if I do heavy IO, my *fsync* performance drops
like a rock. fsync on even 1M takes 15-20 seconds at times. I have
even seen 50 seconds. If I mount with sync option, the fsyncs of 1M
take only a couple hundred milliseconds, while the other heavy IO is
happening.

2009-04-03 00:54:56

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 6:38 PM, Christian Kujau <[email protected]> wrote:
> On Thu, 2 Apr 2009, Trenton D. Adams wrote:
>> I'm thinking of writing a script that monitors performance, and
>> remounts as needed, lol. ?WHAT A HACK. hehe.
>
> Ugh....my brain hurts :-\
>
> Christian.

Yeah, mine too.

2009-04-03 00:59:37

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 6:54 PM, Trenton D. Adams
<[email protected]> wrote:
> On Thu, Apr 2, 2009 at 6:38 PM, Christian Kujau <[email protected]> wrote:
>> On Thu, 2 Apr 2009, Trenton D. Adams wrote:
>>> I'm thinking of writing a script that monitors performance, and
>>> remounts as needed, lol. ?WHAT A HACK. hehe.
>>
>> Ugh....my brain hurts :-\
>>
>> Christian.
>
> Yeah, mine too.
>

Just to make it hurt more for you, here you go...

On one console I run...
dd if=/dev/zero of=/tmp/bigfile bs=1M count=2000

On another I run...
perf-mon.sh
remounting with sync option, performance dropping
remounting without sync option, performance has stabilized

It may be better to write a C program that does a 1M fsync, and if
it's taking too long, then remount, lol. Also, this script here,
using 1 min load average, will catch CPU intensity as well, which is
not really what I want. Ah, it is a hack indeed. ROFL

#!/bin/sh

while true; do
UPTIME=$(uptime | xargs | cut -d ' ' -f10 | sed 's/,//');
if [ "$(echo "$UPTIME > 1" | bc)" -eq "1" ]; then
mount | egrep 's-sys.*sync' >/dev/null
if [ "$?" -ne "0" ]; then
echo "remounting with sync option, performance dropping";
mount -o remount,rw,sync /dev/s/sys /;
fi
else
mount | egrep 's-sys.*sync' > /dev/null
if [ "$?" -eq "0" ]; then
echo "remounting without sync option, performance has stabilized";
mount -o remount,rw /dev/s/sys /;
fi
fi;
sleep 1
done

2009-04-03 01:55:32

by David Rees

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
<[email protected]> wrote:
> On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
> <[email protected]> wrote:
>> Yes, mounting "-o sync" does improve ext3 performance. ?It sucks
>> though, because I do want quick writes. ?And mounting with sync option
>> slows down to disk io speeds. ?In my case, that's between 20 and 23
>> megabytes per second *big frown, quivering lip, and tears in my eyes*.
>> :P
>>
>
> Oh, I should have clarified. ?It improves performance under heavy
> load. ?Under normal load, mounting without sync is fine. ?What I tend
> to do is mount with "remount,rw,sync" when heavy load is starting.
> Then my system goes slowly, but latency is good. ?Then, when it's all
> done (say a big compile, or job, or whatever), I remount without sync
> again.
>
> I'm thinking of writing a script that monitors performance, and
> remounts as needed, lol. ?WHAT A HACK. hehe.

All you're doing here is implementing the lowering of dirty data
limits in the VM dynamically based on how long fsyncs take.

Linus outlined this specific strategy as "the ideal siutation"
somewhere in the depths of "That filesystem thread".

Look at the new in 2.6.29 dirty*bytes parameters in
Documentation/sysctl/vm.txt for more info. By lowering those values,
you can effectively turn normal writes into synchronous writes which
will greatly reduce latency of fsync under heavy write load.

In previous kernels you can tweak dirty_ratio and
dirty_background_ratio, but they don't have the granularity of the new
knobs. Although if you are talking about just remounting in sync
mode, they may work for you at least as a proof of concept. ;-)

-Dave

2009-04-03 02:05:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 02, 2009 at 05:09:39PM -0700, Christian Kujau wrote:
> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
> > They just say - this is file-system bug, it is fixed in
> > ext4 now, just fix the bug in UBIFS.
>
> Would *mounting* the filesystem with "-o sync" help? This way no
> filesystem "fixes" are needed and userland would not have to be rewritten.

It will, but you might not like the performance.... the reason why
it's there is that some users might want the particular tradeoff, but
it probably wouldn't make a good default.

- Ted

2009-04-03 02:05:58

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 7:55 PM, David Rees <[email protected]> wrote:
> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
> <[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
>> <[email protected]> wrote:
>>> Yes, mounting "-o sync" does improve ext3 performance. ?It sucks
>>> though, because I do want quick writes. ?And mounting with sync option
>>> slows down to disk io speeds. ?In my case, that's between 20 and 23
>>> megabytes per second *big frown, quivering lip, and tears in my eyes*.
>>> :P
>>>
>>
>> Oh, I should have clarified. ?It improves performance under heavy
>> load. ?Under normal load, mounting without sync is fine. ?What I tend
>> to do is mount with "remount,rw,sync" when heavy load is starting.
>> Then my system goes slowly, but latency is good. ?Then, when it's all
>> done (say a big compile, or job, or whatever), I remount without sync
>> again.
>>
>> I'm thinking of writing a script that monitors performance, and
>> remounts as needed, lol. ?WHAT A HACK. hehe.
>
> All you're doing here is implementing the lowering of dirty data
> limits in the VM dynamically based on how long fsyncs take.
>
> Linus outlined this specific strategy as "the ideal siutation"
> somewhere in the depths of "That filesystem thread".

I thought he said it was a HORRIBLE solution. :D I recall him
slamming Andrew over it. Unless you're referring to the kernel
actually doing it on the fly.

>
> Look at the new in 2.6.29 dirty*bytes parameters in
> Documentation/sysctl/vm.txt for more info. ?By lowering those values,
> you can effectively turn normal writes into synchronous writes which
> will greatly reduce latency of fsync under heavy write load.
>
> In previous kernels you can tweak dirty_ratio and
> dirty_background_ratio, but they don't have the granularity of the new
> knobs. ?Although if you are talking about just remounting in sync
> mode, they may work for you at least as a proof of concept. ;-)
>
> -Dave
>

dirty_ratio and dirty_background never really had any affect for me.
I'll look into the other parameters. Waiting for the checkout again,
as I am currently under a heavy rsync load (*rolls eyes*).

Thanks.

2009-04-03 02:20:29

by David Rees

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 7:05 PM, Trenton D. Adams
<[email protected]> wrote:
> On Thu, Apr 2, 2009 at 7:55 PM, David Rees <[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
>> <[email protected]> wrote:
>>> Oh, I should have clarified. ?It improves performance under heavy
>>> load. ?Under normal load, mounting without sync is fine. ?What I tend
>>> to do is mount with "remount,rw,sync" when heavy load is starting.
>>> Then my system goes slowly, but latency is good. ?Then, when it's all
>>> done (say a big compile, or job, or whatever), I remount without sync
>>> again.
>>>
>>> I'm thinking of writing a script that monitors performance, and
>>> remounts as needed, lol. ?WHAT A HACK. hehe.
>>
>> All you're doing here is implementing the lowering of dirty data
>> limits in the VM dynamically based on how long fsyncs take.
>>
>> Linus outlined this specific strategy as "the ideal siutation"
>> somewhere in the depths of "That filesystem thread".
>
> I thought he said it was a HORRIBLE solution. :D ?I recall him
> slamming Andrew over it. ?Unless you're referring to the kernel
> actually doing it on the fly.

Yes - you are correct - doing it in userspace isn't the best place to
put it - but if you can do it there, the same ideas could then be
pushed into the kernel and further enhanced.

>> Look at the new in 2.6.29 dirty*bytes parameters in
>> Documentation/sysctl/vm.txt for more info. ?By lowering those values,
>> you can effectively turn normal writes into synchronous writes which
>> will greatly reduce latency of fsync under heavy write load.
>>
>> In previous kernels you can tweak dirty_ratio and
>> dirty_background_ratio, but they don't have the granularity of the new
>> knobs. ?Although if you are talking about just remounting in sync
>> mode, they may work for you at least as a proof of concept. ;-)
>
> dirty_ratio and dirty_background never really had any affect for me.
> I'll look into the other parameters. ?Waiting for the checkout again,
> as I am currently under a heavy rsync load (*rolls eyes*).

How low have you set them? Try setting them to 2 and 1 respectively.
It cuts down fsync latencies by a significant amount in my experience.

-Dave

2009-04-03 02:26:18

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 7:55 PM, David Rees <[email protected]> wrote:
> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
> <[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 6:24 PM, Trenton D. Adams
>> <[email protected]> wrote:
>>> Yes, mounting "-o sync" does improve ext3 performance. ?It sucks
>>> though, because I do want quick writes. ?And mounting with sync option
>>> slows down to disk io speeds. ?In my case, that's between 20 and 23
>>> megabytes per second *big frown, quivering lip, and tears in my eyes*.
>>> :P
>>>
>>
>> Oh, I should have clarified. ?It improves performance under heavy
>> load. ?Under normal load, mounting without sync is fine. ?What I tend
>> to do is mount with "remount,rw,sync" when heavy load is starting.
>> Then my system goes slowly, but latency is good. ?Then, when it's all
>> done (say a big compile, or job, or whatever), I remount without sync
>> again.
>>
>> I'm thinking of writing a script that monitors performance, and
>> remounts as needed, lol. ?WHAT A HACK. hehe.
>
> All you're doing here is implementing the lowering of dirty data
> limits in the VM dynamically based on how long fsyncs take.
>
> Linus outlined this specific strategy as "the ideal siutation"
> somewhere in the depths of "That filesystem thread".
>
> Look at the new in 2.6.29 dirty*bytes parameters in
> Documentation/sysctl/vm.txt for more info. ?By lowering those values,
> you can effectively turn normal writes into synchronous writes which
> will greatly reduce latency of fsync under heavy write load.

WOW, that makes a huge difference. If I set it to 100M, I get the
10-15 second delay I was talking about. But, if I set it to 1M, I get
0.3 to 0.4 second delay on a 1M fsync. That is way better. Perhaps I
should auto-tune based on that parameter then. Although I do agree
with Linus that it sucks to do userland auto-tuning. :P

2009-04-03 02:29:15

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 8:19 PM, David Rees <[email protected]> wrote:
> On Thu, Apr 2, 2009 at 7:05 PM, Trenton D. Adams
> <[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 7:55 PM, David Rees <[email protected]> wrote:
>>> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams
>>> <[email protected]> wrote:
>> dirty_ratio and dirty_background never really had any affect for me.
>> I'll look into the other parameters. ?Waiting for the checkout again,
>> as I am currently under a heavy rsync load (*rolls eyes*).
>
> How low have you set them? ?Try setting them to 2 and 1 respectively.
> It cuts down fsync latencies by a significant amount in my experience.
>
> -Dave
>

That's the odd thing, I was setting them to 2 and 1. I was just
looking at the 2.6.29 code, and it should have made a difference. I
don't know what version of the kernel I was using at the time. And,
I'm not sure if I had the 1M fsync tests in place at the time either,
to be sure about what I was testing. It could be that I wasn't being
very scientific about it at the time. Thanks though, that setting
makes a huge difference.

2009-04-03 02:45:43

by Christian Kujau

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, 2 Apr 2009, Theodore Tso wrote:
> It will, but you might not like the performance.... the reason why
> it's there is that some users might want the particular tradeoff, but
> it probably wouldn't make a good default.

Thanks for confirming this. Yes, I know about the performance impact, but
perhaps it's feasible for some setups.

Christian.

PS: I was curious *how* bad the impact was and so I tried generating a
477 MB tarball, first on an async, then on an sync mounted partition:

/dev/md0 /mnt/md0 ext4 rw,noatime,barrier=1,data=ordered
$ time tar -cf /mnt/md0/test.tar /usr
real 1m36.615s

/dev/md0 /mnt/md0 ext4 rw,sync,noatime,barrier=1,data=ordered
$ time tar -cf /mnt/md0/test.tar /usr
real 5m23.793s

--
Bruce Schneier does not get kidney stones. He gets Rosetta Stones.

2009-04-03 02:50:11

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 8:45 PM, Christian Kujau <[email protected]> wrote:
> On Thu, 2 Apr 2009, Theodore Tso wrote:
>> It will, but you might not like the performance.... ?the reason why
>> it's there is that some users might want the particular tradeoff, but
>> it probably wouldn't make a good default.
>
> Thanks for confirming this. Yes, I know about the performance impact, but
> perhaps it's feasible for some setups.
>
> Christian.
>
> PS: I was curious *how* bad the impact was and so I tried generating a
> ? ?477 MB tarball, first on an async, then on an sync mounted partition:
>
> ? /dev/md0 /mnt/md0 ext4 rw,noatime,barrier=1,data=ordered
> ? $ time tar -cf /mnt/md0/test.tar /usr
> ? real 1m36.615s
>
> ? /dev/md0 /mnt/md0 ext4 rw,sync,noatime,barrier=1,data=ordered
> ? $ time tar -cf /mnt/md0/test.tar /usr
> ? real 5m23.793s

lol, yep sounds about right. Probably much worse on my machine, given
the disk speed is around 20-23M/sec.

2009-04-03 02:58:44

by David Rees

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 7:28 PM, Trenton D. Adams
<[email protected]> wrote:
> On Thu, Apr 2, 2009 at 8:19 PM, David Rees <[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 7:05 PM, Trenton D. Adams <[email protected]> wrote:
>>> On Thu, Apr 2, 2009 at 7:55 PM, David Rees <[email protected]> wrote:
>>>> On Thu, Apr 2, 2009 at 5:28 PM, Trenton D. Adams <[email protected]> wrote:
>>> dirty_ratio and dirty_background never really had any affect for me.
>>> I'll look into the other parameters. ?Waiting for the checkout again,
>>> as I am currently under a heavy rsync load (*rolls eyes*).
>>
>> How low have you set them? ?Try setting them to 2 and 1 respectively.
>> It cuts down fsync latencies by a significant amount in my experience.
>
> That's the odd thing, I was setting them to 2 and 1. ?I was just
> looking at the 2.6.29 code, and it should have made a difference. ?I
> don't know what version of the kernel I was using at the time. ?And,
> I'm not sure if I had the 1M fsync tests in place at the time either,
> to be sure about what I was testing. ?It could be that I wasn't being
> very scientific about it at the time. ?Thanks though, that setting
> makes a huge difference.

Well, it depends on how much memory you have. Keep in mind that those
are percentages - so if you have 2GB RAM, that's the same as setting
it to 40MB and 20MB respectively - both are a lot larger than the 1M
you were setting the dirty*bytes vm knobs to.

I've got a problematic server with 8GB RAM. Even if set both to 1,
that's 80MB and the crappy disks I have in it will often only write
10-20MB/s or less due to the seekiness of the workload. That means
delays of 5-10 seconds worst case which isn't fun.

-Dave

2009-04-03 03:13:39

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 8:58 PM, David Rees <[email protected]> wrote:
> On Thu, Apr 2, 2009 at 7:28 PM, Trenton D. Adams
>> That's the odd thing, I was setting them to 2 and 1. ?I was just
>> looking at the 2.6.29 code, and it should have made a difference. ?I
>> don't know what version of the kernel I was using at the time. ?And,
>> I'm not sure if I had the 1M fsync tests in place at the time either,
>> to be sure about what I was testing. ?It could be that I wasn't being
>> very scientific about it at the time. ?Thanks though, that setting
>> makes a huge difference.
>
> Well, it depends on how much memory you have. ?Keep in mind that those
> are percentages - so if you have 2GB RAM, that's the same as setting
> it to 40MB and 20MB respectively - both are a lot larger than the 1M
> you were setting the dirty*bytes vm knobs to.
>
> I've got a problematic server with 8GB RAM. ?Even if set both to 1,
> that's 80MB and the crappy disks I have in it will often only write
> 10-20MB/s or less due to the seekiness of the workload. ?That means
> delays of 5-10 seconds worst case which isn't fun.
>
> -Dave
>

Yeah, I just finished doing the calculation. :P 40M is what I'm
seeing. Yeah, that sounds like the same as my problem. Even setting
it to 10M dirty_bytes has a very serious latency problem. I'm glad
that option was added, because 1M works much better. I'll have to
change my shell script to dynamically tune on that. Because under
normal load, I want the 40M+ of queueing. It's just when things get
really heavy, and stuff starts getting flushed, that this problem
starts happening.

2009-04-03 03:14:19

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

I'm really sorry, I just realized I hijacked this thread. I'll stop now.

On Thu, Apr 2, 2009 at 9:13 PM, Trenton D. Adams
<[email protected]> wrote:
> On Thu, Apr 2, 2009 at 8:58 PM, David Rees <[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 7:28 PM, Trenton D. Adams
>>> That's the odd thing, I was setting them to 2 and 1. ?I was just
>>> looking at the 2.6.29 code, and it should have made a difference. ?I
>>> don't know what version of the kernel I was using at the time. ?And,
>>> I'm not sure if I had the 1M fsync tests in place at the time either,
>>> to be sure about what I was testing. ?It could be that I wasn't being
>>> very scientific about it at the time. ?Thanks though, that setting
>>> makes a huge difference.
>>
>> Well, it depends on how much memory you have. ?Keep in mind that those
>> are percentages - so if you have 2GB RAM, that's the same as setting
>> it to 40MB and 20MB respectively - both are a lot larger than the 1M
>> you were setting the dirty*bytes vm knobs to.
>>
>> I've got a problematic server with 8GB RAM. ?Even if set both to 1,
>> that's 80MB and the crappy disks I have in it will often only write
>> 10-20MB/s or less due to the seekiness of the workload. ?That means
>> delays of 5-10 seconds worst case which isn't fun.
>>
>> -Dave
>>
>
> Yeah, I just finished doing the calculation. :P ?40M is what I'm
> seeing. ?Yeah, that sounds like the same as my problem. ?Even setting
> it to 10M dirty_bytes has a very serious latency problem. ?I'm glad
> that option was added, because 1M works much better. ?I'll have to
> change my shell script to dynamically tune on that. ?Because under
> normal load, I want the 40M+ of queueing. ?It's just when things get
> really heavy, and stuff starts getting flushed, that this problem
> starts happening.
>

2009-04-03 05:03:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 02, 2009 at 07:58:17PM -0700, David Rees wrote:
>
> I've got a problematic server with 8GB RAM. Even if set both to 1,
> that's 80MB and the crappy disks I have in it will often only write
> 10-20MB/s or less due to the seekiness of the workload. That means
> delays of 5-10 seconds worst case which isn't fun.
>

Well, one solution is data=writeback. If you're confident your server
isn't going to randomly crash (i.e., it's on a UPS, and you're not
running unstable video drivers), that might be a solution. It has
tradeoffs, though.

One thing which I'll probably implement is some patches to ext3 so
that when it's in data=writeback mode, it will use the same
replace-via-rename and replace-via-truncate hueristics that I added in
ext4 so that it will start an aysnchronous writeout on the rename() or
close() w/ truncate(). That should avoid existing files getting
corrupted when they are replaced right before the system crashes.

People will still be better off moving to ext4, but for people who
aren't quite confident in ext4's stability yet and who want to stick
with ext3, maybe it's a good short-term solution. Maybe
data=writeback with the rename hueristic would be a better default
than data=ordered for ext3.

- Ted

2009-04-03 05:15:42

by Trenton D. Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 11:02 PM, Theodore Tso <[email protected]> wrote:
> On Thu, Apr 02, 2009 at 07:58:17PM -0700, David Rees wrote:
>>
>> I've got a problematic server with 8GB RAM. ?Even if set both to 1,
>> that's 80MB and the crappy disks I have in it will often only write
>> 10-20MB/s or less due to the seekiness of the workload. ?That means
>> delays of 5-10 seconds worst case which isn't fun.
>>
>
> People will still be better off moving to ext4, but for people who
> aren't quite confident in ext4's stability yet and who want to stick
> with ext3, maybe it's a good short-term solution. ?Maybe
> data=writeback with the rename hueristic would be a better default
> than data=ordered for ext3.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

I've tried that before...

tdamac ~ # mount -t ext3 -o data=writeback,remount,rw /dev/s/sys /
mount: / not mounted already, or bad option

Does it have to be done on initial mount?

2009-04-03 06:31:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 02, 2009 at 11:15:29PM -0600, Trenton D. Adams wrote:
>
> tdamac ~ # mount -t ext3 -o data=writeback,remount,rw /dev/s/sys /
> mount: / not mounted already, or bad option
>
> Does it have to be done on initial mount?

Yes, which means you have to use the rootflags boot command-line
option.

It's a pain that we can't switch data= modes on the fly. I believe
the problematic transiations are between data=journal and
data=!journal. Transitions between data=ordered and data=writeback
should be easy to add.

- Ted


2009-04-03 06:53:59

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

ext Christian Kujau wrote:
> On Fri, 27 Mar 2009, Artem Bityutskiy wrote:
>> They just say - this is file-system bug, it is fixed in
>> ext4 now, just fix the bug in UBIFS.
>
> Would *mounting* the filesystem with "-o sync" help? This way no
> filesystem "fixes" are needed and userland would not have to be rewritten.

It would, but the overall FS performance would suffer a lot too.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-04-03 18:06:26

by David Rees

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

On Thu, Apr 2, 2009 at 10:02 PM, Theodore Tso <[email protected]> wrote:
> On Thu, Apr 02, 2009 at 07:58:17PM -0700, David Rees wrote:
>>
>> I've got a problematic server with 8GB RAM. ?Even if set both to 1,
>> that's 80MB and the crappy disks I have in it will often only write
>> 10-20MB/s or less due to the seekiness of the workload. ?That means
>> delays of 5-10 seconds worst case which isn't fun.
>
> Well, one solution is data=writeback. ?If you're confident your server
> isn't going to randomly crash (i.e., it's on a UPS, and you're not
> running unstable video drivers), that might be a solution. ?It has
> tradeoffs, though.

Yeah, that's probably a good workaround for the server in question. I
don't recall it ever crashing.

> One thing which I'll probably implement is some patches to ext3 so
> that when it's in data=writeback mode, it will use the same
> replace-via-rename and replace-via-truncate hueristics that I added in
> ext4 so that it will start an aysnchronous writeout on the rename() or
> close() w/ truncate(). ?That should avoid existing files getting
> corrupted when they are replaced right before the system crashes.

I think that would be a welcome addition to the writeback mode of ext3.

> People will still be better off moving to ext4, but for people who
> aren't quite confident in ext4's stability yet and who want to stick
> with ext3, maybe it's a good short-term solution. ?Maybe
> data=writeback with the rename hueristic would be a better default
> than data=ordered for ext3.

I've been waiting for Fedora to ship either the latest stable 2.6.28
or 2.6.29 kernel before putting any serious data on ext4 - from what
I've seen it seems like those kernels should have the vast majority of
stability bugs fixed in them. Last I remember reading the 2.6.27
doesn't quite have all the fixes due to difficulties in backporting
those fixes to that kernel.

-Dave

2009-04-03 19:05:21

by Chris Adams

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

Once upon a time, Theodore Tso <[email protected]> said:
>On Thu, Apr 02, 2009 at 11:15:29PM -0600, Trenton D. Adams wrote:
>>
>> tdamac ~ # mount -t ext3 -o data=writeback,remount,rw /dev/s/sys /
>> mount: / not mounted already, or bad option
>>
>> Does it have to be done on initial mount?
>
>Yes, which means you have to use the rootflags boot command-line
>option.

Can't you also set this in the superblock options (e.g. "tune2fs -o
+journal_data_writeback" /dev/sda1)?

--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

2009-04-09 20:17:51

by Pavel Machek

[permalink] [raw]
Subject: Re: EXT4-ish "fixes" in UBIFS

Hi!

> > I've got a problematic server with 8GB RAM. Even if set both to 1,
> > that's 80MB and the crappy disks I have in it will often only write
> > 10-20MB/s or less due to the seekiness of the workload. That means
> > delays of 5-10 seconds worst case which isn't fun.
> >
>
> Well, one solution is data=writeback. If you're confident your server
> isn't going to randomly crash (i.e., it's on a UPS, and you're not
> running unstable video drivers), that might be a solution. It has
> tradeoffs, though.
>
> One thing which I'll probably implement is some patches to ext3 so
> that when it's in data=writeback mode, it will use the same
> replace-via-rename and replace-via-truncate hueristics that I added in
> ext4 so that it will start an aysnchronous writeout on the rename() or
> close() w/ truncate(). That should avoid existing files getting
> corrupted when they are replaced right before the system crashes.

Truncate case is unfixable, but would it be possible to only do rename
after data are on disk? Because async writeout only makes catastrophic
data loss 'less probable'...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html