2009-03-31 06:04:57

by Alberto Gonzalez

[permalink] [raw]
Subject: Ext4 and the "30 second window of death"

Hi,

Reading this discussion about the fsync performance problems, the reliability
of delayed allocation, etc... made me a bit confused, so as a normal user I
would like to ask a clear question with an example so I can get a clear answer
and understand the implications of all this.

- Let's say I'm a writer and I like to take my laptop to a cafe every day to
write there for a few hours.

- As such, I want to get good battery life so I'm fine with my data being
written to death say every 30 seconds instead of waking up the disk
immediately if I save the document I'm working on.

- I use Ext4 as my filesystem (default in next Fedora release).

- Let's say I've been working on my book for the last 14 months and I've
written about 400 pages on an ODF file.

- My usual workflow is that every time I finish a paragraph, say every 2-3
minutes, I hit Ctrl+S to save the changes.

- So one day, while I'm working on the book the following happens: I finish a
paragraph and his Ctrl+S to save it. 5 seconds later the system freezes for
some reason. Let's suppose that in that 5 window timeframe between pressing
Ctrl+S and the crash the data has not been written to disk (which happens
every 30 seconds). So as a result I:

A - Lose that last paragraph
B - Lose the whole book

If it's 'A', then that's ok, as expected. Bad luck. But if it's 'B', then I
think that's totally unexpected by any user, and totally unacceptable too.
Sure I want good performance and good battery life, but not at such cost.
(Yes, you can argue I should have a recent backup at home, and you'd be right,
but that doesn't change things fundamentally).

As far as I understand, with Ext3 (defaults), the behavior was A. Will this
change to B with Ext4 and all "modern" filesystems (XFS, Btrfs,...)?

Thanks for any answer.


2009-03-31 12:25:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Sun, Mar 29, 2009 at 12:24:21PM +0200, Alberto Gonzalez wrote:
> Hi,
>
> - I use Ext4 as my filesystem (default in next Fedora release).

Fedora will have the patches so that applications that do
replace-via-truncate (a bad idea, these applications are buggy, and
will lose data sometimes even with ext3), or replace-via-rename
without the fsync(), will force the blocks out to disk with the
commit.

> - Let's say I've been working on my book for the last 14 months and I've
> written about 400 pages on an ODF file.

Openoffice, being a portable application, that has to work on other
operating systems and filesystems (for example, like Solaris's UFS),
does do open/write/close/fsync/rename. So you're safe if you're using
OpenOffice (and emacs, and vim).

The replace-via-truncate and replace-via-rename workarounds are there
for the benefit of KDE, and GNOME, which in some configurations
apparently will replace hundreds of dot files when the desktop is
started up, for no reason that I can understand. (Not such a great
idea for SSD write endurance!) Some people apparently spend hours
making sure that their windows are exactly positioned the way they
want it when their desktop starts up, and if the system crashes while
their desktop is starting up, those they could lose their window
positions, which apparently made a whole bunch of users cranky. In
practice, most of the editors that I'm familiar with have been around
for a while, have needed to make sure that that cases such as yours
wouldn't result in data loss, and so are pretty good about using
fsync() so that users' files wouldn't be lost, no matter what the
filesystem or operating system being used.

The problem has been mostly with newer applications, especially the
newer desktop ones, which have been written to assume that they only
have to work safely on Linux and ext3. The replace-via-truncate and
replace-via-rename workarounds provide this safety for ext4.

Best regards,

- Ted

2009-03-31 12:52:43

by Alberto Gonzalez

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tuesday 31 March 2009 14:25:40 Theodore Tso wrote:
> On Sun, Mar 29, 2009 at 12:24:21PM +0200, Alberto Gonzalez wrote:
> > Hi,
> >
> > - I use Ext4 as my filesystem (default in next Fedora release).
>
> Fedora will have the patches so that applications that do
> replace-via-truncate (a bad idea, these applications are buggy, and
> will lose data sometimes even with ext3), or replace-via-rename
> without the fsync(), will force the blocks out to disk with the
> commit.
>
> > - Let's say I've been working on my book for the last 14 months and I've
> > written about 400 pages on an ODF file.
>
> Openoffice, being a portable application, that has to work on other
> operating systems and filesystems (for example, like Solaris's UFS),
> does do open/write/close/fsync/rename. So you're safe if you're using
> OpenOffice (and emacs, and vim).

Ah, good to know, that's quite a relief for normal users like me who were
getting lost with this discussion. But one other doubt:

You've proposed that in laptop mode, fsync's should be held until next write
cycle (say every 30 seconds) so that the disk is not spun up unnecessarily,
wasting battery and shortening it's lifespan too. I absolutely agree with
this, and as a trade-off I'm ok with losing my last paragraph even if I did hit
Ctrl+S to save it a few seconds before a crash. But again, with Ext4 will I
just lose that last paragraph or the whole book in this case?

Thanks,
Alberto.

> Best regards,
>
> - Ted

2009-03-31 13:46:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tue, Mar 31, 2009 at 02:52:05PM +0200, Alberto Gonzalez wrote:
>
> You've proposed that in laptop mode, fsync's should be held until next write
> cycle (say every 30 seconds) so that the disk is not spun up unnecessarily,
> wasting battery and shortening it's lifespan too. I absolutely agree with
> this, and as a trade-off I'm ok with losing my last paragraph even if I did hit
> Ctrl+S to save it a few seconds before a crash. But again, with Ext4 will I
> just lose that last paragraph or the whole book in this case?

Laptop mode is already set up such that the moment the disk spins up,
any pending writes are immediately flushed to disk --- the idea being
that if the disk is spinning, we might as well take advantage of it to
get everything pushed out to disk. As long as we actually keep a
linked list of those fsync's which were "held up", and we make sure
all of the delayed allocation blocks are also allocated before we push
them out, the right thing will happen. If we just ignore the fsync's,
then we might not allocate the delayed allocation blocks. So
basically, we need to be careful about how we implement this addition
to laptop_mode.

Jeff Garzik has also pointed out that there are additional concerns
for databases which may have issued multiple fsync()'s while the disk
has been spun down, where we wouldn't want to mix writes between
fsync()'s. This basically boils down to how much protection do we
want to give for the case where the system crashes while the disk
blocks are being pushed out to disk. (Which isn't that farfetched;
consider the case where the laptop is very low on battery, and runs
out when the disk is woken up and crashes before all of the writes
could be processed.)

So there are some things that would be tricky in terms of implementing
this perfectly, and maybe we would disable the fsync suppression
machinery if the battery level isgetting critical --- and then do
either a clean shutdown or a suspend-to-disk (although here too there
had better be enough juice in the battery to write all of memory to
your swap partition).

The bottom line is that it *can* be implemented safely, but there are
some things that we would need to pay attention to in order to make
sure it *was* safe.

- Ted

2009-03-31 14:46:00

by Alberto Gonzalez

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tuesday 31 March 2009 15:45:47 Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 02:52:05PM +0200, Alberto Gonzalez wrote:
> > You've proposed that in laptop mode, fsync's should be held until next
> > write cycle (say every 30 seconds) so that the disk is not spun up
> > unnecessarily, wasting battery and shortening it's lifespan too. I
> > absolutely agree with this, and as a trade-off I'm ok with losing my last
> > paragraph even if I did hit Ctrl+S to save it a few seconds before a
> > crash. But again, with Ext4 will I just lose that last paragraph or the
> > whole book in this case?
>
> Laptop mode is already set up such that the moment the disk spins up,
> any pending writes are immediately flushed to disk --- the idea being
> that if the disk is spinning, we might as well take advantage of it to
> get everything pushed out to disk. As long as we actually keep a
> linked list of those fsync's which were "held up", and we make sure
> all of the delayed allocation blocks are also allocated before we push
> them out, the right thing will happen. If we just ignore the fsync's,
> then we might not allocate the delayed allocation blocks. So
> basically, we need to be careful about how we implement this addition
> to laptop_mode.
>
> Jeff Garzik has also pointed out that there are additional concerns
> for databases which may have issued multiple fsync()'s while the disk
> has been spun down, where we wouldn't want to mix writes between
> fsync()'s. This basically boils down to how much protection do we
> want to give for the case where the system crashes while the disk
> blocks are being pushed out to disk. (Which isn't that farfetched;
> consider the case where the laptop is very low on battery, and runs
> out when the disk is woken up and crashes before all of the writes
> could be processed.)
>
> So there are some things that would be tricky in terms of implementing
> this perfectly, and maybe we would disable the fsync suppression
> machinery if the battery level isgetting critical --- and then do
> either a clean shutdown or a suspend-to-disk (although here too there
> had better be enough juice in the battery to write all of memory to
> your swap partition).
>
> The bottom line is that it *can* be implemented safely, but there are
> some things that we would need to pay attention to in order to make
> sure it *was* safe.
>
> - Ted

I see. Thanks for the explanation. Right now, laptop-mode (if you use laptop-
mode-tools) is disabled when battery reaches critical level. Anyway,
regardless of corner cases, I think that what we "normal" users want is to
have the choice between:

A - Writing data to disk immediately and lose no work at all, but get worse
performance/battery life/HDD lifespan (this is what happens when an
application uses fsync, right?). Or
B - Delay writes for X seconds (30, 60, 120,...) and get better
performance/battery life/HDD lifespan, but risk to lose X seconds of work.

What is not acceptable is having to choose between A and:

C - Delay writes for X seconds and get better performance/battery life/HDD
lifespan, but risk to lose _all_ your work (instead of just the last X
seconds).

The problem I guess is that right now application writers targeting Ext4 must
choose between using fsync and giving users the 'A' behaviour or not using
fsync and giving them the 'C' behaviour. But what most users would like is
'B', I'm afraid (at least, it's what I want, I might be an exception).

Regards,
Alberto.

2009-03-31 22:03:09

by Alberto Gonzalez

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tuesday 31 March 2009 15:45:47 Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 02:52:05PM +0200, Alberto Gonzalez wrote:
> > You've proposed that in laptop mode, fsync's should be held until next
> > write cycle (say every 30 seconds) so that the disk is not spun up
> > unnecessarily, wasting battery and shortening it's lifespan too. I
> > absolutely agree with this, and as a trade-off I'm ok with losing my last
> > paragraph even if I did hit Ctrl+S to save it a few seconds before a
> > crash. But again, with Ext4 will I just lose that last paragraph or the
> > whole book in this case?
>
> Laptop mode is already set up such that the moment the disk spins up,
> any pending writes are immediately flushed to disk --- the idea being
> that if the disk is spinning, we might as well take advantage of it to
> get everything pushed out to disk. As long as we actually keep a
> linked list of those fsync's which were "held up", and we make sure
> all of the delayed allocation blocks are also allocated before we push
> them out, the right thing will happen. If we just ignore the fsync's,
> then we might not allocate the delayed allocation blocks. So
> basically, we need to be careful about how we implement this addition
> to laptop_mode.

In fact, thinking about it, this option would be the ideal one for desktops
and especially laptops (servers running databases are a different thing). What
we need is that _no_ application uses fsync. The decision as to when the data
should be written to disk should be left to the filesystem. And then the user
can choose how often they want this to happen (every 5, 15, 30, 60...
seconds). So if Ext4 could have a "nofsync" mount option that would disable
fsync from applications (i.e, it wouldn't honor an fsync call), that would be
wonderful. But then of course we have to make sure that if the kernel crashes
(or there's a power-off, etc..), we will just lose the new data that hasn't
been written to disk, but the old data will still be there. So maybe this
could be achieved with mounting the filesystem with nofsync, nodelalloc?

> The bottom line is that it *can* be implemented safely, but there are
> some things that we would need to pay attention to in order to make
> sure it *was* safe.

If you could do this, many of us would be willing to buy you a beer :)

>
> - Ted

And of course, thanks for your patience with this issue. And sorry for all
you're having to take from us uninformed but somehow worried users (I run Ext4
now, but added the nodelalloc option when all this started).

Alberto.

2009-03-31 23:37:31

by Andreas T.Auer

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"



On 01.04.2009 00:02 Alberto Gonzalez wrote:
> In fact, thinking about it, this option would be the ideal one for desktops
> and especially laptops (servers running databases are a different thing). What
> we need is that _no_ application uses fsync. The decision as to when the data
> should be written to disk should be left to the filesystem. And then the user
> can choose how often they want this to happen (every 5, 15, 30, 60...
> seconds). So if Ext4 could have a "nofsync" mount option that would disable
> fsync from applications (i.e, it wouldn't honor an fsync call), that would be
> wonderful. But then of course we have to make sure that if the kernel crashes
> (or there's a power-off, etc..), we will just lose the new data that hasn't
> been written to disk, but the old data will still be there. So maybe this
> could be achieved with mounting the filesystem with nofsync, nodelalloc?
>
>
You are always thinking about the few seconds/minutes of work you gonna
lose, but there are different situations, too.

E.g. your POP3 client receives a very important mail, saves it to disk,
uses fsync to make sure it is out and tells the server to delete it. If
you are gonna delay the fsync, you will have a long window in which the
mail can get lost instead of a minimum window. Or are there any POP3
clients, which can synchronize the mail-polling with a spinning a disk?

There are tasks that are not very important, that should not spin up the
disk and there are tasks, that might better do so. It is the preference
of the user, which tasks should or should not spin up the disk, but the
application developer has to decide globally, whether or not to use
fsync() and the filesystem can't even distinguish the tasks at all,
except that it receives fsyncs or not.

So fine-tuning the system to the ideal disk-writing policy is really
problematic, especially given a lot of different people turning knobs:
- different filesystem developers using different methods and default
behaviors, which can be changed by distros and sys admins.
- different applications trying to use or not use fsync() and other
methods to get the best policies for any kind of fs. Or the developers
are incompetent enough to expect features from the filesystem which are
not always given, whether trained by ext3 data=ordered or trained by
reiserfs or just bare of any better fs knowledge.
- different users having different preferences on what data is how
important, but usually they can not change the fsync-policy of the
applications.

Andreas

2009-04-01 00:05:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tue, Mar 31, 2009 at 04:45:28PM +0200, Alberto Gonzalez wrote:
>
> A - Writing data to disk immediately and lose no work at all, but get worse
> performance/battery life/HDD lifespan (this is what happens when an
> application uses fsync, right?).

People are stressing over the battery usage of spinning up the disk
when you write a file, but in practice, if you're writing an
OpenOffice file, you're probably only going to be typing ^S every 45
seconds? Every couple of minutes? So the fsync() caused by
Openoffice saving out your 300 page Magnum Opus really isn't going to
make that big of a difference to your battery life --- whether it
happens write away when you hit ^S, or whether it happens some 30 or
120 seconds later, isn't really a big deal.

The problem comes when you have lots of applications open on the
desktop, and for some reason they all decide they need to be writing a
huge number of files every few seconds. That seems to be the concern
that people have with respect to wating to batch spinning up the disk
in order to save power. So for example, if every time you get an
instant message via AIM or IRC, your Pidgin client wants to write the
message to a log file, should Pidgin try to fsync() that write? Right
now, if Pidgin doesn't call fsync(), with ext3, in practice your IM
will be written to disk after 5 seconds. With ext4, your IM might not
get written to disk until around 30 seconds. Since Pidgin isn't
replacing the log file, but rather appending to it, it's not a case of
losing the previous work, but rather not simply getting the latest
IM's pushed to stable storage as quickly.

Quite frankly, the people who are complaining about "fsync() will burn
too much problem" are really protesting way too much. How often,
really, should applications be replacing files? Apparently KDE
replaces hundreds the files in some configurations at desktop startup,
but most people seem to agree this is a bug.

Firefox wants to replace a large number of files (and in practice
writes 2.5 megabytes of data) each time you click on a link. (This is
not great for SSD write endurance; after browsing 400 links, you've
written over a gigabyte to your SSD.) But let's be realistic here; if
you're browsing the web, the power used by running flash animations by
the web browser, not to mention the power costs of the WiFi is
probably at least as much if not more than the cost of spinning up the
disk.

At least when I'm running on batteries, I keep the number of
applications down to a minimum, and regardless of whether we are
batching I/O's using laptop mode or not, it's *always* going to save
more power to not do file I/O at all than to do file I/O with some
kind of batching scheme. So the folks who are saying that they can't
afford to fsync() every single file for power reasons really are
making an excuse; the reality is that if they were really worried
about power consumption, they would be going out of their way to avoid
file writes unless it's really necessary. It's one thing if a user
wants to save their Open Office document; when the user wants to save
it, they should save it, and it should go to disk pretty fast --- how
much work the user is willing to risk should be based on how often the
user manually types ^S, or how the user configures their application
to do periodic auto-saves --- whether that's once a minute, or every 3
minutes, or every 5 minutes, or every 10 minutes.

But if there's some application which is replacing hundreds of files a
minute, then that's the real problem, whether they use fsync() or not.

Now, while I think the whole, "we can't use fsync() for power reasons
is an excuse", it's also true that we're not going to be able to
change all applications at a drop of a hat, and may in fact be
impossible to fix all applications, perhaps for years to come. It is
for that reason that ext4 has the replace-via-truncate and
replace-via-rename workarounds. These currently start I/O as soon as
the file is closed (if it had been previously truncated), or renamed
(if it overwrites a target file). From a power perspective, it would
have been better to wait until the next commit boundary to initiate
the I/O (although doing it right away is better from an I/O smoothing
perspective and to reduce fsync latencies). But again, if the
application is replacing a huge number of files on a frequent basis,
that's what's going to suck the most amount of power; batching to
allow the disk to spin down might save a little, but fundamentally the
application is doing something that's going to be a massive power
drain anyway.

> The problem I guess is that right now application writers targeting
> Ext4 must choose between using fsync and giving users the 'A'
> behaviour or not using fsync and giving them the 'C' behaviour. But
> what most users would like is 'B', I'm afraid (at least, it's what I
> want, I might be an exception).

So no, application programmers don't have to choose; if they do things
the broken (old) way, assuming ext3 semantics, users won't lose
existing files, thanks to the workaround patches. Those applications
will be unsafe for many other filesystems and operating systems, but
maybe those application writers don't care. Unfortunately, I confused
a lot of people by telling people they should use fsync(), instead of
saying, "that's OK, ext4 will take care of it for you", because I care
about application portability. But I implemented the application
workarounds *first* because I knew that it would take a long time for
people to fix their applications. Users will be protected either way.

If applications use fsync(), they really won't be using much in the
way of extra power, really! If they are replacing hundreds of files
in a very short time interval, and doing that all the time, then that's
going to burn power no matter what the filesystem tries to do.

Regards,

- Ted

2009-04-01 01:15:19

by Alberto Gonzalez

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

Ted,

I agree with all you've said and now I really think we're making way too much
fuss about a quite simple issue (we, stupid users).

On Wednesday 01 April 2009 02:04:47 Theodore Tso wrote:
> Quite frankly, the people who are complaining about "fsync() will burn
> too much problem" are really protesting way too much.

Yes, I guess you're right. Filesystem behaviour is not going to make that much
difference, it's user's and application's behaviour what will determine battery
life (plus hardware capabilities, obviously).

> Firefox wants to replace a large number of files (and in practice
> writes 2.5 megabytes of data) each time you click on a link. (This is
> not great for SSD write endurance; after browsing 400 links, you've
> written over a gigabyte to your SSD.)

Agreed. In fact I always thought that the ext3+fsync problem with Firefox was
mostly a myth. The fact is that Firefox 3 has some rather unrealistic settings
that cause an insane amount of I/O (disk, but also network I/O). I was using
an old computer with a very slow 40Gb @ 5400 IDE HD at the time F3 came out
and had some problems. After going through all the options and choosing
reasonable settings the problems went away forever (but then I use Firefox
reasonably, not with a couple hundreds of tabs opened at the same time - no
filesystem can fix that).

> But let's be realistic here; if
> you're browsing the web, the power used by running flash animations by
> the web browser, not to mention the power costs of the WiFi is
> probably at least as much if not more than the cost of spinning up the
> disk.

Since I just tested this the other day, I'll post the numbers: With flash
enabled, Konqueror visiting 3 pages, one of them with one small flash add, my
battery lasted for 184 minutes (for an average or 8.5 watts out of my 26w/h
battery). Without flash, 205 minutes, an average of 7.6 watts (this is on an HP
mini netbook).

Anyway, I agree with all the below too. Thanks again for the detailed
explanation.

Regards,
Alberto.

>
> At least when I'm running on batteries, I keep the number of
> applications down to a minimum, and regardless of whether we are
> batching I/O's using laptop mode or not, it's *always* going to save
> more power to not do file I/O at all than to do file I/O with some
> kind of batching scheme. So the folks who are saying that they can't
> afford to fsync() every single file for power reasons really are
> making an excuse; the reality is that if they were really worried
> about power consumption, they would be going out of their way to avoid
> file writes unless it's really necessary. It's one thing if a user
> wants to save their Open Office document; when the user wants to save
> it, they should save it, and it should go to disk pretty fast --- how
> much work the user is willing to risk should be based on how often the
> user manually types ^S, or how the user configures their application
> to do periodic auto-saves --- whether that's once a minute, or every 3
> minutes, or every 5 minutes, or every 10 minutes.
>
> But if there's some application which is replacing hundreds of files a
> minute, then that's the real problem, whether they use fsync() or not.
>
> Now, while I think the whole, "we can't use fsync() for power reasons
> is an excuse", it's also true that we're not going to be able to
> change all applications at a drop of a hat, and may in fact be
> impossible to fix all applications, perhaps for years to come. It is
> for that reason that ext4 has the replace-via-truncate and
> replace-via-rename workarounds. These currently start I/O as soon as
> the file is closed (if it had been previously truncated), or renamed
> (if it overwrites a target file). From a power perspective, it would
> have been better to wait until the next commit boundary to initiate
> the I/O (although doing it right away is better from an I/O smoothing
> perspective and to reduce fsync latencies). But again, if the
> application is replacing a huge number of files on a frequent basis,
> that's what's going to suck the most amount of power; batching to
> allow the disk to spin down might save a little, but fundamentally the
> application is doing something that's going to be a massive power
> drain anyway.
>
> > The problem I guess is that right now application writers targeting
> > Ext4 must choose between using fsync and giving users the 'A'
> > behaviour or not using fsync and giving them the 'C' behaviour. But
> > what most users would like is 'B', I'm afraid (at least, it's what I
> > want, I might be an exception).
>
> So no, application programmers don't have to choose; if they do things
> the broken (old) way, assuming ext3 semantics, users won't lose
> existing files, thanks to the workaround patches. Those applications
> will be unsafe for many other filesystems and operating systems, but
> maybe those application writers don't care. Unfortunately, I confused
> a lot of people by telling people they should use fsync(), instead of
> saying, "that's OK, ext4 will take care of it for you", because I care
> about application portability. But I implemented the application
> workarounds *first* because I knew that it would take a long time for
> people to fix their applications. Users will be protected either way.
>
> If applications use fsync(), they really won't be using much in the
> way of extra power, really! If they are replacing hundreds of files
> in a very short time interval, and doing that all the time, then that's
> going to burn power no matter what the filesystem tries to do.
>
> Regards,
>
> - Ted

2009-04-01 01:26:01

by Alberto Gonzalez

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wednesday 01 April 2009 01:22:19 Andreas T.Auer wrote:
> On 01.04.2009 00:02 Alberto Gonzalez wrote:
> > In fact, thinking about it, this option would be the ideal one for
> > desktops and especially laptops (servers running databases are a
> > different thing). What we need is that _no_ application uses fsync. The
> > decision as to when the data should be written to disk should be left to
> > the filesystem. And then the user can choose how often they want this to
> > happen (every 5, 15, 30, 60... seconds). So if Ext4 could have a
> > "nofsync" mount option that would disable fsync from applications (i.e,
> > it wouldn't honor an fsync call), that would be wonderful. But then of
> > course we have to make sure that if the kernel crashes (or there's a
> > power-off, etc..), we will just lose the new data that hasn't been
> > written to disk, but the old data will still be there. So maybe this
> > could be achieved with mounting the filesystem with nofsync, nodelalloc?
>
> You are always thinking about the few seconds/minutes of work you gonna
> lose, but there are different situations, too.
>
> E.g. your POP3 client receives a very important mail, saves it to disk,
> uses fsync to make sure it is out and tells the server to delete it. If
> you are gonna delay the fsync, you will have a long window in which the
> mail can get lost instead of a minimum window. Or are there any POP3
> clients, which can synchronize the mail-polling with a spinning a disk?

Yes, I guess this is a clear example of data that needs to be written to disk
straight away.

>
> There are tasks that are not very important, that should not spin up the
> disk and there are tasks, that might better do so. It is the preference
> of the user, which tasks should or should not spin up the disk, but the
> application developer has to decide globally, whether or not to use
> fsync() and the filesystem can't even distinguish the tasks at all,
> except that it receives fsyncs or not.
>
> So fine-tuning the system to the ideal disk-writing policy is really
> problematic, especially given a lot of different people turning knobs:
> - different filesystem developers using different methods and default
> behaviors, which can be changed by distros and sys admins.
> - different applications trying to use or not use fsync() and other
> methods to get the best policies for any kind of fs. Or the developers
> are incompetent enough to expect features from the filesystem which are
> not always given, whether trained by ext3 data=ordered or trained by
> reiserfs or just bare of any better fs knowledge.
> - different users having different preferences on what data is how
> important, but usually they can not change the fsync-policy of the
> applications.

Yes, I agree with all the above. There's no magic recipe for any filesystem,
and honestly, I've never had problems with reiserfs in the past or ext3 later
on. I don't know why I got scared with all this "ext4 will give you zero-
length files on every crash unless all applications start to fsync like crazy
and kill your hard drive in a year time" thing. Filesystem developers must
have a bit of bit of knowledge about how this works to not do something too
stupid.

> Andreas

Alberto.

2009-04-01 01:50:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 01:22:19AM +0200, Andreas T.Auer wrote:
> You are always thinking about the few seconds/minutes of work you gonna
> lose, but there are different situations, too.
>
> E.g. your POP3 client receives a very important mail, saves it to disk,
> uses fsync to make sure it is out and tells the server to delete it. If
> you are gonna delay the fsync, you will have a long window in which the
> mail can get lost instead of a minimum window. Or are there any POP3
> clients, which can synchronize the mail-polling with a spinning a disk?

True, but consider --- this is a laptop we're talking about, right?
What if the laptop hard drive crashes after you accidentally drop your
laptop. Even if you're using an SSD, what if someone steals your
laptop. Your first mistake was using POP3. :-)

Personally, what I do is create a local *copy* of my IMAP mailbox, and
I delete messages on the local copy of the mail spool --- and then
periodically I run a program called "mbsync"
(http://isync.sourceforge.net) to propagate deletes back to the IMAP
server, and download new mail to my local Maildir copy of my mail spool.

But still, you're right. In some cases, you really want "fsync()" to
mean "fsync()". I'm not sure how often such applications _should_ be
running on a laptop which is prone to be being dropped and/or stolen.
This would have to be something that a user chooses to do on their
system, and they would have to take into account whether they are
running some workloads that really can't tolerate data loss or not.

If all they are doing is browsing the web, and the issue is firefox's
desire to constantly write to their home directory, the user should be
able to say, "you know, my battery life is more important that making
sure that every last web page I visit is saved away in some file ---
Firefox's 'Awesome Bar' really isn't worth that much to me."

Of course, there is the question whether most users will be able to
understand the risks of doing things like using POP3 and fetchmail as
described in your scenario above. And that's a valid question --- so
it's worth asking whether suppressing fsync()'s really saves enough
power to be worth it, as opposed to say, fixing applications that are
write-happy, or choosing not to use applications which are write-happy
when you are running on battery.

- Ted

2009-04-01 05:21:22

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tue, Mar 31, 2009 at 09:50:10PM -0400, Theodore Tso wrote:
>
> But still, you're right. In some cases, you really want "fsync()" to
> mean "fsync()". I'm not sure how often such applications _should_ be

Hmm. This is starting to sound a lot like the OSX fsync (
http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fsync.2.html
) where there is effectively a "fsync harder" syscall
(F_FULLFSYNC fcntl11).

> If all they are doing is browsing the web, and the issue is firefox's
> desire to constantly write to their home directory, the user should be
> able to say, "you know, my battery life is more important that making
> sure that every last web page I visit is saved away in some file ---
> Firefox's 'Awesome Bar' really isn't worth that much to me."

The "Awesome(bar) Firefox 3 fsync Problem" isn't that you are missing a
day's worth of browsing. The issue is that the sqlite database might
become corrupt and lose _all history_ if fsync lies/doesn't happen and a
crash occurs ( https://bugzilla.mozilla.org/show_bug.cgi?id=435712#c10).
With Firefox 2 there was a file swap happening so an fsync wasn't vital.

Just out of curiosity, when laptop mode is happening is there a
guarantee that writes to other files won't be reordered to before the
fsync?

--
Sitsofe | http://sucs.org/~sits/

2009-04-01 08:51:53

by Andreas T.Auer

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"



On 01.04.2009 03:50 Theodore Tso wrote:
> On Wed, Apr 01, 2009 at 01:22:19AM +0200, Andreas T.Auer wrote:
>
>> E.g. your POP3 client receives a very important mail, saves it to disk,
>> uses fsync to make sure it is out and tells the server to delete it. If
>> you are gonna delay the fsync, you will have a long window in which the
>> mail can get lost instead of a minimum window. Or are there any POP3
>> clients, which can synchronize the mail-polling with a spinning a disk?
>>
>
> True, but consider --- this is a laptop we're talking about, right?
> What if the laptop hard drive crashes after you accidentally drop your
> laptop. Even if you're using an SSD, what if someone steals your
> laptop.
Well, there is always a worst case, but I had quite a lot system crashes
with unstable versions without dropping the laptop once.

> Your first mistake was using POP3. :-)
>

I agree. :-) I am using IMAP, but a lot of people have only their
POP3 account on their only laptop.

> If all they are doing is browsing the web, and the issue is firefox's
> desire to constantly write to their home directory, the user should be
> able to say, "you know, my battery life is more important that making
> sure that every last web page I visit is saved away in some file ---
> Firefox's 'Awesome Bar' really isn't worth that much to me."
>

AFAIK especially FF doesn't use fsync that often anymore by default. And
the user has to know this meanwhile hidden config entry
toolkit.storage.synchronous to raise the fsync level. But there are
surely enough applications that use fsync too much, and enough
applications using it not often enough.

> Of course, there is the question whether most users will be able to
> understand the risks of doing things like using POP3 and fetchmail as
> described in your scenario above. And that's a valid question --- so
> it's worth asking whether suppressing fsync()'s really saves enough
> power to be worth it, as opposed to say, fixing applications that are
> write-happy, or choosing not to use applications which are write-happy
> when you are running on battery.
>
>
Surely a lot of users don't understand all the risks or downsides of any
write-out policy. But there are users who do understand. For those it
would be fine, if they could define the policies for fsync and
non-fsyncs on a per-application basis (with a global default). E.g.: The
POP3-client should write synchrously with fsync, but can wait for two
minutes for non-fsynced data. Firefox should have these values and
openoffice those values etc...
But I guess the implementation effort is too high.

Andreas

2009-04-01 15:12:40

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote:

> Just out of curiosity, when laptop mode is happening is there a
> guarantee that writes to other files won't be reordered to before the
> fsync?

laptop-mode does two things - tweak the dirty page semantics slightly
(not in an interestingly relevant way) and call sys_sync() a few seconds
after something hits disk rather than cache. In contrast to Ted's
suggestion that laptop-mode reduces data integrity, it actually enhances
it by opportunistically ensuring that data hits disk. It's the
lengthening of the commit intervals that usually accompanies it that
increases the risk of data loss.

--
Matthew Garrett | [email protected]

2009-04-01 17:35:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 04:12:21PM +0100, Matthew Garrett wrote:
> On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote:
>
> > Just out of curiosity, when laptop mode is happening is there a
> > guarantee that writes to other files won't be reordered to before the
> > fsync?
>
> laptop-mode does two things - tweak the dirty page semantics slightly
> (not in an interestingly relevant way) and call sys_sync() a few seconds
> after something hits disk rather than cache. In contrast to Ted's
> suggestion that laptop-mode reduces data integrity, it actually enhances
> it by opportunistically ensuring that data hits disk. It's the
> lengthening of the commit intervals that usually accompanies it that
> increases the risk of data loss.

It *can* reduce data integrity; it really depends on how it's tuned
and what scenario you're talking about. To the extent that it uses
sys_sync(), it could help in some cases as well, since filesystems
that do delayed allocation will wake up when the commit interval
fires, and then force out all writes to the disk, yes. But before the
commit interval, there is an increased risk of data loss --- which the
user requested.

The other subtlety comes if we add fsync() suppression to laptop mode
--- which is something that Bart Samwel is very interested in doing
and I talked to him at FOSDEM about this. As Jeff Garzik recently
pointed out, however, if we let the system reorder writes across
fsync() boundaries, or if we combine two writes to the same block
separated by an fsync(), and the system crashes in the middle of
pushing all of these blocks out to the disk, we can end up trashing
the consistency guarantees of a database such as mysql or postgres.
It's a good point, but it only applies if we add fsync() suppression
to laptop mode --- which we haven't done yet.

- Ted

2009-04-01 17:43:54

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 01:35:21PM -0400, Theodore Tso wrote:
> On Wed, Apr 01, 2009 at 04:12:21PM +0100, Matthew Garrett wrote:
> > On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote:
> >
> > > Just out of curiosity, when laptop mode is happening is there a
> > > guarantee that writes to other files won't be reordered to before the
> > > fsync?
> >
> > laptop-mode does two things - tweak the dirty page semantics slightly
> > (not in an interestingly relevant way) and call sys_sync() a few seconds
> > after something hits disk rather than cache. In contrast to Ted's
> > suggestion that laptop-mode reduces data integrity, it actually enhances
> > it by opportunistically ensuring that data hits disk. It's the
> > lengthening of the commit intervals that usually accompanies it that
> > increases the risk of data loss.
>
> It *can* reduce data integrity; it really depends on how it's tuned
> and what scenario you're talking about. To the extent that it uses
> sys_sync(), it could help in some cases as well, since filesystems
> that do delayed allocation will wake up when the commit interval
> fires, and then force out all writes to the disk, yes. But before the
> commit interval, there is an increased risk of data loss --- which the
> user requested.

Not from laptop-mode. Let's separate the functionality from the typical
use case.

> The other subtlety comes if we add fsync() suppression to laptop mode
> --- which is something that Bart Samwel is very interested in doing
> and I talked to him at FOSDEM about this. As Jeff Garzik recently
> pointed out, however, if we let the system reorder writes across
> fsync() boundaries, or if we combine two writes to the same block
> separated by an fsync(), and the system crashes in the middle of
> pushing all of these blocks out to the disk, we can end up trashing
> the consistency guarantees of a database such as mysql or postgres.
> It's a good point, but it only applies if we add fsync() suppression
> to laptop mode --- which we haven't done yet.

I've got absolutely no idea why anyone would want fsync() to stop
meaning "Put my data on the disk please". laptop-mode isn't intended to
reduce data integrity - it's intended to batch disk write-outs such that
there's a lower risk of needing to perform further write-outs in future.
It makes sense for applications which really desperately want
information on disk to fsync() (for instance, saving a file in
OpenOffice).

laptop-mode is something that makes sense as a default behaviour under a
lot of circumstances. Adding fsync() suppression means it's utterly
impossible to use it in that way. An additional mode would be perfectly
reasonable, as long as it's made clear that it's really a request for
data to be discarded at some point. The current mode isn't.

--
Matthew Garrett | [email protected]

2009-04-01 21:22:22

by Ray Lee

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 1, 2009 at 10:43 AM, Matthew Garrett <[email protected]> wrote:
> I've got absolutely no idea why anyone would want fsync() to stop
> meaning "Put my data on the disk please".

Some guy named Andrew used to run a kernel with 'return 0' at the top
of fsync and fdatasync: http://lkml.org/lkml/2007/4/27/88

It's that the latency penalty of apps using *sync() on common hardware
sucks. That's all, and finding a way to fix that would make this
entire thread go away, I think.

2009-04-01 21:26:28

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 02:21:43PM -0700, Ray Lee wrote:
> On Wed, Apr 1, 2009 at 10:43 AM, Matthew Garrett <[email protected]> wrote:
> > I've got absolutely no idea why anyone would want fsync() to stop
> > meaning "Put my data on the disk please".
>
> Some guy named Andrew used to run a kernel with 'return 0' at the top
> of fsync and fdatasync: http://lkml.org/lkml/2007/4/27/88
>
> It's that the latency penalty of apps using *sync() on common hardware
> sucks. That's all, and finding a way to fix that would make this
> entire thread go away, I think.

And also disk spinups.
--
Matthew Garrett | [email protected]

2009-04-02 11:25:29

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 02:21:43PM -0700, Ray Lee wrote:
>
> Some guy named Andrew used to run a kernel with 'return 0' at the top
> of fsync and fdatasync: http://lkml.org/lkml/2007/4/27/88

(Quoting out of context from Andrew's mail)

"hm, fsync.

Aside: why the heck do applications think that their data is so
important that they need to fsync it all the time."

So the advice/complaint is that apps shouldn't fsync unless absolutely
necessary because syncing will always slow?

--
Sitsofe | http://sucs.org/~sits/

2009-04-02 11:37:50

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, Apr 01, 2009 at 01:35:21PM -0400, Theodore Tso wrote:
> On Wed, Apr 01, 2009 at 04:12:21PM +0100, Matthew Garrett wrote:
> > On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote:
> >
> > > Just out of curiosity, when laptop mode is happening is there a
> > > guarantee that writes to other files won't be reordered to before the
> > > fsync?
> >
> > laptop-mode does two things - tweak the dirty page semantics slightly
> > (not in an interestingly relevant way) and call sys_sync() a few seconds
> > after something hits disk rather than cache. In contrast to Ted's
> > suggestion that laptop-mode reduces data integrity, it actually enhances
> > it by opportunistically ensuring that data hits disk. It's the
> > lengthening of the commit intervals that usually accompanies it that
> > increases the risk of data loss.
>
> It *can* reduce data integrity; it really depends on how it's tuned
> and what scenario you're talking about. To the extent that it uses
> sys_sync(), it could help in some cases as well, since filesystems
> that do delayed allocation will wake up when the commit interval
> fires, and then force out all writes to the disk, yes. But before the
> commit interval, there is an increased risk of data loss --- which the
> user requested.

That's fair enough and always seemed to be part of the bargain (let the
disk spin down for longer but risk losing 30 seconds of non-synced
recent data in a crash). The result shouldn't be corruption though.

> The other subtlety comes if we add fsync() suppression to laptop mode
> --- which is something that Bart Samwel is very interested in doing
> and I talked to him at FOSDEM about this. As Jeff Garzik recently
> pointed out, however, if we let the system reorder writes across
> fsync() boundaries, or if we combine two writes to the same block
> separated by an fsync(), and the system crashes in the middle of
> pushing all of these blocks out to the disk, we can end up trashing
> the consistency guarantees of a database such as mysql or postgres.
> It's a good point, but it only applies if we add fsync() suppression
> to laptop mode --- which we haven't done yet.

eek.

If this goes in it needs to come with scary warnings so a distro doesn't
enable it by default (think of all those sqlite database that are
springing up). I know my system is crummy, all of this is only concerned
with if the system crashes uncontrollably (which it shouldn't do) and I
don't do things that would make it safer (like mount with sync) because
I like the speed but there's a risk limit. I don't want to increase my
chances of corruption (as opposed to "just" loss of non recent data) to
be too high...

--
Sitsofe | http://sucs.org/~sits/

2009-04-02 18:23:25

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Wed, 1 Apr 2009, Matthew Garrett wrote:

>> The other subtlety comes if we add fsync() suppression to laptop mode
>> --- which is something that Bart Samwel is very interested in doing
>> and I talked to him at FOSDEM about this. As Jeff Garzik recently
>> pointed out, however, if we let the system reorder writes across
>> fsync() boundaries, or if we combine two writes to the same block
>> separated by an fsync(), and the system crashes in the middle of
>> pushing all of these blocks out to the disk, we can end up trashing
>> the consistency guarantees of a database such as mysql or postgres.
>> It's a good point, but it only applies if we add fsync() suppression
>> to laptop mode --- which we haven't done yet.
>
> I've got absolutely no idea why anyone would want fsync() to stop
> meaning "Put my data on the disk please". laptop-mode isn't intended to
> reduce data integrity - it's intended to batch disk write-outs such that
> there's a lower risk of needing to perform further write-outs in future.
> It makes sense for applications which really desperately want
> information on disk to fsync() (for instance, saving a file in
> OpenOffice).
>
> laptop-mode is something that makes sense as a default behaviour under a
> lot of circumstances. Adding fsync() suppression means it's utterly
> impossible to use it in that way. An additional mode would be perfectly
> reasonable, as long as it's made clear that it's really a request for
> data to be discarded at some point. The current mode isn't.

this issue seems pretty straightforward to me

the apps do fsync (and similar) to the degree that they think their data
is important (potentially with config options if they acknowlege that
their data isn't _always_ that important)

the system allows the admin to override the application and say "I'm
willing to loose up to X seconds of data for other benifits"

if this can work cleanly (with the ordering issue that was identified,
which may involve having multiple versions of the metadata cached) it
seems like a very clean interface.

David Lang

2009-04-02 18:30:10

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 11:22:48AM -0700, [email protected] wrote:
> On Wed, 1 Apr 2009, Matthew Garrett wrote:
> >laptop-mode is something that makes sense as a default behaviour under a
> >lot of circumstances. Adding fsync() suppression means it's utterly
> >impossible to use it in that way. An additional mode would be perfectly
> >reasonable, as long as it's made clear that it's really a request for
> >data to be discarded at some point. The current mode isn't.
>
> this issue seems pretty straightforward to me
>
> the apps do fsync (and similar) to the degree that they think their data
> is important (potentially with config options if they acknowlege that
> their data isn't _always_ that important)
>
> the system allows the admin to override the application and say "I'm
> willing to loose up to X seconds of data for other benifits"
>
> if this can work cleanly (with the ordering issue that was identified,
> which may involve having multiple versions of the metadata cached) it
> seems like a very clean interface.

It does, but it's a different interface to the current one with a
different aim and a different set of tradeoffs. The current behaviour of
laptop-mode is that fsync() results in things hitting disk. The only
configurability of laptop-mode is how long it then waits to flush out
everything else as well.

The solution to "fsync() causes disk spinups" isn't "ignore fsync()".
It's "ensure that applications only use fsync() when they really need
it", which requires us to also be able to say "fsync() should not be
required to ensure that events occur in order".

--
Matthew Garrett | [email protected]

2009-04-02 18:35:29

by Nick Piggin

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Friday 03 April 2009 05:22:48 [email protected] wrote:
> On Wed, 1 Apr 2009, Matthew Garrett wrote:
>
> >> The other subtlety comes if we add fsync() suppression to laptop mode
> >> --- which is something that Bart Samwel is very interested in doing
> >> and I talked to him at FOSDEM about this. As Jeff Garzik recently
> >> pointed out, however, if we let the system reorder writes across
> >> fsync() boundaries, or if we combine two writes to the same block
> >> separated by an fsync(), and the system crashes in the middle of
> >> pushing all of these blocks out to the disk, we can end up trashing
> >> the consistency guarantees of a database such as mysql or postgres.
> >> It's a good point, but it only applies if we add fsync() suppression
> >> to laptop mode --- which we haven't done yet.
> >
> > I've got absolutely no idea why anyone would want fsync() to stop
> > meaning "Put my data on the disk please". laptop-mode isn't intended to
> > reduce data integrity - it's intended to batch disk write-outs such that
> > there's a lower risk of needing to perform further write-outs in future.
> > It makes sense for applications which really desperately want
> > information on disk to fsync() (for instance, saving a file in
> > OpenOffice).
> >
> > laptop-mode is something that makes sense as a default behaviour under a
> > lot of circumstances. Adding fsync() suppression means it's utterly
> > impossible to use it in that way. An additional mode would be perfectly
> > reasonable, as long as it's made clear that it's really a request for
> > data to be discarded at some point. The current mode isn't.
>
> this issue seems pretty straightforward to me
>
> the apps do fsync (and similar) to the degree that they think their data
> is important (potentially with config options if they acknowlege that
> their data isn't _always_ that important)
>
> the system allows the admin to override the application and say "I'm
> willing to loose up to X seconds of data for other benifits"
>
> if this can work cleanly (with the ordering issue that was identified,
> which may involve having multiple versions of the metadata cached) it
> seems like a very clean interface.

It isn't just about ordering of writes a a filesystem. A database program
commits a transaction and then tells the client that it is safe. Client
then goes and does <something> in response to that, which may or may not
involve more writes to the filesystem.

Shouldn't applications have a mode to avoid spinning up the disk if it is
so important?

2009-04-02 18:38:51

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote:

> Shouldn't applications have a mode to avoid spinning up the disk if it is
> so important?

They do. It's called "Don't use fsync() unless your data needs to be on
disk". I'm not sure why you'd ever want an application to be in anything
but this mode.

--
Matthew Garrett | [email protected]

2009-04-02 18:44:48

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, 2 Apr 2009, Matthew Garrett wrote:

> On Thu, Apr 02, 2009 at 11:22:48AM -0700, [email protected] wrote:
>> On Wed, 1 Apr 2009, Matthew Garrett wrote:
>>> laptop-mode is something that makes sense as a default behaviour under a
>>> lot of circumstances. Adding fsync() suppression means it's utterly
>>> impossible to use it in that way. An additional mode would be perfectly
>>> reasonable, as long as it's made clear that it's really a request for
>>> data to be discarded at some point. The current mode isn't.
>>
>> this issue seems pretty straightforward to me
>>
>> the apps do fsync (and similar) to the degree that they think their data
>> is important (potentially with config options if they acknowlege that
>> their data isn't _always_ that important)
>>
>> the system allows the admin to override the application and say "I'm
>> willing to loose up to X seconds of data for other benifits"
>>
>> if this can work cleanly (with the ordering issue that was identified,
>> which may involve having multiple versions of the metadata cached) it
>> seems like a very clean interface.
>
> It does, but it's a different interface to the current one with a
> different aim and a different set of tradeoffs. The current behaviour of
> laptop-mode is that fsync() results in things hitting disk. The only
> configurability of laptop-mode is how long it then waits to flush out
> everything else as well.
>
> The solution to "fsync() causes disk spinups" isn't "ignore fsync()".
> It's "ensure that applications only use fsync() when they really need
> it", which requires us to also be able to say "fsync() should not be
> required to ensure that events occur in order".

ignore the issue of order on the local disk for the moment.

what should an application do to make sure it's data isn't lost?

let's not talk a database here, let's talk something simpler, like a POP3
mail client (even though I strongly favor IMAP ;-)

it wants to have the message saved before it deletes it from the server.

how should it try to do this?

the only portable method is to fsync the file after it's written and
before sending the delete to the server.

so your mail client _should_ issue fsync calls.

however, some (many, most??) users would probably be willing to loose a
little e-mail to gain a significant increase in battery life on their
laptops.

today they have no choice (other than picking a mail client that doesn't
try to protect it's local data)

with the proposed addition to laptop mode (delaying fsync until the disk
is awake), the user (or more precisely the admin) gains the ability to
define this trade-off rather than depending on the application developers
all doing this right.

without this, we end up in a situation like the powertop wakeups. it only
takes one 'buggy' application to destroy your power management and
performance. but in this case, the application that is 'buggy' from a
power management point of view may be entirely correct from a data safety
point of view.

David Lang

2009-04-02 18:57:00

by Nick Piggin

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Friday 03 April 2009 05:38:34 Matthew Garrett wrote:
> On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote:
>
> > Shouldn't applications have a mode to avoid spinning up the disk if it is
> > so important?
>
> They do. It's called "Don't use fsync() unless your data needs to be on
> disk". I'm not sure why you'd ever want an application to be in anything
> but this mode.
>

Well you might decide you are willing to sacrifice timely storage of
logs, or reducing backups in your editor or something. But obviously
the kernel can't decide which of those fsyncs is safe to omit (or
turn into a barrier) while staying within the advertised semantics of
the app. Application obviously can.

2009-04-02 20:08:20

by Ray Lee

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 2, 2009 at 11:44 AM, <[email protected]> wrote:
> let's not talk a database here, let's talk something simpler, like a POP3
> mail client (even though I strongly favor IMAP ;-)
>
> it wants to have the message saved before it deletes it from the server.
>
> how should it try to do this?
>
> the only portable method is to fsync the file after it's written and before
> sending the delete to the server.
>
> so your mail client _should_ issue fsync calls.

That's just not the case. Every POP fetcher I've seen offers an option
to leave seen messages on the server for some period measured in days.
Setting it to one day means that the data will eventually get flushed
by the time the message is deleted.

So, no, the mail client does not have to issue fsync()s at all. (If
dirty data can hang around unwritten for 24 hours, I'd argue that's a
misfeature of the filesystem or kernel.)

Alternately, a client could fetch once every half hour at which point
the cost of an fsync is amortized over all the fetched messages.

2009-04-02 20:59:59

by Andreas T.Auer

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"



On 02.04.2009 22:07 Ray Lee wrote:
> On Thu, Apr 2, 2009 at 11:44 AM, <[email protected]> wrote:
>> let's not talk a database here, let's talk something simpler, like a POP3
>> mail client (even though I strongly favor IMAP ;-)
>>
>> it wants to have the message saved before it deletes it from the server.
>>
>> how should it try to do this?
>>
>> the only portable method is to fsync the file after it's written and before
>> sending the delete to the server.
>>
>> so your mail client _should_ issue fsync calls.
>
> That's just not the case. Every POP fetcher I've seen offers an option
> to leave seen messages on the server for some period measured in days.
> Setting it to one day means that the data will eventually get flushed
> by the time the message is deleted.

Yes, but a lot of users (and I assume >90% of POP3 users) don't use this
option.

> So, no, the mail client does not have to issue fsync()s at all.

Except when operating in immediate-delete mode.

> Alternately, a client could fetch once every half hour at which point
> the cost of an fsync is amortized over all the fetched messages.

Again this is forcing a policy on how users should configure their clients.

And don't forget: POP3 was just an example. There can be a lot of other
applications as well. E.g. what about an application for the reception
of SMS or other mobile text messages? This is pushed to the client, not
polled as with POP3 AFAIK.

Andreas

2009-04-02 21:47:38

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Nick Piggin wrote:

> On Friday 03 April 2009 05:22:48 [email protected] wrote:
>> On Wed, 1 Apr 2009, Matthew Garrett wrote:
>>
>>>> The other subtlety comes if we add fsync() suppression to laptop mode
>>>> --- which is something that Bart Samwel is very interested in doing
>>>> and I talked to him at FOSDEM about this. As Jeff Garzik recently
>>>> pointed out, however, if we let the system reorder writes across
>>>> fsync() boundaries, or if we combine two writes to the same block
>>>> separated by an fsync(), and the system crashes in the middle of
>>>> pushing all of these blocks out to the disk, we can end up trashing
>>>> the consistency guarantees of a database such as mysql or postgres.
>>>> It's a good point, but it only applies if we add fsync() suppression
>>>> to laptop mode --- which we haven't done yet.
>>>
>>> I've got absolutely no idea why anyone would want fsync() to stop
>>> meaning "Put my data on the disk please". laptop-mode isn't intended to
>>> reduce data integrity - it's intended to batch disk write-outs such that
>>> there's a lower risk of needing to perform further write-outs in future.
>>> It makes sense for applications which really desperately want
>>> information on disk to fsync() (for instance, saving a file in
>>> OpenOffice).
>>>
>>> laptop-mode is something that makes sense as a default behaviour under a
>>> lot of circumstances. Adding fsync() suppression means it's utterly
>>> impossible to use it in that way. An additional mode would be perfectly
>>> reasonable, as long as it's made clear that it's really a request for
>>> data to be discarded at some point. The current mode isn't.
>>
>> this issue seems pretty straightforward to me
>>
>> the apps do fsync (and similar) to the degree that they think their data
>> is important (potentially with config options if they acknowlege that
>> their data isn't _always_ that important)
>>
>> the system allows the admin to override the application and say "I'm
>> willing to loose up to X seconds of data for other benifits"
>>
>> if this can work cleanly (with the ordering issue that was identified,
>> which may involve having multiple versions of the metadata cached) it
>> seems like a very clean interface.
>
> It isn't just about ordering of writes a a filesystem. A database program
> commits a transaction and then tells the client that it is safe. Client
> then goes and does <something> in response to that, which may or may not
> involve more writes to the filesystem.
>
> Shouldn't applications have a mode to avoid spinning up the disk if it is
> so important?

why should every application have to have a "I'm mobile" config option?

what about a user that's only mobile sometimes and wants full protection
the rest of the time? how can they easily switch every application between
'keep the data as safe as you can' and 'save battery' modes? will you have
to restart all the apps when you unplug power to switch their modes?

allowing the user to tell the system to override the applications when the
user wants to is _much_ easier.

David Lang

2009-04-02 22:37:54

by Bron Gondwana

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 11:44:20AM -0700, [email protected] wrote:
> let's not talk a database here, let's talk something simpler, like a POP3
> mail client (even though I strongly favor IMAP ;-)
>
> it wants to have the message saved before it deletes it from the server.
>
> how should it try to do this?
>
> the only portable method is to fsync the file after it's written and
> before sending the delete to the server.
>
> so your mail client _should_ issue fsync calls.
>
> however, some (many, most??) users would probably be willing to loose a
> little e-mail to gain a significant increase in battery life on their
> laptops.

Obviously it should do a spamminess test. If the sender is in your
addressbook/whitelist then fsync it, otherwise if it looks spammy,
don't bother.

Seriously, there's no way of telling which emails are the really
important job offer/flight confirmation/invitation from that really
cute girl you met that one time...

... lots of data is like that. It's usually not important except
when it really, really is - and the average user don't want to be
babysitting every single decision about importance.

"Your email program wants to spin up the disk to store a message,
confirm or deny"

Bron.

2009-04-02 23:38:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 10:59:39PM +0200, Andreas T.Auer wrote:
>
> Yes, but a lot of users (and I assume >90% of POP3 users) don't use this
> option.
>

Sometimes, the filesystem isn't the best place to solve all problems.

What's been frustrating about this whole controversy is this implicit
assumptions that users and applications should never change, and the
filesystem should magically accomodate and Do The Right Thing.

If you're *never* going want to risk ever losing mail, then fine,
fsync() it to disk before you send the POP3 DELETE command. If you
don't like the performance delay, or the battery consumption
implications, tough. I'm fresh out of magic pixie dust.

If the application is smarter about not deleting the messages from the
POP spool, then you can afford not to fsync(). But (oh, horrors!) it
might involve making the application smarter, and playing a
synchronization game between the local POP spool and IMAP. It's more
efficient to do this with IMAP, but there are POP clients that do this.

If you are a mail client developer, and the user says, "I want the
advantages of IMAP, but I refuse to switch to an ISP that provides
IMAP; you must give me *all* the advantages IMAP even though I'm using
POP3", you'd probably tell the user, "Yes, and do you want a pony,
too?"

The problem is, this is what the application programmers are telling
the filesystem developers. They refuse to change their programs; and
the features they want are sometimes mutually contradictory, or at
least result in a overconstrained problem --- and then they throw the
whole mess at the filesystem developers' feet and say, "you fix it!"

I'm not saying the filesystems are blameless, but give us a little
slack, guys; we NEED some help from the application developers here.

- Ted

2009-04-02 23:46:42

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 11:44:20AM -0700, [email protected] wrote:
> On Thu, 2 Apr 2009, Matthew Garrett wrote:
> >The solution to "fsync() causes disk spinups" isn't "ignore fsync()".
> >It's "ensure that applications only use fsync() when they really need
> >it", which requires us to also be able to say "fsync() should not be
> >required to ensure that events occur in order".
>
> ignore the issue of order on the local disk for the moment.
>
> what should an application do to make sure it's data isn't lost?

fsync().

> however, some (many, most??) users would probably be willing to loose a
> little e-mail to gain a significant increase in battery life on their
> laptops.

Then they shouldn't use a mail client that fsync()s.

> today they have no choice (other than picking a mail client that doesn't
> try to protect it's local data)
>
> with the proposed addition to laptop mode (delaying fsync until the disk
> is awake), the user (or more precisely the admin) gains the ability to
> define this trade-off rather than depending on the application developers
> all doing this right.

No. Ignoring fsync() makes it difficult for an application to
inappropriately spin up a disk - but it also makes it *impossible* for
an application to save data that it genuinely needs to. Doing this in
kernel means that you have no granularity. You ignore the inappropriate
fsync()s, but you also ignore the ones that are vitally important. I've
no objection to the kernel supporting this functionality, but it should
be /proc/sys/vm/fuck-my-data-harder rather than
/proc/sys/vm/laptop-mode.

Power management is a tradeoff. Sometimes providing correct
functionality costs more than providing incorrect functionality. In
general we strive to carry on providing applications the behaviour they
expect even if it costs us more power - the alternative leads to users
disabling power management functionality because they can't trust it.
Throwing data away isn't an acceptable tradeoff for an extra three
minutes of battery life for most users.

--
Matthew Garrett | [email protected]

2009-04-02 23:48:20

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, Apr 03, 2009 at 05:56:40AM +1100, Nick Piggin wrote:
> On Friday 03 April 2009 05:38:34 Matthew Garrett wrote:
> > On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote:
> >
> > > Shouldn't applications have a mode to avoid spinning up the disk if it is
> > > so important?
> >
> > They do. It's called "Don't use fsync() unless your data needs to be on
> > disk". I'm not sure why you'd ever want an application to be in anything
> > but this mode.
> >
>
> Well you might decide you are willing to sacrifice timely storage of
> logs, or reducing backups in your editor or something. But obviously
> the kernel can't decide which of those fsyncs is safe to omit (or
> turn into a barrier) while staying within the advertised semantics of
> the app. Application obviously can.

I'd argue that if the user cares enough that they want it fsync()ed on
ext3 then they probably also want it fsync()ed if they're on battery.
But yes, if anything is going to make a distinction between grades of
"Must be saved" then it has to be the application - the kernel certainly
doesn't have that information.

--
Matthew Garrett | [email protected]

2009-04-03 00:01:18

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 07:38:06PM -0400, Theodore Tso wrote:

> What's been frustrating about this whole controversy is this implicit
> assumptions that users and applications should never change, and the
> filesystem should magically accomodate and Do The Right Thing.

This is the attitude that I have a significant problem with. Filesystems
exist to serve applications. Without applications, there's no reason to
have a filesystem. If a filesystem doesn't provide the behaviour that
applications want then that filesystem has no reason to exist. The aim
isn't to produce a platonically ideal filesystem. The aim is to produce
a filesystem that behaves well given the applications that use it.

Disagreeing with the behaviour of applications is a perfectly sensible
thing to do. However, it's something that should be done at the *start*
of a filesystem development cycle. Getting agreement from a broad
section of application developers means that you get to write a
filesystem that embodies a different set of assumptions and everyone
wins. Writing a filesystem and then bitching about application behaviour
after it's been merged to mainline is just pathological.

> The problem is, this is what the application programmers are telling
> the filesystem developers. They refuse to change their programs; and
> the features they want are sometimes mutually contradictory, or at
> least result in a overconstrained problem --- and then they throw the
> whole mess at the filesystem developers' feet and say, "you fix it!"

Which application developers did you speak to? Because, frankly, the
majority of the ones I know felt that ext3 embodied the pony that they'd
always dreamed of as a five year old. Stephen gave them that pony almost
a decade ago and now you're trying to take it to the glue factory. I
remember almost crying at that bit on Animal Farm, so I'm really not
surprised that you're getting pushback here.

> I'm not saying the filesystems are blameless, but give us a little
> slack, guys; we NEED some help from the application developers here.

Then having a discussion with application developers over the
expectations they can have would be a good first step. Just pointing at
POSIX isn't good enough - POSIX allows a bunch of behaviours
sufficiently pathological that a filesystem implementing them would be
less useful than /dev/null. We need to have a worthwhile conversation
about what guarantees Linux will provide above and beyond POSIX. The
filesystem summit next week isn't going to be that conversation. Perhaps
something to try at Plumbers?

--
Matthew Garrett | [email protected]

2009-04-03 00:56:09

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Matthew Garrett wrote:

> On Thu, Apr 02, 2009 at 11:44:20AM -0700, [email protected] wrote:
>> On Thu, 2 Apr 2009, Matthew Garrett wrote:
>>> The solution to "fsync() causes disk spinups" isn't "ignore fsync()".
>>> It's "ensure that applications only use fsync() when they really need
>>> it", which requires us to also be able to say "fsync() should not be
>>> required to ensure that events occur in order".
>>
>> ignore the issue of order on the local disk for the moment.
>>
>> what should an application do to make sure it's data isn't lost?
>
> fsync().
>
>> however, some (many, most??) users would probably be willing to loose a
>> little e-mail to gain a significant increase in battery life on their
>> laptops.
>
> Then they shouldn't use a mail client that fsync()s.

so they need to use one mail client when they want to have good battery
life and a different one when they are plugged in to power?

>> today they have no choice (other than picking a mail client that doesn't
>> try to protect it's local data)
>>
>> with the proposed addition to laptop mode (delaying fsync until the disk
>> is awake), the user (or more precisely the admin) gains the ability to
>> define this trade-off rather than depending on the application developers
>> all doing this right.
>
> No. Ignoring fsync() makes it difficult for an application to
> inappropriately spin up a disk - but it also makes it *impossible* for
> an application to save data that it genuinely needs to. Doing this in
> kernel means that you have no granularity. You ignore the inappropriate
> fsync()s, but you also ignore the ones that are vitally important. I've
> no objection to the kernel supporting this functionality, but it should
> be /proc/sys/vm/fuck-my-data-harder rather than
> /proc/sys/vm/laptop-mode.
>
> Power management is a tradeoff. Sometimes providing correct
> functionality costs more than providing incorrect functionality. In
> general we strive to carry on providing applications the behaviour they
> expect even if it costs us more power - the alternative leads to users
> disabling power management functionality because they can't trust it.
> Throwing data away isn't an acceptable tradeoff for an extra three
> minutes of battery life for most users.

I would agree with you if it was three minutes of battery life, but what
if it's an extra hour? (easily possible if the fsyncs make the difference
between the drive running all the time and waking up every 5 min for a few
seconds)

David Lang

2009-04-03 01:00:39

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Matthew Garrett wrote:

> On Fri, Apr 03, 2009 at 05:56:40AM +1100, Nick Piggin wrote:
>> On Friday 03 April 2009 05:38:34 Matthew Garrett wrote:
>>> On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote:
>>>
>>>> Shouldn't applications have a mode to avoid spinning up the disk if it is
>>>> so important?
>>>
>>> They do. It's called "Don't use fsync() unless your data needs to be on
>>> disk". I'm not sure why you'd ever want an application to be in anything
>>> but this mode.
>>>
>>
>> Well you might decide you are willing to sacrifice timely storage of
>> logs, or reducing backups in your editor or something. But obviously
>> the kernel can't decide which of those fsyncs is safe to omit (or
>> turn into a barrier) while staying within the advertised semantics of
>> the app. Application obviously can.
>
> I'd argue that if the user cares enough that they want it fsync()ed on
> ext3 then they probably also want it fsync()ed if they're on battery.
> But yes, if anything is going to make a distinction between grades of
> "Must be saved" then it has to be the application - the kernel certainly
> doesn't have that information.

but is it the user who's deciding today or the application developer?

I agree that the kernel has no way of saying 'this fsync is important,
that one can be ignored'

but I don't think anyone is suggesting that (everyone who has mentioned it
in a proposal has done so saying 'this obviously is too complicated to try
to do'

however, there is one thing about laptop mode that I need clarification
on.

is laptop mode

A. "write everything now, don't delay writes" in the hope that the drive
will be idle enough later to spin down

or

B. "delay all writes until later, then when the drive wakes up do all
pending writes at that time" so that the drive can go to sleep in the
meantime?

I've heard things in these threads that would indicate both behaviors.

David Lang

2009-04-03 01:06:20

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 05:55:11PM -0700, [email protected] wrote:
> On Fri, 3 Apr 2009, Matthew Garrett wrote:
> >Then they shouldn't use a mail client that fsync()s.
>
> so they need to use one mail client when they want to have good battery
> life and a different one when they are plugged in to power?

They need to make a decision about whether they care about their mailbox
being precisely in sync with their server or not, and either use a
client that adapts appropriately or choose a client that behaves
appropriately. It's certainly not the kernel's business.

> >No. Ignoring fsync() makes it difficult for an application to
> >inappropriately spin up a disk - but it also makes it *impossible* for
> >an application to save data that it genuinely needs to. Doing this in
> >kernel means that you have no granularity. You ignore the inappropriate
> >fsync()s, but you also ignore the ones that are vitally important. I've
> >no objection to the kernel supporting this functionality, but it should
> >be /proc/sys/vm/fuck-my-data-harder rather than
> >/proc/sys/vm/laptop-mode.
> >
> >Power management is a tradeoff. Sometimes providing correct
> >functionality costs more than providing incorrect functionality. In
> >general we strive to carry on providing applications the behaviour they
> >expect even if it costs us more power - the alternative leads to users
> >disabling power management functionality because they can't trust it.
> >Throwing data away isn't an acceptable tradeoff for an extra three
> >minutes of battery life for most users.
>
> I would agree with you if it was three minutes of battery life, but what
> if it's an extra hour? (easily possible if the fsyncs make the difference
> between the drive running all the time and waking up every 5 min for a few
> seconds)

If you can demonstrate a real world use case where the hard drive
(typically well under a watt of power consumption on modern systems)
spindown policy will be affected sufficiently pathologically by a mail
client that you lose an hour of battery life, then I'd rethink this. But
mostly I'd conclude that this was an example of an inappropriate
spindown policy.

--
Matthew Garrett | [email protected]

2009-04-03 01:09:20

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 05:59:53PM -0700, [email protected] wrote:

> is laptop mode
>
> A. "write everything now, don't delay writes" in the hope that the drive
> will be idle enough later to spin down

laptop-mode doesn't delay writes. Ever.

> or
>
> B. "delay all writes until later, then when the drive wakes up do all
> pending writes at that time" so that the drive can go to sleep in the
> meantime?

Yes.

> I've heard things in these threads that would indicate both behaviors.

The code's pretty trivial. The only real functional differences
laptop-mode brings are to write out all dirty pages (rather than just
writing down to the watermark) and to call sys_sync() a few seconds
after the last thing that hit disk rather than being satisfied from
cache. It's entirely a mechanism to opportunistically take advantage of
the disk being spun up.

--
Matthew Garrett | [email protected]

2009-04-03 01:16:58

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Matthew Garrett wrote:

> On Thu, Apr 02, 2009 at 05:55:11PM -0700, [email protected] wrote:
>> On Fri, 3 Apr 2009, Matthew Garrett wrote:
>>> Then they shouldn't use a mail client that fsync()s.
>>
>> so they need to use one mail client when they want to have good battery
>> life and a different one when they are plugged in to power?
>
> They need to make a decision about whether they care about their mailbox
> being precisely in sync with their server or not, and either use a
> client that adapts appropriately or choose a client that behaves
> appropriately. It's certainly not the kernel's business.

the kernel is not deciding this, the kernel would be implementing the
user's choice

>>> No. Ignoring fsync() makes it difficult for an application to
>>> inappropriately spin up a disk - but it also makes it *impossible* for
>>> an application to save data that it genuinely needs to. Doing this in
>>> kernel means that you have no granularity. You ignore the inappropriate
>>> fsync()s, but you also ignore the ones that are vitally important. I've
>>> no objection to the kernel supporting this functionality, but it should
>>> be /proc/sys/vm/fuck-my-data-harder rather than
>>> /proc/sys/vm/laptop-mode.
>>>
>>> Power management is a tradeoff. Sometimes providing correct
>>> functionality costs more than providing incorrect functionality. In
>>> general we strive to carry on providing applications the behaviour they
>>> expect even if it costs us more power - the alternative leads to users
>>> disabling power management functionality because they can't trust it.
>>> Throwing data away isn't an acceptable tradeoff for an extra three
>>> minutes of battery life for most users.
>>
>> I would agree with you if it was three minutes of battery life, but what
>> if it's an extra hour? (easily possible if the fsyncs make the difference
>> between the drive running all the time and waking up every 5 min for a few
>> seconds)
>
> If you can demonstrate a real world use case where the hard drive
> (typically well under a watt of power consumption on modern systems)
> spindown policy will be affected sufficiently pathologically by a mail
> client that you lose an hour of battery life, then I'd rethink this. But
> mostly I'd conclude that this was an example of an inappropriate
> spindown policy.

remember that the mail client was an example.

you want another example, think of anything that uses sqlite (like the
firefox history stuff, although that was weakened drasticly due to the
ext3 problems).

David Lang

2009-04-03 01:17:38

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Matthew Garrett wrote:

> On Thu, Apr 02, 2009 at 05:59:53PM -0700, [email protected] wrote:
>
>> is laptop mode
>>
>> A. "write everything now, don't delay writes" in the hope that the drive
>> will be idle enough later to spin down
>
> laptop-mode doesn't delay writes. Ever.
>
>> or
>>
>> B. "delay all writes until later, then when the drive wakes up do all
>> pending writes at that time" so that the drive can go to sleep in the
>> meantime?
>
> Yes.

you just contridicted yourself in these two statements.

David Lang

>> I've heard things in these threads that would indicate both behaviors.
>
> The code's pretty trivial. The only real functional differences
> laptop-mode brings are to write out all dirty pages (rather than just
> writing down to the watermark) and to call sys_sync() a few seconds
> after the last thing that hit disk rather than being satisfied from
> cache. It's entirely a mechanism to opportunistically take advantage of
> the disk being spun up.
>
>

2009-04-03 01:20:37

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 06:16:20PM -0700, [email protected] wrote:
> On Fri, 3 Apr 2009, Matthew Garrett wrote:
>
> >On Thu, Apr 02, 2009 at 05:55:11PM -0700, [email protected] wrote:
> >>On Fri, 3 Apr 2009, Matthew Garrett wrote:
> >>>Then they shouldn't use a mail client that fsync()s.
> >>
> >>so they need to use one mail client when they want to have good battery
> >>life and a different one when they are plugged in to power?
> >
> >They need to make a decision about whether they care about their mailbox
> >being precisely in sync with their server or not, and either use a
> >client that adapts appropriately or choose a client that behaves
> >appropriately. It's certainly not the kernel's business.
>
> the kernel is not deciding this, the kernel would be implementing the
> user's choice

No it wouldn't. The kernel would be implementing an adminstrator's
choice about whether fsync() is important or not. That's something that
would affect the mail client, but it's hardly a decision based on the
mail client. Sucks to be that user if they do anything involving mysql.

> >If you can demonstrate a real world use case where the hard drive
> >(typically well under a watt of power consumption on modern systems)
> >spindown policy will be affected sufficiently pathologically by a mail
> >client that you lose an hour of battery life, then I'd rethink this. But
> >mostly I'd conclude that this was an example of an inappropriate
> >spindown policy.
>
> remember that the mail client was an example.
>
> you want another example, think of anything that uses sqlite (like the
> firefox history stuff, although that was weakened drasticly due to the
> ext3 problems).

Benchmarks please.
--
Matthew Garrett | [email protected]

2009-04-03 01:22:31

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 06:17:12PM -0700, [email protected] wrote:
> On Fri, 3 Apr 2009, Matthew Garrett wrote:
>
> >On Thu, Apr 02, 2009 at 05:59:53PM -0700, [email protected] wrote:
> >
> >>is laptop mode
> >>
> >>A. "write everything now, don't delay writes" in the hope that the drive
> >>will be idle enough later to spin down
> >
> >laptop-mode doesn't delay writes. Ever.
> >
> >>or
> >>
> >>B. "delay all writes until later, then when the drive wakes up do all
> >>pending writes at that time" so that the drive can go to sleep in the
> >>meantime?
> >
> >Yes.
>
> you just contridicted yourself in these two statements.

That's because I'm horribly drunk and managed to confuse the order of
your statements. My substantive point stands - the laptop-mode code
doesn't delay writes, and it's pretty easy for anyone to prove this to
themselves. Neither of your options actually describe its behaviour.
--
Matthew Garrett | [email protected]

2009-04-03 01:24:55

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Matthew Garrett wrote:

> On Thu, Apr 02, 2009 at 06:16:20PM -0700, [email protected] wrote:
>> On Fri, 3 Apr 2009, Matthew Garrett wrote:
>>
>>> On Thu, Apr 02, 2009 at 05:55:11PM -0700, [email protected] wrote:
>>>> On Fri, 3 Apr 2009, Matthew Garrett wrote:
>>>>> Then they shouldn't use a mail client that fsync()s.
>>>>
>>>> so they need to use one mail client when they want to have good battery
>>>> life and a different one when they are plugged in to power?
>>>
>>> They need to make a decision about whether they care about their mailbox
>>> being precisely in sync with their server or not, and either use a
>>> client that adapts appropriately or choose a client that behaves
>>> appropriately. It's certainly not the kernel's business.
>>
>> the kernel is not deciding this, the kernel would be implementing the
>> user's choice
>
> No it wouldn't. The kernel would be implementing an adminstrator's
> choice about whether fsync() is important or not. That's something that
> would affect the mail client, but it's hardly a decision based on the
> mail client. Sucks to be that user if they do anything involving mysql.

in the case of laptops, in 99+% of the cases the user and the
administrator are the same person. in the other cases that's something the
user should take up with the administrator, because the administrator can
do a lot of things to the system that will affect the safety of their data
(including loading a kernel that turns fsync into a noop, but more likely
involving enabling or disabling write caches on disks)

>>> If you can demonstrate a real world use case where the hard drive
>>> (typically well under a watt of power consumption on modern systems)
>>> spindown policy will be affected sufficiently pathologically by a mail
>>> client that you lose an hour of battery life, then I'd rethink this. But
>>> mostly I'd conclude that this was an example of an inappropriate
>>> spindown policy.
>>
>> remember that the mail client was an example.
>>
>> you want another example, think of anything that uses sqlite (like the
>> firefox history stuff, although that was weakened drasticly due to the
>> ext3 problems).
>
> Benchmarks please.

if spinning down a drive saves so little power that it wouldn't make a
significant difference to battery lift to leave it on, why does anyone
bother to spin the drive down?

David Lang

2009-04-03 01:36:25

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 06:24:28PM -0700, [email protected] wrote:
> On Fri, 3 Apr 2009, Matthew Garrett wrote:
> >No it wouldn't. The kernel would be implementing an adminstrator's
> >choice about whether fsync() is important or not. That's something that
> >would affect the mail client, but it's hardly a decision based on the
> >mail client. Sucks to be that user if they do anything involving mysql.
>
> in the case of laptops, in 99+% of the cases the user and the
> administrator are the same person. in the other cases that's something the
> user should take up with the administrator, because the administrator can
> do a lot of things to the system that will affect the safety of their data
> (including loading a kernel that turns fsync into a noop, but more likely
> involving enabling or disabling write caches on disks)

Well, yes, the administrator could hate the user. They could achieve the
same affect by just LD_PRELOADING something that stubbed out fsync() and
inserted random data into every other write(). We generally trust that
admins won't do that.

> >Benchmarks please.
>
> if spinning down a drive saves so little power that it wouldn't make a
> significant difference to battery lift to leave it on, why does anyone
> bother to spin the drive down?

There's various circumstances in which it's beneficial. The difference
between an optimal algorithm for typical use and an optimal algorithm
for typical use where there's an fsync() every 5 minutes isn't actually
that great.

--
Matthew Garrett | [email protected]

2009-04-03 02:22:59

by Ric Wheeler

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

Nick Piggin wrote:
> On Friday 03 April 2009 05:38:34 Matthew Garrett wrote:
>
>> On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote:
>>
>>
>>> Shouldn't applications have a mode to avoid spinning up the disk if it is
>>> so important?
>>>
>> They do. It's called "Don't use fsync() unless your data needs to be on
>> disk". I'm not sure why you'd ever want an application to be in anything
>> but this mode.
>>
>>
>
> Well you might decide you are willing to sacrifice timely storage of
> logs, or reducing backups in your editor or something. But obviously
> the kernel can't decide which of those fsyncs is safe to omit (or
> turn into a barrier) while staying within the advertised semantics of
> the app. Application obviously can.
>
>
One thing that you can do at the application level is to try and batch
up your fsync() requests - running one fsync (especially on the most
recently written file) can take down the earlier files with it.

Clearly, this does require some application level complexity, but you
get the same strong fsync() semantics that you are used to and can run
almost at non-fsync speeds if the batch size is large enough. Your
application should not acknowledge it has safely stored any of the files
locally until it has done an fsync on that particular file.

This technique would work great for an application like rsync, tar, etc.
For a mail client, you would see a benefit only when you were pulling
down batches of messages which clearly is a common case if you are still
reading this thread :-)

The fs_mark program I wrote plays around with the various ways to do
this if someone is interested in playing around a bit,

Ric

2009-04-03 03:09:10

by David Lang

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, 3 Apr 2009, Matthew Garrett wrote:

> On Thu, Apr 02, 2009 at 06:24:28PM -0700, [email protected] wrote:
>> On Fri, 3 Apr 2009, Matthew Garrett wrote:
>>> No it wouldn't. The kernel would be implementing an adminstrator's
>>> choice about whether fsync() is important or not. That's something that
>>> would affect the mail client, but it's hardly a decision based on the
>>> mail client. Sucks to be that user if they do anything involving mysql.
>>
>> in the case of laptops, in 99+% of the cases the user and the
>> administrator are the same person. in the other cases that's something the
>> user should take up with the administrator, because the administrator can
>> do a lot of things to the system that will affect the safety of their data
>> (including loading a kernel that turns fsync into a noop, but more likely
>> involving enabling or disabling write caches on disks)
>
> Well, yes, the administrator could hate the user. They could achieve the
> same affect by just LD_PRELOADING something that stubbed out fsync() and
> inserted random data into every other write(). We generally trust that
> admins won't do that.

then trust the admins to make a reasonable decision for or with the user
on this as well.

>>> Benchmarks please.
>>
>> if spinning down a drive saves so little power that it wouldn't make a
>> significant difference to battery lift to leave it on, why does anyone
>> bother to spin the drive down?
>
> There's various circumstances in which it's beneficial. The difference
> between an optimal algorithm for typical use and an optimal algorithm
> for typical use where there's an fsync() every 5 minutes isn't actually
> that great.

mixing some sub-threads a bit to combine thoughts

you object to calling something like this 'laptop mode'

Ted's statements about laptop mode indicate that he believes that it
delays writes for a configurable time rather than accelerating writes.

what would you think of something like the following

at the block device level an option called something like "delay_writes"

delays writes (including fsync) up to the configurable number of seconds.

if an fsync or barrier is issued the block driver figures out what pages
would be written by that fsync/barrier, puts them in it's queue (but
doesn't start the write), puts a barrier in it's queue following the pages
and marks the pages COW.

if the timeout expires (or the drive spins up for other reasons) and the
pages have not been modified, they get written and released by the block
driver (which should take them out of COW mode).

if the pages get written to prior to the write taking place, COW kicks in
and new pages are allocated for the changes. since the device driver
already has those pages queued the filesystem just ends up with the copied
pages and continues operation. when the drive finally gets spun up, the
queued pages get written prior to anything else (preserving order in case
of a crash)

doing this could cost memory (as there may be multiple copies of something
queued), so it may be worth having some trigger that if more than X pages
are queued by the block driver, it should go ahead and spin up the drive
to write them.

thoughts?

David Lang

2009-04-03 04:54:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, Apr 03, 2009 at 02:36:03AM +0100, Matthew Garrett wrote:
> > if spinning down a drive saves so little power that it wouldn't make a
> > significant difference to battery lift to leave it on, why does anyone
> > bother to spin the drive down?
>
> There's various circumstances in which it's beneficial. The difference
> between an optimal algorithm for typical use and an optimal algorithm
> for typical use where there's an fsync() every 5 minutes isn't actually
> that great.

More to the point, if an application is insane enough to push 2.5
megabytes to disk every single time you click on a web page (this is
excluding the cache; I had my firefox cache pointed at /tmp when I did
this measurement), *and* you are running the WiFi for the browser,
*and* the browser is running flash applications, etc., whether you
defer the writes or not, you're going to be burning a lot of power.
Fundamentally, if an application needs to be writing hundreds of files
or hundreds of kilibytes or more of data all the time, there's
something wrong with the application.

If some KDE applications needs to rewrite hundreds of files at desktop
startup, when the user hasn't even changed any configuration options
yet (this is that desktop **startup**, mind you, where this was
reported), then you're going to burning a lot of power. Anything we
do at the filesystem level is really going to be at the margins.

The annoying thing is the applications programmers aren't willing to
fix their d*mn applications, and instead heap all of the blame on the
filesystem. I will be the first to admit that filesystem designers
have to do their part, and once I realized how bad and sloppy people
had gotten with fsync(), and needlessly rewriting files, I implemented
the ext4 workaround patches *first*. I only started talking about how
application programmers might make changes to obey the established
standards and work with other filesystems after I had put my own house
in order. These are system-wide problems we are talking about, that
will require system-wide solutions. I can provide workarounds for
existing application behaviours, but claiming that applications can
never change, and we must always accomodate the way applications are
currently working and are designed is going to be a losing strategy
for us all.

- Ted

2009-04-03 07:14:18

by Bojan Smojver

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

Theodore Tso <tytso <at> mit.edu> writes:

> The replace-via-truncate and replace-via-rename workarounds are there
> for the benefit of KDE, and GNOME, which in some configurations
> apparently will replace hundreds of dot files when the desktop is
> started up, for no reason that I can understand.

Maybe it would be useful if we had IN_SYNC event in inotify (meaning all buffers
of a closed file have been synced to disk, either implicitly or by fsync() - not
important). Then we could have these apps to do something like this on
configuration change:

1. Backup by link("foo","foo~"), unless we are watching "foo" for IN_SYNC event.
2. Open "foo" and read it.
3. Create "foo.new" and put new stuff in it.
4. Close "foo.new".
5. Rename "foo.new" into "foo".
6. Put a watch on "foo" for IN_SYNC, unless we already have one.

In the regular loop of the app:

1. When the event IN_SYNC turns up for "foo", remove "foo~".
2. Remove the watch.

No fsync() in sight, all atomic and no chance of losing data. If things go
haywire, we shall have fully committed "foo~" on startup, which we then just
rename into most likely broken "foo" and continue. If we don't have "foo~", it
must mean "foo" is OK.

Something like this may even work for rsync (slightly different flow of events,
probably watching from another thread).

When throwing stones, please limit yourself to less than 5kg specimens... :-)

--
Bojan

2009-04-03 07:34:26

by Pavel Machek

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"


> If you are a mail client developer, and the user says, "I want the
> advantages of IMAP, but I refuse to switch to an ISP that provides
> IMAP; you must give me *all* the advantages IMAP even though I'm using
> POP3", you'd probably tell the user, "Yes, and do you want a pony,
> too?"

Somebody wants a pony?

> The problem is, this is what the application programmers are telling
> the filesystem developers. They refuse to change their programs; and
> the features they want are sometimes mutually contradictory, or at
> least result in a overconstrained problem --- and then they throw the
> whole mess at the filesystem developers' feet and say, "you fix it!"
>
> I'm not saying the filesystems are blameless, but give us a little
> slack, guys; we NEED some help from the application developers here.

>From what I seen on the gtk lists, application developers are willing
to change they code. _But_ we should make sure that it does not
regress. fsync() is a regression: spins the disk up too much, slow on
ext3. (They may be willing to do that, but I believe that's a very bad
idea). And yes, I hope your "lets add fsync() everywhere, then break
the fsync with eat-my-data-^W-laptop-mode" plan does not
happen. (Please acknowledge that it is a stupid idea...)

If you give them fbarrier() or replace() or something that is nop or
nearly so on ext3 data=ordered and fixes ext4/btrfs, they'll happily
use it. But we do not have such thing now, and we should not be really
asking them to regress on existing setups.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-04-03 08:15:01

by Andreas T.Auer

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"


On 03.04.2009 01:38 Theodore Tso wrote:
> On Thu, Apr 02, 2009 at 10:59:39PM +0200, Andreas T.Auer wrote:
>> Yes, but a lot of users (and I assume >90% of POP3 users) don't use this
>> option.
>>
>
> Sometimes, the filesystem isn't the best place to solve all problems.

Surely you cannot solve all problems in the filesystem. Especially the
delay-spin-up vs. keep-all-important-recent-data problem simply can't be
done by the filesystem. It can't be done by the application either,
because it is the decision of the user, which data are important enough
to do a spin-up. But it's not possible to tell the filesystem, which
applications should spin-up at fsync(). And even within applications
there are differences between the love-mail from the girl you met
recently and the love-mail from that "russian girl", which isn't a girl,
but just a bunch of fraudsters.

> What's been frustrating about this whole controversy is this implicit
> assumptions that users and applications should never change, and the
> filesystem should magically accomodate and Do The Right Thing.

It's not that they should never change, it's that you can't expect them
to change. There are just a few filesystems in the kernel and you need
some level of competence to maintain the code or contribute to it. But
you have no such filter in the application world, which is much much
bigger than the controlled area of the kernel. The application can be
crappy and would still have its users as long there is no better
alternative for a special task. Even after the project is orphaned it
still can be used by the users. I had such a tool to get the log data
out of my PBX. It was orphaned long before and it had no alternative.

> If you're *never* going want to risk ever losing mail, then fine,
> fsync() it to disk before you send the POP3 DELETE command.

The *user* wants his data safe, but the *application* has to decide
whether or not to fsync(). Well, in case of a POP3 client fsync() should
be common sense before a DELETE.

> The problem is, this is what the application programmers are telling
> the filesystem developers. They refuse to change their programs; and
> the features they want are sometimes mutually contradictory, or at
> least result in a overconstrained problem --- and then they throw the
> whole mess at the filesystem developers' feet and say, "you fix it!"

I think the users are complaining more than the application developers.
If the application developers would complain for their piece of
software, they would probably be smart enough to change their code using
some new function calls (like barrier() or whatever). But the problem
are the non-complaining developers that simply don't have a clue about
all this.

> I'm not saying the filesystems are blameless, but give us a little
> slack, guys; we NEED some help from the application developers here.

You have to find a _reasonable_ default integrity/performance trade-off
for those applications that are not aware of the filesystem levels. "I
just write out the data to disk with fprintf()."

For laptop-mode a global reasonable default doesn't seem to exist, so a
"perfect system" would have the possibility to tell the users, which
applications triggered a spin-up and provide the users with methods to
suppress/fine-tune the spin-up for the applications he wants to. The
distros could pre-configure it to some reasonable defaults for each
application.

Andreas

2009-04-03 11:09:47

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, Apr 03, 2009 at 12:54:14AM -0400, Theodore Tso wrote:
> On Fri, Apr 03, 2009 at 02:36:03AM +0100, Matthew Garrett wrote:
> >
> > There's various circumstances in which it's beneficial. The difference
> > between an optimal algorithm for typical use and an optimal algorithm
> > for typical use where there's an fsync() every 5 minutes isn't actually
> > that great.
>
> More to the point, if an application is insane enough to push 2.5
> megabytes to disk every single time you click on a web page (this is
> excluding the cache; I had my firefox cache pointed at /tmp when I did

I no longer know what is being debated here. Is it one or more of the
following:

a) Laptop mode (as it stands today).
b) Laptop mode with fsync-nop.
c) Apps that should be using fsync.
d) Apps that should not using fsync.
e) Apps writing to the disk too frequently.
f) Apps writing to many files to the disk.
g) Userland constraining kernel changes.
h) Increasing battery life.
i) "Acceptable" chance of new data loss after a crash.
j) "Acceptable" chance of data corruption after a crash.
k) Support for a new filesystem barrier() syscall to indicate the order
that data has to be written.

Note some of the above points are in conflict with each other...

> The annoying thing is the applications programmers aren't willing to
> fix their d*mn applications, and instead heap all of the blame on the
> filesystem. I will be the first to admit that filesystem designers

Isn't this the problem that other systems that place a high value on
backwards compatibly face that the Linux kernel was not supposed to? If
some piece of userland depends on every last bit of behaviour (whether
it was intended/promised or not) then the only way anything can be
changed is with massive effort expended on shims...

--
Sitsofe | http://sucs.org/~sits/

2009-04-03 13:09:18

by Alberto Gonzalez

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Friday 03 April 2009 06:54:14 Theodore Tso wrote:
> On Fri, Apr 03, 2009 at 02:36:03AM +0100, Matthew Garrett wrote:
> > > if spinning down a drive saves so little power that it wouldn't make a
> > > significant difference to battery lift to leave it on, why does anyone
> > > bother to spin the drive down?
> >
> > There's various circumstances in which it's beneficial. The difference
> > between an optimal algorithm for typical use and an optimal algorithm
> > for typical use where there's an fsync() every 5 minutes isn't actually
> > that great.
>
> More to the point, if an application is insane enough to push 2.5
> megabytes to disk every single time you click on a web page (this is
> excluding the cache; I had my firefox cache pointed at /tmp when I did
> this measurement), *and* you are running the WiFi for the browser,
> *and* the browser is running flash applications, etc., whether you
> defer the writes or not, you're going to be burning a lot of power.
> Fundamentally, if an application needs to be writing hundreds of files
> or hundreds of kilibytes or more of data all the time, there's
> something wrong with the application.

I really have to agree. Looking at this thread (that unfortunately I started)
it seems that if Linux is going to improve its power consumption at all it
depends on the filesystem.

Firefox has some unrealistic settings that stress the hard drive and the
network, then some people open a couple hundred tabs at the same time, and
then even the most simple flash animation proved to increase power by 0.9 watts
on my atom processor that has a 2.5 watt TDP, and there are many other
problems to solve first. Linux is still trying to catch up with Windows when it
comes to battery life. It's still clearly behind in "normal" setups (I know,
you can tweak Linux to use little power, but a default install of a mainstream
distro will use clearly more power than Windows while providing similar
functionality). And then Windows can use up to twice more power than OS X [1].
So clearly there is a lot of room for improvement when it comes to power usage
in Linux. But honestly, if we all start blaming the filesystem for it, I don't
think we're going to find the real problems.

Besides, with SSDs getting better and cheaper, I'm sure that from 2010 on,
most (if not all) laptops are going to be shipping with an SSD by default. And
all the spin-up/spin-down problem will go away by itself. And yes, SSDs have
proven to save some battery, but in the most real world tests I've seen it's
by about 5%, so I guess that even with the most powersaving filesystem for a
mechanical HD we could just save about 3% - 4% battery. Not too bad, but still
far from the 40% needed.

So for all having performance problems with ext3 + fsync, let's see if ext4
works for them. For those worried about battery life, let's at least start
looking elsewhere before we want to optimize the filesystem to the last
milliwatt. And as I feel guilty myself for contributing to this, I'd beg for
us all to leave a bit of Slack (as Ted said) to filesystem developers. It's
been a hard week for them already.

Regards,
Alberto.

1 - http://www.anandtech.com/mac/showdoc.aspx?i=3435&p=13
- http://www.anandtech.com/mobile/showdoc.aspx?i=3540&p=10

2009-04-03 13:42:47

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Thu, Apr 02, 2009 at 08:08:36PM -0700, [email protected] wrote:
> On Fri, 3 Apr 2009, Matthew Garrett wrote:
> >Well, yes, the administrator could hate the user. They could achieve the
> >same affect by just LD_PRELOADING something that stubbed out fsync() and
> >inserted random data into every other write(). We generally trust that
> >admins won't do that.
>
> then trust the admins to make a reasonable decision for or with the user
> on this as well.

What a reasonable decision is here depends on what software the user is
running. There simply isn't a reasonable default other than to allow
fsync() to work. Changing requires auditing every single piece of code
the user may run.

> >There's various circumstances in which it's beneficial. The difference
> >between an optimal algorithm for typical use and an optimal algorithm
> >for typical use where there's an fsync() every 5 minutes isn't actually
> >that great.
>
> mixing some sub-threads a bit to combine thoughts
>
> you object to calling something like this 'laptop mode'
>
> Ted's statements about laptop mode indicate that he believes that it
> delays writes for a configurable time rather than accelerating writes.

As I said, the code is pretty easy to read.

(snip)

> thoughts?

I've certainly got no objection to the addition of a mode that changes
the behaviour of fsync() - personally I think it would be an error for
almost anyone to use it, but that's really up to the individual
situation. But it would have a different goal to the existing
laptop-mode and so should have a different name in order to avoid
confusion.

--
Matthew Garrett | [email protected]

2009-04-03 13:45:59

by Matthew Garrett

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Fri, Apr 03, 2009 at 12:54:14AM -0400, Theodore Tso wrote:

> More to the point, if an application is insane enough to push 2.5
> megabytes to disk every single time you click on a web page (this is
> excluding the cache; I had my firefox cache pointed at /tmp when I did
> this measurement), *and* you are running the WiFi for the browser,
> *and* the browser is running flash applications, etc., whether you
> defer the writes or not, you're going to be burning a lot of power.
> Fundamentally, if an application needs to be writing hundreds of files
> or hundreds of kilibytes or more of data all the time, there's
> something wrong with the application.

Yes. If applications are fsync()ing too often then the obvious fix is to
deal with those applications, and that's something we've been successful
with in other fields of power management.

> If some KDE applications needs to rewrite hundreds of files at desktop
> startup, when the user hasn't even changed any configuration options
> yet (this is that desktop **startup**, mind you, where this was
> reported), then you're going to burning a lot of power. Anything we
> do at the filesystem level is really going to be at the margins.

Not really. Desktop startup is a one-off cost and has no significant
impact on your overall power budget. There's little worthwhile
optimisation there from a power management point of view.

> The annoying thing is the applications programmers aren't willing to
> fix their d*mn applications, and instead heap all of the blame on the
> filesystem. I will be the first to admit that filesystem designers
> have to do their part, and once I realized how bad and sloppy people
> had gotten with fsync(), and needlessly rewriting files, I implemented
> the ext4 workaround patches *first*. I only started talking about how
> application programmers might make changes to obey the established
> standards and work with other filesystems after I had put my own house
> in order. These are system-wide problems we are talking about, that
> will require system-wide solutions. I can provide workarounds for
> existing application behaviours, but claiming that applications can
> never change, and we must always accomodate the way applications are
> currently working and are designed is going to be a losing strategy
> for us all.

>From a power management perspective, anything that requires applications
to call fsync() more frequently is a bad thing. So filesystems that
reorder metadata operations are a bad thing. The fix isn't to add
fsync() to applications, the fix is to ensure that filesystems don't
force applications to do so.

--
Matthew Garrett | [email protected]

2009-04-05 04:10:55

by Bojan Smojver

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

Bojan Smojver <bojan <at> rexursive.com> writes:

> Maybe it would be useful if we had IN_SYNC event in inotify

Or, maybe we can just (ab)use aio_fsync() for all this. This could be useful for
renaming of configuration files, less so for rsync (although it could be done
there too, I guess; rsync would just have to wait for synchronisation at the end
of the run). It would work like this:

1. Open "foo" and read it.
2. Open mktemp()-ed "foo.XXXXXX".
3. Write into the temp file.
4. Call aio_fsync().

Then, in the signal handler or the thread created on completion we'd have:

1. Rename the fully synced temp file into "foo".

If we made aio_fsync() wait in laptop mode for the regular commit interval,
instead of writing to disk right away (because it is an async interface after
all, so nobody expects it to finish immediately), we could preserve the normal
fsync() in laptop mode to mean write to disk now. DBs and similar stuff would
then get what they needed too, without complications.

For machines that are not laptops, with a constantly spinning disk and a decent
file system (such as ext4 :-), this should not be a problem performance wise.
And, the program asking for aio_fsync() could still continue without blocking,
therefore being fully interactive.

PS. Disclaimer: I never used this call in any of my programs, so I'm just
guessing that it works the way I understood the docs.

--
Bojan

2009-04-05 04:51:42

by Bojan Smojver

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

Bojan Smojver <bojan <at> rexursive.com> writes:

> 1. Rename the fully synced temp file into "foo".

Forgot to mention... At which point the current config kept in memory would be
dumped, if the reference count of temp files associated with it reached zero.

--
Bojan



2009-04-05 05:41:48

by Bojan Smojver

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

Bojan Smojver <bojan <at> rexursive.com> writes:

> 1. Open "foo" and read it.

Of course, this step would be skipped if we had config still in memory.

--
Bojan


2009-04-05 17:27:21

by Ed Tomlinson

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

On Tuesday 31 March 2009 08:25:40 Theodore Tso wrote:
> On Sun, Mar 29, 2009 at 12:24:21PM +0200, Alberto Gonzalez wrote:
> > Hi,
> >
> > - I use Ext4 as my filesystem (default in next Fedora release).
>
> Fedora will have the patches so that applications that do
> replace-via-truncate (a bad idea, these applications are buggy, and
> will lose data sometimes even with ext3), or replace-via-rename
> without the fsync(), will force the blocks out to disk with the
> commit.
>
> > - Let's say I've been working on my book for the last 14 months and I've
> > written about 400 pages on an ODF file.
>
> Openoffice, being a portable application, that has to work on other
> operating systems and filesystems (for example, like Solaris's UFS),
> does do open/write/close/fsync/rename. So you're safe if you're using
> OpenOffice (and emacs, and vim).
>
> The replace-via-truncate and replace-via-rename workarounds are there
> for the benefit of KDE, and GNOME, which in some configurations
> apparently will replace hundreds of dot files when the desktop is
> started up, for no reason that I can understand. (Not such a great
> idea for SSD write endurance!) Some people apparently spend hours
> making sure that their windows are exactly positioned the way they
> want it when their desktop starts up, and if the system crashes while
> their desktop is starting up, those they could lose their window
> positions, which apparently made a whole bunch of users cranky. In
> practice, most of the editors that I'm familiar with have been around
> for a while, have needed to make sure that that cases such as yours
> wouldn't result in data loss, and so are pretty good about using
> fsync() so that users' files wouldn't be lost, no matter what the
> filesystem or operating system being used.

Its more than losing window postions. I've been using ext4 with kde 4.2.1
along with some experimental modules (drm for xorg for r600 support, btrfs)
and a few patches. As expected this has caused a few crashes. I have had
kde lose desktop setup info (eg. it forgot it was using xrender accel). I
have also had kmail lose all its configuration - which is a pita to rebuild.
Note that these crashes occur long after kde has been started...

> The problem has been mostly with newer applications, especially the
> newer desktop ones, which have been written to assume that they only
> have to work safely on Linux and ext3. The replace-via-truncate and
> replace-via-rename workarounds provide this safety for ext4.

When there are patches out to improve this (bad) behavior I would love
to try them.

TIA
Ed Tomlinson

2009-04-05 18:13:58

by Tomasz Chmielewski

[permalink] [raw]
Subject: Re: Ext4 and the "30 second window of death"

>> The replace-via-truncate and replace-via-rename workarounds are there
>> for the benefit of KDE, and GNOME, which in some configurations
>> apparently will replace hundreds of dot files when the desktop is
>> started up, for no reason that I can understand. (Not such a great
>> idea for SSD write endurance!) Some people apparently spend hours
>> making sure that their windows are exactly positioned the way they
>> want it when their desktop starts up, and if the system crashes while
>> their desktop is starting up, those they could lose their window
>> positions, which apparently made a whole bunch of users cranky. In
>> practice, most of the editors that I'm familiar with have been around
>> for a while, have needed to make sure that that cases such as yours
>> wouldn't result in data loss, and so are pretty good about using
>> fsync() so that users' files wouldn't be lost, no matter what the
>> filesystem or operating system being used.
>
> Its more than losing window postions. I've been using ext4 with kde 4.2.1
> along with some experimental modules (drm for xorg for r600 support, btrfs)
> and a few patches. As expected this has caused a few crashes. I have had
> kde lose desktop setup info (eg. it forgot it was using xrender accel). I
> have also had kmail lose all its configuration - which is a pita to rebuild.
> Note that these crashes occur long after kde has been started...

I've lost contents of my /etc/shadow file some time ago.
Great fun after reboot.

But it also means that the problem begins not with KDE and Gnome, but
much, much earlier.


--
Tomasz Chmielewski
http://wpkg.org

2009-04-06 22:01:27

by L A Walsh

[permalink] [raw]
Subject: supporting laptops fs-semantic changes (was Re: Ext4 and the "30 second window of death")

Matthew Garrett wrote:
>> The other subtlety comes if we add fsync() suppression to laptop mode
-----
Perhaps this has already been suggested, but rather than
adding all these semantics to the core file-system / kernel routines,
wouldn't it be preferable to allow some 'layering' of a pseudo,
memory-based file-system, OVER some 'real' file system (OR), definable
set of files (under a subdir...or same device...or whatever).

The semantics of when the virtual-fs would sync to the physical-fs/files
controlled via mount options. Physical disk writes would be controlled by
selectively ignoring or honoring various "sync" events (time expired,
sync, fsync).

This could allow file-systems with different 'needs' (DB, or otherwise)
to be treated differently.

The advantage of another layer, is you could define _how much_ buffering
you wanted to allocate to a filesystem (or file-set). Maybe it's tolerable
losing a audio-recording of a talk, so large buff + don't sync 'cept when
full is fine. Sensitive filesystems(or sets) (i.e. db's), could be set
with buffers to hold largest 'single-writes', but sync/fsyncs are what
they are.

An optimization could provide for read/writes through the user-mem
controlled
buffered 'fs', to do direct I/O rather than into normal file-buffs where
possible, since presumably all accesses to a file would go through the
layer or not.

Wouldn't require application changing, and wouldn't require changing
well defined, lower-level kernel-filesystem operations.

Just a thought.
Linda