2005-04-27 18:42:54

by Mike Miller

[permalink] [raw]
Subject: [Question] Does the kernel ignore errors writng to disk?

Hello All,
I have observed some behavior under certain failure conditions that seems as if the kernel may be ignoring write errors to disk.
During very heavy read/write io if we force a disk to fail requests continue to be submitted until the controllers queue is full. Ultimately, the requests are timed out by the controller. When this happens we see filesystem corruption. Sometimes it's the file data, other times it's filesystem metadata that has been timed out and failed. Either way its obviously undesirable behavior.
It looks like the OS/filesystem (ext2/3 and reiserfs) does not wait for for a successful completion. Is this assumption correct?

Thanks,
mikem


2005-04-27 19:14:10

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On Wed, 27 Apr 2005 [email protected] wrote:

> Hello All,
> I have observed some behavior under certain failure conditions that seems
> as if the kernel may be ignoring write errors to disk.
> During very heavy read/write io if we force a disk to fail requests
> continue to be submitted until the controllers queue is full.
> Ultimately, the requests are timed out by the controller. When this
> happens we see filesystem corruption. Sometimes it's the file data,
> other times it's filesystem metadata that has been timed out and
> failed. Either way its obviously undesirable behavior.
> It looks like the OS/filesystem (ext2/3 and reiserfs) does not
> wait for for a successful completion. Is this assumption correct?
>
> Thanks,
> mikem

It depends. Obviously if you disconnect your hard drive, the writes
will fail with a time-out. But they fail after a number of retries
(it depends upon the type of disk and its driver). So, if you
"force" a timeout by disconnecting a drive, you don't have
the same situtation as a normally failed write.

Disk/file writes go like this (assuming no sync() or fsync()).

(1) File data gets flushed to a queue.
(2) When the queue gets nearly full, based upon a LRU mechanism,
data are written to the disk.
(3) If the disk-write fails, the driver retries the write.
(4) If the write continues to fail, i.e., timeout, no disk, etc.
the kernel gives up and does not hang forever. If you have
disconnected the drive, you won't have any syslog writes to
the device so your next boot won't show the event. It looks
as though it was ignored.

You can observe the behavior by mounting a floppy disk and
then removing it while it is being written. There are many
attempts to write to the device and then that write is discarded.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-04-28 15:00:06

by Alan

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On Mer, 2005-04-27 at 19:40, [email protected] wrote:
> It looks like the OS/filesystem (ext2/3 and reiserfs) does not wait for for a successful completion. Is this assumption correct?

Of course it doesn't. At 250 ops/second for a decent disk no OS waits
for completions, all batch and asynchronously queue I/O. See man fsync
and also O_DIRECT if you need specific "to disk" support. If you do that
be aware that you must also turn write caching off on the IDE disk. I've
repeatedly asked the "maintainer" of the IDE layer to do this
automatically but gave up bothering long ago. Without that setting users
are playing with fire quite honestly.

The alternative with latest 2.6 stuff is to turn on Jens Axboe's barrier
work which seems to give better performance on a drive new enough to
have cache flush operations.

Alan

2005-04-28 15:06:08

by Mike Miller

[permalink] [raw]
Subject: RE: [Question] Does the kernel ignore errors writng to disk?

> -----Original Message-----
> From: Alan Cox [mailto:[email protected]]
> Sent: Thursday, April 28, 2005 9:58 AM
> To: Miller, Mike (OS Dev)
> Cc: Linux Kernel Mailing List; [email protected];
> [email protected]
> Subject: Re: [Question] Does the kernel ignore errors writng to disk?
>
> On Mer, 2005-04-27 at 19:40, [email protected] wrote:
> > It looks like the OS/filesystem (ext2/3 and reiserfs) does
> not wait for for a successful completion. Is this assumption correct?
>
> Of course it doesn't. At 250 ops/second for a decent disk no
> OS waits for completions, all batch and asynchronously queue
> I/O. See man fsync and also O_DIRECT if you need specific "to
> disk" support. If you do that be aware that you must also
> turn write caching off on the IDE disk. I've repeatedly asked
> the "maintainer" of the IDE layer to do this automatically
> but gave up bothering long ago. Without that setting users
> are playing with fire quite honestly.
>
> The alternative with latest 2.6 stuff is to turn on Jens
> Axboe's barrier work which seems to give better performance
> on a drive new enough to have cache flush operations.
>
> Alan
Thanks, Alan. I'll try Jens barrier.

>
>

2005-04-28 18:19:13

by Bryan Henderson

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

>See man fsync
>and also O_DIRECT if you need specific "to disk" support

Probably the most common way to get the simple but slow write function
where the write() call actually writes to stable storage, and fails if it
can't, is the O_SYNC open flag.

But even that, in some versions of Linux, can miss write errors. It's not
easy for Linux to catch them because the code that sees the I/O fail
doesn't know if it's part of some synchronous procedure where the user
will eventually find out about the error or the more common case where the
application has optimistically walked away and nothing can be done but
write off the loss.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-28 22:45:53

by Alan

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On Iau, 2005-04-28 at 19:14, Bryan Henderson wrote:
> Probably the most common way to get the simple but slow write function
> where the write() call actually writes to stable storage, and fails if it
> can't, is the O_SYNC open flag.

O_SYNC doesn't work completely on several file systems and only on the
latest kernels with some of the common ones.

> But even that, in some versions of Linux, can miss write errors. It's not
> easy for Linux to catch them because the code that sees the I/O fail
> doesn't know if it's part of some synchronous procedure where the user
> will eventually find out about the error or the more common case where the
> application has optimistically walked away and nothing can be done but
> write off the loss.

Or because the error is reported out of order and there are ordering
guarantees in the fs. SCSI is ok here other controllers are not always
right.

2005-04-28 23:15:40

by Bryan Henderson

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

>O_SYNC doesn't work completely on several file systems and only on the
>latest kernels with some of the common ones.

Hmmm. You didn't mention such a restriction when you suggested fsync()
before. Does fsync() work completely on these kernels where O_SYNC
doesn't? Considering that a simple implementation of O_SYNC just does the
equivalent of an fsync() inside every write(), that would be hard to
understand.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On 4/28/05, Alan Cox <[email protected]> wrote:
> On Mer, 2005-04-27 at 19:40, [email protected] wrote:
> > It looks like the OS/filesystem (ext2/3 and reiserfs) does not wait for for a successful completion. Is this assumption correct?
>
> Of course it doesn't. At 250 ops/second for a decent disk no OS waits
> for completions, all batch and asynchronously queue I/O. See man fsync
> and also O_DIRECT if you need specific "to disk" support. If you do that
> be aware that you must also turn write caching off on the IDE disk. I've
> repeatedly asked the "maintainer" of the IDE layer to do this
> automatically but gave up bothering long ago. Without that setting users

WTF is wrong with you Alan?

We agreed on this but it is you to do coding, if you want it,
not me (and there was never any patch from you).

It is not my (unpaid) job to fulfill any requirement you come up with.

BTW I was supposed to push git update today but I wasted this time
on replying your complaints (didn't even bother with personal insults).

> are playing with fire quite honestly.
>
> The alternative with latest 2.6 stuff is to turn on Jens Axboe's barrier
> work which seems to give better performance on a drive new enough to
> have cache flush operations.

2005-04-28 23:51:38

by Alan

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

> We agreed on this but it is you to do coding, if you want it,
> not me (and there was never any patch from you).

I gave up sending you patches because they never got applied and all I
got was "change this" or send a security fix and get told its got wrong
white spacing for your personal religion.

The bug is still there, and the users still need to know its dangerous.
Perhaps that way someone will fix it.

Alan

Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On 4/29/05, Alan Cox <[email protected]> wrote:
> > We agreed on this but it is you to do coding, if you want it,
> > not me (and there was never any patch from you).
>
> I gave up sending you patches because they never got applied and all I
> got was "change this" or send a security fix and get told its got wrong

First to make it clear you never ever sent any patch
for this _particular_ issue.

Oh and you've never changed "this" or even explained why is so
so no wonder why _some_ of your patches don't get applied.

> white spacing for your personal religion.

Sure I complain about your exotic whitespace and coding
style but I _never_ reject patches because of this.

> The bug is still there, and the users still need to know its dangerous.
> Perhaps that way someone will fix it.

Patches as usual are welcomed.

2005-04-29 07:25:31

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On Thu, 28 Apr 2005, Bryan Henderson wrote:
> >O_SYNC doesn't work completely on several file systems and only on the
> >latest kernels with some of the common ones.
>
> Hmmm. You didn't mention such a restriction when you suggested fsync()
> before. Does fsync() work completely on these kernels where O_SYNC
> doesn't? Considering that a simple implementation of O_SYNC just does the
> equivalent of an fsync() inside every write(), that would be hard to
> understand.

Some file systems implement their fsync() function as "return 0;" so no,
you cannot rely on it at all.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-04-29 19:14:18

by Bryan Henderson

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

>On Thu, 28 Apr 2005, Bryan Henderson wrote:
>> >O_SYNC doesn't work completely on several file systems and only on the
>> >latest kernels with some of the common ones.
>>
>> Hmmm. You didn't mention such a restriction when you suggested fsync()

>> before. Does fsync() work completely on these kernels where O_SYNC
>> doesn't? Considering that a simple implementation of O_SYNC just does
the
>> equivalent of an fsync() inside every write(), that would be hard to
>> understand.
>
>Some file systems implement their fsync() function as "return 0;" so no,
>you cannot rely on it at all.

It's pretty clear Alan isn't talking about those cases. I don't think he
would have suggested fsync() to address the delayed write error problem in
a case where fsync() is "return 0;".

But let's talk about the no-op fsync() cases: fsync() is supposed to
cause data to be written to stable storage. "stable" is a relative
concept that the individual filesystem type or driver has to define for
itself. In an ordinary disk-based filesystem, we usually expect it to
mean the data has gone onto the oxide. But that's not really stable --
the disk drive could break and the data would be gone. For some, just
getting into the buffers of the disk drive is stable enough, since then
rebooting Linux wouldn't cause the data to be lost. For ramfs, the Linux
page cache is as stable as you can hope for.

So I view it as correct even if fsync() does nothing on a disk-based
filesystem because the programmer was lazy (or because the user wants to
defeat the performance-busting behavior of some paranoid application). But
when Alan speaks of a "not completely correct" version of synchronization,
which makes me think of something that doesn't implement any consistent
form of "stable," I want to hear more.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-29 22:03:04

by Alan

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

On Gwe, 2005-04-29 at 20:11, Bryan Henderson wrote:
> So I view it as correct even if fsync() does nothing on a disk-based
> filesystem because the programmer was lazy (or because the user wants to
> defeat the performance-busting behavior of some paranoid application). But
> when Alan speaks of a "not completely correct" version of synchronization,
> which makes me think of something that doesn't implement any consistent
> form of "stable," I want to hear more.

On the main fs's people use with a current kernel fsync guarantees the
data went somewhere. What it guarantees beyond that depends on the fs
properties, the driver properties and the media properties.

So ext3 journal=data or jffs which are the strongest guarantee cases
mean that your fsync() data should be on media and stable. Ditto I
believe default ext3 behaviour because fsync has stronger rules than
fdatasync.

The next question is what the I/O device does with the data. SCSI disks
will cache but the scsi layer uses tags and if neccessary turns the
cache off on the drive. In other words you should get that behaviour
correctly on SCSI media.

The default IDE behaviour doesn't turn write cache off and the IDE
device may re-order writes and ack them before they hit storage. IDE
lacks tags, and tends to have poor performance on cache flush commands.
With the barrier support on the right thing should occur, or with hdparm
used to turn the write cache off.

Raid controllers will cache data in their writeback caches, they will
also write and rewrite stripes which can mean a critical failure loses
the cache or involves a whole stripe loss, but that is very unlikely in
most modes. The good ones either write through or have battery backed
caches. The really good ones even let you put the battery/ram unit onto
another card.

Underlying all of this is the fact that disks aren't really disks any
more but NAS devices on funky cables, that can mean you can lose blocks
to drive faults that might not be the block you are currently writing.

Alan

2005-04-30 00:42:08

by Bryan Henderson

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

Thanks for the info on how stability works with SCSI and ATA, but I think
you lost the context of my question.

You said earlier that fsync() and O_DIRECT are ways to deal with the
problem of delayed write errors. I added that O_SYNC is another way. You
then said that O_SYNC doesn't work completely correctly in some recent
(but not current) kernels. You didn't say the same about fsync().

I'd like to know if you mean to say that O_SYNC has some problems in some
kernels that fsync() does not have.

And if it isn't too much trouble, it would be nice to hear details of how
O_SYNC is partially correct in some kernels.

2005-05-01 09:00:56

by Mogens Valentin

[permalink] [raw]
Subject: Re: [Question] Does the kernel ignore errors writng to disk?

Alan Cox wrote:
> The next question is what the I/O device does with the data. SCSI disks
> will cache but the scsi layer uses tags and if neccessary turns the
> cache off on the drive. In other words you should get that behaviour
> correctly on SCSI media.
>
> The default IDE behaviour doesn't turn write cache off and the IDE
> device may re-order writes and ack them before they hit storage. IDE
> lacks tags, and tends to have poor performance on cache flush commands.
> With the barrier support on the right thing should occur, or with hdparm
> used to turn the write cache off.

Is this IDE behaviour confined to IDE drives only?
SATA, when using libata, will solemnly be part of the SCSI chain, and
hense not subject to your mentioned write cache problem, right?

--
Kind regards,
Mogens Valentin