LinuxLists.cc - [BUG] Failed writes marked clean?

2002-11-08 20:23:04

Subject: [BUG] Failed writes marked clean?

Perhaps I'm reading the code incorrectly, but in kernel versions 2.4.18
and 2.5.46 it looks to me like in the case of a write, ll_rw_block
always clears the dirty bit. In the event of an error, nothing resets
the dirty bit and the uptodate flag is cleared. This means that if the
same block needs to be read again, the buffer cache will see that the
buffer is not uptodate and attempt to read the old contents of the
buffer off of the device. If the read suceeds the kernel ends up
corrupting data.

It seems to me that a better solution would be to mark the buffer as
dirty and uptodate and then attempt to propogate the error as far back
as possible. Ideally something can be done to correct the problem at a
higher level. Before I dive in and attempt to do something about this,
I wanted to make sure I was not missing anything important. So am I
full of it, or could this really be a problem?

Ross

2002-11-08 20:47:25

by Linus Torvalds

[permalink] [raw]

Subject: Re: [BUG] Failed writes marked clean?

In article <[email protected]>, Ross Biro <[email protected]> wrote:
>
>Perhaps I'm reading the code incorrectly, but in kernel versions 2.4.18
>and 2.5.46 it looks to me like in the case of a write, ll_rw_block
>always clears the dirty bit. In the event of an error, nothing resets
>the dirty bit and the uptodate flag is cleared.

Correct.

There's not all that much else it could do. Keeping the dirty bit set is
not an option - that would bring the whole system down on IO errors.

As it is, higher layers that care _can_ figure the IO error out, simply
by noticing that the page is not up-to-date after the write. It's then
totally up to the higher layers (ie user space) to write the thing anew
if it cares about the data.

(In other words: this is why we have fsync() and error codes).

Linus

2002-11-08 20:50:44

by Andrew Morton

[permalink] [raw]

Subject: Re: [BUG] Failed writes marked clean?

Ross Biro wrote:
>
> Perhaps I'm reading the code incorrectly, but in kernel versions 2.4.18
> and 2.5.46 it looks to me like in the case of a write, ll_rw_block
> always clears the dirty bit. In the event of an error, nothing resets
> the dirty bit and the uptodate flag is cleared. This means that if the
> same block needs to be read again, the buffer cache will see that the
> buffer is not uptodate and attempt to read the old contents of the
> buffer off of the device. If the read suceeds the kernel ends up
> corrupting data.

That's correct, for metadata. It may not be fully accurate for
file data, where the page state comes into play as well.

The handling of IO errors is very weird. Especially for writes.
And poorly tested. It needs a big revamp and testing.

> It seems to me that a better solution would be to mark the buffer as
> dirty and uptodate and then attempt to propogate the error as far back
> as possible. Ideally something can be done to correct the problem at a
> higher level. Before I dive in and attempt to do something about this,
> I wanted to make sure I was not missing anything important. So am I
> full of it, or could this really be a problem?
>

Well before going and changing stuff, we need to decide what to
change it _to_. What do we want to happen if there's a read error?
And a write error?

For reads, it makes sense for the page/buffer to be left not uptodate,
and return an error.

For write errors, marking the page/buffer not uptodate doesn't make
a lot of sense. Marking it clean makes sense if we're not going to retry
the write. Marking it dirty, uptodate and unmapped would make sense
if we want to go and try a different part of the disk. But it
doesn't make sense if the whole disk is dead.

Also, think about what a write error _means_. Unless the disk is truly
ancient, it means that the device has run out of alternate space for
the block, or all writes are failing. ie: it is a serious failure.

So perhaps the appropriate strategy on write errors is to mark the
device readonly and to drop all write data on the floor. That means
clean+mapped+uptodate.

So yes, I think I agree with myself. Write errors should leave the
page/buffer clean, uptodate, mapped, PageError (whatever the latter
maens...)

2002-11-08 21:24:05

by Ross Biro

[permalink] [raw]

Subject: Re: [BUG] Failed writes marked clean?

Andrew Morton wrote:

>Also, think about what a write error _means_. Unless the disk is truly
>ancient, it means that the device has run out of alternate space for
>the block, or all writes are failing. ie: it is a serious failure.
>
>
I've seen all sorts of interesting drive failure modes, including losing
communications with the drive for a short period and then having it come
back almost as good as new. We've had some data corruption on flaky
drives and I'm guessing this has something to do with it.

I'm going to sit down with our application developers and see what they
want to see from their end and see what I can do.

Ross

2002-11-08 23:29:05

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [BUG] Failed writes marked clean?

On Fri, Nov 08, 2002 at 12:57:19PM -0800, Andrew Morton wrote:
> Well before going and changing stuff, we need to decide what to
> change it _to_. What do we want to happen if there's a read error?
> And a write error?
>
> For reads, it makes sense for the page/buffer to be left not uptodate,
> and return an error.

In some circumstances, it may actually make sense to try writing a
random block of data to the disk, since that may force the disk to
remap the block. (Disks generally only remap a block from the pool of
spare blocks on writes, not on reads.)

Unfortuantely, if the error was just a transient one, you might end up
smashing the block when you write random garbage in an attempt to
remap the block. So perhaps the answer is to retry the read, and if
that fails, *then* try to do a forced rewrite of the block.

The next question is whether to do this in userspace or in the kernel.
And if in the kernel, whether it should be done at the device driver
layer, or in the block I/O layer, or in the filesystem?

I can make a case for doing it in userspace, since that gives us the
most amount of flexibility, and it gives us ample opportunity to do
special things, such as paging an operator for help, etc. On the
other hand, there are arguments for doing it in the kernel. It may be
that an appropriately clever filesystem might be able to do more
intelligent recovery while keeping the filesystem mounted.

- Ted

2002-11-09 01:23:08

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: [BUG] Failed writes marked clean?

In article <[email protected]> you wrote:
> The next question is whether to do this in userspace or in the kernel.

An idea would be to lock/mark the block in the buffer, so it wont be used by
kernel. And then userspace can read out the locked buffers and decide what
to do (like writing to it). Especially good would it be, if user space can
get all details about the expected content (like inode/redir/dentry/data
block of file x).

Greetings
Bernd

2002-11-15 11:31:22

by Pavel Machek

[permalink] [raw]

Subject: Re: [BUG] Failed writes marked clean?

Hi!

> In some circumstances, it may actually make sense to try writing a
> random block of data to the disk, since that may force the disk to
> remap the block. (Disks generally only remap a block from the pool of
> spare blocks on writes, not on reads.)
>
> Unfortuantely, if the error was just a transient one, you might end up
> smashing the block when you write random garbage in an attempt to
> remap the block. So perhaps the answer is to retry the read, and if
> that fails, *then* try to do a forced rewrite of the block.
>

Retrying is not enough. I've seen a notebook
overheating: its cpu was still okay but HDD
was too hot and started acting crazy. I got
away with 2 bad blocks and FS survived. If
kernel tried to do something clever it would
probably make corruption much worse.
Pavel

--
Pavel
My velo broke, so I got Zaurus. If you have Philips Velo 1 you don't need...