2005-05-16 06:06:00

by fs

[permalink] [raw]
Subject: [RFD] What error should FS return when I/O failure occurs?

Hi, ML subscribers,

I'm taking part in the project DOUBT[1], and my sub-project
focuses on the consistency and coherency of FS[2].
There is something I'm still confused - What error should FS
return when I/O failure occurs? It seems there are no relevant
documents or standards on this issue.

I'll just show some examples to make things clear:
1. For EXT3 partition , we mount it as RW, but when I/O occurs, the
I/O related functions return EROFS(ReadOnly?), while other FSes
return EIO.
2. Assume a program doing the following: open - write(async) - close
When user-mode app calls sys_write, for EXT2/JFS, no error
returns, for EXT3, EROFS returns, for XFS/ReiserFS, EIO returns.

I know each FS has its own implementation, but from users'
perspective, they don't care what FS they're using. So, when
handling errors from syscall, they can't do the following(p-code):
ret = sys_write(fd, buf, size);
if(ret < 0){
/* the following is I/O failure related. */
if((IsEXT3() && errno == EROFS) ||
((IsXFS() || IsReiserFS()) && errno == EIO)){
/* do some things about I/O failure */
....
}else{
....
}
}
When I/O failure occurs, there should be some standards which
define the ONLY error that should be returned from VFS, right?
What they should do is (p-code):
ret = sys_write(fd, buf, size);
if(ret < 0){
if(errno == EIO){
...
}else
...
}


Ref
[1]http://developer.osdl.jp/projects/doubt/
[2]http://developer.osdl.jp/projects/doubt/fs-consistency-and-coherency/index.html

regards,
----
Qu Fuping



2005-05-16 06:55:42

by fs

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Mon, 2005-05-16 at 02:35, [email protected] wrote:
> On Mon, 16 May 2005 13:14:25 EDT, fs said:
>
> > 1. For EXT3 partition , we mount it as RW, but when I/O occurs, the
> > I/O related functions return EROFS(ReadOnly?), while other FSes
> > return EIO.
>
> Only the request that actually caused the I/O error (and thus causing the
> system to re-mount the ext3 partition R/O) should get EIO. EROFS is
> the proper error for subsequent requests - because they're being rejected
> because the filesystem is R/O. EIO would be incorrect, because the I/O
> wasn't even tried, much less errored - and there's a good chance that
> subsequent I/O requests *wouldn't* pull an error. Manwhile, subsequent
> requests don't even *know* whether the filesystem was remounted R/O due to
> an error, or if some root user said 'mount -o remount,ro'.
The point is(from the user's perspective, not FS developer's):
If you open a file with O_RDWR, and sys_open returns success,
next, call sys_write, but returns EROFS? The two return values are
paradox/self-contradictory.
> > 2. Assume a program doing the following: open - write(async) - close
> > When user-mode app calls sys_write, for EXT2/JFS, no error
> > returns, for EXT3, EROFS returns, for XFS/ReiserFS, EIO returns.
>
> Remember that the request that actually hits an error could be from a
> process that isn't even in existence anymore, if the page has been sitting
> in the cache for a while and we're finally sending it to disk. If you don't
> believe me, try this on a machine with lots (1G or 2G or so) memory:
>
> 1) cd /usr/src/linux
> 2) tar cf - . | cat > /dev/null # just to prime the disk cache
> 3) make # wait a few minutes for it to complete.
> 4) Now that the 'make' is done, type 'sync' and watch the disk lights blink.
>
> Notice you're syncing the disk blocks written by the various sub-processes
> of 'make', all of which are done and long gone. Who do you report the EIO
> to, on what write() request?
>
> (For even more fun - what happens if it's kjournald pushing the blocks out,
> not the 'sync' command? ;)

Thanks for your example, but it seems you misunderstand my point.
I just use async write as an example, which shows different FS
returns different error. Here is another example:
stat(2) - open(2) - read(2) -close(2)
When I/O failure occurs between stat(2) and open(2),
EXT2/JFS/XFS/ReiserFS returns EIO, but EXT3 returns ENOENT

The purpose of this RFD, is to get the community to understand,
all I/O related syscalls should return VFS error, not FS error.
User mode app should not care about the FS they are using.
So, the community should define the ONLY VFS error first.

regards,
---
Qu Fuping


2005-05-16 08:33:46

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Mon, 16 May 2005 13:14:25 EDT, fs said:

> 1. For EXT3 partition , we mount it as RW, but when I/O occurs, the
> I/O related functions return EROFS(ReadOnly?), while other FSes
> return EIO.

Only the request that actually caused the I/O error (and thus causing the
system to re-mount the ext3 partition R/O) should get EIO. EROFS is
the proper error for subsequent requests - because they're being rejected
because the filesystem is R/O. EIO would be incorrect, because the I/O
wasn't even tried, much less errored - and there's a good chance that
subsequent I/O requests *wouldn't* pull an error. Manwhile, subsequent
requests don't even *know* whether the filesystem was remounted R/O due to
an error, or if some root user said 'mount -o remount,ro'.

> 2. Assume a program doing the following: open - write(async) - close
> When user-mode app calls sys_write, for EXT2/JFS, no error
> returns, for EXT3, EROFS returns, for XFS/ReiserFS, EIO returns.

Remember that the request that actually hits an error could be from a
process that isn't even in existence anymore, if the page has been sitting
in the cache for a while and we're finally sending it to disk. If you don't
believe me, try this on a machine with lots (1G or 2G or so) memory:

1) cd /usr/src/linux
2) tar cf - . | cat > /dev/null # just to prime the disk cache
3) make # wait a few minutes for it to complete.
4) Now that the 'make' is done, type 'sync' and watch the disk lights blink.

Notice you're syncing the disk blocks written by the various sub-processes
of 'make', all of which are done and long gone. Who do you report the EIO
to, on what write() request?

(For even more fun - what happens if it's kjournald pushing the blocks out,
not the 'sync' command? ;)

This isn't as easy as it looks....


Attachments:
(No filename) (226.00 B)

2005-05-16 17:36:17

by Hans Reiser

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

fs wrote:

>Ref
>[1]http://developer.osdl.jp/projects/doubt/
>[2]http://developer.osdl.jp/projects/doubt/fs-consistency-and-coherency/index.html
>
>
>
Sounds like a great project, good luck with it. If you find
improvements of this kind that ReiserFS has need of, we will be happy to
hear of them.

2005-05-16 17:59:22

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Mon, 16 May 2005 14:04:04 EDT, fs said:

> The point is(from the user's perspective, not FS developer's):
> If you open a file with O_RDWR, and sys_open returns success,
> next, call sys_write, but returns EROFS? The two return values are
> paradox/self-contradictory.

You'd be better off pointing out that 'man 2 write' lists the errors that
might be returned as: EBAF, EINVAL, EFAULT, EFBIG, EPIPE, EAGAIN, EINTR,
ENOSPC, and EIO.

Does the POSIX spec allow write() to return -EROFS?

What happens if you're writing to an NFS-mounted file system, and the remote
system remounts the disk R/O? What is reported in that case?

> The purpose of this RFD, is to get the community to understand,
> all I/O related syscalls should return VFS error, not FS error.

All fine and good, until you hit a case like ext3 where reporting
the FS error code will better explain the *real* problem than forcing
it to fit into one of the provided VFS errors.

> User mode app should not care about the FS they are using.
> So, the community should define the ONLY VFS error first.

I think that's been done, and the VFS behavior is "if the FS reports
an error we pass it up to userspace".


Attachments:
(No filename) (226.00 B)

2005-05-16 20:16:39

by Kenichi Okuyama

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

Dear Valdis,

>>>>> "VK" == Valdis Kletnieks <[email protected]> writes:
>> 1. For EXT3 partition , we mount it as RW, but when I/O occurs, the
>> I/O related functions return EROFS(ReadOnly?), while other FSes
>> return EIO.
VK> Only the request that actually caused the I/O error (and thus causing the
VK> system to re-mount the ext3 partition R/O) should get EIO. EROFS is
VK> the proper error for subsequent requests - because they're being rejected
VK> because the filesystem is R/O.

I don't see your point.

According to QuFuPing's test, USB cable was UNPLUGGED. That means,
device is gone, and device driver instantly (well.. within second or
two) detected that fact. How could ext3 mounted device that does
not exist, as Read Only?



>> 2. Assume a program doing the following: open - write(async) - close
>> When user-mode app calls sys_write, for EXT2/JFS, no error
>> returns, for EXT3, EROFS returns, for XFS/ReiserFS, EIO returns.
VK> Remember that the request that actually hits an error could be from a
VK> process that isn't even in existence anymore, if the page has been sitting
VK> in the cache for a while and we're finally sending it to disk.

I don't see the reason why cache is still available.
# I mean why such a implementation is valid.

If storage is known to be lost by device driver, we should not use
that cache anymore.


Think about what the cached data means.

Cache image is the data image which original data exist in some
device. Image on memory can be used as cache because consistency is
managed by device driver.

If device no longer exist within reach of OS, device driver will not
be able to manage the consistency between cache image and what
device really have. Hence, if device driver lost control over
device somehow, CACHE IMAGE SHOULD BECOME INVALID.


So, even for asynchronous IO, or read, or open, or close which only
may require cached image, IF DEVICE DRIVER HAVE ALREADY DETECTED THE
HW FAIURE ( please keep in mind that I did not add case which device
driver did not detcted HW failure yet. I think this is important to
meet the ASYNC requirement ), system should invalidate the cache
image related to that storage before hand. That means, even for
asynchronous IO request, file system should, at least, ask device
driver if they have ALREADY detectED any HW failure.
# And that means, device driver should have such interface.

Since device driver have already detected HW failure, whether you
really will cause IO or not doesn't matter, EIO should be the
correct return of error for this case.

EXT3 should never succeed in remounting lost device as Read Only.

regards,
----
Kenichi Okuyama

2005-05-16 20:36:36

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Tue, 17 May 2005 05:11:13 +0900, Kenichi Okuyama said:

> According to QuFuPing's test, USB cable was UNPLUGGED. That means,
> device is gone, and device driver instantly (well.. within second or
> two) detected that fact. How could ext3 mounted device that does
> not exist, as Read Only?

I thought we were talking about write requests - which were getting short-circuited
because the file system was R/O before we even tried to talk to the actual
file system. No sense in queueing a write I/O when it's known to be R/O.

If you're trying to *read* from the now-absent disk and encounter a page
that's not already in the cache, yes, you'll probably be returning an EIO.

> I don't see the reason why cache is still available.
> # I mean why such a implementation is valid.
>
> If storage is known to be lost by device driver, we should not use
> that cache anymore.

Why? If the disk disappeared out from under us because it was an unplugged USB
device, there's at least a possibility of it reappearing via hotplug - presumably
if you verify the UUID that it's the *same* file system, hotplug could do a
'mount -o remount' and recover the situation....

(Of course, this may not be practical if we've already tried a write-out due to
memory pressure or the like, and may not fit well into the innards of the VFS - but
it's certainly not an outrageously daft thing to attempt - "User unplugged before
we finished writing, but we still have all the needed pages, so we can re-drive
the sync to disk as if nothing happened"....)


Attachments:
(No filename) (226.00 B)

2005-05-16 21:45:43

by Kenichi Okuyama

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

>>>>> "Valdis" == Valdis Kletnieks <[email protected]> writes:

Valdis> On Tue, 17 May 2005 05:11:13 +0900, Kenichi Okuyama said:
>> According to QuFuPing's test, USB cable was UNPLUGGED. That means,
>> device is gone, and device driver instantly (well.. within second or
>> two) detected that fact. How could ext3 mounted device that does
>> not exist, as Read Only?

Valdis> I thought we were talking about write requests - which were getting short-circuited
Valdis> because the file system was R/O before we even tried to talk to the actual
Valdis> file system. No sense in queueing a write I/O when it's known to be R/O.

Wrong. Did you check what Qu have said?

1) USB storage exist as READ/WRITE mounted.
2) Then he unplugged USB cable, making USB storage unavailble.
3) EXT3 FS reported the error EROFS.

So, it is at the time somewhere between "after USB cable unplug" and
"write(2) return" that EXT3 remounted the file system as RO.
It was not RO from beginning.


Valdis> If you're trying to *read* from the now-absent disk and encounter a page
Valdis> that's not already in the cache, yes, you'll probably be returning an EIO.
>> I don't see the reason why cache is still available.
>> # I mean why such a implementation is valid.
>>
>> If storage is known to be lost by device driver, we should not use
>> that cache anymore.

Valdis> Why? If the disk disappeared out from under us because it was an unplugged USB
Valdis> device, there's at least a possibility of it reappearing via hotplug - presumably
Valdis> if you verify the UUID that it's the *same* file system, hotplug could do a
Valdis> 'mount -o remount' and recover the situation....

I don't think that's good idea.

USB storage is gone. And it SEEMS to came back.
But how do you know that it's images were not changed.

Blocks you have cached might have different image. If you remount
the file system, the cache image should be updated as well.

But very fact that *cache image should be updated* means, old cache
image was invalid. And when did it become invalid?

When it was gone.

Think about thing this way. There was USB storage and it's cached
image. Storage is somewhat gone. It never returned before reboot.
Was cache image valid after storage gone? Ofcourse not. That cache
is nothing more than old data which came from LOST, and NEVER COMING
BACK device.

If device did come back but with change, we must read the data from
storage again. Old cache image was useless, and was harmful.
If device did come back without change, we can read the data from
storage again.

No need to keep the cache image, taking risk of cache not being
valid, especially while you have no control over the storage.


By the way.

Try umount, and then mount it again manually for any device. You'll
find all the cache images for that file system are gone.
If your assumption about cache is correct, why isn't this
umount/remount feature keeping the cache image?

You'll, at least, see that there is some inconsistency about cache
handling when we *umount->mount* and *remount*.

regards,
----
Kenichi OKuyama

2005-05-16 22:08:07

by Brad Boyer

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Tue, May 17, 2005 at 06:39:31AM +0900, Kenichi Okuyama wrote:
> USB storage is gone. And it SEEMS to came back.
> But how do you know that it's images were not changed.
>
> Blocks you have cached might have different image. If you remount
> the file system, the cache image should be updated as well.
>
> But very fact that *cache image should be updated* means, old cache
> image was invalid. And when did it become invalid?
>
> When it was gone.
>
> Think about thing this way. There was USB storage and it's cached
> image. Storage is somewhat gone. It never returned before reboot.
> Was cache image valid after storage gone? Ofcourse not. That cache
> is nothing more than old data which came from LOST, and NEVER COMING
> BACK device.
>
> If device did come back but with change, we must read the data from
> storage again. Old cache image was useless, and was harmful.
> If device did come back without change, we can read the data from
> storage again.
>
> No need to keep the cache image, taking risk of cache not being
> valid, especially while you have no control over the storage.

This is a difficult problem, but it's not as completely invalid as
you seem to think. The use case I remember taking advantage of in
actual experience is from the classic MacOS. The way the mac handled
floppies was very interesting. There was a way to eject an HFS floppy
without unmounting it. Using this trick, you could have multiple disks
mounted using the same physical drive. It kept as much as it could
in RAM to be able to use the files, and the system would block on
unknown sectors until the correct disk was reinserted. However, it's
very difficult to get this level of usage without full knowledge all
the way from the device driver up to the UI. Since Apple controlled
the whole thing, they could get away with this. I'm not sure we could
do an equivalent thing in as different of an environment as we have.
They could tell apart each filesystem, notify the user when a different
disk was needed, and everything else to have a seamless experience.

Brad Boyer
[email protected]

2005-05-16 22:35:51

by Elladan

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Tue, May 17, 2005 at 06:39:31AM +0900, Kenichi Okuyama wrote:
> >>>>> "Valdis" == Valdis Kletnieks <[email protected]> writes:
>
> Valdis> Why? If the disk disappeared out from under us because it was an unplugged USB
> Valdis> device, there's at least a possibility of it reappearing via hotplug - presumably
> Valdis> if you verify the UUID that it's the *same* file system, hotplug could do a
> Valdis> 'mount -o remount' and recover the situation....
>
> I don't think that's good idea.
>
> USB storage is gone. And it SEEMS to came back.
> But how do you know that it's images were not changed.
>
> Blocks you have cached might have different image. If you remount
> the file system, the cache image should be updated as well.
>
> But very fact that *cache image should be updated* means, old cache
> image was invalid. And when did it become invalid?

[...]

> You'll, at least, see that there is some inconsistency about cache
> handling when we *umount->mount* and *remount*.

This is basically the problem people have had with removable storage for
years... You can't really solve it perfectly, since as you note one
could always place the storage in another machine and change it.

But I think it's instructive to note what most other systems have done
in this situation... The solution seems similar in most cases, from eg.
Mac, Amiga, DOS, Windows, etc.

The typical solution is, when a removable device is yanked when dirty
blocks exist, is to keep the dirty blocks around, and put the device
into some sort of pending-reinsert state.

Then most systems typically display a large message to the user of the
form: "You idiot! Put the disk/cd/flash/etc. back in!"

The cache and dirty blocks would then only be cleared on a user cancel.
If the same device (according to some ID test) reappears, then it's
reactivated and usage continues normally.

Obviously, this sort of approach requires some user interaction to get
right. It has the distinct advantage of not throwing away the data the
user wrote after an inadvertant disconnect, for example if they thought
the device was done writing when it really wasn't. It can also keep
from corrupting the FS metadata.

The downside is that it might not really work, if there wasn't a good
way to know when sectors actually are in stable storage, since a few
blocks could be lost around the time the device was pulled.

-J

2005-05-16 22:55:21

by Coywolf Qi Hunt

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On 5/17/05, Kenichi Okuyama <[email protected]> wrote:
> Dear Valdis,
>
> >>>>> "VK" == Valdis Kletnieks <[email protected]> writes:
> >> 1. For EXT3 partition , we mount it as RW, but when I/O occurs, the
> >> I/O related functions return EROFS(ReadOnly?), while other FSes
> >> return EIO.
> VK> Only the request that actually caused the I/O error (and thus causing the
> VK> system to re-mount the ext3 partition R/O) should get EIO. EROFS is
> VK> the proper error for subsequent requests - because they're being rejected
> VK> because the filesystem is R/O.
>
> I don't see your point.
>
> According to QuFuPing's test, USB cable was UNPLUGGED. That means,
> device is gone, and device driver instantly (well.. within second or
> two) detected that fact. How could ext3 mounted device that does
> not exist, as Read Only?
>
>
> >> 2. Assume a program doing the following: open - write(async) - close
> >> When user-mode app calls sys_write, for EXT2/JFS, no error
> >> returns, for EXT3, EROFS returns, for XFS/ReiserFS, EIO returns.
> VK> Remember that the request that actually hits an error could be from a
> VK> process that isn't even in existence anymore, if the page has been sitting
> VK> in the cache for a while and we're finally sending it to disk.
>
> I don't see the reason why cache is still available.
> # I mean why such a implementation is valid.
>
> If storage is known to be lost by device driver, we should not use
> that cache anymore.
>
> Think about what the cached data means.
>
> Cache image is the data image which original data exist in some
> device. Image on memory can be used as cache because consistency is
> managed by device driver.
>
> If device no longer exist within reach of OS, device driver will not
> be able to manage the consistency between cache image and what
> device really have. Hence, if device driver lost control over
> device somehow, CACHE IMAGE SHOULD BECOME INVALID.
>
> So, even for asynchronous IO, or read, or open, or close which only
> may require cached image, IF DEVICE DRIVER HAVE ALREADY DETECTED THE
> HW FAIURE ( please keep in mind that I did not add case which device
> driver did not detcted HW failure yet. I think this is important to
> meet the ASYNC requirement ), system should invalidate the cache
> image related to that storage before hand. That means, even for
> asynchronous IO request, file system should, at least, ask device
> driver if they have ALREADY detectED any HW failure.
> # And that means, device driver should have such interface.
>
> Since device driver have already detected HW failure, whether you
> really will cause IO or not doesn't matter, EIO should be the
> correct return of error for this case.
>
> EXT3 should never succeed in remounting lost device as Read Only.


Two kinds of HW failure,

1. still readable, only write failure.
2. unreadable, unwriteable.

For the first case, if mount option errors=remount-ro is given or implied,
EROFS is appropriate, otherwise EIO. For the second case, always EIO.

The current VFS design does not try to hide the problems from its
underlying fs'.
No need to make it transparent. Userland programs need to consider
both EROFS and EIO.


--
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

2005-05-16 22:57:51

by Coywolf Qi Hunt

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On 5/17/05, Kenichi Okuyama <[email protected]> wrote:
> >>>>> "Valdis" == Valdis Kletnieks <[email protected]> writes:
>
> Valdis> On Tue, 17 May 2005 05:11:13 +0900, Kenichi Okuyama said:
> >> According to QuFuPing's test, USB cable was UNPLUGGED. That means,
> >> device is gone, and device driver instantly (well.. within second or
> >> two) detected that fact. How could ext3 mounted device that does
> >> not exist, as Read Only?
>
> Valdis> I thought we were talking about write requests - which were getting short-circuited
> Valdis> because the file system was R/O before we even tried to talk to the actual
> Valdis> file system. No sense in queueing a write I/O when it's known to be R/O.
>
> Wrong. Did you check what Qu have said?
>
> 1) USB storage exist as READ/WRITE mounted.
> 2) Then he unplugged USB cable, making USB storage unavailble.
> 3) EXT3 FS reported the error EROFS.
>
> So, it is at the time somewhere between "after USB cable unplug" and
> "write(2) return" that EXT3 remounted the file system as RO.
> It was not RO from beginning.
>
> Valdis> If you're trying to *read* from the now-absent disk and encounter a page
> Valdis> that's not already in the cache, yes, you'll probably be returning an EIO.
> >> I don't see the reason why cache is still available.
> >> # I mean why such a implementation is valid.
> >>
> >> If storage is known to be lost by device driver, we should not use
> >> that cache anymore.
>
> Valdis> Why? If the disk disappeared out from under us because it was an unplugged USB
> Valdis> device, there's at least a possibility of it reappearing via hotplug - presumably
> Valdis> if you verify the UUID that it's the *same* file system, hotplug could do a
> Valdis> 'mount -o remount' and recover the situation....
>
> I don't think that's good idea.
>
> USB storage is gone. And it SEEMS to came back.
> But how do you know that it's images were not changed.
>
> Blocks you have cached might have different image. If you remount
> the file system, the cache image should be updated as well.
>
> But very fact that *cache image should be updated* means, old cache
> image was invalid. And when did it become invalid?
>
> When it was gone.
>
> Think about thing this way. There was USB storage and it's cached
> image. Storage is somewhat gone. It never returned before reboot.
> Was cache image valid after storage gone? Ofcourse not. That cache
> is nothing more than old data which came from LOST, and NEVER COMING
> BACK device.
>
> If device did come back but with change, we must read the data from
> storage again. Old cache image was useless, and was harmful.
> If device did come back without change, we can read the data from
> storage again.
>
> No need to keep the cache image, taking risk of cache not being
> valid, especially while you have no control over the storage.
>
> By the way.
>
> Try umount, and then mount it again manually for any device. You'll
> find all the cache images for that file system are gone.
> If your assumption about cache is correct, why isn't this
> umount/remount feature keeping the cache image?

When there's umount, the kernel has no way to know whether it will
come back (mount) or not. When there's mount -o remount, the device
has never gone.

>
> You'll, at least, see that there is some inconsistency about cache
> handling when we *umount->mount* and *remount*.


--
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

2005-05-17 04:34:50

by fs

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Mon, 2005-05-16 at 11:42, Chris Siebenmann wrote:
> You write:
> | When I/O failure occurs, there should be some standards which
> | define the ONLY error that should be returned from VFS, right?
>
> In practice there is no standard and there never will be any standard.
> In general the only thing code can do on any write error is to abort
> the operation, regardless of what errno is. (The exceptions are for
> things like nonblocking IO, where 'EAGAIN' and 'EWOULDBLOCK' are not
> real errors.)
Yes, we're sure to abort the operation, but we can't use
exit(EXIT_FAILURE) directly. For HA environment, we should
identify the cause of the error, take correspondent action,
right? So we need to get the right error.
> ---
> "I shall clasp my hands together and bow to the corners of the world."
> Number Ten Ox, "Bridge of Birds"
> [email protected] utgpu!cks

regards,
----
Qu Fuping


2005-05-17 04:58:10

by fs

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Mon, 2005-05-16 at 18:54, Coywolf Qi Hunt wrote:

> Two kinds of HW failure,
>
> 1. still readable, only write failure.
> 2. unreadable, unwriteable.
>
> For the first case, if mount option errors=remount-ro is given or implied,
> EROFS is appropriate, otherwise EIO. For the second case, always EIO.
>
> The current VFS design does not try to hide the problems from its
> underlying fs'.
> No need to make it transparent. Userland programs need to consider
> both EROFS and EIO.
What you said is based on the FS implementor's perspective.
But from user's perspective, they open a file with O_RDWR, get a
success, then write returns EROFS?
Besides, EXT3 ALWAYS return EROFS for the 1st and 2nd case, even
you specify errors=continue, things are still the same.

regards,
----
Qu Fuping


2005-05-17 05:37:10

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: [RFD] What error should FS return when I/O failure occurs?

> What you said is based on the FS implementor's perspective.
> But from user's perspective, they open a file with O_RDWR, get a
> success, then write returns EROFS?
> Besides, EXT3 ALWAYS return EROFS for the 1st and 2nd case, even
> you specify errors=continue, things are still the same.

Which version of kernel you are using?

It was probably the case in kernel before 2.4.20. The old ext3 had a
problem that it ignored IO error at journal commit time. I submitted a
patch to fix that around the time of 2.4.20. 2.6 should be fine too,
unless someone else broke it again.

Hua

2005-05-17 05:39:23

by fs

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Mon, 2005-05-16 at 13:58, [email protected] wrote:

> You'd be better off pointing out that 'man 2 write' lists the errors
that
> might be returned as: EBAF, EINVAL, EFAULT, EFBIG, EPIPE, EAGAIN,
EINTR,
> ENOSPC, and EIO.
>
> Does the POSIX spec allow write() to return -EROFS?
If there is POSIX spec about this issue, I won't post this RFD.
> What happens if you're writing to an NFS-mounted file system, and the
remote
> system remounts the disk R/O? What is reported in that case?
So, it's necessary to define the right error in this case.
Each FS will follow this standard, give the defined error;
User can follow this standard, without caring what FS they're using.

> > The purpose of this RFD, is to get the community to understand,
> > all I/O related syscalls should return VFS error, not FS error.
>
> All fine and good, until you hit a case like ext3 where reporting
> the FS error code will better explain the *real* problem than forcing
> it to fit into one of the provided VFS errors.

So, if linux supports a new FS, which returns another error,
does that mean the app should be rewritten to include the new
error? There should be some standards constraint this behavour.

> > User mode app should not care about the FS they are using.
> > So, the community should define the ONLY VFS error first.
>
> I think that's been done, and the VFS behavior is "if the FS reports
> an error we pass it up to userspace".

Then,from userspace, V (of VFS) loses its meaning, because the
error is FS-related, not FS-unrelated.

regards,
----
Qu Fuping


2005-05-17 05:47:57

by fs

[permalink] [raw]
Subject: RE: [RFD] What error should FS return when I/O failure occurs?

On Tue, 2005-05-17 at 01:36, Hua Zhong (hzhong) wrote:
> > What you said is based on the FS implementor's perspective.
> > But from user's perspective, they open a file with O_RDWR, get a
> > success, then write returns EROFS?
> > Besides, EXT3 ALWAYS return EROFS for the 1st and 2nd case, even
> > you specify errors=continue, things are still the same.
>
> Which version of kernel you are using?
My test environment is based on 2.6.11 kernel
> It was probably the case in kernel before 2.4.20. The old ext3 had a
> problem that it ignored IO error at journal commit time. I submitted a
> patch to fix that around the time of 2.4.20. 2.6 should be fine too,
> unless someone else broke it again.
>
> Hua


2005-05-17 06:00:42

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: [RFD] What error should FS return when I/O failure occurs?

The thing is the EIO almost always happens at background so there is no
way to return it to the user space. If you want to see EIO, do fsync
explicitly.

> > Which version of kernel you are using?
> My test environment is based on 2.6.11 kernel
> > It was probably the case in kernel before 2.4.20. The old ext3 had a
> > problem that it ignored IO error at journal commit time. I
> submitted a
> > patch to fix that around the time of 2.4.20. 2.6 should be fine too,
> > unless someone else broke it again.
> >
> > Hua
>

2005-05-17 06:13:13

by fs

[permalink] [raw]
Subject: RE: [RFD] What error should FS return when I/O failure occurs?

On Tue, 2005-05-17 at 02:00, Hua Zhong (hzhong) wrote:
> The thing is the EIO almost always happens at background so there is no
> way to return it to the user space. If you want to see EIO, do fsync
> explicitly.
even with fsync, the result is still EROFS,
you can visit
http://developer.osdl.jp/projects/doubt/fs-consistency-and-coherency/
to see the result.



2005-05-17 06:18:36

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Tuesday 17 May 2005 01:30, Elladan wrote:
> > I don't think that's good idea.
> >
> > USB storage is gone. And it SEEMS to came back.
> > But how do you know that it's images were not changed.

Original poster seems to be a bit confused between what device driver
is responsible for, and what FS is responsible for.

IIRC, currently FS has no code to understand that device was removed.
It just gets failures on read/write ops, but does not do much with
error codes. It does not 'unmount on the fly' etc. It may do
something like this, but currently it is not coded.
Errors are simply propagated up to callers.

> > Blocks you have cached might have different image. If you remount
> > the file system, the cache image should be updated as well.
> >
> > But very fact that *cache image should be updated* means, old cache
> > image was invalid. And when did it become invalid?
>
> [...]
>
> > You'll, at least, see that there is some inconsistency about cache
> > handling when we *umount->mount* and *remount*.
>
> This is basically the problem people have had with removable storage for
> years... You can't really solve it perfectly, since as you note one
> could always place the storage in another machine and change it.

Linux is worse than "other OSes" because it treats removable storage
either the same as ordinary disks: caches writes, starts writeback
sometime in the future (thus we do not notice removals ASAP),
or Linux can mount removables O_SYNC: writes are not cached at all
(too aggressive/slow in many circumstances) and you can be positively
sure data is on disk if I/O indicator diode is off.

A new, less aggressive sync option which means "okay to cache writes,
but start writeback at once and write out all dirty data for this device"
would bring the best of both worlds.

Users of USB sticks not dying anymore from O_SYNC mounts will be
quite happy, too :)

Further step could be a way to keep such FSes as unjournalled ext2
in 'clean' state between write bursts (if FS is mounted with abm
'lazy sync' flag) and auto-unmounting of removed media. Although
I suspect even current hotplug can be configured to do auto-unmount,
I just did not try it myself.

> But I think it's instructive to note what most other systems have done
> in this situation... The solution seems similar in most cases, from eg.
> Mac, Amiga, DOS, Windows, etc.
>
> The typical solution is, when a removable device is yanked when dirty
> blocks exist, is to keep the dirty blocks around, and put the device
> into some sort of pending-reinsert state.
>
> Then most systems typically display a large message to the user of the
> form: "You idiot! Put the disk/cd/flash/etc. back in!"

Such message may be doable with hotplug. We need 'only' a
'pending-reinsert state' bits coded.

> The cache and dirty blocks would then only be cleared on a user cancel.
> If the same device (according to some ID test) reappears, then it's
> reactivated and usage continues normally.
>
> Obviously, this sort of approach requires some user interaction to get
> right. It has the distinct advantage of not throwing away the data the
> user wrote after an inadvertant disconnect, for example if they thought
> the device was done writing when it really wasn't. It can also keep
> from corrupting the FS metadata.
>
> The downside is that it might not really work, if there wasn't a good
> way to know when sectors actually are in stable storage, since a few
> blocks could be lost around the time the device was pulled.

I think it is sensible to try to improve clean removals first,
then tackle dirty ones.
--
vda

2005-05-17 08:33:38

by fs

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Tue, 2005-05-17 at 03:57, Denis Vlasenko wrote:
> On Tuesday 17 May 2005 19:47, fs wrote:
> I think you want too much from fs developers. Use this:
>
> if(error)
> if(errno==...) {...}
> else if(errno==...) {...}
> else {...} <------------ handle any other errors
>
> and be happy.
For users, the OS is a black box, it provides services of FS.
The OS should hide differences of each FS, so usermode app can
run happily on every FS. For the same reason, OS should return
the same error, no matter what FS it comes from. Users only care
about the interface, not the implementation. So, OS should
_AT LEAST_ make the interface clear(here means the syscall should
return a definite error)

> --
> vda
>

regards,
----
Qu Fuping


2005-05-17 21:28:20

by Kenichi Okuyama

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

>>>>> "J" == Elladan <[email protected]> writes:

J> On Tue, May 17, 2005 at 06:39:31AM +0900, Kenichi Okuyama wrote:
>> >>>>> "Valdis" == Valdis Kletnieks <[email protected]> writes:
>>
Valdis> Why? If the disk disappeared out from under us because it was an unplugged USB
Valdis> device, there's at least a possibility of it reappearing via hotplug - presumably
Valdis> if you verify the UUID that it's the *same* file system, hotplug could do a
Valdis> 'mount -o remount' and recover the situation....
>>
>> I don't think that's good idea.
>>
>> USB storage is gone. And it SEEMS to came back.
>> But how do you know that it's images were not changed.
>>
>> Blocks you have cached might have different image. If you remount
>> the file system, the cache image should be updated as well.
>>
>> But very fact that *cache image should be updated* means, old cache
>> image was invalid. And when did it become invalid?

J> [...]

>> You'll, at least, see that there is some inconsistency about cache
>> handling when we *umount->mount* and *remount*.

J> This is basically the problem people have had with removable storage for
J> years... You can't really solve it perfectly, since as you note one
J> could always place the storage in another machine and change it.

Unfortunately, this problem does happen even on case of
non-removable storage. HDD will break, and will ( accidentally ) be
removed from OS perspective. FS does not treat them differently.


In old days, HDD nor any device had a way to detect problem at all,
except for detecting timeout. THAT was the reason why old OS used to
use cache for read/write even after human detected HW failure.
They simply DID NOT KNOW about HW failure, and therefore
optimistically assumed cache image was still valid.

But look at USB case. If you look at /var/log/messages, you will
find USB device driver detecting your cable-unplug as soon as you
unplugged it.

File System should not ASSUME HW to be healthy without asking to
Device Driver about it. It is device driver who is responsible for
health check of HW, not FS.


J> But I think it's instructive to note what most other systems have done
J> in this situation... The solution seems similar in most cases, from eg.
J> Mac, Amiga, DOS, Windows, etc.

I do agree about this.
Yes, everyone is crossing the street even signals are red.
In old days, there was excuse. Today, I don't beleive so.


J> The typical solution is, when a removable device is yanked when dirty
J> blocks exist, is to keep the dirty blocks around, and put the device
J> into some sort of pending-reinsert state.

You should not call "IMPLEMENTATION without CAREFUL THOUGHT" as SOLUTION.


OK. Let me say it this way.

If THIS is how Linux is implemented, it should be implemented like
so among ALL the FS, for application should not care about what FS
they are standing on.

That is, someone (Linus?) have to declare this as DESIGN of Linux.
Thing should not simply be implementation dependent.


Don't take me wrong. I'm saying Linux is great.
Because Linux is great, this new problem arised.


In legacy system, like *BSD, there was only one FS supported as for
local FS. Or at least for their major use.
# Though FreeBSD do support VFAT, you won't want to implement
# DB on top of VFAT. It must be UFS for this use.

Since only one FS was default local FS, how it is implemented was
also single. Hence, how it is implemented, was one and only way that
OS serves to you.


But in case of Linux, there are:
EXT2, EXT3, JFS, XFS, ReiserFS
at LEAST for default local FS. There was no such OS like this.
And very fact that they are ALL default, means user will ask ALL of
these FS to react exactly the same for same HW failure, unless that
FS somehow have a way to recover in 100%, transparently.


There is no wonder why there is no standard about this. This is NEW
issue that no other OS really needed to face in past. It's not old
issue.

And That's the reason, I believe, that Qu Fuping arised this problem.


J> Then most systems typically display a large message to the user of the
J> form: "You idiot! Put the disk/cd/flash/etc. back in!"

Like in Solaris 10, you mean.

I don't see the reason why this is good solution. OS is giving up
the control over HW, and yet not stopping the service related to
that HW. This is nothing more than risk.


Human action and software action are not of same scale speed. I
mean, Human need several second (if not minute) before he can do a
thing, when he sees this message. On other hand, Application can ask
for, like 1000 requests within second.

If OS did not stop the service related to lost Storage immediately,
those several thousands of requests will be replied with result of
success. And at some point later, all of the sudden, they become
invalid.

I don't think this is RELIABLE service. Rather, I will worry about
chance of security vulneability.


regards,
----
Kenichi Okuyama

2005-05-18 06:02:19

by fs

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Tue, 2005-05-17 at 14:26, Bryan Henderson wrote:
> >Yes, we're sure to abort the operation, but we can't use
> >exit(EXIT_FAILURE) directly. For HA environment, we should
> >identify the cause of the error, take correspondent action,
> >right? So we need to get the right error.
>
> You mean a computer program will take the correspondent action? I think
> it would take a remarkably intelligent program to respond appropriately to
> particular failures -- especially if the program isn't tailored to a very
> specific environment. In practice, all I ever see a binary response --
> one for success, one for failure. The errno is used at most for giving a
> three word explanation to a human so the human can respond. That's why
> people don't take this issue seriously.
>
> "pass the errno up" is definitely a layering violation and cheap
> architecture. It's why the 3 word description you get is often
> meaningless -- it's telling you about a failure deep in computations you
> aren't even supposed to know about. I myself stay away from errnos where
> possible and produce error information in English text, with each layer
> adding information meaningful at that layer. But where we're sticking
> with classic errnos, it just doesn't make sense to work really hard on it.
>
> Nonetheless, I think there's broad agreement, and the current discussion
> is consistent with it, that if write() fails due to an I/O error, the
> errno should be EIO. Whether it's formally specified or not, the standard
> is there. That ext3 returns EROFS is either a bug or an implementation
what standard do you mean?
> convenience compromise or a case where the actual failure is more
> complicated than you imagine (maybe an operation fails and gets retried --
> the original failure caused an automatic switch to R/O and the retry
> failed because of the R/O status. Errnos are definitely not sufficient to
> give you the whole chain of causation for a failure -- if it gives you
> even the immediate cause, you should feel fortunate).
I suggest you visit our project, see the testing result,
http://developer.osdl.jp/projects/doubt/fs-consistency-and-coherency
For each test case, different FS returns different result.
>From user's perspective, it's really annoying, so, there should be a
standard which constraints the error type. Otherwise, different fs
can return whatever they want, regardless of the user's need.
> --
> Bryan Henderson IBM Almaden Research Center
> San Jose CA Filesystems
>


2005-05-18 07:58:14

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Wed, 18 May 2005 13:10:24 EDT, fs said:

> For each test case, different FS returns different result.
> From user's perspective, it's really annoying, so, there should be a
> standard which constraints the error type. Otherwise, different fs
> can return whatever they want, regardless of the user's need.

Which does the user "need":

a) an 'errno' valye that's forced to be one of a specific subset of values,
even if none of them explain what's going on

or

b) an 'errno' value that actually tells you about the error?

Remember - if the *kernel* forces a -EROFS to become a -EIO, then userspace
is stuck with that value. If the kernel passes -EROFS back to userspace,
then after glibc stashes an EROFS into errno, either glibc or the application
program can insert a 'if (errno == EROFS) {errno = EIO;}' if it feels that
EROFS is unnatural.

And in any case, that's what the *application programmer* needs. What the *user*
needs is for the file to either be safely stored, or a dialog box put up saying
that it failed....


Attachments:
(No filename) (226.00 B)

2005-05-19 15:45:11

by Elladan

[permalink] [raw]
Subject: Re: [RFD] What error should FS return when I/O failure occurs?

On Wed, May 18, 2005 at 06:26:40AM +0900, Kenichi Okuyama wrote:
> >>>>> "J" == Elladan <[email protected]> writes:
>
> J> This is basically the problem people have had with removable storage for
> J> years... You can't really solve it perfectly, since as you note one
> J> could always place the storage in another machine and change it.
>
> Unfortunately, this problem does happen even on case of
> non-removable storage. HDD will break, and will ( accidentally ) be
> removed from OS perspective. FS does not treat them differently.
>
> In old days, HDD nor any device had a way to detect problem at all,
> except for detecting timeout. THAT was the reason why old OS used to
> use cache for read/write even after human detected HW failure.
> They simply DID NOT KNOW about HW failure, and therefore
> optimistically assumed cache image was still valid.
>
> But look at USB case. If you look at /var/log/messages, you will
> find USB device driver detecting your cable-unplug as soon as you
> unplugged it.
>
> File System should not ASSUME HW to be healthy without asking to
> Device Driver about it. It is device driver who is responsible for
> health check of HW, not FS.

This isn't an issue that can be fixed perfectly. If you just pull a USB
device out, data damage is likely. However, if the device isn't
modified before re-inserting, attempting to finish the write will often
work fine.

Other OS's, even older ones, typically do have a near-immediate
notification that the device has gone away. For example, old floppy
disk based systems such as the Amiga may have had manual eject, but they
did have the capability to detect floppy disk presence. Yelling at the
user is a way to (possibly) complete the IO and prevent FS corruption.
Sometimes it works, sometimes it doesn't. If you hold the last few
seconds of IO in memory as well as the remaining dirty buffers, the
probability of avoiding corruption (provided the device wasn't placed in
another machine) is fairly good.

You're right though, other than possibly having some start/stop support,
this does not need much FS support. It's a driver and UI issue for the
most part.

To implement some of the fancier versions of this, such as being able to
pull out a floppy, place a different one in a drive, and have two apps
using two floppies at once (as some systems have implemented) this would
require placing a volume manager on top of the device driver as well,
and implementing this sort of logic in there. The problem there is that
the volume manager needs to understand the disk label well enough to
identify a particular device.

-J