2008-01-18 15:35:50

by Peter Staubach

[permalink] [raw]
Subject: [PATCH 0/3] enhanced ESTALE error handling

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server. This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client. It can also happen
when the file resides upon a file system which is no longer
exported.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server. This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this. These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care can be taken to ensure that forward progress is always
being made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again. Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ. In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks. When two or more copies of this program
are running, many ESTALE errors can be seen over the network.

Comments?

Thanx...

ps


Attachments:
syscallgen.c (14.65 kB)
exec_test.c (42.00 B)
Download all attachments

2008-01-18 16:55:47

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

Chuck Lever wrote:
> Hi Peter-
>
> On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
>> Hi.
>>
>> Here is a patch set which modifies the system to enhance the
>> ESTALE error handling for system calls which take pathnames
>> as arguments.
>
> The VFS already handles ESTALE.
>
> If a pathname resolution encounters an ESTALE at any point, the
> resolution is restarted exactly once, and an additional flag is passed
> to the file system during each lookup that forces each component in
> the path to be revalidated on the server. This has no possibility of
> causing an infinite loop.
>
> Is there some part of this logic that is no longer working?

The VFS does not fully handle ESTALE. An ESTALE error can occur
during the second pathname resolution attempt. There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.

Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process. For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.

Thanx...

ps

2008-01-18 17:18:52

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:
> Chuck Lever wrote:
>> Hi Peter-
>>
>> On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
>>> Hi.
>>>
>>> Here is a patch set which modifies the system to enhance the
>>> ESTALE error handling for system calls which take pathnames
>>> as arguments.
>>
>> The VFS already handles ESTALE.
>>
>> If a pathname resolution encounters an ESTALE at any point, the
>> resolution is restarted exactly once, and an additional flag is
>> passed to the file system during each lookup that forces each
>> component in the path to be revalidated on the server. This has
>> no possibility of causing an infinite loop.
>>
>> Is there some part of this logic that is no longer working?
>
> The VFS does not fully handle ESTALE. An ESTALE error can occur
> during the second pathname resolution attempt.

If an ESTALE occurs during the second resolution attempt, we should
give up. When I addressed this issue two years ago, the two-try
logic was the only acceptable solution because there's no way to
guarantee the pathname resolution will ever finish unless we put a
hard limit on it.

> There are lots of
> reasons, some of which are the 1 second resolution from some file
> systems on the server

Which is a server bug, AFAICS. It's simply impossible to close all
the windows that result from sloppy file time stamps without
completely disabling client-side caching. The NFS protocol relies on
file time stamps to manage cache coherence. If the server is lying
about time stamps, there's no way the client can cache coherently.

> and the window in between the revalidation
> and the actual use of the file handle associated with each
> dentry/inode pair.

A use case or two would be useful to explore (on linux-nfs or linux-
fsdevel, rather than lkml).

> Also, there was no support for ESTALE errors which occur during
> subsequent operations to the pathname resolution process. For
> example, during a mkdir(2) operation, the ESTALE can occur from
> the over the wire MKDIR operation after the LOOKUP operations
> have all succeeded.

If the final operation fails after a pathname resolution, then it's a
real error. Is there a fixed and valid recovery script for the
client in this case that will allow the mkdir to proceed?

Admittedly, the NFS client could recover more cleanly from some of
these problems, but given the architecture of the Linux VFS, it will
be difficult to address some of the corner cases.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-01-18 17:30:15

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

Chuck Lever wrote:
> On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:
>> Chuck Lever wrote:
>>> Hi Peter-
>>>
>>> On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
>>>> Hi.
>>>>
>>>> Here is a patch set which modifies the system to enhance the
>>>> ESTALE error handling for system calls which take pathnames
>>>> as arguments.
>>>
>>> The VFS already handles ESTALE.
>>>
>>> If a pathname resolution encounters an ESTALE at any point, the
>>> resolution is restarted exactly once, and an additional flag is
>>> passed to the file system during each lookup that forces each
>>> component in the path to be revalidated on the server. This has no
>>> possibility of causing an infinite loop.
>>>
>>> Is there some part of this logic that is no longer working?
>>
>> The VFS does not fully handle ESTALE. An ESTALE error can occur
>> during the second pathname resolution attempt.
>
> If an ESTALE occurs during the second resolution attempt, we should
> give up. When I addressed this issue two years ago, the two-try logic
> was the only acceptable solution because there's no way to guarantee
> the pathname resolution will ever finish unless we put a hard limit on
> it.
>

I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.

>> There are lots of
>> reasons, some of which are the 1 second resolution from some file
>> systems on the server
>
> Which is a server bug, AFAICS. It's simply impossible to close all
> the windows that result from sloppy file time stamps without
> completely disabling client-side caching. The NFS protocol relies on
> file time stamps to manage cache coherence. If the server is lying
> about time stamps, there's no way the client can cache coherently.
>

Server bug or not, it is something that the client has to live
with. We can't get the server file system fixed, so it is
something that we should find a way to live with. This support
can help.

>> and the window in between the revalidation
>> and the actual use of the file handle associated with each
>> dentry/inode pair.
>
> A use case or two would be useful to explore (on linux-nfs or
> linux-fsdevel, rather than lkml).
>

I created a bunch of use cases in the gensyscall.c program that
I attached to the original description of the problem and my
proposed solution. It was very useful in generating many, many
ESTALE errors over the wire from a variety of different over the
wire operations, which were originally getting returned to the
user level.

>> Also, there was no support for ESTALE errors which occur during
>> subsequent operations to the pathname resolution process. For
>> example, during a mkdir(2) operation, the ESTALE can occur from
>> the over the wire MKDIR operation after the LOOKUP operations
>> have all succeeded.
>
> If the final operation fails after a pathname resolution, then it's a
> real error. Is there a fixed and valid recovery script for the client
> in this case that will allow the mkdir to proceed?
>

Why do you think that it is an error?

It can easily occur if the directory in which the new directory
is to be created disppears after it is looked up and before the
MKDIR is issued.

The recovery is to perform the lookup again.

> Admittedly, the NFS client could recover more cleanly from some of
> these problems, but given the architecture of the Linux VFS, it will
> be difficult to address some of the corner cases.

Could you outline some of these corner cases that this proposal
would not address, please?

I ran the test program for many hours, against several different
servers, and although I can't prove completeness, was not able to
show any ESTALE errors being returned unexpectedly.

Thanx...

ps

2008-01-18 17:52:45

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
> Chuck Lever wrote:
>> On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:
>>> Chuck Lever wrote:
>>>> Hi Peter-
>>>>
>>>> On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
>>>>> Hi.
>>>>>
>>>>> Here is a patch set which modifies the system to enhance the
>>>>> ESTALE error handling for system calls which take pathnames
>>>>> as arguments.
>>>>
>>>> The VFS already handles ESTALE.
>>>>
>>>> If a pathname resolution encounters an ESTALE at any point, the
>>>> resolution is restarted exactly once, and an additional flag is
>>>> passed to the file system during each lookup that forces each
>>>> component in the path to be revalidated on the server. This has
>>>> no possibility of causing an infinite loop.
>>>>
>>>> Is there some part of this logic that is no longer working?
>>>
>>> The VFS does not fully handle ESTALE. An ESTALE error can occur
>>> during the second pathname resolution attempt.
>>
>> If an ESTALE occurs during the second resolution attempt, we
>> should give up. When I addressed this issue two years ago, the
>> two-try logic was the only acceptable solution because there's no
>> way to guarantee the pathname resolution will ever finish unless
>> we put a hard limit on it.
>>
>
> I can probably imagine a situation where the pathname resolution
> would never finish, but I am not sure that it could ever happen
> in nature.

Unless someone is doing something malicious. Or if the server is
repeatedly returning ESTALE for some reason.

>>> There are lots of
>>> reasons, some of which are the 1 second resolution from some file
>>> systems on the server
>>
>> Which is a server bug, AFAICS. It's simply impossible to close
>> all the windows that result from sloppy file time stamps without
>> completely disabling client-side caching. The NFS protocol relies
>> on file time stamps to manage cache coherence. If the server is
>> lying about time stamps, there's no way the client can cache
>> coherently.
>>
>
> Server bug or not, it is something that the client has to live
> with. We can't get the server file system fixed, so it is
> something that we should find a way to live with. This support
> can help.

We haven't identified a server-side solution yet, but that doesn't
mean it doesn't exist.

If we address the time stamp problem in the client, should we also go
to lengths to address it in every other corner of the NFS client?
Should we also address every other server bug we discover with a
client side fix?

>>> Also, there was no support for ESTALE errors which occur during
>>> subsequent operations to the pathname resolution process. For
>>> example, during a mkdir(2) operation, the ESTALE can occur from
>>> the over the wire MKDIR operation after the LOOKUP operations
>>> have all succeeded.
>>
>> If the final operation fails after a pathname resolution, then
>> it's a real error. Is there a fixed and valid recovery script for
>> the client in this case that will allow the mkdir to proceed?
>>
>
> Why do you think that it is an error?

Because this is a problem that sometimes requires application-level
recovery. Can we guarantee that retrying the mkdir is the right
thing to do every time?

> It can easily occur if the directory in which the new directory
> is to be created disppears after it is looked up and before the
> MKDIR is issued.
>
> The recovery is to perform the lookup again.

Have you tried this client against a file server when you unexport
the filesystem under test? The server returns ESTALE no matter what
the client does. Should the client continue to retry the request if
the file system has been permanently taken offline?

>> Admittedly, the NFS client could recover more cleanly from some of
>> these problems, but given the architecture of the Linux VFS, it
>> will be difficult to address some of the corner cases.
>
> Could you outline some of these corner cases that this proposal
> would not address, please?

I think we have one right here: should the client retry a mkdir if
gets an ESTALE?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-01-18 18:12:03

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

Chuck Lever wrote:
> On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
>> Chuck Lever wrote:
>>> On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:
>>>> Chuck Lever wrote:
>>>>> Hi Peter-
>>>>>
>>>>> On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
>>>>>> Hi.
>>>>>>
>>>>>> Here is a patch set which modifies the system to enhance the
>>>>>> ESTALE error handling for system calls which take pathnames
>>>>>> as arguments.
>>>>>
>>>>> The VFS already handles ESTALE.
>>>>>
>>>>> If a pathname resolution encounters an ESTALE at any point, the
>>>>> resolution is restarted exactly once, and an additional flag is
>>>>> passed to the file system during each lookup that forces each
>>>>> component in the path to be revalidated on the server. This has
>>>>> no possibility of causing an infinite loop.
>>>>>
>>>>> Is there some part of this logic that is no longer working?
>>>>
>>>> The VFS does not fully handle ESTALE. An ESTALE error can occur
>>>> during the second pathname resolution attempt.
>>>
>>> If an ESTALE occurs during the second resolution attempt, we should
>>> give up. When I addressed this issue two years ago, the two-try
>>> logic was the only acceptable solution because there's no way to
>>> guarantee the pathname resolution will ever finish unless we put a
>>> hard limit on it.
>>>
>>
>> I can probably imagine a situation where the pathname resolution
>> would never finish, but I am not sure that it could ever happen
>> in nature.
>
> Unless someone is doing something malicious. Or if the server is
> repeatedly returning ESTALE for some reason.
>

If the server is repeatedly returning ESTALE, then the pathname
resolution will fail to make progress and give up, return ENOENT
to the user level.

A malicious user on the network can cause so many other problems
than just something like this too. But, in this case, the user
would have to predict why and when the client was issuing a
specific operation and know whether or not to return ESTALE.
This seems quite far fetched and quite unlikely to me.

>>>> There are lots of
>>>> reasons, some of which are the 1 second resolution from some file
>>>> systems on the server
>>>
>>> Which is a server bug, AFAICS. It's simply impossible to close all
>>> the windows that result from sloppy file time stamps without
>>> completely disabling client-side caching. The NFS protocol relies
>>> on file time stamps to manage cache coherence. If the server is
>>> lying about time stamps, there's no way the client can cache
>>> coherently.
>>>
>>
>> Server bug or not, it is something that the client has to live
>> with. We can't get the server file system fixed, so it is
>> something that we should find a way to live with. This support
>> can help.
>
> We haven't identified a server-side solution yet, but that doesn't
> mean it doesn't exist.
>

No, it doesn't and I, and most everyone else, would also like to
see such a solution. That said, I am pretty sure that we are not
going to get a fix for ext3 and forcing everyone to move away from
ext3 is not a good solution either.

> If we address the time stamp problem in the client, should we also go
> to lengths to address it in every other corner of the NFS client?
> Should we also address every other server bug we discover with a
> client side fix?
>

These aren't asked seriously, are they?

When possible, we get the server bug fixed. When not possible,
such as the time stamp issue with ext3, we attempt work around
it as best as possible.


>>>> Also, there was no support for ESTALE errors which occur during
>>>> subsequent operations to the pathname resolution process. For
>>>> example, during a mkdir(2) operation, the ESTALE can occur from
>>>> the over the wire MKDIR operation after the LOOKUP operations
>>>> have all succeeded.
>>>
>>> If the final operation fails after a pathname resolution, then it's
>>> a real error. Is there a fixed and valid recovery script for the
>>> client in this case that will allow the mkdir to proceed?
>>>
>>
>> Why do you think that it is an error?
>
> Because this is a problem that sometimes requires application-level
> recovery. Can we guarantee that retrying the mkdir is the right thing
> to do every time?
>

When would not retrying the MKDIR be the right thing to do?
When doing a mkdir("a/b"), the user can not tell nor cares
which instance of directory "a" is the one that gets "b" created
in it.

Which cases are the ones that you see that require user
level recovery?

>> It can easily occur if the directory in which the new directory
>> is to be created disppears after it is looked up and before the
>> MKDIR is issued.
>>
>> The recovery is to perform the lookup again.
>
> Have you tried this client against a file server when you unexport the
> filesystem under test? The server returns ESTALE no matter what the
> client does. Should the client continue to retry the request if the
> file system has been permanently taken offline?
>

Since the NFS client supports "intr", then why not continue to
retry the request? It certainly won't hurt the network, trying
at most once every acdirmin timeout seconds. This, by default,
would be once every 30 seconds.

This would alleviate a long standing complaint that when an
admin uses a poor administrative procedure that users become
completely hosed.

>>> Admittedly, the NFS client could recover more cleanly from some of
>>> these problems, but given the architecture of the Linux VFS, it will
>>> be difficult to address some of the corner cases.
>>
>> Could you outline some of these corner cases that this proposal
>> would not address, please?
>
> I think we have one right here: should the client retry a mkdir if
> gets an ESTALE?

Yes. Why not? Please describe more specifically why you think
that it should not.

Thanx...

ps

2008-01-18 18:17:24

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

Hi Peter-

On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
> Chuck Lever wrote:
>> On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:
>>> Chuck Lever wrote:
>>>> Hi Peter-
>>>>
>>>> On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
>>> and the window in between the revalidation
>>> and the actual use of the file handle associated with each
>>> dentry/inode pair.
>>
>> A use case or two would be useful to explore (on linux-nfs or
>> linux-fsdevel, rather than lkml).
>
> I created a bunch of use cases in the gensyscall.c program that
> I attached to the original description of the problem and my
> proposed solution. It was very useful in generating many, many
> ESTALE errors over the wire from a variety of different over the
> wire operations, which were originally getting returned to the
> user level.

The gensyscall.c program is what I would call a set of unit test, btw.

This is not the same as a use case, which would include information
about the application environment, its users, a detailed description
of current system behavior, and some discussion of alternatives for
improving it (including doing nothing).

A test case is written in a programming language, a use case is
written in a natural language.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-01-18 18:38:32

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

On Fri, Jan 18, 2008 at 01:12:03PM -0500, Peter Staubach wrote:
> Chuck Lever wrote:
>> On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
>>> I can probably imagine a situation where the pathname resolution
>>> would never finish, but I am not sure that it could ever happen
>>> in nature.
>>
>> Unless someone is doing something malicious. Or if the server is
>> repeatedly returning ESTALE for some reason.
>>
>
> If the server is repeatedly returning ESTALE, then the pathname
> resolution will fail to make progress and give up, return ENOENT
> to the user level.
>
> A malicious user on the network can cause so many other problems
> than just something like this too. But, in this case, the user
> would have to predict why and when the client was issuing a
> specific operation and know whether or not to return ESTALE.
> This seems quite far fetched and quite unlikely to me.

Any idea what the consequences would be in this case? It at least
shouldn't overflow the stack, or freeze the whole machine (because it
spins indefinitely under some crucial lock), or panic, etc. (If the one
filesystem just becomes unusable--well, fine, what better can you hope
for in the presence of a malicious server or network?)

--b.

2008-01-18 19:12:35

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

J. Bruce Fields wrote:
> On Fri, Jan 18, 2008 at 01:12:03PM -0500, Peter Staubach wrote:
>
>> Chuck Lever wrote:
>>
>>> On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
>>>
>>>> I can probably imagine a situation where the pathname resolution
>>>> would never finish, but I am not sure that it could ever happen
>>>> in nature.
>>>>
>>> Unless someone is doing something malicious. Or if the server is
>>> repeatedly returning ESTALE for some reason.
>>>
>>>
>> If the server is repeatedly returning ESTALE, then the pathname
>> resolution will fail to make progress and give up, return ENOENT
>> to the user level.
>>
>> A malicious user on the network can cause so many other problems
>> than just something like this too. But, in this case, the user
>> would have to predict why and when the client was issuing a
>> specific operation and know whether or not to return ESTALE.
>> This seems quite far fetched and quite unlikely to me.
>>
>
> Any idea what the consequences would be in this case? It at least
> shouldn't overflow the stack, or freeze the whole machine (because it
> spins indefinitely under some crucial lock), or panic, etc. (If the one
> filesystem just becomes unusable--well, fine, what better can you hope
> for in the presence of a malicious server or network?)

Assuming that such a user could precisely and accurately predict
when to return ESTALE, the particular system call would just stay
in the kernel, sending out requests to the NFS server.

It wouldn't overflow the stack because the recovery is done by
looping and not by recursion and unless there is a bug that needs
to be fixed, all necessary resources are released before the
retries occur. The machine wouldn't freeze because as soon as
the request is sent, the process blocks and some other process
can be scheduled. The process should be interruptible, so even
it could be signaled to stop the activity.

It seems to me that mostly, the file system will become unusable,
but as Bruce points out, what do you expect in the presence of a
malicious entity? If such are a concern, then measures such as
stronger security can be employed to prevent them from wreaking
havoc.

Thanx...

ps

2008-01-18 15:46:35

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

On Fri, Jan 18, 2008 at 10:35:50AM -0500, Peter Staubach wrote:
> Hi.
>
> Here is a patch set which modifies the system to enhance the
> ESTALE error handling for system calls which take pathnames
> as arguments.

I think your cover letter may be bigger than any of the actual
patches.... I'm not complaining! But would it be worth adding this
explanation and test code to Documentation/filesystems/ just to keep it
around?

--b.

>
> The error, ESTALE, was originally introduced to handle the
> situation where a file handle, which NFS uses to uniquely
> identify a file on the server, no longer refers to a valid file
> on the server. This can happen when the file is removed on the
> server, either by an application on the server, some other
> client accessing the server, or sometimes even by another
> mounted file system from the same client. It can also happen
> when the file resides upon a file system which is no longer
> exported.
>
> The error, ESTALE, is usually seen when cached directory
> information is used to convert a pathname to a dentry/inode pair.
> The information is discovered to be out of date or stale when a
> subsequent operation is sent to the NFS server. This can easily
> happen in system calls such as stat(2) when the pathname is
> converted a dentry/inode pair using cached information, but then
> a subsequent GETATTR call to the server discovers that the file
> handle is no longer valid.
>
> System calls which take pathnames as arguments should never see
> ESTALE errors from situations like this. These system calls
> should either fail with an ENOENT error if the pathname can not
> be successfully be translated to a dentry/inode pair or succeed
> or fail based on their own semantics.
>
> ESTALE errors which occur during the lookup process can be
> handled by dropping the dentry which refers to the non-existent
> file from the dcache and then restarting the lookup process.
> Care can be taken to ensure that forward progress is always
> being made in order to avoiding infinite loops.
>
> ESTALE errors which occur during operations subsequent to the
> lookup process can be handled by unwinding appropriately and
> then performing the lookup process again. Eventually, either
> the lookup process will succeed or fail correctly or the
> subsequent operation will succeed or fail on its own merits.
>
> This support is desired in order to tighten up recovery from
> discovering stale resources due to the loose cache consistency
> semantics that file systems such as NFS employ. In particular,
> there are several large Red Hat customers, converting from
> Solaris to Linux, who desire this support in order that their
> applications environments continue to work.
>
> Please note that system calls which do not take pathnames as
> arguments or perhaps use file descriptors to identify the
> file to be manipulated may still fail with ESTALE errors.
> There is no recovery possible with these systems calls like
> there is with system calls which take pathnames as arguments.
>
> This support was tested using the attached programs and
> running multiple copies on mounted file systems which do not
> share superblocks. When two or more copies of this program
> are running, many ESTALE errors can be seen over the network.
>
> Comments?
>
> Thanx...
>
> ps

> #
> #define _XOPEN_SOURCE 500
> #define _LARGEFILE64_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/statfs.h>
> #include <sys/inotify.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <signal.h>
>
> void mkdir_test(void);
> void link_test(void);
> void open_test(void);
> void access_test(void);
> void chmod_test(void);
> void chown_test(void);
> void readlink_test(void);
> void utimes_test(void);
> void chdir_test(void);
> void chroot_test(void);
> void rename_test(void);
> void exec_test(void);
> void mknod_test(void);
> void statfs_test(void);
> void truncate_test(void);
> void xattr_test(void);
> void inotify_test(void);
>
> struct tests {
> void (*test)(void);
> };
>
> struct tests tests[] = {
> mkdir_test,
> link_test,
> open_test,
> access_test,
> chmod_test,
> chown_test,
> readlink_test,
> utimes_test,
> chdir_test,
> chroot_test,
> rename_test,
> exec_test,
> mknod_test,
> statfs_test,
> truncate_test,
> xattr_test,
> inotify_test
> };
>
> pid_t test_pids[sizeof(tests) / sizeof(tests[0])];
>
> pid_t parent_pid;
>
> void kill_tests(int);
>
> int
> main(int argc, char *argv[])
> {
> int i;
>
> parent_pid = getpid();
>
> sigset(SIGINT, kill_tests);
>
> sighold(SIGINT);
>
> for (i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) {
> test_pids[i] = fork();
> if (test_pids[i] == 0) {
> for (;;)
> (*tests[i].test)();
> /* NOTREACHED */
> }
> }
>
> sigrelse(SIGINT);
>
> pause();
> }
>
> void
> kill_tests(int sig)
> {
> int i;
>
> for (i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) {
> if (test_pids[i] != -1) {
> if (kill(test_pids[i], SIGTERM) < 0)
> perror("kill");
> }
> }
>
> exit(0);
> }
>
> void
> check_error(int error, char *operation)
> {
>
> if (error < 0 && errno == ESTALE) {
> perror(operation);
> kill(parent_pid, SIGINT);
> pause();
> }
> }
>
> void
> check_error_child(int error, char *operation)
> {
>
> if (error < 0 && errno == ESTALE) {
> perror(operation);
> kill(parent_pid, SIGINT);
> exit(1);
> }
> }
>
> void
> do_stats(char *file)
> {
> int error;
> struct stat stbuf;
> struct stat64 stbuf64;
>
> error = stat(file, &stbuf);
> check_error(error, "stat");
>
> error = stat64(file, &stbuf64);
> check_error(error, "stat64");
>
> error = lstat(file, &stbuf);
> check_error(error, "lstat");
>
> error = lstat64(file, &stbuf64);
> check_error(error, "lstat64");
> }
>
> void
> do_stats_child(char *file)
> {
> int error;
> struct stat stbuf;
> struct stat64 stbuf64;
>
> error = stat(file, &stbuf);
> check_error_child(error, "stat");
>
> error = stat64(file, &stbuf64);
> check_error_child(error, "stat64");
>
> error = lstat(file, &stbuf);
> check_error_child(error, "lstat");
>
> error = lstat64(file, &stbuf64);
> check_error_child(error, "lstat64");
> }
>
> char *mkdir_dirs[] = {
> "mkdir/a",
> "mkdir/a/b",
> "mkdir/a/b/c",
> "mkdir/a/b/c/d",
> "mkdir/a/b/c/d/e",
> "mkdir/a/b/c/d/e/f",
> "mkdir/a/b/c/d/e/f/g",
> "mkdir/a/b/c/d/e/f/g/h",
> "mkdir/a/b/c/d/e/f/g/h/i",
> "mkdir/a/b/c/d/e/f/g/h/i/j",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y",
> "mkdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z",
> NULL
> };
>
> void
> mkdir_test()
> {
> int i;
> int error;
>
> error = mkdir("mkdir", 0755);
> check_error(error, "mkdir");
>
> for (i = 0; mkdir_dirs[i] != NULL; i++) {
> error = mkdir(mkdir_dirs[i], 0755);
> check_error(error, "mkdir");
> do_stats(mkdir_dirs[i]);
> }
>
> while (--i >= 0) {
> do_stats(mkdir_dirs[i]);
> error = rmdir(mkdir_dirs[i]);
> check_error(error, "rmdir");
> }
>
> error = rmdir("mkdir");
> check_error(error, "rmdir");
> }
>
> char *link_file_a = "link/a";
> char *link_file_b = "link/b";
>
> void
> link_test()
> {
> int error;
> int fd;
>
> error = mkdir("link", 0755);
> check_error(error, "mkdir");
>
> fd = open(link_file_a, O_CREAT, 0644);
> check_error(fd, "open");
>
> (void) close(fd);
>
> do_stats(link_file_a);
>
> error = link(link_file_a, link_file_b);
> check_error(error, "link");
> do_stats(link_file_a);
> do_stats(link_file_b);
>
> error = unlink(link_file_a);
> check_error(error, "unlink");
> do_stats(link_file_a);
> do_stats(link_file_b);
>
> error = link(link_file_b, link_file_a);
> check_error(error, "link");
> do_stats(link_file_a);
> do_stats(link_file_b);
>
> error = unlink(link_file_b);
> check_error(error, "unlink");
> do_stats(link_file_a);
> do_stats(link_file_b);
>
> error = unlink(link_file_a);
> check_error(error, "unlink");
> do_stats(link_file_a);
> do_stats(link_file_b);
>
> error = rmdir("link");
> check_error(error, "rmdir");
> }
>
> char *open_file = "open/a";
>
> void
> open_test()
> {
> int error;
> int fd;
>
> error = mkdir("open", 0755);
> check_error(error, "mkdir");
>
> fd = open(open_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(open_file);
>
> fd = open(open_file, O_RDWR);
> check_error(fd, "open: O_RDWR");
>
> (void) close(fd);
>
> do_stats(open_file);
>
> error = unlink(open_file);
> check_error(error, "unlink");
>
> error = rmdir("open");
> check_error(error, "rmdir");
> }
>
> char *access_file = "access/a";
>
> void
> access_test()
> {
> int error;
> int fd;
>
> error = mkdir("access", 0755);
> check_error(error, "mkdir");
>
> fd = open(access_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(access_file);
>
> error = access(access_file, F_OK);
> check_error(error, "access");
>
> do_stats(access_file);
>
> error = unlink(access_file);
> check_error(error, "unlink");
>
> error = rmdir("access");
> check_error(error, "rmdir");
> }
>
> char *chmod_file = "chmod/a";
>
> void
> chmod_test()
> {
> int error;
> int fd;
>
> error = mkdir("chmod", 0755);
> check_error(error, "mkdir");
>
> fd = open(chmod_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(chmod_file);
>
> error = chmod(chmod_file, 0600);
> check_error(error, "chmod");
>
> do_stats(chmod_file);
>
> error = unlink(chmod_file);
> check_error(error, "unlink");
>
> error = rmdir("chmod");
> check_error(error, "rmdir");
> }
>
> char *chown_file = "chown/a";
>
> void
> chown_test()
> {
> int error;
> int fd;
>
> error = mkdir("chown", 0755);
> check_error(error, "mkdir");
>
> fd = open(chown_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(chown_file);
>
> error = chown(chown_file, 4597, 4597);
> check_error(error, "chown");
>
> do_stats(chown_file);
>
> error = lchown(chown_file, 4596, 4596);
> check_error(error, "lchown");
>
> do_stats(chown_file);
>
> error = unlink(chown_file);
> check_error(error, "unlink");
>
> error = rmdir("chown");
> check_error(error, "rmdir");
> }
>
> char *readlink_file = "readlink/a";
>
> void
> readlink_test()
> {
> int error;
> char buf[BUFSIZ];
>
> error = mkdir("readlink", 0755);
> check_error(error, "mkdir");
>
> error = symlink("b", readlink_file);
> check_error(error, "symlink");
>
> do_stats(readlink_file);
>
> error = readlink(readlink_file, buf, sizeof(buf));
> check_error(error, "readlink");
>
> do_stats(readlink_file);
>
> error = unlink(readlink_file);
> check_error(error, "unlink");
>
> error = rmdir("readlink");
> check_error(error, "rmdir");
> }
>
> char *utimes_file = "utimes/a";
>
> void
> utimes_test()
> {
> int error;
> int fd;
>
> error = mkdir("utimes", 0755);
> check_error(error, "mkdir");
>
> fd = open(utimes_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(utimes_file);
>
> error = utime(utimes_file, NULL);
> check_error(error, "utime");
>
> do_stats(utimes_file);
>
> error = utimes(utimes_file, NULL);
> check_error(error, "utimes");
>
> do_stats(utimes_file);
>
> error = unlink(utimes_file);
> check_error(error, "unlink");
>
> error = rmdir("utimes");
> check_error(error, "rmdir");
> }
>
> char *chdir_dir = "chdir/dir";
>
> void
> chdir_test()
> {
> int error;
> int pid;
> int status;
>
> error = mkdir("chdir", 0755);
> check_error(error, "mkdir");
>
> pid = fork();
> if (pid == 0) {
> error = mkdir(chdir_dir, 0755);
> check_error_child(error, "mkdir");
>
> do_stats_child(chdir_dir);
>
> error = chdir(chdir_dir);
> check_error_child(error, "chdir");
>
> do_stats_child(chdir_dir);
>
> exit(0);
> }
>
> (void) wait(&status);
>
> do_stats(chdir_dir);
>
> error = rmdir(chdir_dir);
> check_error(error, "rmdir");
>
> error = rmdir("chdir");
> check_error(error, "rmdir");
> }
>
> char *chroot_dir = "chroot/dir";
>
> void
> chroot_test()
> {
> int error;
> int pid;
> int status;
>
> error = mkdir("chroot", 0755);
> check_error(error, "mkdir");
>
> pid = fork();
> if (pid == 0) {
> error = mkdir(chroot_dir, 0755);
> check_error_child(error, "mkdir");
>
> do_stats_child(chroot_dir);
>
> error = chroot(chroot_dir);
> check_error_child(error, "chroot");
>
> do_stats_child(chroot_dir);
>
> exit(0);
> }
>
> (void) wait(&status);
>
> do_stats(chroot_dir);
>
> error = rmdir(chroot_dir);
> check_error(error, "rmdir");
>
> error = rmdir("chroot");
> check_error(error, "rmdir");
> }
>
> char *rename_file_a = "rename/a";
> char *rename_file_b = "rename/b";
>
> void
> rename_test()
> {
> int error;
> int fd;
>
> error = mkdir("rename", 0755);
> check_error(error, "mkdir");
>
> fd = open(rename_file_a, O_CREAT, 0644);
> check_error(fd, "open");
>
> (void) close(fd);
>
> do_stats(rename_file_a);
>
> error = rename(rename_file_a, rename_file_b);
> check_error(error, "rename");
>
> do_stats(rename_file_a);
> do_stats(rename_file_b);
>
> error = rename(rename_file_b, rename_file_a);
> check_error(error, "rename");
>
> do_stats(rename_file_a);
> do_stats(rename_file_b);
>
> error = unlink(rename_file_a);
> check_error(error, "unlink");
>
> error = rmdir("rename");
> check_error(error, "rmdir");
> }
>
> char *exec_file = "exec/a";
> char *exec_source_file = "exec_test";
>
> void
> exec_test()
> {
> int error;
> int pid;
> int status;
>
> error = mkdir("exec", 0755);
> check_error(error, "mkdir");
>
> error = link(exec_source_file, exec_file);
> check_error(error, "link");
> do_stats(exec_file);
>
> pid = fork();
> if (pid == 0) {
> error = execl(exec_file, exec_file, NULL);
> check_error_child(error, "execl");
>
> exit(1);
> }
>
> wait(&status);
>
> do_stats(exec_file);
>
> error = unlink(exec_file);
> check_error(error, "unlink");
>
> error = rmdir("exec");
> check_error(error, "rmdir");
> }
>
> char *mknod_file = "mknod/a";
>
> void
> mknod_test()
> {
> int error;
>
> error = mkdir("mknod", 0755);
> check_error(error, "mkdir");
>
> error = mknod(mknod_file, S_IFCHR | 0644, 0);
> check_error(error, "mknod");
>
> do_stats(mknod_file);
>
> error = unlink(mknod_file);
> check_error(error, "unlink");
>
> error = rmdir("mknod");
> check_error(error, "rmdir");
> }
>
> void
> statfs_test()
> {
> int error;
> struct statfs stbuf;
> struct statfs64 stbuf64;
>
> error = mkdir("statfs", 0755);
> check_error(error, "mkdir");
>
> do_stats("statfs");
>
> error = statfs("statfs", &stbuf);
> check_error(error, "statfs");
>
> error = statfs64("statfs", &stbuf64);
> check_error(error, "statfs64");
>
> error = rmdir("statfs");
> check_error(error, "rmdir");
> }
>
> char *truncate_file = "truncate/a";
>
> void
> truncate_test()
> {
> int error;
> int fd;
>
> error = mkdir("truncate", 0755);
> check_error(error, "mkdir");
>
> fd = open(truncate_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(truncate_file);
>
> error = truncate(truncate_file, 1024);
> check_error(error, "truncate");
>
> do_stats(truncate_file);
>
> error = unlink(truncate_file);
> check_error(error, "unlink");
>
> error = rmdir("truncate");
> check_error(error, "rmdir");
> }
>
> char *xattr_file = "xattr/a";
>
> #define ACL_USER_OBJ (0x01)
> #define ACL_USER (0x02)
> #define ACL_GROUP_OBJ (0x04)
> #define ACL_MASK (0x10)
> #define ACL_OTHER (0x20)
>
> struct posix_acl_xattr_entry {
> unsigned short e_tag;
> unsigned short e_perm;
> unsigned int e_id;
> };
>
> #define POSIX_ACL_XATTR_VERSION 0x0002
>
> struct posix_acl_xattr_header {
> unsigned int a_version;
> struct posix_acl_xattr_entry a_entries[5];
> };
>
> void
> xattr_test()
> {
> int error;
> int fd;
> char buf[1024];
> struct posix_acl_xattr_header ents;
>
> error = mkdir("xattr", 0755);
> check_error(error, "mkdir");
>
> fd = open(xattr_file, O_CREAT | O_RDWR, 0444);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(xattr_file);
>
> error = getxattr(xattr_file, "system.posix_acl_access", buf,
> sizeof (buf));
> check_error(error, "getxattr");
> error = lgetxattr(xattr_file, "system.posix_acl_access", buf,
> sizeof (buf));
> check_error(error, "lgetxattr");
>
> ents.a_version = POSIX_ACL_XATTR_VERSION;
> ents.a_entries[0].e_tag = ACL_USER_OBJ;
> ents.a_entries[0].e_perm = 06;
> ents.a_entries[0].e_id = -1;
> ents.a_entries[1].e_tag = ACL_USER;
> ents.a_entries[1].e_perm = 06;
> ents.a_entries[1].e_id = 10;
> ents.a_entries[2].e_tag = ACL_GROUP_OBJ;
> ents.a_entries[2].e_perm = 06;
> ents.a_entries[2].e_id = -1;
> ents.a_entries[3].e_tag = ACL_MASK;
> ents.a_entries[3].e_perm = 06;
> ents.a_entries[3].e_id = -1;
> ents.a_entries[4].e_tag = ACL_OTHER;
> ents.a_entries[4].e_perm = 06;
> ents.a_entries[4].e_id = -1;
>
> error = setxattr(xattr_file, "system.posix_acl_access",
> &ents, sizeof (ents), 0);
> check_error(error, "setxattr");
>
> do_stats(xattr_file);
>
> error = lsetxattr(xattr_file, "system.posix_acl_access",
> &ents, sizeof (ents), 0);
> check_error(error, "lsetxattr");
>
> do_stats(xattr_file);
>
> error = getxattr(xattr_file, "system.posix_acl_access", buf,
> sizeof (buf));
> check_error(error, "getxattr");
> error = lgetxattr(xattr_file, "system.posix_acl_access", buf,
> sizeof (buf));
> check_error(error, "lgetxattr");
>
> error = listxattr(xattr_file, buf, sizeof (buf));
> check_error(error, "listxattr");
> error = llistxattr(xattr_file, buf, sizeof (buf));
> check_error(error, "llistxattr");
>
> error = removexattr(xattr_file, "system.posix_acl_access");
> check_error(error, "removexattr");
>
> do_stats(xattr_file);
>
> error = setxattr(xattr_file, "system.posix_acl_access",
> &ents, sizeof (ents), 0);
> check_error(error, "setxattr");
>
> do_stats(xattr_file);
>
> error = lremovexattr(xattr_file, "system.posix_acl_access");
> check_error(error, "lremovexattr");
>
> do_stats(xattr_file);
>
> error = unlink(xattr_file);
> check_error(error, "unlink");
>
> error = rmdir("xattr");
> check_error(error, "rmdir");
> }
>
> char *inotify_file = "inotify/a";
>
> void
> inotify_test()
> {
> int error;
> int fd;
> int wd;
>
> error = mkdir("inotify", 0755);
> check_error(error, "mkdir");
>
> fd = open(inotify_file, O_CREAT | O_RDWR, 0644);
> check_error(fd, "open: O_CREAT");
>
> (void) close(fd);
>
> do_stats(inotify_file);
>
> fd = inotify_init();
> check_error(error, "inotify_init");
>
> do_stats(inotify_file);
>
> wd = inotify_add_watch(fd, inotify_file, IN_ALL_EVENTS);
> check_error(wd, "inotify_add_watch");
>
> do_stats(inotify_file);
>
> error = inotify_rm_watch(fd, wd);
> check_error(error, "inotify_rm_watch");
>
> (void) close(fd);
>
> do_stats(inotify_file);
>
> error = unlink(inotify_file);
> check_error(error, "unlink");
>
> error = rmdir("inotify");
> check_error(error, "rmdir");
> }

> #include <stdlib.h>
>
> main()
> {
> exit(0);
> }


2008-01-18 16:44:12

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:
> Hi.
>
> Here is a patch set which modifies the system to enhance the
> ESTALE error handling for system calls which take pathnames
> as arguments.

The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the
resolution is restarted exactly once, and an additional flag is
passed to the file system during each lookup that forces each
component in the path to be revalidated on the server. This has no
possibility of causing an infinite loop.

Is there some part of this logic that is no longer working?
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-02-01 20:57:56

by Peter Staubach

[permalink] [raw]
Subject: [PATCH 0/3] enhanced ESTALE error handling (v2)

Hi.

Here is version 2 of a patch set which modifies the system to
enhance the ESTALE error handling for system calls which take
pathnames as arguments.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server. This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client. The NFS server also
returns this error when the file resides upon a file system
which is no longer exported. Additionally, some NFS servers
even change the file handle when a file is renamed, although
this practice is discouraged.

This error occurs even if a file or directory, with the same
name, is recreated on the server without the client being
aware of it. The file handle refers to a specific instance
of a file and deleting the file and then recreating it creates
a new instance of the file.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server. This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

This error can also occur when a change is made on the server
in between looking up different components of the pathname to
be looked up or between a successful lookup and a subsequent
operation.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this. These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics. In the above example,
stat(2), restarting at the pathname lookup will either cause the
system call to succeed or fail, depending upon whether the
file really exists or not.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care is taken to ensure that forward progress is always being
made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again. Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ. In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

The loose consistency model of file systems such as NFS is
exacerbated by the large granularity of timestamps available
for files on file systems such ext3. The NFS client may not
be able to detect changes in directories due to multiple
changes occurring in the same second, for example.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks. When two or more copies of this program
are running, many ESTALE errors can be seen over the network.
Without these patches, the test program errors out almost
immediately. With these patches, the test program runs
for as long one desires.

Comments?

Thanx...

ps


Attachments:
syscallgen.c (14.83 kB)
exec_test.c (42.00 B)
Download all attachments

2008-03-10 20:23:25

by Peter Staubach

[permalink] [raw]
Subject: [PATCH 0/3] enhanced ESTALE error handling (v3)

Hi.

Here is version 3 of a patch set which modifies the system to
enhance the ESTALE error handling for system calls which take
pathnames as arguments. This patch set is essentially the
same as the v2 patches, but updated to reflect the current
state of the code around them.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server. This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client. The NFS server also
returns this error when the file resides upon a file system
which is no longer exported. Additionally, some NFS servers
even change the file handle when a file is renamed, although
this practice is discouraged.

This error occurs even if a file or directory, with the same
name, is recreated on the server without the client being
aware of it. The file handle refers to a specific instance
of a file and deleting the file and then recreating it creates
a new instance of the file.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server. This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

This error can also occur when a change is made on the server
in between looking up different components of the pathname to
be looked up or between a successful lookup and a subsequent
operation.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this. These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics. In the above example,
stat(2), restarting at the pathname lookup will either cause the
system call to succeed or fail, depending upon whether the
file really exists or not.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care is taken to ensure that forward progress is always being
made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again. Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ. In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

The loose consistency model of file systems such as NFS is
exacerbated by the large granularity of timestamps available
for files on file systems such ext3. The NFS client may not
be able to detect changes in directories due to multiple
changes occurring in the same second, for example.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks. When two or more copies of this program
are running, many ESTALE errors can be seen over the network.
Without these patches, the test program errors out almost
immediately. With these patches, the test program runs
for as long one desires.

Comments?

Thanx...

ps


Attachments:
syscallgen.c (14.83 kB)
exec_test.c (42.00 B)
Download all attachments

2008-03-10 22:42:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/3] enhanced ESTALE error handling (v3)

On Mar 10, 2008 16:23 -0400, Peter Staubach wrote:
> Here is version 3 of a patch set which modifies the system to
> enhance the ESTALE error handling for system calls which take
> pathnames as arguments. This patch set is essentially the
> same as the v2 patches, but updated to reflect the current
> state of the code around them.

[snip long discussion of ESTALE causes]

> This support was tested using the attached programs and
> running multiple copies on mounted file systems which do not
> share superblocks. When two or more copies of this program
> are running, many ESTALE errors can be seen over the network.
> Without these patches, the test program errors out almost
> immediately. With these patches, the test program runs
> for as long one desires.

Have you tried "racer.sh"? That is a very stressful metadata tester
that does random operations on a handful of file and directory names.
It can be run on a single client, or on multiple clients and needs no
coordination between the clients. I guess it won't tell you if you
are getting ESTALE back correctly or not, but it can quickly find if
there are any problems with the retrying code.

I've attached an updated tarball of the original scripts here.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


Attachments:
(No filename) (1.30 kB)
racer-lustre.tar.gz (1.96 kB)
Download all attachments