2005-03-23 00:18:47

by Filipe Brandenburger

[permalink] [raw]
Subject: Stale NFS file handle


Hello,

I have a problem where I'm getting "Stale NFS file handle" errors when a
file is updated. I can easily reproduce the problem if I run a sequence
of commands in two different hosts.

My environment is:

1) Server: Netapp FAS940
2) Client 1: Linux RedHat 9 with kernel 2.4.21-4.ELsmp (kernel of RHAS3)
3) Client 2: exactly the same as client 1.

The file system is mounted on both clients with the options
rsize=8192,wsize=8192,timeo=28,intr, additionally it's mounted read-only
on client 2 (it also gives me stale file handle if it's mounted
read-write, so it doesn't really matter).

My test setup is:

On client 2, I setup a loop to read a file:

# while :; do cat test.txt; done >/dev/null

Then, on client 1, I create a new file and rename it over the original
file:

# date >new.txt; mv -f new.txt test.txt

Whenever I execute this on client 1, I get the following error message
on client 2:

cat: test.txt: Stale NFS file handle



Why is this happening? Is there a way to fix this problem? I tried the
mount options "noac" and "nocto" on client 2, and used "mount -o remount"
on it, after that the output of "mount" returned these options, and it
didn't solve the issue.

Is there a way to solve this issue without changing applications that
access this file? Because although my test environment consists of only
"cat" and "mv", my real production environment is of proprietary
applications, that are harder to fix, "cat" and "mv" was only the way I
used to reproduce the problem in a controlled environment...

Thanks a lot,
Filipe



-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-03-23 13:04:07

by Steve Dickson

[permalink] [raw]
Subject: Re: Stale NFS file handle

Filipe Brandenburger wrote:
> I have a problem where I'm getting "Stale NFS file handle" errors when a
> file is updated. I can easily reproduce the problem if I run a sequence
> of commands in two different hosts.
>
> My environment is:
>
> 1) Server: Netapp FAS940
> 2) Client 1: Linux RedHat 9 with kernel 2.4.21-4.ELsmp (kernel of RHAS3)
> 3) Client 2: exactly the same as client 1.
Sometime back there was an Netapp issue that was causing
ESTALES with mostly 2.4 kernels... I upgraded the OS on
our toater and the problem when away....

steved.


-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-23 13:57:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: Stale NFS file handle

on den 23.03.2005 Klokka 08:12 (-0500) skreiv Steve Dickson:
> Filipe Brandenburger wrote:
> > I have a problem where I'm getting "Stale NFS file handle" errors when a
> > file is updated. I can easily reproduce the problem if I run a sequence
> > of commands in two different hosts.
> >
> > My environment is:
> >
> > 1) Server: Netapp FAS940
> > 2) Client 1: Linux RedHat 9 with kernel 2.4.21-4.ELsmp (kernel of RHAS3)
> > 3) Client 2: exactly the same as client 1.
> Sometime back there was an Netapp issue that was causing
> ESTALES with mostly 2.4 kernels... I upgraded the OS on
> our toater and the problem when away....

... yeah, but this was a pretty obvious case of user error.

He was running

while :; do cat test.txt; done >/dev/null

on a client, then deleting the file on the server. Even if the call to
open() is successful, you both can and will get ESTALEs on the
subsequent call to read().

Cheers,
Trond

--
Trond Myklebust <[email protected]>



-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-23 14:42:51

by Lever, Charles

[permalink] [raw]
Subject: RE: Stale NFS file handle

i saw trond's post. he is correct, your use case is broken. the
patches will fix ESTALE for open(2), but not for read(2). i don't know
of any NFS implementation that will recover from an ESTALE on a read
operation.

you need to understand that NFS is not a cluster file system. it does
not provide single-system semantics. for a better understanding of the
limitations of NFS's caching model, take a look at Callaghan's "NFS
Illustrated."

> -----Original Message-----
> From: Filipe Brandenburger [mailto:[email protected]]=20
> Sent: Wednesday, March 23, 2005 9:35 AM
> To: Lever, Charles
> Subject: Re: [NFS] Stale NFS file handle
>=20
>=20
> Hi, there.
>=20
> Thanks for your answer. Do you know of such a patch that=20
> would solve this issue at the Linux Kernel level? I'm using=20
> kernel 2.4, do you know if 2.6 is any better on that? Do you=20
> know if other client implementations actually recover from=20
> these errors? I googled around and found out that this may be=20
> an issue on Solaris as well...
>=20
> Thanks,
> Filipe
>=20
>=20
>=20
> * Wed, 23 Mar 2005 05:53:39 -0800, "Lever, Charles"=20
> <[email protected]>:
> > when you replaced the file, client 2 still had the old file handle=20
> > cached. when it used that old file handle again, the=20
> server reported=20
> > the file no longer existed with an ESTALE error.
> >=20
> > the problem is that Linux NFS clients don't recover from=20
> ESTALE errors.
> > it's a deficiency in the client implementation that, at=20
> this point, is=20
> > fixed only by patches. at some point soon the patches will be=20
> > integrated into the mainline and distributions.
>=20
>=20


-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-23 17:13:35

by Filipe Brandenburger

[permalink] [raw]
Subject: Re: Stale NFS file handle


* Wed, 23 Mar 2005 08:57:15 -0500, Trond Myklebust <[email protected]>:
> He was running
>
> while :; do cat test.txt; done >/dev/null
>
> on a client, then deleting the file on the server. Even if the call to
> open() is successful, you both can and will get ESTALEs on the
> subsequent call to read().

Ok,

But then, how do you suggest I should change applications to do it? The
applications that publish content to the NFS run on one host and are
based on rsync, the applications that deliver content are web servers
(Apache) reading from this same NFS on another pool of hosts (these are
the ones that get the ESTALE error).

Where is the problem? On the applications that publish? Should they open
the file and update it in-place instead of creating a new one and
renaming? I don't think so! This would lead to content that is a mix of
the old and the new, that is corrupt.

Or is the webserver? Should the application protect itself from ESTALE
errors and retry? Somehow that seems wrong to me also. Then I would have
to change all applications that read this content to do it. Why doesn't
the NFS client recover from this kind of errors?

If it's really not possible to change it on the NFS client (the kernel),
what workaround would you suggest me to use?

Thanks,
Filipe



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-23 17:27:13

by Trond Myklebust

[permalink] [raw]
Subject: Re: Stale NFS file handle

on den 23.03.2005 Klokka 14:15 (-0300) skreiv Filipe Brandenburger:

> If it's really not possible to change it on the NFS client (the kernel),
> what workaround would you suggest me to use?

Rename the file, then delete it once you know that the clients no longer
have it open. That's the obvious and standard way of dealing with this
sort of problem over NFS.

Cheers,
Trond

--
Trond Myklebust <[email protected]>



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-23 18:59:32

by Lever, Charles

[permalink] [raw]
Subject: RE: Stale NFS file handle

filipe-

in general the kernel patches i referred to earlier will prevent most
issues when using rsync and serving web pages. an occassional ESTALE is
unavoidable because no NFS client can recover from an ESTALE during a
read operation. however, the patches do allow a subsequent open(2)
operation on that pathname to find the new file.


> -----Original Message-----
> From: Filipe Brandenburger [mailto:[email protected]]=20
> Sent: Wednesday, March 23, 2005 12:15 PM
> To: Trond Myklebust
> Cc: Steve Dickson; [email protected]
> Subject: Re: [NFS] Stale NFS file handle
>=20
>=20
> * Wed, 23 Mar 2005 08:57:15 -0500, Trond Myklebust=20
> <[email protected]>:
> > He was running
> >=20
> > while :; do cat test.txt; done >/dev/null
> >=20
> > on a client, then deleting the file on the server. Even if=20
> the call to
> > open() is successful, you both can and will get ESTALEs on the=20
> > subsequent call to read().
>=20
> Ok,
>=20
> But then, how do you suggest I should change applications to=20
> do it? The applications that publish content to the NFS run=20
> on one host and are based on rsync, the applications that=20
> deliver content are web servers
> (Apache) reading from this same NFS on another pool of hosts=20
> (these are the ones that get the ESTALE error).
>=20
> Where is the problem? On the applications that publish?=20
> Should they open the file and update it in-place instead of=20
> creating a new one and renaming? I don't think so! This would=20
> lead to content that is a mix of the old and the new, that is corrupt.
>=20
> Or is the webserver? Should the application protect itself=20
> from ESTALE errors and retry? Somehow that seems wrong to me=20
> also. Then I would have to change all applications that read=20
> this content to do it. Why doesn't the NFS client recover=20
> from this kind of errors?
>=20
> If it's really not possible to change it on the NFS client=20
> (the kernel), what workaround would you suggest me to use?
>=20
> Thanks,
> Filipe
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by Microsoft Mobile & Embedded=20
> DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more=20
> about the latest Windows
> Embedded(r) & Windows Mobile(tm) platforms, applications &=20
> content. Register by 3/29 & save $300=20
> http://ads.osdn.com/?ad_id=3D6883&alloc_id=3D15149&op=3Dclick
> _______________________________________________
> NFS maillist - [email protected]=20
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20


-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs