2005-04-25 21:07:55

by Kris Vassallo

[permalink] [raw]
Subject: Stale File handles keep coming back

I am experiencing a problem with stale file handles, and I have not been
able to find an answer in the archives nor has doing anything in the
readme helped. I apologize if an answer has already been posted
regarding an issue such as this. After many frustrating hours of
troubleshooting I am hoping for some help. The mail is long but I am
hoping that it answers all the questions.

I am experiencing the following problem: I have a group of ~20 nfs
clients (2.4.22-1.2115.nptlsmp) which are all mounting a /home directory
off of 1 NFS server (2.6.11-1.14_FC3smp); the server was upgraded FROM
2.4.22-1.2115.nptlsmp last week. All servers are running NFS v3 over
UDP.
As soon as I did the upgrade people started getting stale file
handles, so I shut down all clients and the server, booted the server
back up and then brought the clients back up so that everything would be
in sync. However, this did not resolve the issue.
I now have people complaining that if they do a ls in a specific
directory within their home directory, (this happens for all users in
their own private home directory, which I believe rules out the issue of
having someone else manipulating the file) they get an error thats says:
ls: .: Stale NFS file handle .

*What I don't understand is that if you were to touch a file within
that directory, all of a sudden the contents of the directory magically
become visible. Removing the directory, recopying the directory back,
and then using for a while works but then all of a sudden it goes stale
again!!

The following is the content of my /etc/exports
/home 10.113.1.0/26(sync,rw,no_wdelay,no_root_squash) \
10.113.1.64/26(sync,no_wdelay,rw) \
10.113.1.128/26(sync,no_wdelay,rw) \
10.113.1.192/26(sync,no_wdelay,rw)

Clients mount the share with defaults.

I am not changing anything in the /etc/exports file. The directory has
not actally been deleted as I can go to the server and do a ls on the
directory and see the contents of it (strangely this makes the stale
file handle go away). I have just added no_subtree_check to the exports
file and have not yet tested this. We have not hot swapped any hard
disks in our RAID array. This is an ext3 file system which I believe
supports permanent inode numbers. This is really weird and I can't think
of anything that would cause this but something with nfs and the 2.6.11
kernel?

Please CC me on any responses.
Thank you very much in advance for any help.
-Kris


2005-04-26 12:27:48

by NeilBrown

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

On Monday April 25, [email protected] wrote:
> I am experiencing a problem with stale file handles, and I have not been
> able to find an answer in the archives nor has doing anything in the
> readme helped. I apologize if an answer has already been posted
> regarding an issue such as this. After many frustrating hours of
> troubleshooting I am hoping for some help. The mail is long but I am
> hoping that it answers all the questions.

Very odd...

The output of:

echo 2048 > /proc/sys/sunrpc/rpc_debug
grep . /proc/net/rpc/*/content
ls -l /proc/fs/nfsd
cat /proc/fs/nfs/exports

might help.

Thanks,
NeilBrown


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-04-26 22:22:43

by Kris Vassallo

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

On Tue, 2005-04-26 at 05:27, Neil Brown wrote:

> On Monday April 25, [email protected] wrote:
> > I am experiencing a problem with stale file handles, and I have not been
> > able to find an answer in the archives nor has doing anything in the
> > readme helped. I apologize if an answer has already been posted
> > regarding an issue such as this. After many frustrating hours of
> > troubleshooting I am hoping for some help. The mail is long but I am
> > hoping that it answers all the questions.
>
> Very odd...
>
> The output of:
>
> echo 2048 > /proc/sys/sunrpc/rpc_debug
> grep . /proc/net/rpc/*/content
> ls -l /proc/fs/nfsd
> cat /proc/fs/nfs/exports

As you wished:

[root@venus ~]# echo 2048 > /proc/sys/sunrpc/rpc_debug
[root@venus ~]# grep . /proc/net/rpc/*/content
/proc/net/rpc/auth.unix.ip/content:#class IP domain
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555250 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.35 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554736 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.31 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555091 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.32 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554725 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.33 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114553809 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.30 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554372 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.23 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554950 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.20 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554967 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.22 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555344 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.15 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554495 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.16 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555310 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.18 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555306 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.13 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555211 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.7 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114555134 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.8 10.113.1.0/26
/proc/net/rpc/auth.unix.ip/content:# expiry=1114554927 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 10.113.1.9 10.113.1.0/26
/proc/net/rpc/nfs4.idtoname/content:#domain type id [name]
/proc/net/rpc/nfs4.nametoid/content:#domain type name [id]
/proc/net/rpc/nfsd.export/content:#path domain(flags)
/proc/net/rpc/nfsd.export/content:# expiry=1114555041 refcnt=1
/proc/net/rpc/nfsd.export/content:/home
10.113.1.0/26(rw,no_root_squash,sync,no_wdelay,no_subtree_check)
/proc/net/rpc/nfsd.fh/content:#domain fsidtype fsid [path]
/proc/net/rpc/nfsd.fh/content:# expiry=1114555041 refcnt=0
/proc/net/rpc/nfsd.fh/content:10.113.1.0/26 0 0x1100080000000002 /home

[root@venus ~]# ls -l /proc/fs/nfsd/
total 0
-r--r--r-- 1 root root 0 Apr 22 19:14 exports
-rw------- 1 root root 0 Apr 22 19:14 filehandle
-rw------- 1 root root 0 Apr 22 19:14 nfsv4leasetime
-rw------- 1 root root 0 Apr 22 19:14 threads

[root@venus ~]# cat /proc/fs/nfs/exports
# Version 1.1
# Path Client(Flags) # IPs
/home 10.113.1.0/26(rw,no_root_squash,sync,no_wdelay,no_subtree_check)
[root@venus ~]#

Hope this helps!
Thanks for taking a look at this,
Kris


>
> might help.
>
> Thanks,
> NeilBrown

2005-04-27 02:42:42

by NeilBrown

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

On Tuesday April 26, [email protected] wrote:
> On Tue, 2005-04-26 at 05:27, Neil Brown wrote:
>
> > On Monday April 25, [email protected] wrote:
> > > I am experiencing a problem with stale file handles, and I have not been
> > > able to find an answer in the archives nor has doing anything in the
> > > readme helped. I apologize if an answer has already been posted
> > > regarding an issue such as this. After many frustrating hours of
> > > troubleshooting I am hoping for some help. The mail is long but I am
> > > hoping that it answers all the questions.
> >
> > Very odd...
> >
> > The output of:
> >
> > echo 2048 > /proc/sys/sunrpc/rpc_debug
> > grep . /proc/net/rpc/*/content
> > ls -l /proc/fs/nfsd
> > cat /proc/fs/nfs/exports
>
> As you wished:

Thanks. Everything looks OK.

10.113.1.{7,8,9,13,15,16,18,20,22,23,30,31,32,35}

all seem to have authenticated OK and I suspect other clients would if
they tried.

Maybe we need to drill-down a bit more....

Once the client gets an ESTALE for a particular object, it won't try
the lookup again (because it already knows it is stale), so once the
problem is happening on the client, it might be too late to check what
is happening on the server...

Maybe if you cd up out the directory and 'ls' again from above it
might re-check with the server.

So, when it happens again, please check that the IP address of the
client really is in /proc/net/rpc/auth.unix.ip/content and then

echo 1023 > /proc/sys/sunrpc/nfsd_debug
on the server.
Then on the client,
cd $HOME
ls -l the/offending/directory

See if that works, and see what you get in the kernel logs.


NeilBrown


-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-04-29 07:01:14

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Hi all,

just for the records: I'm experiencing the same problem after
I upgraded from the SuSE kernel 2.6.8-24.11 to 2.6.8-24.14
(official SuSE update). The PC server and the client are running
the 2.6.8-24.14 kernel, we have an additional NFS server running
the SLES 9 kernel-pseries64-2.6.5-7.151.

Our users have exactly the same symptoms, either on their homes
or on directories mounted from the pSeries. It happens when
they e.g. have an xterm with a shell and chdir to a directory,
do sth. in there, and then leave the shell untouched for say 30
minutes. After this time, they do a "ls" and get

ls: .: Stale NFS file handle

Calling "cd .." and then "cd <former_directory>", the ls works
again.

I'm currently trying to the cut the environment down to a base
scenario by e.g. throwing out all other NFS mounts that we have
from other domains etc. I will try the things you proposed and
send the results here.

Just to report that... Maybe it makes it easier to find the bug when
it's known to happen in 2.4 and 2.6...

cu,
Frank


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-04-29 08:00:15

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Hi,

while I was still trying to set up some test environments,
one of our users stepped again on two stale directories, so
I fetched all the info you mentioned. The stale NFS occured
in the users home (mounted from /export/home from the server).


Neil Brown wrote

>>> echo 2048 > /proc/sys/sunrpc/rpc_debug
>>> grep . /proc/net/rpc/*/content

This is a lot because we have 60 NFS clients. I just grepped the
lines for the client on which the stale NFS occured at that moment
which is still a lot due to the many mounts:

/proc/net/rpc/auth.unix.ip/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/auth.unix.ip/content:nfsd 141.84.1.156 cauchy.bio.ifi.lmu.de
--
/proc/net/rpc/nfsd.export/content-# expiry=2147483647 refcnt=1
/proc/net/rpc/nfsd.export/content:/
cauchy.bio.ifi.lmu.de(ro,no_root_squash,sync,wdelay)
--
/proc/net/rpc/nfsd.export/content-# expiry=2147483647 refcnt=1
/proc/net/rpc/nfsd.export/content:/export/clientpass
cauchy.bio.ifi.lmu.de(ro,no_root_squash,sync,wdelay)
--
/proc/net/rpc/nfsd.export/content-# expiry=2147483647 refcnt=1
/proc/net/rpc/nfsd.export/content:/export/clientroot
cauchy.bio.ifi.lmu.de(rw,no_root_squash,sync,wdelay)
--
/proc/net/rpc/nfsd.export/content-# expiry=2147483647 refcnt=1
/proc/net/rpc/nfsd.export/content:/var cauchy.bio.ifi.lmu.de(ro,root_squash,sync,wdelay)
--
/proc/net/rpc/nfsd.export/content-# expiry=2147483647 refcnt=1
/proc/net/rpc/nfsd.export/content:/export/diskless/141.84.1.156
cauchy.bio.ifi.lmu.de(rw,no_root_squash,sync,wdelay)
--
/proc/net/rpc/nfsd.export/content-# expiry=2147483647 refcnt=1
/proc/net/rpc/nfsd.export/content:/export/home
cauchy.bio.ifi.lmu.de(rw,root_squash,sync,wdelay)
--
/proc/net/rpc/nfsd.fh/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/nfsd.fh/content:cauchy.bio.ifi.lmu.de 0 0x0200080000000002 /var
--
/proc/net/rpc/nfsd.fh/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/nfsd.fh/content:cauchy.bio.ifi.lmu.de 0 0x0100fc0000000004
/export/clientpass
--
/proc/net/rpc/nfsd.fh/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/nfsd.fh/content:cauchy.bio.ifi.lmu.de 0 0x0100080000000002 /
--
/proc/net/rpc/nfsd.fh/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/nfsd.fh/content:cauchy.bio.ifi.lmu.de 0 0x0100fc000000000b
/export/clientroot
--
/proc/net/rpc/nfsd.fh/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/nfsd.fh/content:cauchy.bio.ifi.lmu.de 0 0x0100fc0000002e4a
/export/diskless/141.84.1.156
--
/proc/net/rpc/nfsd.fh/content-# expiry=2147483647 refcnt=0
/proc/net/rpc/nfsd.fh/content:cauchy.bio.ifi.lmu.de 0 0x0000fc0000000002 /export/home


>>> ls -l /proc/fs/nfsd

babbage /root/tmp# ls -l /proc/fs/nfsd
total 0
dr-xr-xr-x 2 root root 0 Apr 29 09:44 .
dr-xr-xr-x 4 root root 0 Apr 29 09:44 ..

? Should there be sth?


>>> cat /proc/fs/nfs/exports

babbage /root/tmp# cat /proc/fs/nfs/exports |grep cauchy
/ cauchy.bio.ifi.lmu.de(ro,no_root_squash,sync,wdelay)
/export/clientpass cauchy.bio.ifi.lmu.de(ro,no_root_squash,sync,wdelay)
/export/clientroot cauchy.bio.ifi.lmu.de(rw,no_root_squash,sync,wdelay)
/var cauchy.bio.ifi.lmu.de(ro,root_squash,sync,wdelay)
/export/diskless/141.84.1.156 cauchy.bio.ifi.lmu.de(rw,no_root_squash,sync,wdelay)
/export/home cauchy.bio.ifi.lmu.de(rw,root_squash,sync,wdelay)


Additionally, here are the relevant mount options on the client:

cauchy /root# mount | grep 141.84.1.131
/dev/root on / type nfs
(rw,v3,rsize=16384,wsize=16384,hard,intr,tcp,nolock,addr=141.84.1.131)
141.84.1.131://export/diskless/141.84.1.156//etc/local on /etc/local type nfs
(rw,v3,rsize=16384,wsize=16384,hard,intr,tcp,nolock,addr=141.84.1.131)
141.84.1.131://export/diskless/141.84.1.156//var on /var type nfs
(rw,v3,rsize=16384,wsize=16384,hard,intr,tcp,nolock,addr=141.84.1.131)
141.84.1.131:/boot on /boot type nfs
(ro,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
141.84.1.131:/var/adm on /var/adm type nfs
(ro,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
141.84.1.131:/var/lib/texmf on /var/lib/texmf type nfs
(ro,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
141.84.1.131:/var/lib/rpm on /var/lib/rpm type nfs
(ro,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
141.84.1.131:/var/log/apache2 on /var/httpd type nfs
(ro,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
141.84.1.131:/export/clientpass on /export type nfs
(ro,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
141.84.1.131:/export/clientroot on /export/localhome/root type nfs
(rw,tcp,hard,intr,rsize=16384,wsize=16384,addr=141.84.1.131)
babbage:/export/home on /home type nfs
(rw,tcp,hard,rsize=16384,wsize=16384,addr=141.84.1.131)



> So, when it happens again, please check that the IP address of the
> client really is in /proc/net/rpc/auth.unix.ip/content and then

Yes, it was definitely there:
babbage /root/tmp# grep 141.84.1.156 /proc/net/rpc/auth.unix.ip/content
nfsd 141.84.1.156 cauchy.bio.ifi.lmu.de


>
> echo 1023 > /proc/sys/sunrpc/nfsd_debug
> on the server.
> Then on the client,
> cd $HOME
> ls -l the/offending/directory

This is amazing: This ls does work, while the one in the shell
where the stale occured still don't work. So the directory
is only stale in on shell and not stale in the other shell.
I thought that the stale should would have recovered, too...

> See if that works, and see what you get in the kernel logs.

We did the following:
1) echo 1023 > /proc/sys/sunrpc/nfsd_debug
2) the user did a ls in the stale directory, and got the stale message
again
3) the user did a "ls <the directory>" from another shell and got
the contents of the directory
4) the user did a ls in the stale shell and go the stale messages
again
5) echo 0 > /proc/sys/sunrpc/nfsd_debug

So I hope that the log does not contain too much other, useless
information.


Let me know if I can provide more information!

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


Attachments:
nfsd.debug.bz2 (4.01 kB)

2005-04-29 14:05:20

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

One more info:

It seems that we can easily reproduce the problem when
a) an application does some work in a directory and
b) a shell is running in the same directory and left unused for some
time (about 45 minutes seem to be enough).

Then the shell encounters the "stale NFS handle" when it tries to do
sth again.

Scenarios are e.g. an emacs running and writing a .tex file every
few minutes, and a shell is used to call "latex ...". 40 minutes of
editing with emacs without using the shell result in a stale NFS
handle when calling latex in the shell after that time.

Or running a firefox and having the shell chdir to
~/.mozilla/firefox/<the defaults user profile directory>
I guess firefox is writing the cache or sth. else. Again, the shell
will stale when left untouched for about 40 minutes.

Maybe this can help to reproduce the bug.

cu,
Frank


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-04-29 14:08:47

by Trond Myklebust

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

fr den 29.04.2005 Klokka 09:01 (+0200) skreiv Frank Steiner:

> Our users have exactly the same symptoms, either on their homes
> or on directories mounted from the pSeries. It happens when
> they e.g. have an xterm with a shell and chdir to a directory,
> do sth. in there, and then leave the shell untouched for say 30
> minutes. After this time, they do a "ls" and get
>
> ls: .: Stale NFS file handle
>
> Calling "cd .." and then "cd <former_directory>", the ls works
> again.

Which filesystem is this?

Cheers,
Trond

--
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-04-30 13:15:31

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Trond Myklebust wrote

> fr den 29.04.2005 Klokka 09:01 (+0200) skreiv Frank Steiner:
>
>>Our users have exactly the same symptoms, either on their homes
>>or on directories mounted from the pSeries. It happens when
>>they e.g. have an xterm with a shell and chdir to a directory,
>>do sth. in there, and then leave the shell untouched for say 30
>>minutes. After this time, they do a "ls" and get
>>
>>ls: .: Stale NFS file handle
>>
>>Calling "cd .." and then "cd <former_directory>", the ls works
>>again.
>
> Which filesystem is this?

reiserfs v3 on both servers (PC and pSeries).


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-04-30 16:29:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

On lau , 2005-04-30 at 15:15 +0200, Frank Steiner wrote:

> reiserfs v3 on both servers (PC and pSeries).

Is the superblock in reiserfs 3.6 format? What
does /proc/fs/reiserfs/<partition>/version say?

If the format is older than 3.6, then you should convert it if you want
to export it through NFS. To do so, you can use the "-oconv" option when
mounting the filesystem (see the reiserfs section in the mount manpage).

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-12 01:01:50

by Kris Vassallo

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Ok, well I have found very good success in killing this stupid stale
file handle problem in the 2.6.11 kernel by switching from ext3 to ext2
and then rebooting. For the past 5 days not 1 of my developers has
complained about this issue.
Obviously if this machine goes down and a fsck starts running on the 750
GB of data, I am in some deep doo doo, but for now I am using this as a
way of getting around the problem until someone comes up with a nifty
patch.
-Kris

On Tue, 2005-05-03 at 22:44, Frank Steiner wrote:

> Kris Vassallo wrote
>
> > So in reference to this bug
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150759
> > it seems as if ridding the system of the journal by instead using ext2
> > is fixing the problem? I can't tell if that bug has anything to do with
> > providing ESTALE errors but it seems to have the same effect where you
> > can't see files even though they are there.
>
> Yes, sounds very similar...
>
>
> >
> > Has anyone tried using the journal_data_ordered option? I am not sure
> > there is a way to do that in reiser but I know it can be done with ext3.
>
> According to the man page, "data=ordered" is the default. Have you
> explicitely changed it?
>
> I couldn't find any option to change this in reiserfs...

2005-05-12 01:14:19

by NeilBrown

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

On Wednesday May 11, [email protected] wrote:
> Ok, well I have found very good success in killing this stupid stale
> file handle problem in the 2.6.11 kernel by switching from ext3 to ext2
> and then rebooting. For the past 5 days not 1 of my developers has
> complained about this issue.
> Obviously if this machine goes down and a fsck starts running on the 750
> GB of data, I am in some deep doo doo, but for now I am using this as a
> way of getting around the problem until someone comes up with a nifty
> patch.

It seems likely that this problem is related to the b-tree based
directory indexing use in ext3.

You could try using tune2fs to make sure this is turned off, and try
ext3 again.

NeilBrown


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-02 06:24:34

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Trond Myklebust wrote

> On lau , 2005-04-30 at 15:15 +0200, Frank Steiner wrote:
>
>>reiserfs v3 on both servers (PC and pSeries).
>
> Is the superblock in reiserfs 3.6 format? What
> does /proc/fs/reiserfs/<partition>/version say?

Hmm, there is not /proc/fs/reiserfs on my system (?), but

debugreiserfs /dev/mapper/exportraid-home (yes, LVM)

tells me:

Reiserfs super block in block 16 on 0xfc00 of format 3.6 with standard journal

So I guess that's ok. Also note that the problems wer introduced onlt
after the latest kernel update while the filesystem was created a year
ago and running stable since that time!

I'm currently testing the patch against the acess/getattr flooding that
you sent here because a colleague who is running a few computer pools
observes a correlation between high nfs load and the stale NFS handles.

cu,
Frank


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-03 10:45:56

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Frank Steiner wrote

> just for the records: I'm experiencing the same problem after
> I upgraded from the SuSE kernel 2.6.8-24.11 to 2.6.8-24.14

Ok, no longer true. Doing intensive tests I also ran into the problem
in a test suite where server and client were both running 2.6.8-24.11.
Looks like it takes a little longer to trigger on -24.11, so that's
why we might not have stepped on it earlier.

Just to keep Olaf from banging his head on the changeset between -24.11
and 24.14 :-)

I could not yet test if the acess/getattr-patch helps because I can't
apply it to 2.6.8. Looks very different to 2.6.11, so I've no idea how
to hack that in 2.6.8.

Still hoping someone here finds the bug :-)

cu,
frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-03 11:11:24

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Frank Steiner wrote

> minutes. After this time, they do a "ls" and get
>
> ls: .: Stale NFS file handle
>
> Calling "cd .." and then "cd <former_directory>", the ls works
> again.

We just figured out that the files in the directory can be viewed when
doing "ls specific_file":

wirth [13:09] lx003umv.default 55) ls -la
ls: .: Stale NFS file handle
wirth [13:10] lx003umv.default 56) ls -la defaults.ini
-rw------- 1 tester users 24 Apr 29 13:10 defaults.ini
wirth [13:10] lx003umv.default 57) ls -la
ls: .: Stale NFS file handle

So just "." goes stale, everything below seems to be sane. Maybe that
info is helpful...

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-03 21:21:29

by Kris Vassallo

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

So in reference to this bug
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150759
it seems as if ridding the system of the journal by instead using ext2
is fixing the problem? I can't tell if that bug has anything to do with
providing ESTALE errors but it seems to have the same effect where you
can't see files even though they are there.

Has anyone tried using the journal_data_ordered option? I am not sure
there is a way to do that in reiser but I know it can be done with ext3.
I am wondering if stuff is sitting in the journal and there is a lag
between writing in the journal and then having that data written to
disk. Using journal_data_ordered would theoretically fix that problem as
it would force the data to be written to the disk first.
This is just a shot in the dark and I will let you know if I have any
success with this. Does anyone have any ideas on this?

-Kris

On Tue, 2005-04-26 at 05:27, Neil Brown wrote:

> On Monday April 25, [email protected] wrote:
> > I am experiencing a problem with stale file handles, and I have not been
> > able to find an answer in the archives nor has doing anything in the
> > readme helped. I apologize if an answer has already been posted
> > regarding an issue such as this. After many frustrating hours of
> > troubleshooting I am hoping for some help. The mail is long but I am
> > hoping that it answers all the questions.
>
> Very odd...
>
> The output of:
>
> echo 2048 > /proc/sys/sunrpc/rpc_debug
> grep . /proc/net/rpc/*/content
> ls -l /proc/fs/nfsd
> cat /proc/fs/nfs/exports
>
> might help.
>
> Thanks,
> NeilBrown
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-04 05:45:29

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Kris Vassallo wrote

> So in reference to this bug
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150759
> it seems as if ridding the system of the journal by instead using ext2
> is fixing the problem? I can't tell if that bug has anything to do with
> providing ESTALE errors but it seems to have the same effect where you
> can't see files even though they are there.

Yes, sounds very similar...


>
> Has anyone tried using the journal_data_ordered option? I am not sure
> there is a way to do that in reiser but I know it can be done with ext3.

According to the man page, "data=ordered" is the default. Have you
explicitely changed it?

I couldn't find any option to change this in reiserfs...

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-04 22:48:23

by Kris Vassallo

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

On Tue, 2005-05-03 at 22:44, Frank Steiner wrote:

> Kris Vassallo wrote
>
> > So in reference to this bug
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150759
> > it seems as if ridding the system of the journal by instead using ext2
> > is fixing the problem? I can't tell if that bug has anything to do with
> > providing ESTALE errors but it seems to have the same effect where you
> > can't see files even though they are there.
>
> Yes, sounds very similar...
>
>
> >
> > Has anyone tried using the journal_data_ordered option? I am not sure
> > there is a way to do that in reiser but I know it can be done with ext3.
>
> According to the man page, "data=ordered" is the default. Have you
> explicitely changed it?

Bah! I changed it, no good. I did do it on a file system that was
already mounted but after I made the changes with tune2fs I did a mount
-o remount and then a exportfs -r, I think that should have made it take
effect.

>
> I couldn't find any option to change this in reiserfs...

I don't remember who, but someone was using a suse kernel which was a
bit older than the fedora kernel I am using (2.6.11-1.14_FC3smp). I
found something interesting in the changelog on kernel.org that would
make me think that this issue was resolved in 2.6.11 but since I am
using 2.6.11 it either isn't fixed or the fedora team patched something
and broke the nfs fix.
Take a look at the following. It would be interesting to see if the
2.6.11 release on Suse fixes this problem (although I don't know if this
has been released yet).
http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.11
And I quote "<[email protected]>

NFSv2/v3/v4: ESTALE should not be a permanent condition on directories.

Although it usually means that someone has deleted a file on the server,
the ESTALE error may also indicate that the sysadmin has used exportfs to
deny our client access to the server. Most NFS implementations therefore
consider it a non-permanent condition, and allow inodes to "recover" when
the sysadmin re-enables access.
If, however, you want to work with broken servers, like unfsd, that reuse
filehandles for new files after the original file gets deleted, then
"recovery" is impossible, since it may be that the filehandle now points
to a different file. Note that this is broken server behaviour that may
happen even without us ever seeing the ESTALE error.
In order to minimize (but we can never eliminate entirely) this race
condition on unfsd servers, Linux has traditionally made ESTALE a
permanent condition on all filehandles except the root filehandle.

The problem is that if we apply this strict staleness criterion to
directories (particularly so for he current directory), then all
processes will need to re-walk the path starting from the mount point,
in order to recover from the sysadmin intervention case. As this is not
usual on other *NIX implementations, and may in any case be undermined by
caching rules etc, this is being seen as a usability problem.

This patch makes ESTALE a non-permanent condition on directories, but
preserves the current behaviour for non-directories.

Signed-off-by: Trond Myklebust <[email protected]>
"

-Kris

2005-05-04 23:06:27

by Trond Myklebust

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

on den 04.05.2005 Klokka 15:48 (-0700) skreiv Kris Vassallo:

> Take a look at the following. It would be interesting to see if the
> 2.6.11 release on Suse fixes this problem (although I don't know if
> this has been released yet).
> http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.11
> And I quote "<[email protected]>
> NFSv2/v3/v4: ESTALE should not be a permanent condition on directories.
>

That client-side patch can do nothing to prevent errors from being
returned if a server is erroneously generating ESTALE.

It will just make it unnecessary to do the equivalent of "cd
`pwd`" (i.e. rewalk the path) on the client once the server manages to
sort itself out.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-05 00:15:48

by Kris Vassallo

[permalink] [raw]
Subject: Re: Stale File handles keep coming back


On Tue, 2005-05-03 at 04:11, Frank Steiner wrote:

> Frank Steiner wrote
>
> > minutes. After this time, they do a "ls" and get
> >
> > ls: .: Stale NFS file handle
> >
> > Calling "cd .." and then "cd <former_directory>", the ls works
> > again.

Ok a question for you. Let me explain what I am noticing: As soon as a
client gets the stale file error, I go to the actual NFS server, and do
a ls in the directory in question. Then without doing anything else, the
client does another ls and then the contents can be seen. I don't know
exactly what an ls does that would make the clients suddenly work again.

Are you able to do something similar?

-Kris


>
> We just figured out that the files in the directory can be viewed when
> doing "ls specific_file":
>
> wirth [13:09] lx003umv.default 55) ls -la
> ls: .: Stale NFS file handle
> wirth [13:10] lx003umv.default 56) ls -la defaults.ini
> -rw------- 1 tester users 24 Apr 29 13:10 defaults.ini
> wirth [13:10] lx003umv.default 57) ls -la
> ls: .: Stale NFS file handle
>
> So just "." goes stale, everything below seems to be sane. Maybe that
> info is helpful...
>
> cu,
> Frank

2005-05-06 06:39:14

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Kris Vassallo wrote

> Ok a question for you. Let me explain what I am noticing: As soon as a
> client gets the stale file error, I go to the actual NFS server, and do
> a ls in the directory in question. Then without doing anything else, the
> client does another ls and then the contents can be seen. I don't know
> exactly what an ls does that would make the clients suddenly work again.
>
> Are you able to do something similar?

No. The client still gets the stale handle. However, it works again
if I do "cd `pwd`" on the client. So, reading Tronds comment

> That client-side patch can do nothing to prevent errors from being
> returned if a server is erroneously generating ESTALE.
>
> It will just make it unnecessary to do the equivalent of "cd
> `pwd`" (i.e. rewalk the path) on the client once the server manages to
> sort itself out.

I'm a bit confused if this patch would help or not. I will test it.



This might be interesting for Olaf: The problem does not exist in the
kotd sles9-i386/SLES9_SP2_BRANCH/

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-05-09 06:04:20

by Frank Steiner

[permalink] [raw]
Subject: Re: Stale File handles keep coming back

Frank Steiner wrote

> > That client-side patch can do nothing to prevent errors from being
> > returned if a server is erroneously generating ESTALE.
> >
> > It will just make it unnecessary to do the equivalent of "cd
> > `pwd`" (i.e. rewalk the path) on the client once the server manages to
> > sort itself out.
>
> I'm a bit confused if this patch would help or not. I will test it.

Applying that patch to the SuSE kernel 2.6.8-24.14 seems to solve the
stale problem for me. Well, reading Tronds explanation, I guess it
does not really "solve" but maybe just hide the bug... If the client
does an equivlate of "cd `pwd`", then it will just recover from the
stale handle without the user recognizing it at all.

Anyway, with the normal SuSE 2.6.8-24.14 I get a stale "." whenever
I start firefox on the client and chdir to the firefox profile
directory. After 30 minutes the shell reliably has a stale directory.

With the patch applied and the client running the patched kernel,
(the server doesn't need to) there is no stale in this situation
after more than one day and that's all I need to make my users happy.

People in our department running server and client with 2.6.11 report
they still have massive problems with stale NFS handles, so it looks
like this patch just helped for a certain problem in 2.6.8...

cu,
Frank




--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs