2005-06-23 20:55:44

by Eiwe Lingefors

[permalink] [raw]
Subject: More Stale NFS handles

I have trawled the nfs mailing list archives a bit and judging by
recent posts there is undoubtedly several others with the same
problem. The problems started happening when I installed a new
Fedora Core 3 server and migrated home directories to it. Like Jim
Farley mentioned in a recent thread on the list, our environment has
been stable for years prior.

The server:
Dell PowerEdge 2850
1 x PowerVault 220S 14x300GB SCSI
Fedora Core 3
Kernel 2.6.9-1.667smp
Several exported LVM volumes all formatted with reiserfs v3.6

The clients:
64 x IBM x335 cluster nodes
RedHat 7.3
Kernel 2.4.26

40 x Sun Blade 150
Solaris 9

Things I have done:
x Upgraded automount on all linux clients to 4.1.4
x Tuned rsize,wsize,timeo,retrans
x Increased number of nfsd processes to 128

So far nothing I have done has helped reduce the amount of stale NFS
file handles.

I'm not sure what additional information might be helpful. The
problems I'm having essentially mirror those of others who have
posted regarding stale NFS handles in the past few months on this
list. I'm at my wits end after having fought with this problem for
weeks. Any insight or pointers would be deeply appreciated. I'll be
happy to provide additional information if needed.

Thanks,
Eiwe Lingefors

PS. I have quoted Jim Farley's message below since the problems are
essentially identical to mine.

Jim Farley wrote:

> Hi,
>
> We have been running an nfs environment for a couple of years now
without
> difficulties. Unfortunately when we upgraded the nfs server from
RedHat9 to Fedora
> Core 3 (2.6.9-1.667smp), clients (10 RedHat 9 systems,
2.4.20-8smp) started getting
> stale NFS file handles. I used tcpdump and verified that the
messages are exchanged
> between the server and client without delay which would indicate
the network is not a
> problem. I have also verified that the server is not under heavy
load (cpu, memory,
> network), nor are the clients.
> We basically take the defaults on the client and on the server:
>
> Client:
> cat /proc/mounts:
> 192.168.11.10:/home /home nfs
> rw,v3,rsize=4096,wsize=4096,hard,intr,udp,lock,addr=192.168.11.10 0 0
>
> Server:
> cat /etc/exports
> /home 192.168.11.0/255.255.255.0(rw,no_subtree_check)
>
> What I have done:
> - increase number of nfsd's to 32
> - disabled caching on one client (noac)
> - reverted the server from the ext3 filesystem to ext2
>
> None of this has helped. Is there anything else I can try?
>
> Thanks,
> Jimmy



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-06-29 20:28:15

by Kris Vassallo

[permalink] [raw]
Subject: Re: More Stale NFS handles

On Thu, 2005-06-23 at 13:55, Eiwe Lingefors wrote:

> The server:
> Dell PowerEdge 2850
> 1 x PowerVault 220S 14x300GB SCSI
> Fedora Core 3
> Kernel 2.6.9-1.667smp
> Several exported LVM volumes all formatted with reiserfs v3.6
> So far nothing I have done has helped reduce the amount of stale NFS
> file handles.

Bah! I have the same problem, I've had it for months now. I was about to
build yet another server and use reiserfs but thanks to you I am spared
the agony of finding out it wont work.

>
> I'm not sure what additional information might be helpful. The
> problems I'm having essentially mirror those of others who have
> posted regarding stale NFS handles in the past few months on this
> list. I'm at my wits end after having fought with this problem for
> weeks.

YES YES YES!!! It almost brings a state of insanity with it!! Plus, atop
my insanity I have a group of 30 developers who are complaining about
not being able to do work because of stale file handles.

> Any insight or pointers would be deeply appreciated. I'll be
> happy to provide additional information if needed.

I was using ext3, upgraded to core 3 from core 1 and had the exact same
problems. What did it for me, until someone fixes this problem, was to
turn off journaling (so basically its back to ext2). Since reverting
back to ext 2 the problem has gone away. Now this is going to be a huge
problem if the machine crashes because fscking a 1.5 TB disk array is
going to suck! I experimented with the way the journaling gets done
(data goes to disk first or journal first) and I wasn't able to fix the
problem.

Aaaaaaargh!
I just thought I would share my ongoing battle story, hopefully someone
will figure out what's causing this and will come up with a fix.

-Kris


>
> Thanks,
> Eiwe Lingefors
>


2005-06-29 20:51:05

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: More Stale NFS handles

On Wed, Jun 29, 2005 at 01:28:07PM -0700, Kris Vassallo wrote:
> I was using ext3, upgraded to core 3 from core 1 and had the exact same
> problems. What did it for me, until someone fixes this problem, was to
> turn off journaling (so basically its back to ext2). Since reverting
> back to ext 2 the problem has gone away. Now this is going to be a huge
> problem if the machine crashes because fscking a 1.5 TB disk array is
> going to suck! I experimented with the way the journaling gets done
> (data goes to disk first or journal first) and I wasn't able to fix the
> problem.
>
> Aaaaaaargh!
> I just thought I would share my ongoing battle story, hopefully someone
> will figure out what's causing this and will come up with a fix.

We're you using ext3 directory indexing? [Look for feature "dir_index" on
the "Filesystem features:" line in "/sbin/tune2fs -l <device>" output.]

If so, was the following patch in the kernel that you were using?

http://marc.theaimsgroup.com/?l=ext2-devel&m=111568753527230&w=2

Regards,

Bill Rugolsky


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-29 21:05:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: More Stale NFS handles

on den 29.06.2005 Klokka 13:28 (-0700) skreiv Kris Vassallo:

> > I'm not sure what additional information might be helpful. The
> > problems I'm having essentially mirror those of others who have
> > posted regarding stale NFS handles in the past few months on this
> > list. I'm at my wits end after having fought with this problem for
> > weeks.
> YES YES YES!!! It almost brings a state of insanity with it!! Plus,
> atop my insanity I have a group of 30 developers who are complaining
> about not being able to do work because of stale file handles.

So have you all tried any of the newer server kernels?

There were some important fixes for the server ESTALE problems that went
into 2.6.10 IIRC

http://linux.bkbits.net:8080/linux-2.6/cset@415b3380pxf4sB97gM8ujLqDxi6GfQ

and

http://linux.bkbits.net:8080/linux-2.6/cset@41997cc5nwxi8SY_jN4yhIPUSu_yVA?nav=index.html|src/|src/fs|related/fs/dcache.c

Do the kernels you've tested have these two patches?

Trond



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-24 05:48:22

by Frank Steiner

[permalink] [raw]
Subject: Re: More Stale NFS handles

Eiwe Lingefors wrote

> I have trawled the nfs mailing list archives a bit and judging by
> recent posts there is undoubtedly several others with the same
> problem. The problems started happening when I installed a new
> Fedora Core 3 server and migrated home directories to it. Like Jim
> Farley mentioned in a recent thread on the list, our environment has
> been stable for years prior.

There seem to be at least two problem that can cause these. One coming
from the server side, maybe journaling related. Read the "Stale File
handles keep coming back" thread. The other one is caused by the client
and can be fixed by a thread Trond sent on this list. It's also
referenced in this thread. This one solved the problem at our site.

cu,
Frank


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-24 06:16:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: More Stale NFS handles

fr den 24.06.2005 Klokka 07:48 (+0200) skreiv Frank Steiner:

> There seem to be at least two problem that can cause these. One coming
> from the server side, maybe journaling related. Read the "Stale File
> handles keep coming back" thread. The other one is caused by the client
> and can be fixed by a thread Trond sent on this list. It's also
> referenced in this thread. This one solved the problem at our site.


The client patch you are talking about just improves the way we recover
from ESTALE errors. It does nothing to fix the actual cause of the
error.

The most common actual causes of ESTALE are listed on

http://nfs.sourceforge.net/#faq_a10

Cheers,
Trond



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-24 06:47:15

by Frank Steiner

[permalink] [raw]
Subject: Re: More Stale NFS handles

Trond Myklebust wrote

> fr den 24.06.2005 Klokka 07:48 (+0200) skreiv Frank Steiner:
>
>>There seem to be at least two problem that can cause these. One coming
>>from the server side, maybe journaling related. Read the "Stale File
>>handles keep coming back" thread. The other one is caused by the client
>>and can be fixed by a thread Trond sent on this list. It's also
>>referenced in this thread. This one solved the problem at our site.
>
>
> The client patch you are talking about just improves the way we recover
> from ESTALE errors. It does nothing to fix the actual cause of the
> error.

Ok, that's right. But with this patch we don't see any stale NFS
handles anymore, so we are happy for now :-)


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-13 11:38:15

by Tobias Diedrich

[permalink] [raw]
Subject: Re: NFSv3, 2.6, ext3 and dir_index

Followup for people looking at this thread in the archive:

The patch mentioned in Bills Message seems to be in 2.6.13 release candidates,
however my server is still running 2.6.11.12 (without the patch),
but I'm about to upgrade to 2.6.12.4 (with the patch applied).
Hopefully this will fix both errors I was seeing. :-)

Tobias Diedrich wrote on 20 Mar 2005 in
<[email protected]>:
> Sorry for the late reply.
>
> Chip Salzenberg wrote:
>
> > Did you ever get any replies on dir_index vs. nfs?
>
> Not really.
>
> > I'm just setting up a new server and I'm wondering if there's
> > something still out there that might make me sorry to use dir_index.
>
> For now I've just disabled dir_index and I haven't had any problems
> since, but I'm not completely sure if that really was the cause of
> the problem I was seeing. At least disabling dir_index is rather
> easy (remove dir_index featureflag with tune2fs and run e2fsck to
> get rid of the remaining hashtrees).
> Maybe I'll try again to get a proper traffic trace, but I guess I'd
> have to set up a test client first to separate the unrelated nfs traffic.
>
> --
> Tobias PGP: http://9ac7e0bc.uguu.de

Bill Rugolsky Jr. wrote on 29 Jun 2005 in
<[email protected]>:
> On Wed, Jun 29, 2005 at 01:28:07PM -0700, Kris Vassallo wrote:
> > I was using ext3, upgraded to core 3 from core 1 and had the exact same
> > problems. What did it for me, until someone fixes this problem, was to
> > turn off journaling (so basically its back to ext2). Since reverting
> > back to ext 2 the problem has gone away. Now this is going to be a huge
> > problem if the machine crashes because fscking a 1.5 TB disk array is
> > going to suck! I experimented with the way the journaling gets done
> > (data goes to disk first or journal first) and I wasn't able to fix the
> > problem.
> >
> > Aaaaaaargh!
> > I just thought I would share my ongoing battle story, hopefully someone
> > will figure out what's causing this and will come up with a fix.
>
> We're you using ext3 directory indexing? [Look for feature "dir_index" on
> the "Filesystem features:" line in "/sbin/tune2fs -l <device>" output.]
>
> If so, was the following patch in the kernel that you were using?
>
> http://marc.theaimsgroup.com/?l=ext2-devel&m=111568753527230&w=2
>
> Regards,
>
> Bill Rugolsky

--
Tobias PGP: http://9ac7e0bc.uguu.de


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs