2004-08-20 12:12:21

by Frank Steiner

[permalink] [raw]
Subject: client apps not surviving nfsd restart

Hi,

we are running a diskless client system. For a long time, we used kernel
2.4 (.16-.21, SuSE versions) and nfs-over-udp. We never encountered any
problems when the server had to reboot. All of the clients (about 40, of
which about 20 are work- stations running KDE or gnome with several apps)
survived the server reboot without problems. Especially without stale NFS
handles (at least no one has complained about them for 1.5 years :-))

Now we got some problems:

1) After installing SusE 9.0, the default was set to nfs-over-tcp. I didn't
know that, but suddenly after every server reboot I had at least 4-5
of the work station users complaining about stale NFS handles, e.g.,
/usr was stale and so java didn't start anymore etc. It's not really
reproducable by some certain sequence of starting apps and rebooting
the server etc., but it happens every time the server has to reboot
with a few clients.

In the NFS howto I read that the disadvantage of nfs-over-tcp is that
"If your server crashes in the middle of a packet transmission, the
client will hang and any shares will need to be unmounted and remounted."

But I thought a clean reboot with a clean stop and later start of the
nfsserver shouldn't make a problem.

Is there any way to work around these problems? Any known reasons
why with nfs-over-tcp we have these stales which we did not experience
with udp before?
Note that we do not change export options or anything on the file
system during these server reboots.

Trying to debug I was also able to make /usr stale by just restarting
the nfsd (with sleeping, not to run into the race condition Neil
solved with the recent patch :-)) a few times. Again, not safely
reproducible...

2) We are currently testing kernel 2.6.8.1. The nfs behaviour seems
to have changed in some ways. Running e.g. "find /" on a diskless
client with kernel 2.4 would just hang when the server rebootet
and later go on when the server was back.
With 2.6.8.1, the find command will immediately abort and report
some stale nfs handles.
This causes much more client applications to abort when the server
reboots (we have many programs doing a lot I/O, creating files,
deleting, tracing through directories etc. And they are good
candidates for such failures).

This happens the same way with udp/tcp, with intr/nointr. Is there
a way to make the clients behave like with the 2.4 kernel? So they
just get stuck when the server shuts down and wait and continue when
the server is back?

3) This is a general problem with 2.6 and 2.4:
When e.g. copying a large file from an nfs-mounted directory to a
local partition and the nfs server goes down, the cp immediately
breaks with
cp: reading `SLES-8-SP-3a-ppc-RC3-CD1.iso': Stale NFS file handle

This is also independent from udp/tcp or intr/nointr. Should that
happen? Is there a way to make cp just hang until the server comes
back?

Similar things happen with other apps. Using rsync instead of cp
will indeed just stop and wait until the server is back, but when
is has finished copying the file it will complain

read errors mapping "SLES-8-SP-3a-ppc-RC3-CD1.iso": (5) Input/output error


So I end up with three questions:

- is it possible to make all applications just hang and wait until the
nfs server comes back instead of they abort their work (like find/cp)?
- are there some general hints how to avoid stale nfs handles during a server
reboot? I didn't find much about this in google.
- How do they after all happen when the file system on the server is not
changed at all during the reboot? Especially things like /usr don't move
away or sth., so how can a stale nfs handle for "/usr" happen?


Thanks for any help!

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049



-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2004-08-20 13:06:34

by Olaf Kirch

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

On Fri, Aug 20, 2004 at 02:12:15PM +0200, Frank Steiner wrote:
> In the NFS howto I read that the disadvantage of nfs-over-tcp is that
> "If your server crashes in the middle of a packet transmission, the
> client will hang and any shares will need to be unmounted and
> remounted."

I have never observed this behavior. TCP reconnect works just fine.

> 2) We are currently testing kernel 2.6.8.1. The nfs behaviour seems
> to have changed in some ways. Running e.g. "find /" on a diskless
> client with kernel 2.4 would just hang when the server rebootet
> and later go on when the server was back.

What does your exports table look like? Do you have network exports
or other kinds of wildcarding? I have a report of nfsd/mountd not
properly reloading the kernel's export table after a nfsd restart.

> With 2.6.8.1, the find command will immediately abort and report
> some stale nfs handles.

Do you mean it will report ESTALE when the server goes down, or when
it comes back up?

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 13:22:02

by Olaf Kirch

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

On Fri, Aug 20, 2004 at 02:12:15PM +0200, Frank Steiner wrote:
> 2) We are currently testing kernel 2.6.8.1. The nfs behaviour seems
> to have changed in some ways. Running e.g. "find /" on a diskless
> client with kernel 2.4 would just hang when the server rebootet
> and later go on when the server was back.
> With 2.6.8.1, the find command will immediately abort and report
> some stale nfs handles.

I reproduced this. Someone "improved" the init script to do this:

/usr/sbin/exportfs -au
killproc -n -KILL nfsd

so we first zap the exports tables, then stop all nfsds. Any NFS
requests received in the meanwhile will see ESTALE.

Try removing the exportfs call.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 13:25:46

by Bernd Schubert

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

Hello Frank,


> 1) After installing SusE 9.0, the default was set to nfs-over-tcp. I didn't
> know that, but suddenly after every server reboot I had at least 4-5
> of the work station users complaining about stale NFS handles, e.g.,
> /usr was stale and so java didn't start anymore etc. It's not really
> reproducable by some certain sequence of starting apps and rebooting
> the server etc., but it happens every time the server has to reboot
> with a few clients.

can you confirm this with vanilla client/server kernel version? We have 45
diskless clients here and each of them has no problem on a server reboot.
This works for more than two years with vanilla 2.4.X kernel versions. Since
about 4-5 month we also use nfs-over-tcp and this also has no negative
effect.
We also tried using 2.6.7 on our server, but it always crashed every morning
with page allocation errors, so we had to give up with this version and
switched back to 2.4.X. However, although the server was rebooted every
morning (and/or the failover server took over), this caused no problems for
the clients (except for the directories mounted from ClusterNFS, but thats a
different more complicated CNFSD related story).

>
> In the NFS howto I read that the disadvantage of nfs-over-tcp is that
> "If your server crashes in the middle of a packet transmission, the
> client will hang and any shares will need to be unmounted and
> remounted."

This howto seems to be slightly outdated.

>
> But I thought a clean reboot with a clean stop and later start of the
> nfsserver shouldn't make a problem.

It doesn't make a problems with vanilla kernel version.

[snip]

>
> 2) We are currently testing kernel 2.6.8.1. The nfs behaviour seems
> to have changed in some ways. Running e.g. "find /" on a diskless
> client with kernel 2.4 would just hang when the server rebootet
> and later go on when the server was back.
> With 2.6.8.1, the find command will immediately abort and report
> some stale nfs handles.

We only tested 2.6.7 and it caused no such problems, as we have failover, I
tested failover during file transfers and this worked like a charm.


Cheers,
Bernd

PS: About 1.5 years ago we were forced to use a Suse kernel on our server (due
to a closed-sources binary only kernel module) and this kernel caused a lot
of nfs related trouble for us. As the module also didn't work properly we
finally decided not to buy this software ;)


--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universit?t Heidelberg
INF 229
69120 Heidelberg
e-mail: [email protected]


Attachments:
(No filename) (2.61 kB)
(No filename) (189.00 B)
signature
Download all attachments

2004-08-20 13:55:51

by Frank Steiner

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

Olaf Kirch wrote:
> On Fri, Aug 20, 2004 at 02:12:15PM +0200, Frank Steiner wrote:
>
>> In the NFS howto I read that the disadvantage of nfs-over-tcp is that
>> "If your server crashes in the middle of a packet transmission, the
>> client will hang and any shares will need to be unmounted and
>> remounted."
>
>
> I have never observed this behavior. TCP reconnect works just fine.

Hmm, ok. So the stales we saw might have been cause by some other
difference between SuSE 7.3 and 9.0. Could the lock manager be a
cause for this? With SuSE 7.3 we had everything mounted with "nolock".
Now with SusE 9.0 we use locks. Could that cause stales like we
experienced on /usr?

>
>
>>2) We are currently testing kernel 2.6.8.1. The nfs behaviour seems
>> to have changed in some ways. Running e.g. "find /" on a diskless
>> client with kernel 2.4 would just hang when the server rebootet
>> and later go on when the server was back.
>
>
> What does your exports table look like? Do you have network exports
> or other kinds of wildcarding? I have a report of nfsd/mountd not
> properly reloading the kernel's export table after a nfsd restart.

We are using netgroups. For testing I removed them and just left
entries for one diskless client, without any wildcards or group
but it does not make a difference (I rebooted client and server
with the new exports just to make sure...)

>
>
>> With 2.6.8.1, the find command will immediately abort and report
>> some stale nfs handles.
>
>
> Do you mean it will report ESTALE when the server goes down, or when
> it comes back up?

Immediately when the server goes down. "find" and "cp" immediately
abort and I'm back on the shell prompt.

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 14:26:45

by Frank Steiner

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

Olaf Kirch wrote:

> I reproduced this. Someone "improved" the init script to do this:
>
> /usr/sbin/exportfs -au
> killproc -n -KILL nfsd
>
> so we first zap the exports tables, then stop all nfsds. Any NFS
> requests received in the meanwhile will see ESTALE.
>
> Try removing the exportfs call.

Yes, that makes the difference! The "find" now again just hangs and
after the server returns, continues without any problems. Same with
"cp". "rsync" now also doesn't complain at the end anymore!

I feel stupid about this: When reporting the race problem with restarting
the server on the lkml, Neil already told me that the "exportfs -au"
should not be called before killing the nfsd and so I changed the order.
But because this had no effect for the race bug, I forgot it and didn't
change the init script for all our hosts.

So I guess that all the stales and problems we had since 9.0 were just
caused by the bug in the script... And I was already so disappointed
by nfs-over-tcp because I blamed it for the stales :-)

Thanks for your help! I will bugfix our scripts and then give 2.6 a
try with nfs-over-tcp :-)

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049



-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 14:34:03

by Frank Steiner

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

Hi Bernd,

Bernd Schubert wrote:

> can you confirm this with vanilla client/server kernel version? We have 45

as Olaf figured out, it was indeed just a little bug in the init script...


> PS: About 1.5 years ago we were forced to use a Suse kernel on our server (due
> to a closed-sources binary only kernel module) and this kernel caused a lot
> of nfs related trouble for us. As the module also didn't work properly we
> finally decided not to buy this software ;)

Hmm, since now (that we need a more recent version of 2.6 than the the
2.6.5 SuSE shipped with 9.1) I've always been using SuSE kernels, and
I really loved them for many reasons. They had always included a lot of
useful stuff which I'm now forced to add all by myself (things like the
bcm5700 driver in 2.4 when the tg3 crashed our server every night :-))

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 14:47:05

by Olaf Kirch

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

> Hmm, since now (that we need a more recent version of 2.6 than the the
> 2.6.5 SuSE shipped with 9.1) I've always been using SuSE kernels, and
> I really loved them for many reasons. They had always included a lot of
> useful stuff which I'm now forced to add all by myself (things like the
> bcm5700 driver in 2.4 when the tg3 crashed our server every night :-))

There is the "kernel of the day" at ftp://ftp.suse.com/pub/people/mantel/kotd

This is purely for testing, but people who, for some reason, need bleeding
edge kernels, seem to find it useful.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 15:31:33

by mehta kiran

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

hi,

client apps survive only if nfs is restarted on
on the same machine(running Suse 9).

I had setup as follows:

1. shared disk with two partitions.One partition
is mounted on /var/lib/nfs and other contains
the filesystem to be exported.

2. I had two nodes in the cluster and had virtual

(floating) ip through which client accesses
the exported filesystem.

3 .i had only one client accessing filesystem
which server had exported

I have written program which opens a file from
filesystem, locks it , sleeps for 10 seconds and

then unlocks it.

I followed steps and found that client app hangs.

1.on client node i stared the program
2.Then when the lock was acquired , i killed
nfsd running on the server (on node one).
3.when nfsd comes up on second node in the
cluster client is not able to communicate
with server and hangs for long duration
(this at last gives permission denied error
Before this error is given , i execute
step4 )

4. If after sometime nfsd is again brought
up on first , client program communicates

with nfs server and releases lock.

There is no problem with floating ip
Also both machines in cluster have exactly
same configuration and minimal security

Using netstat and tcpdump dump , i found that
floating ip works well but client is not
able to communicate with new nfs server(present
on second node). Netstat shows that client
continuously send SYN packet to nfs server but
gets no ACK packet.

I have been debugging these for past 2 days but
found no solution


thanks,
kiran















__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-20 18:28:46

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: client apps not surviving nfsd restart

On Friday 20 August 2004 16:46, Olaf Kirch wrote:
> > Hmm, since now (that we need a more recent version of 2.6 than
> > the the 2.6.5 SuSE shipped with 9.1) I've always been using SuSE
> > kernels, and I really loved them for many reasons. They had
> > always included a lot of useful stuff which I'm now forced to add
> > all by myself (things like the bcm5700 driver in 2.4 when the tg3
> > crashed our server every night :-))

The SuSE kernel is much better then its fame, because a few people
took a cursorily look into patches.{common,i386}. I imagine, this
scared the bones out of their bodies, since a single human won't get
a grip on it in a timely fashion. Others, who watched them, propagate
this FUD since then.. OTOH, the changelog could be more specific.

I used to roll my owns for diskless usage for a long time, but I'm
very happy to have learned to use the official ones out of the box
without hassles.

> There is the "kernel of the day" at
> ftp://ftp.suse.com/pub/people/mantel/kotd

The preferred path is ftp://ftp.suse.com/pub/projects/kernel/kotd
or better yet, because rsyncable:
ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/projects/kernel/kotd

> This is purely for testing, but people who, for some reason, need
> bleeding edge kernels, seem to find it useful.

It saved my life at least once ;-)

> Olaf



-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs