2005-11-14 14:55:49

by Jeremy Sanders

[permalink] [raw]
Subject: Diskless boot problems


We have a set of clients which boot over the network, and mount their root
file systems over NFS from a server. All systems run linux-2.6.9-22
(Scientific Linux 4.1, a RHEL clone).

Occasionally the system breaks. For some reason some of the clients
produce errors in the logs like:

portmap: server localhost not responding, timed out
RPC: failed to contact portmap (errno -5).
portmap: server localhost not responding, timed out
RPC: failed to contact portmap (errno -5).
portmap: server localhost not responding, timed out
RPC: failed to contact portmap (errno -5).
lockd_up: makesock failed, error=-5
portmap: server localhost not responding, timed out
RPC: failed to contact portmap (errno -5).
lockd_up: no pid, 2 users??

This happens every few weeks, and appears random. The clients then get
into a funny state where some parts of the system appear to continue
working, but with lockups and hangs.

The systems boot using pxelinux. A busybox initrd sets the IP of the
client using dhcp, and mounts the root file system. This setup is so that
the client can use a standard kernel image.

The mount options the client use are:

XXX:~> cat /proc/mounts
aaa.bbb.ccc.ddd:/xss_data1/foo / nfs rw,v3,rsize=32768,wsize=32768,hard,udp,nolock,addr=aaa.bbb.ccc.ddd 0 0
...

I'm not sure how the kernel uses rsize,wsize=32768 with udp, but it says
so.

Any ideas what could cause this??? We experienced a very similar problem
with Fedora Core 2, and before that RedHat 7.3.

Thanks

Jeremy

--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-11-14 15:48:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: Diskless boot problems

On Mon, 2005-11-14 at 15:40 +0000, Jeremy Sanders wrote:

> I think it dies - but unfortunately I stupidly rebooted the system this
> time. I may have to wait a week to confirm this.
>
> I tried restarting the portmapper, but this didn't solve the problem.

No. Restarting the portmapper isn't sufficient because you will still
lose all the information that the old session held. If this sort of
crash happens very often, you could try using pmap_dump to save the
portmapper information at regular intervals, then use pmap_set to
restore it after a crash.

> There was a mount to another nfs server stuck in a "D" state (that server
> has the same OS but is mounted with locking and tcp).

lockd is a shared resource, so if it is failing to start up, then that
would affect all partitions that are being mounted with locking.

Cheers,
Trond



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-11-14 15:57:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: Diskless boot problems

On Mon, 2005-11-14 at 15:54 +0000, Jeremy Sanders wrote:
> On Mon, 14 Nov 2005, Trond Myklebust wrote:
>
> > On Mon, 2005-11-14 at 15:40 +0000, Jeremy Sanders wrote:
> >
> >> I think it dies - but unfortunately I stupidly rebooted the system this
> >> time. I may have to wait a week to confirm this.
> >>
> >> I tried restarting the portmapper, but this didn't solve the problem.
> >
> > No. Restarting the portmapper isn't sufficient because you will still
> > lose all the information that the old session held. If this sort of
> > crash happens very often, you could try using pmap_dump to save the
> > portmapper information at regular intervals, then use pmap_set to
> > restore it after a crash.
>
> Is there any way to debug it next time it crashes? I suppose the problem
> is that unless I monitor all the traffic, I won't be able to work out how
> it crashed.
>

Portmap should normally send all its output to the syslogger, but if you
want extra debugging info, there is a '-v' flag that sets it in verbose
mode.

Cheers,
Trond



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-11-14 15:54:42

by Jeremy Sanders

[permalink] [raw]
Subject: Re: Diskless boot problems

On Mon, 14 Nov 2005, Trond Myklebust wrote:

> On Mon, 2005-11-14 at 15:40 +0000, Jeremy Sanders wrote:
>
>> I think it dies - but unfortunately I stupidly rebooted the system this
>> time. I may have to wait a week to confirm this.
>>
>> I tried restarting the portmapper, but this didn't solve the problem.
>
> No. Restarting the portmapper isn't sufficient because you will still
> lose all the information that the old session held. If this sort of
> crash happens very often, you could try using pmap_dump to save the
> portmapper information at regular intervals, then use pmap_set to
> restore it after a crash.

Is there any way to debug it next time it crashes? I suppose the problem
is that unless I monitor all the traffic, I won't be able to work out how
it crashed.

Jeremy

--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-11-14 15:32:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: Diskless boot problems

On Mon, 2005-11-14 at 14:55 +0000, Jeremy Sanders wrote:
> We have a set of clients which boot over the network, and mount their root
> file systems over NFS from a server. All systems run linux-2.6.9-22
> (Scientific Linux 4.1, a RHEL clone).
>
> Occasionally the system breaks. For some reason some of the clients
> produce errors in the logs like:
>
> portmap: server localhost not responding, timed out
> RPC: failed to contact portmap (errno -5).
> portmap: server localhost not responding, timed out
> RPC: failed to contact portmap (errno -5).
> portmap: server localhost not responding, timed out
> RPC: failed to contact portmap (errno -5).
> lockd_up: makesock failed, error=-5
> portmap: server localhost not responding, timed out
> RPC: failed to contact portmap (errno -5).
> lockd_up: no pid, 2 users??

So what is the portmap daemon doing when this happens? Is it running? Is
it dead? Is it hanging?

Cheers,
Trond



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-11-14 15:43:41

by Jeremy Sanders

[permalink] [raw]
Subject: Re: Diskless boot problems

On Mon, 14 Nov 2005, Trond Myklebust wrote:

>> portmap: server localhost not responding, timed out
>> RPC: failed to contact portmap (errno -5).
>> portmap: server localhost not responding, timed out
>> RPC: failed to contact portmap (errno -5).
>> portmap: server localhost not responding, timed out
>> RPC: failed to contact portmap (errno -5).
>> lockd_up: makesock failed, error=-5
>> portmap: server localhost not responding, timed out
>> RPC: failed to contact portmap (errno -5).
>> lockd_up: no pid, 2 users??
>
> So what is the portmap daemon doing when this happens? Is it running? Is
> it dead? Is it hanging?

I think it dies - but unfortunately I stupidly rebooted the system this
time. I may have to wait a week to confirm this.

I tried restarting the portmapper, but this didn't solve the problem.

There was a mount to another nfs server stuck in a "D" state (that server
has the same OS but is mounted with locking and tcp).

Jeremy

--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs