Hi,
my config is diskless NFSv3 nfsroot (+ some extra NFDSv3 mounts) and NFSv4 /home/* automount.
Centos 5.4, kernel 2.6.18-164.11.1.el5.
Periodically my nodes hangs, nothing appeared in the logs (remote syslog + netconsole).
Node is kind of alive, you can ping, some deamons (for example pbs_mom) reports that it's alive etc.
But anything which require FS access - frozen.
Another symptom, it looks like portmap doesn't answer. At lease if I try "rpcinfo -p node_name", then it ends with
"rpcinfo: can't contact portmapper: rpcinfo: RPC: Timed out"
In principal, this can have something with locking.
At least, I had to mount all my NFSv3 mounts with nolock, to reduce frequency of problem (nfsroot was nolock, obviously. but there are couple of extra v3 mounts, like /opt with extra software and RW directory for torque.
What can be a problem here?
What kind of information I have to collect from system to figure out what it real problem?
Anton.
On Mon, Mar 01, 2010 at 04:01:42PM +0100, Anton Starikov wrote:
> Hi,
>
>
> my config is diskless NFSv3 nfsroot (+ some extra NFDSv3 mounts) and NFSv4 /home/* automount.
> Centos 5.4, kernel 2.6.18-164.11.1.el5.
That's the client? What's the server?
That's pretty old kernel; I'd file a bug with CentOS.
> Periodically my nodes hangs, nothing appeared in the logs (remote syslog + netconsole).
> Node is kind of alive, you can ping, some deamons (for example pbs_mom) reports that it's alive etc.
> But anything which require FS access - frozen.
>
> Another symptom, it looks like portmap doesn't answer. At lease if I try "rpcinfo -p node_name", then it ends with
> "rpcinfo: can't contact portmapper: rpcinfo: RPC: Timed out"
>
> In principal, this can have something with locking.
> At least, I had to mount all my NFSv3 mounts with nolock, to reduce frequency of problem (nfsroot was nolock, obviously. but there are couple of extra v3 mounts, like /opt with extra software and RW directory for torque.
>
> What can be a problem here?
>
> What kind of information I have to collect from system to figure out what it real problem?
Is there any server-side logging?
Can you see any interesting network traffic after the hang?
--b.
On Tue, Mar 02, 2010 at 06:20:43PM +0100, Anton Starikov wrote:
>
> On Mar 2, 2010, at 6:16 PM, J. Bruce Fields wrote:
>
> > On Mon, Mar 01, 2010 at 04:01:42PM +0100, Anton Starikov wrote:
> >> Hi,
> >>
> >>
> >> my config is diskless NFSv3 nfsroot (+ some extra NFDSv3 mounts) and NFSv4 /home/* automount.
> >> Centos 5.4, kernel 2.6.18-164.11.1.el5.
> >
> > That's the client? What's the server?
>
> Server is Opensolaris.
>
> > That's pretty old kernel; I'd file a bug with CentOS.
>
> Unfortunately, with newer kernels this setup is even more problematic. :)
Any details?
As a rule, this list is probably going to be a better place to handle
bugs with the latest upstream kernels, and your distributor is more
likely to be useful for their kernels.
--b.
>
> >>
> >> What kind of information I have to collect from system to figure out what it real problem?
> >
> > Is there any server-side logging?
> > Can you see any interesting network traffic after the hang?
>
> It always unfortunate, but last couple of days I can't get a hang :) Although nothing changed in setup, so it will happen anyway.
>
>
> Anton.
>
On Tue, Mar 02, 2010 at 07:05:08PM +0100, Anton Starikov wrote:
>
> On Mar 2, 2010, at 6:52 PM, J. Bruce Fields wrote:
>
> >>
> >>> That's pretty old kernel; I'd file a bug with CentOS.
> >>
> >> Unfortunately, with newer kernels this setup is even more problematic. :)
> >
> > Any details?
> >
> > As a rule, this list is probably going to be a better place to handle
> > bugs with the latest upstream kernels, and your distributor is more
> > likely to be useful for their kernels.
>
> I submitted that to this list about year ago. It seems that one of the
> biggest issues that with NFS3 root, and NFS4 /home idmapd get
> deadlocked. To resolve NFS4 credentials it need to access NFS3. which
> is blocked by waiting final of NFS4 operation. I tried to move a lot
> of stuff to tmpfs, but it didn't resolve situation, if root still
> NFS3.
Do you have a pointer to the previous discussion?
--b.
>
> My general observation is that there is trend: newer kernel, faster
> you get deadlock with this setup :)
>
> Anton.
>
>
>
>
What can I do to debug problem?
This issue is killing me!
BTW, I also created centos bug-report.
Anton.