2008-12-07 14:49:22

by mike

[permalink] [raw]
Subject: [NFS] Help! NFS broken

I upgraded my Ubuntu Hardy server to Intrepid the night before last.
When I woke up yesterday morning, my main server I use as my ssh
gateway into the others was totally messed up.

My server is FreeBSD. I haven't had to touch it since I set it up.

My clients are 6 Ubuntu servers. All are identical in packages,
configs (I diff'ed /etc), kernel versions, network setup, etc.

Only *one* of the machines is suffering (and of course, one of the
most important ones) - and it isn't even one of the busiest.

I've tried downgrading the kernel on the box suffering the issue from
2.6.27-10 to 2.6.27-7, my next attempt will be picking a kernel .deb
that was from the previous Ubuntu release...

What is odd is that it works great after reboot and lasts for a couple
hours, then stops working. I can umount -l /home and then try to
remount it (see below) but it never gets anywhere and eventually dies
with a generic message. I tried to strace -f it, and it gave me
nothing to work with. The FreeBSD server doesn't give me anything in
logs to go off of either. I can ping and ssh between the two no
problem at this point still. It's just NFS that is odd. Also I did
notice trying to restart services manually and try to debug them that
portmap seemed to throw a kernel error in my logs once in a while. But
I don't get a connection to portmap when I run the mount command, and
I would assume if portmap is required for mounting NFS shares that it
would need to contact it. That could totally be irrelevant though.

Any help or insight or request for additional information is
appreciated. On-list or off-list is fine. I will pay someone via
Paypal who can help me resolve this quickly...

[root@lvs01 ~]# mount -vvvv /home
mount: fstab path: "/etc/fstab"
mount: mtab path: "/etc/mtab"
mount: lock path: "/etc/mtab~"
mount: temp path: "/etc/mtab.tmp"
mount: spec: "raid01:/home"
mount: node: "/home"
mount: types: "nfs"
mount: opts: "rsize=8192,rsize=8192,tcp,rw,acregmin=30"
mount: external mount: argv[0] = "/sbin/mount.nfs"
mount: external mount: argv[1] = "raid01:/home"
mount: external mount: argv[2] = "/home"
mount: external mount: argv[3] = "-v"
mount: external mount: argv[4] = "-o"
mount: external mount: argv[5] = "rw,rsize=8192,rsize=8192,tcp,acregmin=30"
mount.nfs: timeout set for Sun Dec 7 06:36:39 2008
mount.nfs: text-based options:
'rsize=8192,rsize=8192,tcp,acregmin=30,addr=10.13.220.94'
(just stalls here, normally a connection is near instant. eventually
it will die with a generic error message. i can control-C to quit it
too, so it's not frozen completely)



thanks...

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs



2008-12-08 23:39:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [NFS] Help! NFS broken

On Sun, Dec 07, 2008 at 06:48:52AM -0800, mike wrote:
> I upgraded my Ubuntu Hardy server to Intrepid the night before last.
> When I woke up yesterday morning, my main server I use as my ssh
> gateway into the others was totally messed up.
>
> My server is FreeBSD. I haven't had to touch it since I set it up.
>
> My clients are 6 Ubuntu servers. All are identical in packages,
> configs (I diff'ed /etc), kernel versions, network setup, etc.

So the "server" in the first paragraph is an NFS client, and its NFS
server is the FreeBSD machine?

And what are the first symptoms? Any threads accessing the NFS
filesystem just hang? A sysrq-T trace on the client showing where
they're hanging might be helpful.

--b.

>
> Only *one* of the machines is suffering (and of course, one of the
> most important ones) - and it isn't even one of the busiest.
>
> I've tried downgrading the kernel on the box suffering the issue from
> 2.6.27-10 to 2.6.27-7, my next attempt will be picking a kernel .deb
> that was from the previous Ubuntu release...
>
> What is odd is that it works great after reboot and lasts for a couple
> hours, then stops working. I can umount -l /home and then try to
> remount it (see below) but it never gets anywhere and eventually dies
> with a generic message. I tried to strace -f it, and it gave me
> nothing to work with. The FreeBSD server doesn't give me anything in
> logs to go off of either. I can ping and ssh between the two no
> problem at this point still. It's just NFS that is odd. Also I did
> notice trying to restart services manually and try to debug them that
> portmap seemed to throw a kernel error in my logs once in a while. But
> I don't get a connection to portmap when I run the mount command, and
> I would assume if portmap is required for mounting NFS shares that it
> would need to contact it. That could totally be irrelevant though.
>
> Any help or insight or request for additional information is
> appreciated. On-list or off-list is fine. I will pay someone via
> Paypal who can help me resolve this quickly...
>
> [root@lvs01 ~]# mount -vvvv /home
> mount: fstab path: "/etc/fstab"
> mount: mtab path: "/etc/mtab"
> mount: lock path: "/etc/mtab~"
> mount: temp path: "/etc/mtab.tmp"
> mount: spec: "raid01:/home"
> mount: node: "/home"
> mount: types: "nfs"
> mount: opts: "rsize=8192,rsize=8192,tcp,rw,acregmin=30"
> mount: external mount: argv[0] = "/sbin/mount.nfs"
> mount: external mount: argv[1] = "raid01:/home"
> mount: external mount: argv[2] = "/home"
> mount: external mount: argv[3] = "-v"
> mount: external mount: argv[4] = "-o"
> mount: external mount: argv[5] = "rw,rsize=8192,rsize=8192,tcp,acregmin=30"
> mount.nfs: timeout set for Sun Dec 7 06:36:39 2008
> mount.nfs: text-based options:
> 'rsize=8192,rsize=8192,tcp,acregmin=30,addr=10.13.220.94'
> (just stalls here, normally a connection is near instant. eventually
> it will die with a generic error message. i can control-C to quit it
> too, so it's not frozen completely)
>
>
>
> thanks...
>
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you. Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
> _______________________________________________
> Please note that [email protected] is being discontinued.
> Please subscribe to [email protected] instead.
> http://vger.kernel.org/vger-lists.html#linux-nfs
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs


2008-12-09 00:13:09

by mike

[permalink] [raw]
Subject: Re: [NFS] Help! NFS broken

On Mon, Dec 8, 2008 at 3:38 PM, J. Bruce Fields <[email protected]> wrote:

> So the "server" in the first paragraph is an NFS client, and its NFS
> server is the FreeBSD machine?

Yes

> And what are the first symptoms? Any threads accessing the NFS
> filesystem just hang? A sysrq-T trace on the client showing where
> they're hanging might be helpful.

Honestly, these are production, and I looked in every place I could
think for any hints, and I get nothing. I can't really be using this
to test either. What is odd is identically configured machines (down
to the same files in /etc, same packages from dpkg -l etc) have no
issue.

For a last result, I tried different kernel versions (from Ubuntu):

linux-image-2.6.27-10-server - broken
linux-image-2.6.27-7-server - broken
linux-image-2.6.28-2-server - i think i used this too quick and it was
broken (I might be wrong and wound up deciding to go back instead of
forward)
linux-image-2.6.24-16-server - working for 2 days now, so sticking with it

Note that all the other nodes (5 of the 6 identical nodes are fine,
this was the one bad one) are running the default kernel in Intrepid
at the moment: linux-image-2.6.27-10-server and don't seem to be
suffering from any issues.

So it seems to be a combination of those kernels + that machine.
Problem is, that machine's configuration is identical - same
nfs-utils, portmap, etc, etc. and from my rsync scan, even the
majority of files (and anything that should be relevant) in /etc are
identical too.

This might be an Ubuntu bug or something flaky with 2.6.27 (maybe
2.6.28 too) and NFS in general but I don't know how I can produce any
worthwhile debugging, especially considering this is in production.
When I wrote this I saw no fix in place; the kernel downgrade appears
to be the workaround for now.

Sorry I can't be more help. At the point of this email I had the
luxury of a broken setup to debug, but now that I've stabilized it I
have to keep it this way :)

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs