2005-04-21 05:10:42

by Kenneth Sumrall

[permalink] [raw]
Subject: NFS server hang, looking for suggestions

At work, we have a very large (5.6 Tb) SCSI raid unit, which is formatted
as 1 XFS filesystem. It is connected to a SuperMicro 6012P-6 dual CPU
Pentium-4 server. The server is running on Suse 9.2, but we've upgraded
the kernel from the 2.6.8 that shipped with it to 2.6.11.7 from kernel.org.
The server exports the XFS filesystem using the kernel NFSD Version 3.

The machine has recently been hanging on a regular basis. We think it's
related to NFS as the hangs often occur during a time in our nightly builds
when a bunch of machines are all writing data to the server at the same time.
However, sometimes the hangs occur when the write load is not as heavy.

The things we've tried are:
Swap the server box with a spare. Just to make sure it's not a hardware
problem.

Tried booting with "nosmp noapic" in case SMP was causing us problems.

Update to 2.6.11.7, because I read about a problem exporting XFS over NFS
in 2.6.8. One thing I'm not clear on, with the 2.6.8 XFS over NFS bug,
could that cause XFS filesystem corruption. Should I run xfs_check on
my XFS filesystem?

We recently re-cabled a bunch of the clients for this machine, and in the
process, removed a choke point where 13 of our clients were funnelled through
a 100 Mbs ethernet switch. That could have caused major fragmentation issues,
which I've read are a bad thing. It's only been 1 day since we did that, so
no data yet on if things are better.

Other things to note. Because the RAID is so big, we are running XFS directly
on the raw disk device, not a partition. The partition format seems to have
problems with sizes over 2 terabytes. Of course, I had to turn on CONFIG_LBD
in order to access such a large block device.

The ethernet interface is an e1000 gigabit interface. It plugs directly into
our main Foundry ethernet switch. The clients all have 100 Mbit interfaces, but
there's a bunch of them.

Also, the RAID uses a sector size of 2048 bytes, not the typical 512 bytes.
The SCSI controller in the server is an Adaptec Ultra160 chip, and we're using
the aic7xxx driver.

Does anyone have any suggestions on how to further diagnose our problem? I've
not used magic sysrq before, but I'm thinking maybe trying to dump a list of
current tasks, and the registers might be useful to see if it hangs in the
same place everytime. Or I could apply the KGDB patch, and try using that.

Does anyone have any other ideas on how to diagnose this? Any known problems
I'm not aware of? I'd really like to make this server rock solid.

Thanks.

Ken Sumrall
[email protected]



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-04-21 18:09:13

by Dan Stromberg

[permalink] [raw]
Subject: Re: NFS server hang, looking for suggestions


You may find that if you back it up, and repartition into a bunch of 2
terabyte partitions, that you'll get something more stable. If that
works, then you can be pretty much assured that the problem is somehow
related to the 2T limit. If you cannot give up your single 2T+ slice,
you might try going to x86-64 or PowerPC, or even sparcv9.

You probably should check your logs, if you haven't already.

Some URL's that you might find (somewhat :) relevant:

http://dcs.nac.uci.edu/~strombrg/Problem-solving-on-unix-linux-
systems.html

http://dcs.nac.uci.edu/~strombrg/NFS-troubleshooting-2.html

http://dcs.nac.uci.edu/~strombrg/crashy-system.html

http://dcs.nac.uci.edu/~strombrg/RAID-notes.html

FWIW, we had some very bad experiences with both Lustre and GFS, in
combination with NFS, on some x86 (32 bit) systems with SuperMicro
motherboards. We never figured out definitively if it was related to
(both of) the distributed filesystems, the 3Ware RAID controllers, the
Maxtor disks, the SuperMicro motherboards, or some combination. Anyway,
IBM is going to take their hardware back, and give us our money back in
return...

On Wed, 2005-04-20 at 22:10 -0700, Kenneth Sumrall wrote:

> At work, we have a very large (5.6 Tb) SCSI raid unit, which is formatted
> as 1 XFS filesystem. It is connected to a SuperMicro 6012P-6 dual CPU
> Pentium-4 server. The server is running on Suse 9.2, but we've upgraded
> the kernel from the 2.6.8 that shipped with it to 2.6.11.7 from kernel.org.
> The server exports the XFS filesystem using the kernel NFSD Version 3.
>
> The machine has recently been hanging on a regular basis. We think it's
> related to NFS as the hangs often occur during a time in our nightly builds
> when a bunch of machines are all writing data to the server at the same time.
> However, sometimes the hangs occur when the write load is not as heavy.
>
> The things we've tried are:
> Swap the server box with a spare. Just to make sure it's not a hardware
> problem.
>
> Tried booting with "nosmp noapic" in case SMP was causing us problems.
>
> Update to 2.6.11.7, because I read about a problem exporting XFS over NFS
> in 2.6.8. One thing I'm not clear on, with the 2.6.8 XFS over NFS bug,
> could that cause XFS filesystem corruption. Should I run xfs_check on
> my XFS filesystem?
>
> We recently re-cabled a bunch of the clients for this machine, and in the
> process, removed a choke point where 13 of our clients were funnelled through
> a 100 Mbs ethernet switch. That could have caused major fragmentation issues,
> which I've read are a bad thing. It's only been 1 day since we did that, so
> no data yet on if things are better.
>
> Other things to note. Because the RAID is so big, we are running XFS directly
> on the raw disk device, not a partition. The partition format seems to have
> problems with sizes over 2 terabytes. Of course, I had to turn on CONFIG_LBD
> in order to access such a large block device.
>
> The ethernet interface is an e1000 gigabit interface. It plugs directly into
> our main Foundry ethernet switch. The clients all have 100 Mbit interfaces, but
> there's a bunch of them.
>
> Also, the RAID uses a sector size of 2048 bytes, not the typical 512 bytes.
> The SCSI controller in the server is an Adaptec Ultra160 chip, and we're using
> the aic7xxx driver.
>
> Does anyone have any suggestions on how to further diagnose our problem? I've
> not used magic sysrq before, but I'm thinking maybe trying to dump a list of
> current tasks, and the registers might be useful to see if it hangs in the
> same place everytime. Or I could apply the KGDB patch, and try using that.
>
> Does anyone have any other ideas on how to diagnose this? Any known problems
> I'm not aware of? I'd really like to make this server rock solid.
>
> Thanks.
>
> Ken Sumrall
> [email protected]
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: New Crystal Reports XI.
> Version 11 adds new functionality designed to reduce time involved in
> creating, integrating, and deploying reporting solutions. Free runtime info,
> new features, or free trial, at: http://www.businessobjects.com/devxi/728
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part