From: Dan Stromberg Subject: Re: NFS server hang, looking for suggestions Date: Thu, 21 Apr 2005 11:09:10 -0700 Message-ID: <1114106950.27207.130.camel@seki.nac.uci.edu> References: <426735CC.9020900@pacbell.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-JaN8kQH9GDOE3UUS15iK" Cc: strombrg@dcs.nac.uci.edu, nfs@lists.sourceforge.net Return-path: Received: from sc8-sf-mx1-b.sourceforge.net ([10.3.1.11] helo=sc8-sf-mx1.sourceforge.net) by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30) id 1DOg6z-0003y5-HP for nfs@lists.sourceforge.net; Thu, 21 Apr 2005 11:09:13 -0700 Received: from dcs.nac.uci.edu ([128.200.34.32] ident=root) by sc8-sf-mx1.sourceforge.net with esmtp (TLSv1:AES256-SHA:256) (Exim 4.41) id 1DOg6y-0003XL-Pu for nfs@lists.sourceforge.net; Thu, 21 Apr 2005 11:09:13 -0700 To: Kenneth Sumrall In-Reply-To: <426735CC.9020900@pacbell.net> Sender: nfs-admin@lists.sourceforge.net Errors-To: nfs-admin@lists.sourceforge.net List-Unsubscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Post: List-Help: List-Subscribe: , List-Archive: --=-JaN8kQH9GDOE3UUS15iK Content-Type: multipart/alternative; boundary="=-kS/Uj0ylTod4kLr3HmPT" --=-kS/Uj0ylTod4kLr3HmPT Content-Type: text/plain Content-Transfer-Encoding: quoted-printable You may find that if you back it up, and repartition into a bunch of 2 terabyte partitions, that you'll get something more stable. If that works, then you can be pretty much assured that the problem is somehow related to the 2T limit. If you cannot give up your single 2T+ slice, you might try going to x86-64 or PowerPC, or even sparcv9. You probably should check your logs, if you haven't already. Some URL's that you might find (somewhat :) relevant: http://dcs.nac.uci.edu/~strombrg/Problem-solving-on-unix-linux- systems.html http://dcs.nac.uci.edu/~strombrg/NFS-troubleshooting-2.html http://dcs.nac.uci.edu/~strombrg/crashy-system.html http://dcs.nac.uci.edu/~strombrg/RAID-notes.html FWIW, we had some very bad experiences with both Lustre and GFS, in combination with NFS, on some x86 (32 bit) systems with SuperMicro motherboards. We never figured out definitively if it was related to (both of) the distributed filesystems, the 3Ware RAID controllers, the Maxtor disks, the SuperMicro motherboards, or some combination. Anyway, IBM is going to take their hardware back, and give us our money back in return... On Wed, 2005-04-20 at 22:10 -0700, Kenneth Sumrall wrote: > At work, we have a very large (5.6 Tb) SCSI raid unit, which is formatted > as 1 XFS filesystem. It is connected to a SuperMicro 6012P-6 dual CPU > Pentium-4 server. The server is running on Suse 9.2, but we've upgraded > the kernel from the 2.6.8 that shipped with it to 2.6.11.7 from kernel.or= g. > The server exports the XFS filesystem using the kernel NFSD Version 3. >=20 > The machine has recently been hanging on a regular basis. We think it's > related to NFS as the hangs often occur during a time in our nightly buil= ds > when a bunch of machines are all writing data to the server at the same t= ime. > However, sometimes the hangs occur when the write load is not as heavy. >=20 > The things we've tried are: > Swap the server box with a spare. Just to make sure it's not a hardwa= re > problem. >=20 > Tried booting with "nosmp noapic" in case SMP was causing us problems. >=20 > Update to 2.6.11.7, because I read about a problem exporting XFS over = NFS > in 2.6.8. One thing I'm not clear on, with the 2.6.8 XFS over NFS bug= , > could that cause XFS filesystem corruption. Should I run xfs_check on > my XFS filesystem? >=20 > We recently re-cabled a bunch of the clients for this machine, and in = the > process, removed a choke point where 13 of our clients were funnelled = through > a 100 Mbs ethernet switch. That could have caused major fragmentation= issues, > which I've read are a bad thing. It's only been 1 day since we did th= at, so > no data yet on if things are better. >=20 > Other things to note. Because the RAID is so big, we are running XFS dir= ectly > on the raw disk device, not a partition. The partition format seems to h= ave > problems with sizes over 2 terabytes. Of course, I had to turn on CONFIG= _LBD > in order to access such a large block device. >=20 > The ethernet interface is an e1000 gigabit interface. It plugs directly = into > our main Foundry ethernet switch. The clients all have 100 Mbit interfac= es, but > there's a bunch of them. >=20 > Also, the RAID uses a sector size of 2048 bytes, not the typical 512 byte= s. > The SCSI controller in the server is an Adaptec Ultra160 chip, and we're = using > the aic7xxx driver. >=20 > Does anyone have any suggestions on how to further diagnose our problem? = I've > not used magic sysrq before, but I'm thinking maybe trying to dump a list= of > current tasks, and the registers might be useful to see if it hangs in th= e > same place everytime. Or I could apply the KGDB patch, and try using tha= t. >=20 > Does anyone have any other ideas on how to diagnose this? Any known prob= lems > I'm not aware of? I'd really like to make this server rock solid. >=20 > Thanks. >=20 > Ken Sumrall > ksumrall@pacbell.net >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by: New Crystal Reports XI. > Version 11 adds new functionality designed to reduce time involved in > creating, integrating, and deploying reporting solutions. Free runtime in= fo, > new features, or free trial, at: http://www.businessobjects.com/devxi/728 > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs --=-kS/Uj0ylTod4kLr3HmPT Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
You may find that if you back it up, and repartition into a bunch of 2 tera= byte partitions, that you'll get something more stable.  If that works= , then you can be pretty much assured that the problem is somehow related t= o the 2T limit.  If you cannot give up your single 2T+ slice, you migh= t try going to x86-64 or PowerPC, or even sparcv9.

You probably should check your logs, if you haven't already.

Some URL's that you might find (somewhat :) relevant:

http://dcs.nac.uci.edu/~strombrg/Problem-solving-on-unix-linux= -systems.html

htt= p://dcs.nac.uci.edu/~strombrg/NFS-troubleshooting-2.html

http://dcs.= nac.uci.edu/~strombrg/crashy-system.html

http://dcs.nac= .uci.edu/~strombrg/RAID-notes.html

FWIW, we had some very bad experiences with both Lustre and GFS, in combina= tion with NFS, on some x86 (32 bit) systems with SuperMicro motherboards.&n= bsp; We never figured out definitively if it was related to (both of) the d= istributed filesystems, the 3Ware RAID controllers, the Maxtor disks, the S= uperMicro motherboards, or some combination.  Anyway, IBM is going to = take their hardware back, and give us our money back in return...

On Wed, 2005-04-20 at 22:10 -0700, Kenneth Sumrall wrote:
At work, we have a very large (5.6 Tb) SCSI raid un=
it, which is formatted
as 1 XFS filesystem.  It is connected to a SuperMic=
ro 6012P-6 dual CPU
Pentium-4 server.  The server is running on Suse 9.=
2, but we've upgraded
the kernel from the 2.6.8 that shipped with it to 2=
.6.11.7 from kernel.org.
The server exports the XFS filesystem using the ker=
nel NFSD Version 3.

The machine has recently been hanging on a regular =
basis.  We think it's
related to NFS as the hangs often occur during a ti=
me in our nightly builds
when a bunch of machines are all writing data to th=
e server at the same time.
However, sometimes the hangs occur when the write l=
oad is not as heavy.

The things we've tried are:
   Swap the server box with a spare.  Just to make =
sure it's not a hardware
   problem.

   Tried booting with "nosmp noapic" in c=
ase SMP was causing us problems.

   Update to 2.6.11.7, because I read about a probl=
em exporting XFS over NFS
   in 2.6.8.  One thing I'm not clear on, with the =
2.6.8 XFS over NFS bug,
   could that cause XFS filesystem corruption.  Sho=
uld I run xfs_check on
   my XFS filesystem?

   We recently re-cabled a bunch of the clients for=
 this machine, and in the
   process, removed a choke point where 13 of our c=
lients were funnelled through
   a 100 Mbs ethernet switch.  That could have caus=
ed major fragmentation issues,
   which I've read are a bad thing.  It's only been=
 1 day since we did that, so
   no data yet on if things are better.

Other things to note.  Because the RAID is so big, =
we are running XFS directly
on the raw disk device, not a partition.  The parti=
tion format seems to have
problems with sizes over 2 terabytes.  Of course, I=
 had to turn on CONFIG_LBD
in order to access such a large block device.

The ethernet interface is an e1000 gigabit interfac=
e.  It plugs directly into
our main Foundry ethernet switch.  The clients all =
have 100 Mbit interfaces, but
there's a bunch of them.

Also, the RAID uses a sector size of 2048 bytes, no=
t the typical 512 bytes.
The SCSI controller in the server is an Adaptec Ult=
ra160 chip, and we're using
the aic7xxx driver.

Does anyone have any suggestions on how to further =
diagnose our problem?  I've
not used magic sysrq before, but I'm thinking maybe=
 trying to dump a list of
current tasks, and the registers might be useful to=
 see if it hangs in the
same place everytime.  Or I could apply the KGDB pa=
tch, and try using that.

Does anyone have any other ideas on how to diagnose=
 this?  Any known problems
I'm not aware of?  I'd really like to make this ser=
ver rock solid.

Thanks.

Ken Sumrall
ksumrall@pa=
cbell.net



---------------------------------------------------=
----
This SF.Net email is sponsored by: New Crystal Repo=
rts XI.
Version 11 adds new functionality designed to reduc=
e time involved in
creating, integrating, and deploying reporting solu=
tions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728=

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
--=-kS/Uj0ylTod4kLr3HmPT-- --=-JaN8kQH9GDOE3UUS15iK Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQBCZ+xGo0feVm00f/8RAm8SAJ9e+1UvFKDO4QgO1srzuK4c+G/AbwCghw7o IpG+0Q7AmKcPNnHU5bkUBZs= =TwFP -----END PGP SIGNATURE----- --=-JaN8kQH9GDOE3UUS15iK-- ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs