2002-06-11 03:48:24

by Simon Matthews

[permalink] [raw]
Subject: NFS Client mis-behaviour?

I have seen some strange behaviour from the NFS client.

We installed a new machine, 2 x 2.2GHz Xeon processors, based on the Intel
E75000 chipset. Ethernet interface is an on-board Intel 8255x based.
Kernel is 2.4.18 with the 3.5GB VM patch installed.

When running a large job on this machine, the job would start and after a
short time, the CPU utilization would drop and the large job would clearly
hang. Also, "df" would hang and any attempts to list the directory that
contained the data would hang. Eventually, after about 10-15 minutes, it
would go back to normal (until the large job was started again).

Clearly the NFS client was unable to access the data directory.

Solution: the Ethernet interface was connected to a switch that only
supports half-duplex connecting to a full-duplex switch solved the
problem. However, it does seem that the NFS client was not handling the
situation well.

Another question: this system has 2 CPUs, yet the kernel detects 4. Any
ideas why?

Simon



2002-06-11 05:16:34

by Andreas Dilger

[permalink] [raw]
Subject: Re: NFS Client mis-behaviour?

On Jun 10, 2002 20:48 -0700, Simon Matthews wrote:
> I have seen some strange behaviour from the NFS client.

Sorry, can't help with the NFS problem.

> We installed a new machine, 2 x 2.2GHz Xeon processors
>
> Another question: this system has 2 CPUs, yet the kernel detects 4. Any
> ideas why?

Probably because they are "Hyper Threaded" (HT) CPUs, which means they
have (almost) 2 CPU cores per chip. The second core is not a full CPU,
so it is not as fast as running a 4 CPU system, but reports are about
30% faster than just 2 CPUs.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-06-11 14:31:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS Client mis-behaviour?

>>>>> " " == Simon Matthews <[email protected]> writes:

> Solution: the Ethernet interface was connected to a switch that
> only supports half-duplex connecting to a full-duplex switch
> solved the problem. However, it does seem that the NFS client
> was not handling the situation well.

The NFS client neither knows nor cares what is going on down in the
ethernet layer. As far as it is concerned, you might as well be using
semaphore to pass messages between the computers.

All the NFS client needs to know is that it should retry the socket
sendmsg() operation when a certain (user defined) timeout value is
reached.

Cheers,
Trond

2002-06-11 16:36:54

by Simon Matthews

[permalink] [raw]
Subject: Re: NFS Client mis-behaviour?

Trond,

I realize that the transport is hidden to NFS. However, in the situation I
described, the NFS client did not behave well: it seemed to lock up totally
for a period of 10-15 minutes.

Other packets were able to make it into and out of the machine: I could
telnet/ssh/rlogin. The user could not interrupt the process, despite the
fact that the mount options included "intr".

My point is that the use of half-duplex may prevent the NFS client from
sending or receiving (probably sending) some packets. But, since the
processes that caused the load had stopped doing anything and other packets
were passing in and out, the NFS client should have been able to recover
earlier.

Simon


At 04:31 PM 6/11/02 +0200, Trond Myklebust wrote:
> >>>>> " " == Simon Matthews <[email protected]> writes:
>
> > Solution: the Ethernet interface was connected to a switch that
> > only supports half-duplex connecting to a full-duplex switch
> > solved the problem. However, it does seem that the NFS client
> > was not handling the situation well.
>
>The NFS client neither knows nor cares what is going on down in the
>ethernet layer. As far as it is concerned, you might as well be using
>semaphore to pass messages between the computers.
>
>All the NFS client needs to know is that it should retry the socket
>sendmsg() operation when a certain (user defined) timeout value is
>reached.
>
>Cheers,
> Trond

2002-06-12 07:55:12

by Alan

[permalink] [raw]
Subject: Re: NFS Client mis-behaviour?

> Another question: this system has 2 CPUs, yet the kernel detects 4. Any
> ideas why?

Hypedthreading - new xeons appear as 2 cpus and one execution path can
make use of execution resources when the other stalls.

2002-06-12 11:58:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS Client mis-behaviour?

On Tuesday 11 June 2002 18:28, Simon Matthews wrote:

> Other packets were able to make it into and out of the machine: I could
> telnet/ssh/rlogin. The user could not interrupt the process, despite the
> fact that the mount options included "intr".

There is a well known problem with 'intr': if one process is waiting on the
page lock, then there is no provision for interrupting (that's a known
weakness with the MM layer).
Since taking the page lock is usually done by some process that wants to read
from a page, the usual cause of such a hangup is the fact that some other
process is in the middle of an NFS READ. For this reason, if you kill *all*
READ operations (by doing 'killall -9 rpciod) then you can usually recover.
That's something that is only possible for 'root' though...

> My point is that the use of half-duplex may prevent the NFS client from
> sending or receiving (probably sending) some packets. But, since the
> processes that caused the load had stopped doing anything and other packets
> were passing in and out, the NFS client should have been able to recover
> earlier.

As I said, all the client is required to do is to retry (unless it gets
interrupted). I'm not sure what else you mean by 'recover' in the above
sentence.

Cheers,
Trond