Hi All, I am having some trouble determining whether it is my NFS server or Client that is the bottleneck. I have 10 web servers hooked up to 2 NFS servers.
The clients all run Redhat 7.2 with 2.4.18 with NFSALL patch applied, and the servers are redhat 7.2 with the 2.4.18 kernel. I am using UDP.
My problem is that my web servers when doing a "vmstat 1" get many blocked processes. See below.
I can't really get ANY kind of good reading to whether or not an NFS server is overloaded or not. Besides iostat and nfsstat are there any good tools?!?!
I am going crazy trying to test. Please help. If you are an NFS guru feel free to give me a call at (801) 361-1177 (Cell phone). I am happy to pay!
PS - The numbers below were taken at 10:30 PM so traffic is NOT at a high during this time, you can expect 50% traffic increase at mid day.
Any help would really be appreciated! Thanks!
r b w swpd free buff cache si so bi bo in cs us sy id
8 1 0 9668 103728 28624 456756 0 0 0 38 588 286 29 3 68
11 0 0 9668 103196 28624 457068 0 0 0 20 617 221 14 5 81
9 2 0 9668 103172 28624 456988 0 0 0 12 267 104 13 2 85
The 'r' at the top left is the problem. This happens when NFS traffic goes way up. The nfsstat -c for the client look like this.
Client nfs v3:
null getattr setattr lookup access readlink
19 0% 29862505 67% 5171 0% 6976413 15% 16064 0% 0 0%
read write create mkdir symlink mknod
7697391 17% 2287 0% 10 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 53 0%
fsstat fsinfo pathconf commit
5 0% 5 0% 0 0% 2287 0%
All clients are pretty much the same. The nfsstat -s for the two NFS servers are below.
Server nfs v2:
null getattr setattr root lookup readlink
0 0% 4818 0% 457163 19% 0 0% 374516 16% 0 0%
read wrcache write create remove rename
1006 0% 0 0% 1338999 57% 147627 6% 550 0% 0 0%
link symlink mkdir rmdir readdir fsstat
0 0% 0 0% 6923 0% 73 0% 4795 0% 5 0%
Server nfs v3:
null getattr setattr lookup access readlink
12 0% 2487236 38% 22870 0% 2451861 38% 19225 0% 0 0%
read write create mkdir symlink mknod
971932 15% 274775 4% 31984 0% 1212 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
14422 0% 391 0% 10524 0% 0 0% 24138 0% 0 0%
fsstat fsinfo pathconf commit
8 0% 7 0% 0 0% 95129 1%
AND THE OTHER SERVER
Server nfs v2:
null getattr setattr root lookup readlink
0 0% 54088 0% 453322 1% 0 0% 23253032 55% 0 0%
read wrcache write create remove rename
1964341 4% 0 0% 13001439 31% 449191 1% 676267 1% 21247 0%
link symlink mkdir rmdir readdir fsstat
0 0% 0 0% 15479 0% 65541 0% 1647071 3% 51 0%
Server nfs v3:
null getattr setattr lookup access readlink
0 0% 416318245 2% 2678145 0% 461314618 2% 127706435 3% 0 0%
read write create mkdir symlink mknod
184416498 1% 22576823 1% 2622952 0% 92490 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
1550770 0% 61184 0% 1673675 0% 0 0% 5333994 0% 180 0%
fsstat fsinfo pathconf commit
187 0% 129 0% 0 0% 10053536 0%
IOSTAT 5 for the first NFS server below
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev3-0 1.00 0.00 30.40 0 152
dev8-0 48.60 273.60 516.80 1368 2584
IOSTAT 5 for the second NFS server below
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev3-0 0.40 0.00 17.60 0 88
dev8-0 70.60 752.00 132.80 3760 664
On Tuesday June 4, [email protected] wrote:
> Hi All, I am having some trouble determining whether it is my NFS server or Client that is the bottleneck. I have 10 web servers hooked up to 2 NFS servers.
> The clients all run Redhat 7.2 with 2.4.18 with NFSALL patch applied, and the servers are redhat 7.2 with the 2.4.18 kernel. I am using UDP.
>
> My problem is that my web servers when doing a "vmstat 1" get many blocked processes. See below.
>
> I can't really get ANY kind of good reading to whether or not an NFS
> server is overloaded or not. Besides iostat and nfsstat are there
> any good tools?!?!
> I am going crazy trying to test. Please help. If you are an NFS
> guru feel free to give me a call at (801) 361-1177 (Cell phone). I
> am happy to pay!
Try:
# cat /proc/net/rpc/nfsd
I get:
rc 19779 29251958 135512291
fh 2376 163053154 6932683 1437 0
io 3259963919 553169354
th 94 21542 3554.490 1121.770 915.500 831.020 341.850 210.850 141.590 52.500 56.460 564.230
ra 256 4059797 19063 10367 7587 5209 3618 2408 1820 1396 734 498054
net 164789163 164789562 0 0
rpc 164785911 2835 2835 0 0
proc2 18 1057931 115172462 24952465 0 13386753 143713 5627467 0 3166880 338554 282079 89827 33325 9448 8796 5118 384934 122775
proc3 22 1 0 47 1271 1315 0 657 169 36 0 0 0 10 0 32 4 79 0 19 0 0 30
The "th" line might be interesting. It shows how many total seconds
that 0-10%, or 10-20% or ... of the threads were busy.
Note that the early numbers are orders of magnitude larger than the
later numbers. This is good. What do you get?
Try
while : ; do netstat -n -u -a | grep :2049 ; sleep 1; done
I get:
udp 348 0 0.0.0.0:2049 0.0.0.0:*
udp 1740 0 0.0.0.0:2049 0.0.0.0:*
udp 0 0 0.0.0.0:2049 0.0.0.0:*
udp 0 0 0.0.0.0:2049 0.0.0.0:*
udp 2088 0 0.0.0.0:2049 0.0.0.0:*
udp 1044 0 0.0.0.0:2049 0.0.0.0:*
udp 0 0 0.0.0.0:2049 0.0.0.0:*
udp 2436 0 0.0.0.0:2049 0.0.0.0:*
^^^^^^
Note this column of numbers. If it hits a ceiling at 65536, you have
a problem that can be fixed. See the NFS FAQ, or show me the numbers
and I will explain.
NeilBrown
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
> while : ; do netstat -n -u -a | grep :2049 ; sleep 1; done
>
> I get:
> udp 348 0 0.0.0.0:2049 0.0.0.0:*
>
>
> udp 2436 0 0.0.0.0:2049 0.0.0.0:*
>
> ^^^^^^
>
> Note this column of numbers. If it hits a ceiling at 65536, you have
> a problem that can be fixed. See the NFS FAQ, or show me the numbers
> and I will explain.
Interesting - In the half hour or so that I've watched our NFS
server, this number has hit 65406. It isn't clear to me what section of the
NFS FAQ (http://nfs.sourceforge.net/) is relevant to this test.
The other thing in this post that caught my eye was mention of the
2.4.18 NFSALL patch. I presume this is the patch located at
http://www.cse.unsw.edu.au/~neilb/patches/linux-stable/2.4.18/patch-Bd-NfsdA
ll.gz How does this, or does it, relate to the post from Hirokazu Takahashi:
http://www.geocrawler.com/archives/3/789/2002/4/0/8479569/ which seems to
also be related to BKL removal. Of particular interest to me in Hirokazu's
post is mention of some kernel Oops fixes that I think we might be seeing.
We are running a RedHat 7.2 box with kernel 2.4.18, XFS 1.1, nfs-utils-0.3.3
and we have had all sorts of strange Oops's and hangs. Are there other
patches we should consider applying to a 2.4.18 NFS server? Does the NFSALL
patch address any stability issues or is it mostly to add V3TCP?
Unfortunately, we are stuck at 2.4.18 until we can get rid of XFS (I think
XFS is the source of most of our problems, but we are starting to look at
NFS as well)
Thanks for any insight...
-poul
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
>>>>> " " == Poul Petersen <[email protected]> writes:
> server? Does the NFSALL patch address any stability issues or
> is it mostly to add V3TCP?
Nope. NFS_ALL has bugger all to do with adding TCP... The NFS client
in the standard kernel already has full support for NFS over TCP.
For info on what is in the NFS_ALL patches see the file HEADER.html in
the same directory you found the patch...
Cheers,
Trond
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Wednesday June 5, [email protected] wrote:
> > while : ; do netstat -n -u -a | grep :2049 ; sleep 1; done
> >
> > I get:
> > udp 348 0 0.0.0.0:2049 0.0.0.0:*
> >
> >
> > udp 2436 0 0.0.0.0:2049 0.0.0.0:*
> >
> > ^^^^^^
> >
> > Note this column of numbers. If it hits a ceiling at 65536, you have
> > a problem that can be fixed. See the NFS FAQ, or show me the numbers
> > and I will explain.
>
> Interesting - In the half hour or so that I've watched our NFS
> server, this number has hit 65406. It isn't clear to me what section of the
> NFS FAQ (http://nfs.sourceforge.net/) is relevant to this test.
5.4. Memory Limits on the Input Queue
NeilBrown
_______________________________________________________________
Sponsored by:
ThinkGeek at http://www.ThinkGeek.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs