I've been trying to troubleshoot/tune around a problem I have been
experiencing where if the NFS server is under heavy write load (the
disks are saturated), client side NFS performance drops to nearly 0.
As soon as the load is lifted so that there is no significant IO wait
time on the server, the clients start acting responsively again.
Server setup:
Fedora 9 kernel 2.6.27.15-78.2.23.fc9.x86_64
NFS tuning - none
Disk system - 230GB SATA RAID1 array on a basic AACRAID adapter
Dual XEONs, 8GB RAM
Network GigE
Client setup:
Fedora 10 kernel 2.6.27.19-170.2.35.fc10.x86_64
NFS tuning - none
Network GigE
Things I have tried:
I have tried playing with the disk scheduler, switching to deadline
from cfq = no difference.
I have tried playing with rsize/wsize settings on the client = no difference.
Steps to reproduce:
1. Write a big file to the same partition that is exported on the server:
dd if=/dev/zero of=/opt/export/bigfile bs=1M count=5000 conv=fdatasync
2. Write a small file to the same partition from the client:
dd if=/dev/zero of=/opt/export/bigfile bs=16k count=8 conf=fdatasync
I am seeing slightly less than 2kBps (yes, 1000-2000 bytes per second)
from the client while this is happening.
It really looks like the nfs daemons on the server just get no
priority when compared to the local process.
Any ideas? This is something I have noticed for quite some time.
Only thing I can think of trying is to upgrade the disk array so that
the disks are no longer a bottleneck.
-Dave
On Fri, Mar 13, 2009 at 3:29 PM, Trond Myklebust
<[email protected]> wrote:
> On Fri, 2009-03-13 at 14:59 -0700, David Rees wrote:
>> And the activity around the time I am reproducing the slowdown:
>>
>> Server nfs v3:
>> null ? ? ? ? getattr ? ? ?setattr ? ? ?lookup ? ? ? access ? ? ? readlink
>> 0 ? ? ? ? 0% 3503 ? ? 75% 7 ? ? ? ? 0% 31 ? ? ? ?0% 1027 ? ? 22% 0 ? ? ? ? 0%
>> read ? ? ? ? write ? ? ? ?create ? ? ? mkdir ? ? ? ?symlink ? ? ?mknod
>> 9 ? ? ? ? 0% 50 ? ? ? ?1% 6 ? ? ? ? 0% 0 ? ? ? ? 0% 0 ? ? ? ? 0% 0 ? ? ? ? 0%
>> remove ? ? ? rmdir ? ? ? ?rename ? ? ? link ? ? ? ? readdir ? ? ?readdirplus
>> 2 ? ? ? ? 0% 0 ? ? ? ? 0% 2 ? ? ? ? 0% 0 ? ? ? ? 0% 0 ? ? ? ? 0% 0 ? ? ? ? 0%
>> fsstat ? ? ? fsinfo ? ? ? pathconf ? ? commit
>> 0 ? ? ? ? 0% 0 ? ? ? ? 0% 0 ? ? ? ? 0% 13 ? ? ? ?0%
>
> Is this the result of only doing 2 'dd' copies? That's a lot of GETATTR
> calls for that kind of workload...
No - the client that I have been duplicating this from is also my
desktop and the NFS server hosts my home directory and it was active
during the test.
That's where the extreme slowdown in NFS performance affects me the
most. When then heavy IO on the server is going on (even just a
single process writing as fast as it can), my applications (Firefox,
Thunderbird, Gnome Terminals, just about anything that accesses the
NFS mount) will basically lock up and go totally unresponsive while
they wait for the NFS server to respond. They will sit unresponsive
for minutes at a time and are unusable until the heavy IO stops on the
server.
I do software development from this machine and I have timed one of my
project builds with and without the heavy IO on the NFS server - a
build that normally takes about 20 seconds will take 5 minutes to
complete (it does read/write a lot of small files).
-Dave
On Fri, 2009-03-13 at 14:32 -0700, David Rees wrote:
> On Fri, Mar 13, 2009 at 2:10 PM, Trond Myklebust
> <[email protected]> wrote:
> > On Fri, 2009-03-13 at 13:36 -0700, David Rees wrote:
> >> Steps to reproduce:
> >>
> >> 1. Write a big file to the same partition that is exported on the server:
> >> dd if=/dev/zero of=/opt/export/bigfile bs=1M count=5000 conv=fdatasync
> >> 2. Write a small file to the same partition from the client:
> >> dd if=/dev/zero of=/opt/export/bigfile bs=16k count=8 conf=fdatasync
> >
> > You don't need conv=fdatasync when writing to NFS. The close-to-open
> > cache consistency automatically guarantees fdatasync on close().
>
> Yeah, I didn't think I did, but I added it for good measure since I
> needed it to simulate load directly on the server. It doesn't seem to
> affect the numbers significantly either way.
>
> >> I am seeing slightly less than 2kBps (yes, 1000-2000 bytes per second)
> >> from the client while this is happening.
> >
> > UDP transport, or TCP? If the former, then definitely switch to the
> > latter, since you're probably pounding the server with unnecessary RPC
> > retries while it is busy with the I/O.
> >
> > For the same reason, also make sure that -otimeo=600 (the default for
> > TCP).
>
> Yes, both set that way by default. From /proc/mounts on the client:
>
> rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.2.1.13,mountvers=3,mountproto=tcp,addr=10.2.1.13
That all looks correct.
Just a few more random thoughts:
Do you perhaps see a lot of sockets connected to port 2049 and that are
not in the TCP_ESTABLISHED state when you do 'netstat -nt' on the
server?
What is the ratio of 'write' to 'commit' operations when you look at the
'nfsstat -s' output (also on the server)?
Trond
On Fri, Mar 13, 2009 at 2:10 PM, Trond Myklebust
<[email protected]> wrote:
> On Fri, 2009-03-13 at 13:36 -0700, David Rees wrote:
>> Steps to reproduce:
>>
>> 1. Write a big file to the same partition that is exported on the server:
>> dd if=/dev/zero of=/opt/export/bigfile bs=1M count=5000 conv=fdatasync
>> 2. Write a small file to the same partition from the client:
>> dd if=/dev/zero of=/opt/export/bigfile bs=16k count=8 conf=fdatasync
>
> You don't need conv=fdatasync when writing to NFS. The close-to-open
> cache consistency automatically guarantees fdatasync on close().
Yeah, I didn't think I did, but I added it for good measure since I
needed it to simulate load directly on the server. It doesn't seem to
affect the numbers significantly either way.
>> I am seeing slightly less than 2kBps (yes, 1000-2000 bytes per second)
>> from the client while this is happening.
>
> UDP transport, or TCP? If the former, then definitely switch to the
> latter, since you're probably pounding the server with unnecessary RPC
> retries while it is busy with the I/O.
>
> For the same reason, also make sure that -otimeo=600 (the default for
> TCP).
Yes, both set that way by default. From /proc/mounts on the client:
rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.2.1.13,mountvers=3,mountproto=tcp,addr=10.2.1.13
-Dave
On Fri, 2009-03-13 at 13:36 -0700, David Rees wrote:
> I've been trying to troubleshoot/tune around a problem I have been
> experiencing where if the NFS server is under heavy write load (the
> disks are saturated), client side NFS performance drops to nearly 0.
> As soon as the load is lifted so that there is no significant IO wait
> time on the server, the clients start acting responsively again.
>
> Server setup:
> Fedora 9 kernel 2.6.27.15-78.2.23.fc9.x86_64
> NFS tuning - none
> Disk system - 230GB SATA RAID1 array on a basic AACRAID adapter
> Dual XEONs, 8GB RAM
> Network GigE
>
> Client setup:
> Fedora 10 kernel 2.6.27.19-170.2.35.fc10.x86_64
> NFS tuning - none
> Network GigE
>
> Things I have tried:
>
> I have tried playing with the disk scheduler, switching to deadline
> from cfq = no difference.
> I have tried playing with rsize/wsize settings on the client = no difference.
>
> Steps to reproduce:
>
> 1. Write a big file to the same partition that is exported on the server:
> dd if=/dev/zero of=/opt/export/bigfile bs=1M count=5000 conv=fdatasync
> 2. Write a small file to the same partition from the client:
> dd if=/dev/zero of=/opt/export/bigfile bs=16k count=8 conf=fdatasync
You don't need conv=fdatasync when writing to NFS. The close-to-open
cache consistency automatically guarantees fdatasync on close().
> I am seeing slightly less than 2kBps (yes, 1000-2000 bytes per second)
> from the client while this is happening.
UDP transport, or TCP? If the former, then definitely switch to the
latter, since you're probably pounding the server with unnecessary RPC
retries while it is busy with the I/O.
For the same reason, also make sure that -otimeo=600 (the default for
TCP).
Trond
On Fri, 2009-03-13 at 14:59 -0700, David Rees wrote:
> And the activity around the time I am reproducing the slowdown:
>
> Server nfs v3:
> null getattr setattr lookup access readlink
> 0 0% 3503 75% 7 0% 31 0% 1027 22% 0 0%
> read write create mkdir symlink mknod
> 9 0% 50 1% 6 0% 0 0% 0 0% 0 0%
> remove rmdir rename link readdir readdirplus
> 2 0% 0 0% 2 0% 0 0% 0 0% 0 0%
> fsstat fsinfo pathconf commit
> 0 0% 0 0% 0 0% 13 0%
Is this the result of only doing 2 'dd' copies? That's a lot of GETATTR
calls for that kind of workload...
Trond
On Mon, Mar 16, 2009 at 9:54 PM, Greg Banks <[email protected]> wrote:
> David, could you try your test case with this command running on the server?
>
> tethereal -i eth0 -f 'port 2049' -z rpc,rtt,100003,3 -w /dev/null
>
> and ^C it when you've done. ?You should see a table by RPC call with
> minimum maximum and average "roundtrip times" (actually as they're
> measured on the server, they should be server response times). ?This
> should tell you which calls are slow and whether it's all those calls or
> only a few outliers.
OK, filtering out the empty lines, I ran the test (note that I
modified it slightly, details later) with and without the server side
large write running:
Without:
Procedure Calls Min RTT Max RTT Avg RTT
GETATTR 3 0.00006 0.00133 0.00048
SETATTR 1 0.03878 0.03878 0.03878
ACCESS 9 0.00006 0.00232 0.00049
COMMIT 1 1.13381 1.13381 1.13381
With:
Procedure Calls Min RTT Max RTT Avg RTT
GETATTR 1 0.00016 0.00016 0.00016
SETATTR 1 30.14662 30.14662 30.14662
ACCESS 8 0.00005 0.00544 0.00154
COMMIT 1 0.34472 0.34472 0.34472
> Another experiment worth trying is a local filesystem load fairness test
> on the server (no NFS client involved). ?This will tell you whether the
> nfsd's IO priority is an issue, or if you're just seeing an IO
> scheduling issue.
>
> 1. ?echo 3 > /proc/sys/vm/drop_caches
> 2. ?time find /your/home/some/dir -ls > /dev/null (choose a directory
> tree with a few hundred files)
> 3. ?echo 3 > /proc/sys/vm/drop_caches
> 4. ?start your large writes
> 5. ?time find /your/home/some/dir -ls > /dev/null
This test actually returned similar response times with or without the
big server write running. Which led me to thinking that this isn't
even a NFS problem - it actually appears to be a VM or
filesystem(ext3) problem.
Here's the test case again:
1. On the server, write out a large file:
dd if=/dev/zero of=bigfile bs=1M count=5000
2. On the client, write out a small file:
dd if=/dev/zero of=smallfile bs=4k count=1
Now, I just ran #2 both on a NFS client and on the local server -
response time was the same, about 45 seconds.
So it appears that my combination of server (older dual Xeon 3Ghz, 8GB
RAM, SATA RAID1) lets too much dirty data accumulate creating a huge
backlog of data to be written to the journal and leaving small,
latency sensitive writes go very slowly. I seem to remember reading
about this problem before - oh yeah:
http://lwn.net/Articles/216853/
on top of this newer bug:
http://bugzilla.kernel.org/show_bug.cgi?id=12309
http://lkml.org/lkml/2009/1/16/487
I'm off to go hack dirty_ratio and dirty_background_ratio for now
until this bug gets fixed and ext4 appears to fully stabilize. :-)
Thanks for your help, guys.
-Dave
David Rees wrote:
> On Fri, Mar 13, 2009 at 3:29 PM, Trond Myklebust
> <[email protected]> wrote:
>
>> On Fri, 2009-03-13 at 14:59 -0700, David Rees wrote:
>>
>>> And the activity around the time I am reproducing the slowdown:
>>>
>>> Server nfs v3:
>>> null getattr setattr lookup access readlink
>>> 0 0% 3503 75% 7 0% 31 0% 1027 22% 0 0%
>>> read write create mkdir symlink mknod
>>> 9 0% 50 1% 6 0% 0 0% 0 0% 0 0%
>>> remove rmdir rename link readdir readdirplus
>>> 2 0% 0 0% 2 0% 0 0% 0 0% 0 0%
>>> fsstat fsinfo pathconf commit
>>> 0 0% 0 0% 0 0% 13 0%
>>>
>> Is this the result of only doing 2 'dd' copies? That's a lot of GETATTR
>> calls for that kind of workload...
>>
>
> No - the client that I have been duplicating this from is also my
> desktop and the NFS server hosts my home directory and it was active
> during the test.
>
> That's where the extreme slowdown in NFS performance affects me the
> most. When then heavy IO on the server is going on (even just a
> single process writing as fast as it can), my applications (Firefox,
> Thunderbird, Gnome Terminals, just about anything that accesses the
> NFS mount) will basically lock up and go totally unresponsive while
> they wait for the NFS server to respond. They will sit unresponsive
> for minutes at a time and are unusable until the heavy IO stops on the
> server.
>
> I do software development from this machine and I have timed one of my
> project builds with and without the heavy IO on the NFS server - a
> build that normally takes about 20 seconds will take 5 minutes to
> complete (it does read/write a lot of small files).
>
David, could you try your test case with this command running on the server?
tethereal -i eth0 -f 'port 2049' -z rpc,rtt,100003,3 -w /dev/null
and ^C it when you've done. You should see a table by RPC call with
minimum maximum and average "roundtrip times" (actually as they're
measured on the server, they should be server response times). This
should tell you which calls are slow and whether it's all those calls or
only a few outliers.
Another experiment worth trying is a local filesystem load fairness test
on the server (no NFS client involved). This will tell you whether the
nfsd's IO priority is an issue, or if you're just seeing an IO
scheduling issue.
1. echo 3 > /proc/sys/vm/drop_caches
2. time find /your/home/some/dir -ls > /dev/null (choose a directory
tree with a few hundred files)
3. echo 3 > /proc/sys/vm/drop_caches
4. start your large writes
5. time find /your/home/some/dir -ls > /dev/null
--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.
On Fri, Mar 13, 2009 at 2:40 PM, Trond Myklebust
<[email protected]> wrote:
> Do you perhaps see a lot of sockets connected to port 2049 and that are
> not in the TCP_ESTABLISHED state when you do 'netstat -nt' on the
> server?
Only one tcp socket from the client to the server (no other clients
need to be active when running the testcase).
> What is the ratio of 'write' to 'commit' operations when you look at the
> 'nfsstat -s' output (also on the server)?
It appears to be about 3-4 to 1 writes to commits. By far the
heaviest activity is with getattr calls with 60-75% of the activity.
# nfsstat -s
Server rpc stats:
calls badcalls badauth badclnt xdrcall
21044536 0 0 0 0
Server nfs v3:
null getattr setattr lookup access readlink
30 0% 12977249 61% 267095 1% 1201529 5% 3030825 14% 4816 0%
read write create mkdir symlink mknod
722160 3% 1369511 6% 242484 1% 2852 0% 103 0% 1 0%
remove rmdir rename link readdir readdirplus
218097 1% 4354 0% 189125 0% 284 0% 1569 0% 139267 0%
fsstat fsinfo pathconf commit
3930 0% 554 0% 272 0% 610240 2%
And the activity around the time I am reproducing the slowdown:
Server nfs v3:
null getattr setattr lookup access readlink
0 0% 3503 75% 7 0% 31 0% 1027 22% 0 0%
read write create mkdir symlink mknod
9 0% 50 1% 6 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
2 0% 0 0% 2 0% 0 0% 0 0% 0 0%
fsstat fsinfo pathconf commit
0 0% 0 0% 0 0% 13 0%
-Dave
Trond Myklebust wrote:
> On Fri, 2009-03-13 at 13:36 -0700, David Rees wrote:
>
>>
>
> You don't need conv=fdatasync when writing to NFS. The close-to-open
> cache consistency automatically guarantees fdatasync on close().
>
Yes...but at least some versions of dd will print a performance number
at the end, where the time delta used in the calculation includes the
fdatasync() time but not the close() time. So conv=fdatasync is a
useful trick to get accurate performance numbers from dd. Personally I
don't use that metric from dd because it's a simple overall average and
I like to watch the time behaviour of the transfer rate.
--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.