2009-02-16 08:29:57

by Carsten Aulbert

[permalink] [raw]
Subject: Tracing down 250ms open/chdir calls

Hi all,

sorry in advance for this vague subject and also the vague email, I'm
trying my best to summarize the problem:

On our large cluster we sometimes encounter the problem that our main
scheduling processes are often in state D and in the end not capable
anymore of pushing work to the cluster.

The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
of memory, when certain types of jobs are running we see system loads of
about 20-30 which might go up to 80-100 from time to time. Looking at
the individual cores they are mostly busy with system tasks (e.g. htop
shows 'red' bars).

stat -tt -c showed that several system calls of the scheduler take a
long time to complete, most notably open and chdir which took between
180 and 230ms to complete (during our testing). Since most of these open
and chdir are via NFSv3 I'm including that list as well. The NFS servers
are Sun Fire X4500 boxes running Solaris 10u5 right now.

A standard output line looks like:
93.37 38.997264 230753 169 78 open

i.e. 93.37% of the system-related time was spend in 169 successful open
calls which took 230753us/call, thus 39 wall clock seconds were spend in
a minute just doing open.

We tried several things to understand the problem, but apart from moving
more files (mostly log files of currently running jobs) off NFS we did
not move far ahead so far. On
https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
we have summarized some things.

With the help of 'stress' and a tiny program just doing open/putc/close
into a single file, I've tried to get a feeling how good or bad things
are when compared to other head nodes with different tasks/loads:

https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest

(this test may or may not help in the long run, I'm just poking into the
dark).

Now my questions:

* Do you have any suggestions how to continue debugging this problem?
* Does anyone know how to improve the situation? Next on my agenda would
be to try different IO algorithms, any hints which ones should be good
for such boxes?
* I guess I missed vital information. please let me know if you need
more information of the system

Please Cc me from linux-kernel, I'm only on the other two addressed lists.

Cheers and a lot of TIA

Carsten

--
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31


2009-02-16 13:34:02

by Joe Landman

[permalink] [raw]
Subject: Re: Tracing down 250ms open/chdir calls

Carsten Aulbert wrote:
> Hi all,
>
> sorry in advance for this vague subject and also the vague email, I'm
> trying my best to summarize the problem:
>
> On our large cluster we sometimes encounter the problem that our main
> scheduling processes are often in state D and in the end not capable
> anymore of pushing work to the cluster.

Hi Carsten

Are you using a "standard" cluster scheduler (SGE, PBS, ...) or a
locally written one?

> The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
> of memory, when certain types of jobs are running we see system loads of
> about 20-30 which might go up to 80-100 from time to time. Looking at
> the individual cores they are mostly busy with system tasks (e.g. htop
> shows 'red' bars).

Hmmm... These are your head nodes? Not your NFS server nodes? Sounds
like there are a large number of blocked IO processes ... try a

vmstat 1

and look at the "b" column (usually second from left ... or nearly
there). Lots of blocked IO processes can have the affect of introducing
significant latency into all system calls.

> stat -tt -c showed that several system calls of the scheduler take a
> long time to complete, most notably open and chdir which took between
> 180 and 230ms to complete (during our testing). Since most of these open
> and chdir are via NFSv3 I'm including that list as well. The NFS servers
> are Sun Fire X4500 boxes running Solaris 10u5 right now.

Hmmm.... What happens if you make these local to each box? What are the
mount options for the mount points? We have spoken to other users with
performance problems on such servers. The NFS server/OS combination you
indicate above isn't known to be very fast. This isn't your problem
(see later), but it looks like your own data suggests you are giving up
nearly an order of magnitude performance using this NFS server/OS
combination, likely at a premium price as well.

> A standard output line looks like:
> 93.37 38.997264 230753 169 78 open
>
> i.e. 93.37% of the system-related time was spend in 169 successful open
> calls which took 230753us/call, thus 39 wall clock seconds were spend in
> a minute just doing open.
>
> We tried several things to understand the problem, but apart from moving
> more files (mostly log files of currently running jobs) off NFS we did
> not move far ahead so far. On
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
> we have summarized some things.

I didn't catch if these are tcp/udp mounts. What are the mount options?

> With the help of 'stress' and a tiny program just doing open/putc/close
> into a single file, I've tried to get a feeling how good or bad things
> are when compared to other head nodes with different tasks/loads:
>
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest

... and these results do in fact confirm the comment I made about the
NFS server/OS combo not known to be fast.

From my laptop running a Ubuntu 8.10 variant over gigabit to a known
very slow NFS server ...

./measure.exe /misc/crunch/home/landman/test2
Wrote 24905 times into /misc/crunch/home/landman/test2, speed is
421.14/s, time elapsed: 59.14s


from a different server in a different location to a JackRabbit-S server
over NFS on a single gigabit with default mounts (this server running a
long IOzone)

./measure.exe /mnt/d/1
Wrote 124312 times into /mnt/d/1, speed is 2072.99/s, time elapsed: 59.97s

and from the same starting point but to a JackRabbit-S server over NFS
on a single gigabit with default mounts (this server idle)

./measure.exe /mnt/d/x
Wrote 134426 times into /mnt/d/x, speed is 2270.79/s, time elapsed: 59.20s

This system would be roughly similar to your d02 machine.

ly
Assuming you aren't using mount options of noac,sync,... Could you
enlighten us as to what mount options are for the head nodes?

Also, the way the code is written, you are doing quite a few calls to
gettimeofday ... you could probably avoid this with a little re-writing
of the main loop.

If you are using noac or sync on your NFS mounts, then this could
explain some of the differences you are seeing (certainly the 100/s vs
800/s ... but not likely the 4/s)

However, if you notice that h2 in your table is an apparent outlier,
there may be something more systematic going on there. Since you
indicate there is a high load going on while you are testing, this is
likely what you need to explore.

Grab the atop program. It is pretty good about letting you explore what
is causing load. That is, despite your x4500's/Solaris combo showing
itself not to be a fast NFS server the problem appears to be more likely
more localized on the h2 machine than on the NFS machines.

http://freshmeat.net/projects/atop/

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [email protected]
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

2009-02-16 12:56:43

by Trond Myklebust

[permalink] [raw]
Subject: Re: Tracing down 250ms open/chdir calls

On Mon, 2009-02-16 at 08:55 +0100, Carsten Aulbert wrote:
> Hi all,
>
> sorry in advance for this vague subject and also the vague email, I'm
> trying my best to summarize the problem:
>
> On our large cluster we sometimes encounter the problem that our main
> scheduling processes are often in state D and in the end not capable
> anymore of pushing work to the cluster.
>
> The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
> of memory, when certain types of jobs are running we see system loads of
> about 20-30 which might go up to 80-100 from time to time. Looking at
> the individual cores they are mostly busy with system tasks (e.g. htop
> shows 'red' bars).
>
> stat -tt -c showed that several system calls of the scheduler take a
> long time to complete, most notably open and chdir which took between
> 180 and 230ms to complete (during our testing). Since most of these open
> and chdir are via NFSv3 I'm including that list as well. The NFS servers
> are Sun Fire X4500 boxes running Solaris 10u5 right now.
>
> A standard output line looks like:
> 93.37 38.997264 230753 169 78 open
>
> i.e. 93.37% of the system-related time was spend in 169 successful open
> calls which took 230753us/call, thus 39 wall clock seconds were spend in
> a minute just doing open.
>
> We tried several things to understand the problem, but apart from moving
> more files (mostly log files of currently running jobs) off NFS we did
> not move far ahead so far. On
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
> we have summarized some things.
>
> With the help of 'stress' and a tiny program just doing open/putc/close
> into a single file, I've tried to get a feeling how good or bad things
> are when compared to other head nodes with different tasks/loads:
>
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest
>
> (this test may or may not help in the long run, I'm just poking into the
> dark).
>
> Now my questions:
>
> * Do you have any suggestions how to continue debugging this problem?
> * Does anyone know how to improve the situation? Next on my agenda would
> be to try different IO algorithms, any hints which ones should be good
> for such boxes?
> * I guess I missed vital information. please let me know if you need
> more information of the system
>
> Please Cc me from linux-kernel, I'm only on the other two addressed lists.
>
> Cheers and a lot of TIA
>
> Carsten

2.6.27.7 has a known NFS client performance bug due to a change in the
authentication code. The fix was merged in 2.6.27.9: see the commit
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git&a=commitdiff&h=a0f04d0096bd7edb543576c55f7a0993628f924a

Cheers
Trond


2009-02-16 13:01:12

by Carsten Aulbert

[permalink] [raw]
Subject: Re: Tracing down 250ms open/chdir calls

Hi Trond,

Trond Myklebust schrieb:
> 2.6.27.7 has a known NFS client performance bug due to a change in the
> authentication code. The fix was merged in 2.6.27.9: see the commit
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git&a=commitdiff&h=a0f04d0096bd7edb543576c55f7a0993628f924a

Maybe that's also a reason why we don't see it that often anymore -
although the user load has shifted a bit also. We are running 2.7.27.14
(and .15 since Friday).

Thanks for letting me know, I'll probably re-run the tests again with a
newer kernel.

Cheers

Carsten

2009-02-16 13:44:13

by Carsten Aulbert

[permalink] [raw]
Subject: Re: Tracing down 250ms open/chdir calls

Hi Joe,

(keeping all lists cross-posted, please shout briefly at me if I fall
off the line of not being rude):

Joe Landman schrieb:
>
> Are you using a "standard" cluster scheduler (SGE, PBS, ...) or a
> locally written one?
>

We use Condor (http://www.cs.wisc.edu/condor/).
>
> Hmmm... These are your head nodes? Not your NFS server nodes? Sounds
> like there are a large number of blocked IO processes ... try a
>

Yes, these are the head nodes and not the NFS servers.

> vmstat 1
>
> and look at the "b" column (usually second from left ... or nearly
> there). Lots of blocked IO processes can have the affect of introducing
> significant latency into all system calls.
>

Right now, (system load 40, but still quite responsive box):

vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 60 1748932 49204 7181132 0 0 393 730 15 12 8 4 76 12
1 0 60 1738804 49204 7187768 0 0 0 254 3340 11635 15 8 77 0
1 0 60 1693220 49216 7193328 0 0 0 306 3118 9296 16 7 77 0
1 0 60 1751184 49224 7192924 0 0 0 343 3450 10866 12 9 79 0
0 0 60 1754924 49224 7192936 0 0 0 127 2744 6750 5 7 87 2
0 0 60 1756932 49240 7187552 0 0 0 532 3289 9673 3 6 91 0
0 0 60 1752664 49240 7193776 0 0 0 77 2835 10075 2 7 92 0
2 0 60 1754956 49244 7193820 0 0 0 553 3976 14870 4 12 84 0
1 0 60 1742032 49252 7206288 0 0 0 193 3588 9133 5 6 89 0
1 0 60 1736920 49252 7206316 0 0 0 284 3821 9402 7 7 86 0
4 0 60 1742292 49260 7193964 0 0 0 514 4545 12428 17 10 74 0

> Hmmm.... What happens if you make these local to each box? What are the
> mount options for the mount points? We have spoken to other users with
> performance problems on such servers. The NFS server/OS combination you
> indicate above isn't known to be very fast. This isn't your problem
> (see later), but it looks like your own data suggests you are giving up
> nearly an order of magnitude performance using this NFS server/OS
> combination, likely at a premium price as well.
>

The mount options are pretty standard NFSv3 via tcp, ...:

s02:/atlashome/USER on /home/USER type nfs (rw,vers=3,rsize=32768,wsize=32768,namlen=255,soft,nointr,nolock,noacl,proto=tcp,
timeo=600,retrans=2,sec=sys,mountaddr=10.20.20.2,mountvers=3,mountproto=tcp,addr=10.20.20.2)

> Assuming you aren't using mount options of noac,sync,... Could you
> enlighten us as to what mount options are for the head nodes?
The linux data server are exporting NFS with async, the X4500 should be about the same,
i.e. we set nocacheflushing for the zpool

>
> Also, the way the code is written, you are doing quite a few calls to
> gettimeofday ... you could probably avoid this with a little re-writing
> of the main loop.
>

Well, that proofs that I should never be let near any serious programming -
at least not in time critical parts of the codes ;)

> If you are using noac or sync on your NFS mounts, then this could
> explain some of the differences you are seeing (certainly the 100/s vs
> 800/s ... but not likely the 4/s)
>
> However, if you notice that h2 in your table is an apparent outlier,
> there may be something more systematic going on there. Since you
> indicate there is a high load going on while you are testing, this is
> likely what you need to explore.
>

That was my idea as well.

> Grab the atop program. It is pretty good about letting you explore what
> is causing load. That is, despite your x4500's/Solaris combo showing
> itself not to be a fast NFS server the problem appears to be more likely
> more localized on the h2 machine than on the NFS machines.
>
> http://freshmeat.net/projects/atop/

I'll try that and will dive again into iotop as well.

Cheers

Carsten