2002-01-02 01:01:00

by Steinar Hauan

[permalink] [raw]
Subject: smp cputime issues

hello,

we are encountering some weird timing behaviour on our linux cluster.

specifically: when running 2 copies of selected programs on a
dual-cpu system, the cputime reported for each process is up to 25%
higher than when the processes are run on their own. however, if running
two different jobs on the same machine, both complete with a cputime
equal to when run individually. sample timing output attached.

profiling confirms that everything slows down approximately to scale.
the results reproduce on a range of different machines (see below).

additional specifications:
- kernel version 2.4.16 (with apic enabled)
- chipsets: apollo pro 133, apollo pro 266,
intel i860, serverworks LE
- all jobs requires less than 1/10 of physical memory
- no significant disk i/o takes place
- timing with dtime(), /usr/bin/time and shell built-in time
- this behavior is NOT seen for all applications. the worst
"offender" spends most of its time doing linear algebra.

ideas or info-pointers appreciated. more specs available on request.

regards,
--
Steinar Hauan, dept of ChemE -- [email protected]
Carnegie Mellon University, Pittsburgh PA, USA


Attachments:
one.txt (936.00 B)
two.txt (940.00 B)
Download all attachments

2002-01-02 01:31:41

by M. Edward Borasky

[permalink] [raw]
Subject: RE: smp cputime issues

The obvious question is: how do the printed *elapsed* (wall clock) times
compare with a stopwatch timing of the same run??

--
M. Edward Borasky

[email protected]
http://www.borasky-research.net

2002-01-02 13:20:34

by Martin Knoblauch

[permalink] [raw]
Subject: Re: smp cputime issues

> smp cputime issues
>
>
> hello,
>
> we are encountering some weird timing behaviour on our linux cluster.
>
> specifically: when running 2 copies of selected programs on a
> dual-cpu system, the cputime reported for each process is up to 25%
> higher than when the processes are run on their own. however, if running
> two different jobs on the same machine, both complete with a cputime
> equal to when run individually. sample timing output attached.
>
> profiling confirms that everything slows down approximately to scale.
> the results reproduce on a range of different machines (see below).
>
> additional specifications:
> - kernel version 2.4.16 (with apic enabled)
> - chipsets: apollo pro 133, apollo pro 266,
> intel i860, serverworks LE
> - all jobs requires less than 1/10 of physical memory
> - no significant disk i/o takes place
> - timing with dtime(), /usr/bin/time and shell built-in time
> - this behavior is NOT seen for all applications. the worst
> "offender" spends most of its time doing linear algebra.
>
> ideas or info-pointers appreciated. more specs available on request.
>

two points. First for clarification - do you see the effects also on
elapsed time? Or do you say that the CPU time reporting is screwed?

Second - you mention that you see the effect mainly on linear algebra
stuff. Could it be that you are memory bandwidth limited if you run two
of them together? Are you using Intel CPUs (my guess) which have the FSB
concept that may make memory bandwidth scaling a problem, or AMD Athlons
which use the Alpha/EV6 bus and should be a bit more friendly.

Finally, how big is "1/10th of physical" memory? What kind of memory.

Martin
--
+-----------------------------------------------------+
|Martin Knoblauch |
|-----------------------------------------------------|
|http://www.knobisoft.de/cats |
|-----------------------------------------------------|
|e-mail: [email protected] |
+-----------------------------------------------------+

2002-01-02 13:55:16

by Steinar Hauan

[permalink] [raw]
Subject: RE: smp cputime issues

On Tue, 1 Jan 2002, M. Edward Borasky wrote:
> The obvious question is: how do the printed *elapsed* (wall clock) times
> compare with a stopwatch timing of the same run??

sorry,

should have included that all timings are consistent.
(usr/sys vs reported wall clock time vs external stop watch time)

for reference: the effect arises for a several different memory types
(pc133, pc133 ecc, pc133 reg ecc, pc2100) and the impact is similar.
thus if it was only a memory bandwidth issue, i would expect
the results to depend more on the memory/chipset in question.

regards,
--
Steinar Hauan, dept of ChemE -- [email protected]
Carnegie Mellon University, Pittsburgh PA, USA

2002-01-02 15:07:36

by M. Edward Borasky

[permalink] [raw]
Subject: RE: smp cputime issues

> Second - you mention that you see the effect mainly on linear algebra
> stuff. Could it be that you are memory bandwidth limited if you run two
> of them together? Are you using Intel CPUs (my guess) which have the FSB
> concept that may make memory bandwidth scaling a problem, or AMD Athlons
> which use the Alpha/EV6 bus and should be a bit more friendly.

Hmmm ... linear algebra ... are you by any chance using Atlas? Atlas is
highly optimized for the chips and as many other architectural features as
it can discover, such as cache size. I'm sure a well-tuned Atlas application
is quite capable of bending a machine to its own purposes, quite possibly
to the discomfort of other users attempting to use the system. If the issue
is sharing of resources between the linear algebra code and other users,
perhaps the thing to do is get Atlas, if you're not currently using it,
and then "nice" the linear algebra code.

I run Atlas on my (UP) 1.333 GHz Athlon Thunderbird and it screams. I can
get 4+ GFLOPS in the 3DNOW 32-bit code and well over 1 GFLOP in 64 bits.
--
M. Edward Borasky

[email protected]
http://www.borasky-research.net

2002-01-02 17:53:18

by Martin Knoblauch

[permalink] [raw]
Subject: Re: smp cputime issues

Steinar Hauan wrote:
>
> On Wed, 2 Jan 2002, Martin Knoblauch wrote:
> > two points. First for clarification - do you see the effects also on
> > elapsed time? Or do you say that the CPU time reporting is screwed?
>
> wall clock time is consistent with (cpu time) x (%utilization)
>

OK, just asked to make sure I didn't misunderstand.

> > Second - you mention that you see the effect mainly on linear algebra
> > stuff. Could it be that you are memory bandwidth limited if you run two
> > of them together? Are you using Intel CPUs (my guess) which have the FSB
> > concept that may make memory bandwidth scaling a problem, or AMD Athlons
> > which use the Alpha/EV6 bus and should be a bit more friendly.
>
> these results are on Intel p3 and (p4) xeon cpu's, yes.
>

OK, that is what I almost guessed.

> > Finally, how big is "1/10th of physical" memory? What kind of memory.
>
> the effects are reproducible with runs of size down to 40mb.
> (i've made a toy problem that runs in ~2 mins to isolate the effect)
>
> i've used 4 machine types
>
> p3 800mhz @ apollo pro 133 with 1gb pc133 ecc mem
> p3 1ghz @ apollo pro 266 with 1gb pc2100 ddr mem
> p3 1ghz @ serverworks LE with 2gb pc133 reg ecc mem
>
> for all of the above, the reported cpu usage is +25%. on the machine
>
> p4 xeon 1.7ghz @ intel i860 with 500mb pc800 reg ecc rdram
>
> the effect is less pronounced (5-6%), thus confirming that memory
> bandwidth may be an issue. still, if that's the case; there's a
> significant difference in bandwith between the other 3 machines.
> (the serverworks chipset has dual channels)
>

You are probably not bound by the bandwidth between memory and the
"chipset", but the bandwidth on the FSB (or between FSB and Chipset).
This would explain why the Serverworks LE doesn't give you better
scaling than the other P3 systems.

The P4 has a much higher FSB speed (400 MHz vs. 100/133 MHz). As a
result it has more headroom for scaling. You could look ath the Streams
results for an indicator.

http://www.cs.virginia.edu/stream/

The P4s definitely show the best numbers in the "PC" category, a LOT
better than any P3 result, which seem to max out at about 450 MB/sec.
Unfortunatelly no dual entries.

Dell_8100-1500 1 2106.0 2106.0 2144.0 2144.0
Intel_STL2-PIII-933 1 423.0 419.0 517.0 517.0
Intel_440BX-2_PIII-650 1 455.0 421.0 501.0 500.0

It would be interesting to see your test performed on a dual Athlon
(comparable speed to the P4). There seems to be evidence that they scale
better for scientific stuff, although the streams results do not show a
very good scaling.

AMD_Athlon_1200 2 922.0 916.4 1051.7 1053.4
AMD_Athlon_1200 1 726.8 711.8 860.1 851.4

http://www.amdzone.com/releaseview.cfm?ReleaseID=764 (as a reference for
better Athlon scaling).

Martin
--
+-----------------------------------------------------+
|Martin Knoblauch |
|-----------------------------------------------------|
|http://www.knobisoft.de/cats |
|-----------------------------------------------------|
|e-mail: [email protected] |
+-----------------------------------------------------+

2002-01-03 01:24:24

by J.A. Magallon

[permalink] [raw]
Subject: Re: smp cputime issues (patch request ?)


On 20020102 Steinar Hauan wrote:
>hello,
>
> we are encountering some weird timing behaviour on our linux cluster.
>
> specifically: when running 2 copies of selected programs on a
> dual-cpu system, the cputime reported for each process is up to 25%
> higher than when the processes are run on their own. however, if running
> two different jobs on the same machine, both complete with a cputime
> equal to when run individually. sample timing output attached.
>

Cache pollution problems ?

As I understand, your job does not use too much memory, does no IO,
just linear algebra (ie, matrix-times-vector or vector-plus-vector
operations). That implies sequential access to matrix rows and vectors.

I will try to guess...

Problem with linux scheduler is that processes are bounced from one CPU
to the other, they are not tied to one, nor try to stay in the one they
start, even if there is no need for the cpu to do any other job.
On an UP box, the cache is useful to speed up your matrix-vector ops.
One process on a 2-way box, just bounces from one cpu to the other,
and both caches are filled with the same data. Two processes on two
cpus, and everytime they 'swap' between cpus they trash the previous
cache for the other job, so when it returs it has no data cached.

Solutions:
- cpu affinity patch: manually tie processes to cpus
- new scheduler: a patch for the scheduler that tries to
keep processes on the cpu they start was talked about on the list.

I would prefer the second option. I think it is named something like
'multiqueue scheduler', and its 'father' could be (AFAIR) Davide Libezni.
Look for that on the list archives. Problem: I think the patch only
exists for 2.5.

Request: a version for 2.4.17+ ?? (plz)

Disclaimer: of course, all the previous discussion can be crap.

Good luck. I am also interested in this problem.

--
J.A. Magallon # Let the source be with you...
mailto:[email protected]
Mandrake Linux release 8.2 (Cooker) for i586
Linux werewolf 2.4.18-pre1-beo #3 SMP Thu Dec 27 10:15:27 CET 2001 i686

2002-01-03 02:35:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: smp cputime issues (patch request ?)

On Thu, 3 Jan 2002, J.A. Magallon wrote:

>
> On 20020102 Steinar Hauan wrote:
> >hello,
> >
> > we are encountering some weird timing behaviour on our linux cluster.
> >
> > specifically: when running 2 copies of selected programs on a
> > dual-cpu system, the cputime reported for each process is up to 25%
> > higher than when the processes are run on their own. however, if running
> > two different jobs on the same machine, both complete with a cputime
> > equal to when run individually. sample timing output attached.
> >
>
> Cache pollution problems ?
>
> As I understand, your job does not use too much memory, does no IO,
> just linear algebra (ie, matrix-times-vector or vector-plus-vector
> operations). That implies sequential access to matrix rows and vectors.
>
> I will try to guess...
>
> Problem with linux scheduler is that processes are bounced from one CPU
> to the other, they are not tied to one, nor try to stay in the one they
> start, even if there is no need for the cpu to do any other job.
> On an UP box, the cache is useful to speed up your matrix-vector ops.
> One process on a 2-way box, just bounces from one cpu to the other,
> and both caches are filled with the same data. Two processes on two
> cpus, and everytime they 'swap' between cpus they trash the previous
> cache for the other job, so when it returs it has no data cached.
>
> Solutions:
> - cpu affinity patch: manually tie processes to cpus
> - new scheduler: a patch for the scheduler that tries to
> keep processes on the cpu they start was talked about on the list.
>
> I would prefer the second option. I think it is named something like
> 'multiqueue scheduler', and its 'father' could be (AFAIR) Davide Libezni.
> Look for that on the list archives. Problem: I think the patch only
> exists for 2.5.

The patch is here :

http://www.xmailserver.org/linux-patches/xsched-2.5.2-pre4-0.58.diff

I did not read the whole thread but if your two tasks are strictly cpu
bound and you've two cpus, you should not have problems even with the
current scheduler.




- Davide


2002-01-05 05:22:18

by Steinar Hauan

[permalink] [raw]
Subject: Re: smp cputime issues (patch request ?)

On Thu, 3 Jan 2002, J.A. Magallon wrote:
> Cache pollution problems ?
>
> As I understand, your job does not use too much memory, does no IO,
> just linear algebra (ie, matrix-times-vector or vector-plus-vector
> operations). That implies sequential access to matrix rows and vectors.

very correct.

> Problem with linux scheduler is that processes are bounced from one CPU
> to the other, they are not tied to one, nor try to stay in the one they
> start, even if there is no need for the cpu to do any other job.

one of the tips received was to set the penalty for cpu switch, i.e. set

linux/include/asm/smp.h:#define PROC_CHANGE_PENALTY 15

to a much higher value (50). this had no effect on the results.

> On an UP box, the cache is useful to speed up your matrix-vector ops.
> One process on a 2-way box, just bounces from one cpu to the other,
> and both caches are filled with the same data. Two processes on two
> cpus, and everytime they 'swap' between cpus they trash the previous
> cache for the other job, so when it returs it has no data cached.

this would be an issue, agreed, but cache invalidation by cpu bounces
should also affect one-cpu jobs? thus is does not explain why this
effect should be (much) worse with 2 jobs.

regards,
--
Steinar Hauan, dept of ChemE -- [email protected]
Carnegie Mellon University, Pittsburgh PA, USA

2002-01-05 05:51:33

by John Alvord

[permalink] [raw]
Subject: Re: smp cputime issues (patch request ?)

On Sat, 5 Jan 2002 00:21:44 -0500 (EST), Steinar Hauan <[email protected]>
wrote:

>On Thu, 3 Jan 2002, J.A. Magallon wrote:
>> Cache pollution problems ?
>>
>> As I understand, your job does not use too much memory, does no IO,
>> just linear algebra (ie, matrix-times-vector or vector-plus-vector
>> operations). That implies sequential access to matrix rows and vectors.
>
>very correct.
>
>> Problem with linux scheduler is that processes are bounced from one CPU
>> to the other, they are not tied to one, nor try to stay in the one they
>> start, even if there is no need for the cpu to do any other job.
>
>one of the tips received was to set the penalty for cpu switch, i.e. set
>
> linux/include/asm/smp.h:#define PROC_CHANGE_PENALTY 15
>
>to a much higher value (50). this had no effect on the results.
>
>> On an UP box, the cache is useful to speed up your matrix-vector ops.
>> One process on a 2-way box, just bounces from one cpu to the other,
>> and both caches are filled with the same data. Two processes on two
>> cpus, and everytime they 'swap' between cpus they trash the previous
>> cache for the other job, so when it returs it has no data cached.
>
>this would be an issue, agreed, but cache invalidation by cpu bounces
>should also affect one-cpu jobs? thus is does not explain why this
>effect should be (much) worse with 2 jobs.

One factor to consider is that to see it bounce, you need to be
running an observation process like top, or if it is a GUI display two
processes (application and X). Those observing processes will
continuosly bump aside the calcuation processes, causing a bouncing
effect.

john