Hello all you great Linux hackers,
in our fine physics group we recently bought a DUAL XEON P4 2666MHz, 2GB, with
hyper-threading support and I had the honour of making the thing work. In the
process I also did some benchmarking using two different kernels (stock
SuSE-8.2-Pro 2.4.20-64GB-SMP, and the latest and greatest vanilla
2.6.0-test4). I benchmarked
[1] kernel compiles (after 'cat'ting all files >/dev/null, into the buffer
cache) and
[2] running time of a multi-threaded numerical simulation making extensive use
of FFTs, using the fftw.org library.
To cut the detailed story (below) short, the results puzzle me to a certain
extend: The physical/logical CPU distinction, which 2.6.0 is supposed to make
and put to good use in the improved scheduler, does not seem to make a huge
difference, in fact it is worse than in 2.4: While 2.6 was approximately 14%
faster than 2.4 in 'make' and 'make -j4', 'make -j2' was only 6.8% faster on
2.6. (That is why percentage gain from -j2 to -j4 is more striking in 2.6
than in 2.4.) This suggests worse scheduling for two active threads on 4
logical cpus on 2.6 with 'make -j2'. For kernel compiles, 2.6 is faster than
2.4 in general, but I should point out that the stock SuSE-8.2 2.4 SMP kernel
was not optimized for the specific P4 hardware.
The next strange thing is that using FFTW (in a single program) with two or
four simulataneous FFT-threads on 2.6.0-test4 is significantly *slower* than
in 2.4, where the hyperthreading/SMP is said to be inferior to 2.6. The
simulation took almost 50% longer on 2.6, all other things being equal.
Unfortunately, these results made me go back to 2.4 for the time being.
If somebody would like more or other details please send email to me. Also let
me know if you want to test some patch on the machine. Since I am not
subscribed to the list, please 'cc' me in your
replies/request/interpretations.
Another question: Why ist the physical id of the cpus on the dual HT system
either 0 or 3. Seems counterintuitive to me. Does the 'cat /proc/cpuinfo'
(shown below for 2.6) perhaps show a wrong physical/logical cpu
enumeration/distinction in 2.6, which could lead to the decreased 2.6
performance?
Thank you
Max Krueger
P.S. HT References I found online have not compared HT between 2.4 and 2.6,
but they all assume improvements in 2.6.
http://www.linuxworld.com/story/33885.htm
http://www-106.ibm.com/developerworks/linux/library/l-htl/
// %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
// Benchmarking details:
[1]
// ***Kernel compiles***
// Same configuration. Files held in buffercache.
// Between two runs: make mrproper; cp ../.config .
// ******* 2.6.0-test4
time make -j4 bzImage modules
real 6m22.062s (26.1% faster than -j2 on 2.6)
(15.1% faster than -j4 on 2.4)
user 21m54.580s
sys 2m15.787s
time make -j2 bzImage modules
real 8m37.429s (ONLY 6.8% faster than -j2 on 2.4)
user 15m10.568s
sys 1m42.431s
time make bzImage modules
real 13m48.561s (14% faster than make on 2.4)
user 12m7.176s
sys 1m30.951s
// ******* 2.4.20
time make -j4 bzImage modules
real 7m30.777s (18.9% faster than -j2 on 2.4)
user 21m4.820s
sys 7m7.660s
time make -j2 bzImage modules
real 9m15.707s
user 16m18.740s
sys 2m10.620s
time make bzImage modules
real 15m59.935s
user 14m21.240s
sys 1m39.500s
[2]
// %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
// Multithreaded simulation using N threads with library fftw.org
// Time needed for 2000 timesteps, each doing various 3D FTs
// of 2048x32x32 matrices
// ******* 2.6.0-test4
four threads: 3:13 (three hours, thirteen minutes)
two threads: 3:26
// ******* 2.4.20
four threads: 2:16 (!!!)
two threads: 2:49
// For comparison:
// 2.4.20 on a Dual Athlon 2000+MP 1666MHz, 2GB
two threads: 3:19
// 2.4.22-rcX on a Dual Opteron 240 1349MHz, 2GB
two threads: 1:57
CPU:
linux-2.6.0-test4> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2666.657
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5259.26
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2666.657
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5324.80
processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2666.657
cache size : 512 KB
physical id : 3
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5324.80
processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2666.657
cache size : 512 KB
physical id : 3
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5324.80
linux-2.6.0-test4> cat /proc/modules
videodev 6272 0 (autoclean)
autofs 10644 2 (autoclean)
nfsd 85168 4 (autoclean)
ipv6 179508 -1 (autoclean)
isa-pnp 32520 0 (unused)
st 30708 0 (autoclean) (unused)
sr_mod 13624 0 (autoclean)
sg 29276 0 (autoclean)
mousedev 4536 0 (unused)
joydev 5984 0 (unused)
evdev 4352 0 (unused)
input 3456 0 [mousedev joydev evdev]
usb-ohci 22056 0 (unused)
usb-uhci 24976 0 (unused)
ehci-hcd 18700 0 (unused)
usbcore 66476 1 [usb-ohci usb-uhci ehci-hcd]
raw1394 16724 0 (unused)
ieee1394 38064 0 [raw1394]
e1000 46948 1
ip_conntrack_ftp 4112 0 (unused)
ipt_state 568 2
ip_conntrack 19384 2 [ip_conntrack_ftp ipt_state]
iptable_filter 1708 1
ip_tables 11808 2 [ipt_state iptable_filter]
ide-scsi 10608 0
ide-cd 32220 0
cdrom 30496 0 [sr_mod ide-cd]
ext3 90696 1
jbd 54292 1 [ext3]
[email protected] wrote:
> Hello all you great Linux hackers,
>
> in our fine physics group we recently bought a DUAL XEON P4 2666MHz, 2GB, with
> hyper-threading support and I had the honour of making the thing work. In the
> process I also did some benchmarking using two different kernels (stock
> SuSE-8.2-Pro 2.4.20-64GB-SMP, and the latest and greatest vanilla
> 2.6.0-test4). I benchmarked
>
> [1] kernel compiles (after 'cat'ting all files >/dev/null, into the buffer
> cache) and
>
> [2] running time of a multi-threaded numerical simulation making extensive use
> of FFTs, using the fftw.org library.
>
> To cut the detailed story (below) short, the results puzzle me to a certain
> extend: The physical/logical CPU distinction, which 2.6.0 is supposed to make
I'm no kernel developer so take my opinion as worth more than
anyone else here (much less). The new scheduler in the 2.6
kernels is still being tweaked by Con and Igno, et al. But beyond
that there are several new ways to tweak the scheduler
designed to handled different loads, amounts of mem. etc...
Skimming the past few months of the mail list archives for
what to tweak and how may enhance the tasks you are currently
testing. My $0.01 (I'm cheap like that).
-sb
On Tue, Aug 26, 2003, [email protected] wrote:
> in our fine physics group we recently bought a DUAL XEON P4 2666MHz, 2GB, with
> hyper-threading support and I had the honour of making the thing work. In the
> process I also did some benchmarking using two different kernels (stock
> SuSE-8.2-Pro 2.4.20-64GB-SMP, and the latest and greatest vanilla
> 2.6.0-test4). I benchmarked
>
> [2] running time of a multi-threaded numerical simulation making extensive use
> of FFTs, using the fftw.org library.
One thing to watch out for, with fftw: I believe it will benchmark
various kernels, and decide which one to use, at run-time. If the
scheduler fools it into thinking that a particular kernel is going to
perform better, it might do the wrong thing.
Does fftw have a switch to write a debug log?
("kernel" in this context means "the small section of code used to solve
the fft", not "the OS code running in privileged mode".)
-andy
On Tue, 26 Aug 2003, Andy Isaacson wrote:
> On Tue, Aug 26, 2003, [email protected] wrote:
> > in our fine physics group we recently bought a DUAL XEON P4 2666MHz, 2GB,
> > with
> > hyper-threading support and I had the honour of making the thing work.
> > In the
> > process I also did some benchmarking using two different kernels (stock
> > SuSE-8.2-Pro 2.4.20-64GB-SMP, and the latest and greatest vanilla
> > 2.6.0-test4). I benchmarked
> >
> > [2] running time of a multi-threaded numerical simulation making
> > extensive use of FFTs, using the fftw.org library.
>
> One thing to watch out for, with fftw: I believe it will benchmark
> various kernels, and decide which one to use, at run-time. If the
> scheduler fools it into thinking that a particular kernel is going to
> perform better, it might do the wrong thing.
>
> Does fftw have a switch to write a debug log?
>
> ("kernel" in this context means "the small section of code used to solve
> the fft", not "the OS code running in privileged mode".)
>
> -andy
The benchmarks in the fftw.org libraries are useful only as
time-sinks. At least on ix86, our tests show that a generic
fft, perhaps 10 years old, with no special treatment except
using pointers instead of indexes, when using 'double' as
float, is, within the test noise, as fast as their "self-
adapting" fft.
Also, the nature of a fft, reduces the influence of the kernel.
Even with some kind of parallelism, for which there is little
chance because of the serial nature of a fft, you are testing
threads, not the kernel. To the kernel, a fft is just some
CPU bound math.
If you are going to do a lot of math simulation and are not
going to be creating a lot of separate tasks that communicate,
your choice of kernel (or operating system) is irrelevant.
This kind of math-code just sits in user's memory plugging
along until it writes its results to something, somewhere.
The kernel is not involved until the answers are available.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.
In article <[email protected]>,
<[email protected]> wrote:
| The next strange thing is that using FFTW (in a single program) with two or
| four simulataneous FFT-threads on 2.6.0-test4 is significantly *slower* than
| in 2.4, where the hyperthreading/SMP is said to be inferior to 2.6. The
| simulation took almost 50% longer on 2.6, all other things being equal.
| Unfortunately, these results made me go back to 2.4 for the time being.
You may want to run w/o HT at all for your particular problem, or limit
threads to the number of physical CPUs. Having multiple threads trying
to do ffts is likely to thrash the cache, and will result in contention
for the (single) FPU.
The schedular would seem to lack the information to make a determination
of the magnitude of the fpu contention, and by making better use of HT
it makes the contention worse.
Perhaps there should be a "don't hyperthread" attribute one could set to
hint that running multiple threads in a single CPU is unlikely to work
well. Since there isn't, I don't know a way to avoid the problem unless
you can set maxthreads to Ncpu.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Tuesday 26 August 2003 14:12, Richard B. Johnson wrote:
> On Tue, 26 Aug 2003, Andy Isaacson wrote:
> > On Tue, Aug 26, 2003, [email protected] wrote:
> > > in our fine physics group we recently bought a DUAL XEON P4 2666MHz,
> > > 2GB, with
> > > hyper-threading support and I had the honour of making the thing work.
> > > In the
> > > process I also did some benchmarking using two different kernels (stock
> > > SuSE-8.2-Pro 2.4.20-64GB-SMP, and the latest and greatest vanilla
> > > 2.6.0-test4). I benchmarked
Chances are -neither- of these kernels have HT enhancements for the scheduler.
I am positive the 260test kernels do not have shared runqueues for HT
siblings and the scheduler does not make use of the cpu_sibling_map. Test
make -j2 with HT disabled, and I bet you get better results than make -j2
with HT enabled....
> P.S. HT References I found online have not compared HT between 2.4 and 2.6,
> but they all assume improvements in 2.6.
> http://www.linuxworld.com/story/33885.htm
This article is incorrect. The scheduler changes did not make it in to 2.5.32
> http://www-106.ibm.com/developerworks/linux/library/l-htl/
This article discusses the changes needed, but does not state that the changes
are in the 2.5 kernel.
This is still a problem, even the "What to Expect From 2.6" is incorrect:
http://ftp.kernel.org/pub/linux/kernel/people/davej/misc/post-halloween-2.5.txt
Ingo's latest patch that fixes this is here:
http://people.redhat.com/mingo/O(1)-scheduler/sched-2.5.68-B2
And here for 2.4:
http://people.redhat.com/mingo/O(1)-scheduler/sched-HT-2.4.21-rc7-ac1-A1
Not sure why the FFT results are so much lower on 2.6, but I'm not sure sure
it has anything to do with HT, maybe something else? Can you try turning off
HT and see what happens?
-Andrew Theurer