Hi all,
I would like to be personally CC'ed the answers/comments posted to the
list in response to my posting.
I have following CPU
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping : 1
cpu MHz : 3000.000
cache size : 1024 KB
on 2.6.17 everything works as I expect. My Mandriva distribution
standard has a 2.6.22.9 kernel which performed bad, numbers follow,
installed a fresh 2.6.24.2 kernel also performed bad, so I installed
2.6.17 and everything works OK.
I have some benchmarks from mplayer:
Kernel 2.6.22.9 smp hyperthreading
BENCHMARKs: VC: 334.042s VO: 0.053s A: 0.000s Sys: 4.049s = 338.143s
Kernel 2.6.22.9 nonsmp/hyperthreading
BENCHMARKs: VC: 262.008s VO: 0.031s A: 0.000s Sys: 3.528s = 265.567s
with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
BENCHMARKs: VC: 245.175s VO: 0.050s A: 0.000s Sys: 2.479s = 247.704s
with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
BENCHMARKs: VC: 227.992s VO: 0.051s A: 0.000s Sys: 2.551s = 230.594s
The 2.6.24.2 kernel had results as the 2.6.22.9 version
Regards Henk Schoneveld
Hello Henk,
On Fri, Feb 22, 2008 at 10:36:01AM +0100, belcampo wrote:
> Kernel 2.6.22.9 smp hyperthreading
> BENCHMARKs: VC: 334.042s VO: 0.053s A: 0.000s Sys: 4.049s = 338.143s
> Kernel 2.6.22.9 nonsmp/hyperthreading
> BENCHMARKs: VC: 262.008s VO: 0.031s A: 0.000s Sys: 3.528s = 265.567s
> with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
> BENCHMARKs: VC: 245.175s VO: 0.050s A: 0.000s Sys: 2.479s = 247.704s
> with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
> BENCHMARKs: VC: 227.992s VO: 0.051s A: 0.000s Sys: 2.551s = 230.594s
I'm not familiar with mplayer benchmarks, what do they actually
measure?
Regards,
Frederik
(I'm aware that this could be considered thread necromancy, but I
haven't yet seen any indication that that is considered a bad thing in
these here parts; if it is, then I apologize, and upon being informed of
the fact will undertake to not commit such again.)
On 2/22/2008 5:06 AM, Frederik Deweerdt wrote:
> Hello Henk,
>
> On Fri, Feb 22, 2008 at 10:36:01AM +0100, belcampo wrote:
>
>> Kernel 2.6.22.9 smp hyperthreading
>> BENCHMARKs: VC: 334.042s VO: 0.053s A: 0.000s Sys: 4.049s = 338.143s
>> Kernel 2.6.22.9 nonsmp/hyperthreading
>> BENCHMARKs: VC: 262.008s VO: 0.031s A: 0.000s Sys: 3.528s = 265.567s
>> with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
>> BENCHMARKs: VC: 245.175s VO: 0.050s A: 0.000s Sys: 2.479s = 247.704s
>> with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
>> BENCHMARKs: VC: 227.992s VO: 0.051s A: 0.000s Sys: 2.551s = 230.594s
>
> I'm not familiar with mplayer benchmarks, what do they actually
> measure?
I don't know if this discussion got continued privately, but on the
assumption that it didn't, I think I can give at least a basic answer to
this.
The VC: value is the amount of time spent in the video-codec code during
that run, the VO: value is the amount of time spent in the video-output
code, the A: is the amount of time spent in (ISTR) audio processing -
though whether codec or audio-output or audio filters etc. is unclear, I
remember there being separate values for those rather than their being
lumped under one header- and the Sys: value is I believe the amount of
time spent in system calls.
(For the record: I'm a long-time lurker and occasional, largely
non-code, contributor on the MPlayer development lists, but I've never
had occasion to look at the code behind or the logic involved in the
-benchmark output.)
--
Andrew Buehler
Andrew Buehler wrote:
> (I'm aware that this could be considered thread necromancy, but I
> haven't yet seen any indication that that is considered a bad thing in
> these here parts; if it is, then I apologize, and upon being informed of
> the fact will undertake to not commit such again.)
>
> On 2/22/2008 5:06 AM, Frederik Deweerdt wrote:
>
>> Hello Henk,
>>
>> On Fri, Feb 22, 2008 at 10:36:01AM +0100, belcampo wrote:
>>
>>> Kernel 2.6.22.9 smp hyperthreading
>>> BENCHMARKs: VC: 334.042s VO: 0.053s A: 0.000s Sys: 4.049s =
>>> 338.143s
>>> Kernel 2.6.22.9 nonsmp/hyperthreading
>>> BENCHMARKs: VC: 262.008s VO: 0.031s A: 0.000s Sys: 3.528s =
>>> 265.567s
>>> with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
>>> BENCHMARKs: VC: 245.175s VO: 0.050s A: 0.000s Sys: 2.479s =
>>> 247.704s
>>> with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
>>> BENCHMARKs: VC: 227.992s VO: 0.051s A: 0.000s Sys: 2.551s =
>>> 230.594s
>>
>> I'm not familiar with mplayer benchmarks, what do they actually measure?
>
> I don't know if this discussion got continued privately, but on the
> assumption that it didn't, I think I can give at least a basic answer to
> this.
>
> The VC: value is the amount of time spent in the video-codec code during
> that run, the VO: value is the amount of time spent in the video-output
> code, the A: is the amount of time spent in (ISTR) audio processing -
> though whether codec or audio-output or audio filters etc. is unclear, I
> remember there being separate values for those rather than their being
> lumped under one header- and the Sys: value is I believe the amount of
> time spent in system calls.
>
> (For the record: I'm a long-time lurker and occasional, largely
> non-code, contributor on the MPlayer development lists, but I've never
> had occasion to look at the code behind or the logic involved in the
> -benchmark output.)
>
Turning on hyperthreading effectively halves the amount of cache
available for each logical CPU when both are doing work, which can do
more harm than good. Number-crunching applications that utilize the
cache effectively generally don't benefit from hyperthreading,
particularly floating-point-intensive ones.
On the other hand, hyperthreading is excellent for streaming integer
work, like compiling. Whether or not you should use it depends entirely
on your workload.
-- Chris
Chris Snook <[email protected]> writes:
>
> Turning on hyperthreading effectively halves the amount of cache
> available for each logical CPU when both are doing work, which can do
> more harm than good.
When the two cores are in the same address space (as in being two
threads of the same process) L1 cache will be shared on P4. I think
for the other cases the cache management is also a little more
sophisticated than a simple split, depending on which HT generation
you're talking about (Intel had at least 4 generations out, each with
improvements over the earlier ones)
BTW your argument would be in theory true also for multi core with
shared L2 or L3, but even there the CPUs tend to be more sophisticated.
e.g. Core2 has a mechanism called "adaptive cache" which allows one
Core to use significantly more of the L2 in some cases.
> Number-crunching applications that utilize the
> cache effectively generally don't benefit from hyperthreading,
> particularly floating-point-intensive ones.
That sounds like a far too broad over generalization to me.
-Andi (who personally always liked HT)
Hi all,
Back to basics:
Kernel 2.6.22.9 smp hyperthreading needs 338.143s
with 2.6.17 kernel smp/hyperthreading needs 247.704s
for exactly the same job on the same machine.
For me it's not about HT vs. non-MT
Henk Schoneveld
Hi Andi,
On Fri, Mar 07, 2008 at 08:20:32PM +0100, Andi Kleen wrote:
> Chris Snook <[email protected]> writes:
> >
> > Turning on hyperthreading effectively halves the amount of cache
> > available for each logical CPU when both are doing work, which can do
> > more harm than good.
>
> When the two cores are in the same address space (as in being two
> threads of the same process) L1 cache will be shared on P4. I think
> for the other cases the cache management is also a little more
> sophisticated than a simple split, depending on which HT generation
> you're talking about (Intel had at least 4 generations out, each with
> improvements over the earlier ones)
Oh that's quite interesting to know.
> BTW your argument would be in theory true also for multi core with
> shared L2 or L3, but even there the CPUs tend to be more sophisticated.
> e.g. Core2 has a mechanism called "adaptive cache" which allows one
> Core to use significantly more of the L2 in some cases.
>
> > Number-crunching applications that utilize the
> > cache effectively generally don't benefit from hyperthreading,
> > particularly floating-point-intensive ones.
>
> That sounds like a far too broad over generalization to me.
>
> -Andi (who personally always liked HT)
Well, in my experience, except for compiling, HT has always caused
massive slowdowns, especially on network-intensive applications.
Basically, network perf took a 20-30% hit, while compiling took
20-30% boost. But I must admit that I never tried HT on anything
more recent than a P4, maybe things have changed since.
regards,
willy
> Well, in my experience, except for compiling, HT has always caused
> massive slowdowns, especially on network-intensive applications.
> Basically, network perf took a 20-30% hit, while compiling took
What network workload? Networking tends to have a lot of cache misses
and unless you're exceeding your memory bandwidth HT normally does
well on such workloads because it can do other things while the
CPU is waiting for loads.
> 20-30% boost. But I must admit that I never tried HT on anything
> more recent than a P4, maybe things have changed since.
There's nothing more recent out yet (unless you're talking non x86),
but there were many different P4 generations. In particular Prescott
(90nm) was quite different from the earlier ones, but even before
and after there were some improvements and changes.
-Andi
On Sat, Mar 08, 2008 at 12:46:55PM +0100, Andi Kleen wrote:
> > Well, in my experience, except for compiling, HT has always caused
> > massive slowdowns, especially on network-intensive applications.
> > Basically, network perf took a 20-30% hit, while compiling took
>
> What network workload?
high session rate HTTP traffic. That means high packet rates, high
session lookup rates, etc...
> Networking tends to have a lot of cache misses
> and unless you're exceeding your memory bandwidth HT normally does
> well on such workloads because it can do other things while the
> CPU is waiting for loads.
On SMP, the load is generally divided with user-space on once CPU
and IRQs on the other one, but not well balanced though (less IRQ),
which means that SMP is rarely more than 50-60% faster than UP. On
HT, I normally observe lower performance than on UP.
> > 20-30% boost. But I must admit that I never tried HT on anything
> > more recent than a P4, maybe things have changed since.
>
> There's nothing more recent out yet (unless you're talking non x86),
> but there were many different P4 generations. In particular Prescott
> (90nm) was quite different from the earlier ones, but even before
> and after there were some improvements and changes.
OK. Amusingly, the HT flag is present on my C2D E8200 :
model name : Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
Cheers,
willy
>On HT, I normally observe lower performance than on UP.
Hmm weird. It might be interesting to investigate in detail
what is going on there.
> model name : Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
Dual core systems generally have it, It leads to better scheduling
on some older OS because in many aspects dual core is nearer HT than
a true dual socket systems. There was no traditional way to express
"core siblings" in CPUID so they just faked HT again, but added some
additional ways to detect real dual coreness. AMD does it similar
(but slightly different). Of course modern kernels don't need such
hacks anymore.
-Andi