Hi,
While investigating the inordinate performance impact one of my patches
seemed to be having, we tracked it down to two hlist_for_each_entry
loops, and finally to the prefetch instruction in the loop.
The machine I'm testing on has 4 power5 1.5Ghz cpus and 16G ram. I was
mostly using dbench (v3.03) in runs of 50 and 100 on an ext2 system.
Kernel was 2.6.11-rc5.
I've not had much of a chance to test on x86, but the few tests I've run
have shown that prefetch does improve performance there. From what I've
seen this seems to be a ppc (perhaps ppc64) specific symptom.
Following are two sets of interesting results on the ppc64 system. The
first is on a stock 2.6.11-rc5 kernel. The actual stock kernel gave the
following results for 100 runs of dbench:
# elements: 100, mean 862.580380, variance 5.973441, std dev 2.444062
When I patched fs/dcache.c to replace the three hlist_for_each_entry{,_rcu}
rules with manual loops, as shown in the attached file dcache-nohlist.patch,
I got:
# elements: 50, mean 881.804980, variance 10.695022, std dev 3.270325
The next set of results is based on 2.6.11-rc5 with the LSM stacking
patches (from http://www.sf.net/projects/lsm-stacker). I was understandably
alarmed to find the original patched version gave me:
# elements: 100, mean 797.654870, variance 7.503588, std dev 2.739268
The code which I determined to be responsible contained two
list_for_each_entry loops, Replacing one with a manual loop gave me
# elements: 50, mean 835.859980, variance 81.901719, std dev 9.049957
and replacing the second gave me
# elements: 50, mean 846.541060, variance 17.095401, std dev 4.134658
Finally I followed Paul McKenney's suggestion and just commented out the
ppc definition of prefetch altogether, which gave me:
# elements: 50, mean 860.823880, variance 47.567428, std dev 6.896914
I am currently testing this same patch against a non-stacking kernel.
thanks,
-serge
Serge E. Hallyn writes:
> While investigating the inordinate performance impact one of my patches
> seemed to be having, we tracked it down to two hlist_for_each_entry
> loops, and finally to the prefetch instruction in the loop.
I would be interested to know what results you get if you leave the
loops using hlist_for_each_entry but change prefetch() and prefetchw()
to do the dcbt or dcbtst instruction only if the address is non-zero,
like this:
static inline void prefetch(const void *x)
{
if (x)
__asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
}
static inline void prefetchw(const void *x)
{
if (x)
__asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x));
}
It seems that doing a prefetch on a NULL pointer, while it doesn't
cause a fault, does waste time looking for a translation of the zero
address.
Paul.
On Wed, 30 Mar 2005 13:55:25 +1000, Paul Mackerras <[email protected]> wrote:
> Serge E. Hallyn writes:
>
> > While investigating the inordinate performance impact one of my patches
> > seemed to be having, we tracked it down to two hlist_for_each_entry
> > loops, and finally to the prefetch instruction in the loop.
>
> I would be interested to know what results you get if you leave the
> loops using hlist_for_each_entry but change prefetch() and prefetchw()
> to do the dcbt or dcbtst instruction only if the address is non-zero,
> like this:
>
> static inline void prefetch(const void *x)
> {
> if (x)
> __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
>
> static inline void prefetchw(const void *x)
> {
> if (x)
> __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x));
> }
>
> It seems that doing a prefetch on a NULL pointer, while it doesn't
> cause a fault, does waste time looking for a translation of the zero
> address.
>
> Paul.
Don't know exactly about power5, but G5 processor is described on IBM
docs as doing automatic whole-page prefetch read-ahead when detecting
linear accesses.
--
Greetz, Antonio Vargas aka winden of network
http://wind.codepixel.com/
Las cosas no son lo que parecen, excepto cuando parecen lo que si son.
Antonio Vargas writes:
> Don't know exactly about power5, but G5 processor is described on IBM
> docs as doing automatic whole-page prefetch read-ahead when detecting
> linear accesses.
Sure, but linked lists would rarely be laid out linearly in memory.
Paul.
Quoting Paul Mackerras ([email protected]):
> Serge E. Hallyn writes:
>
> > While investigating the inordinate performance impact one of my patches
> > seemed to be having, we tracked it down to two hlist_for_each_entry
> > loops, and finally to the prefetch instruction in the loop.
>
> I would be interested to know what results you get if you leave the
> loops using hlist_for_each_entry but change prefetch() and prefetchw()
> to do the dcbt or dcbtst instruction only if the address is non-zero,
> like this:
>
> static inline void prefetch(const void *x)
> {
> if (x)
> __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
>
> static inline void prefetchw(const void *x)
> {
> if (x)
> __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x));
> }
>
> It seems that doing a prefetch on a NULL pointer, while it doesn't
> cause a fault, does waste time looking for a translation of the zero
> address.
Hi,
Olof Johansson had suggested that earlier, except that his patch used
if (unlikely(!x))
return;
Performance was quite good, but not as good as having prefetch completely
disabled. I got
# elements: 50, mean 851.263680, variance 24.561146, std dev 4.955920
compared to 860.823880 stdev 6.896914 with prefetch disabled.
thanks,
-serge