First, a big thanks to everyone who ran the program. Some of you have
quite impressively diverse machines. I didn't expect results from a VAX! :)
The test was to establish two mappings of the same shared memory at
different virtual addresses, and see if both views were cache
coherent. That means that writes to one view are seen immediately
by reads from the equivalent address in the other view.
(Multiple views of the same memory is called virtual aliasing, and
this test is only for one kind of cache coherency: coherency between
virtual aliases. Other kinds, such as coherency between different
CPUs or with the i-cache, are not the subject of this test).
The test also measured whether alternating accesses between two
different addresses which view the same memory location was
significantly slower than when the two different addresses viewed
different memory. This was intended to detect kernels which implement
cache coherency by marking the pages uncacheable, or using page faults.
The test also measured whether a tight write/write/read instruction
sequence was cache incoherent, even when a looser instruction sequence
was coherent. This was to detect CPUs which have incoherent write
buffer (aka. store buffer) pipelines in the presence of virtual
aliasing, despite having coherent L1 caches.
Observations
============
Virtual alias performance penalty
.................................
It was a surprise to me to learn of L1 caches which are coherent, but
have a performance penalty when there is virtual aliasing. In
retrospect this is obvious, but I didn't think of it at the time.
This penalty is present on all the AMD chips (x86 clones, x86_64), but
not the Intels (x86s, IA64). It's possible that there's a small
penalty on Intels, lower than the threshold of the test, but I did not
detect any penalty at all on my Intel Celeron.
The file <asm-ia64/shmparam.h> suggests that IA64 will see a
performance penalty for virtual aliases that aren't a multiple of 1MB
apart. None of the results I received indicate a significant penalty.
(The test deems a factor of 2 in access time to be significant).
Write buffers
.............
It was a surprise to discover CPUs which have incoherent write buffers
yet have coherent L1 caches. (This means that a write to a virtual
address which is read from within very few instructions returns the
written data, completely ignoring any intervening write instructions
which write to a different virtual address which maps to the same
memory location.) I didn't exect to find any of these.
CPUs with incoherent write buffers: PA-RISC 2.0, 68040 and ARMs.
Coherent and not coherent CPUs
..............................
As expected, some CPUs don't offer cache coherency between virtual
alises at small address separations, and some do. Generally:
Virtual alias coherent: x86, IA64, x86_64, PPC, Alpha, VAX, S390
Virtual alias not coherent: Sparc, PA-RISC, ARM, MIPS, m68k, SH
Validity of SHMLBA value
........................
Many CPUs offer virtual cache coherency when the aliases are separated
by a certain CPU-dependent multiple. In principle, all
Linux-supported architectures _should_ have a multiple which makes
virtual aliases coherent, because it's defined in the API as "SHMLBA".
However, on some specific CPUs, no coherent multiple was found.
Valid kernel SHMLBA: Sparc, PA-RISC, MIPS
(plus all the coherent architectures)
SHMLBA not valid: ARM, m68k
SHMLBA not defined: SH
Note that "SHMLBA" is defined for some architectures on which it
doesn't actually provide coherent virtual aliases. On the ARM this is
believed to be due to a chip bug, and very recent kernels may contain
a workaround for it (disabling the write buffer for aliased pages).
No workaround for the m68k was determined during this test.
Note that Glibc's definition of SHMLBA may differ from the kernel's
definition. I'm looking at
glibc-2.3.1/sysdeps/unix/sysv/linux/sparc/bits/shm.h, which defines
SHMLBA to be the same as the page size. The Glibc MIPS definition is
similar.
Moral: you can't trust "SHMLBA" to indicate when virtual aliases are coherent.
Test results
============
Here are all the results that folk sent me.
- "all pass" means the cache is coherent and the timing acceptable
- "SLOW <16k" means the cache is coherent but there's a timing penalty
for alternating rapidly between the same location in two views.
- "PIPELINE FAIL!" means the L1 cache is coherent, but the CPU's
pipeline is not for instructions close together.
- "FAIL <16k" means the cache is not coherent below a certain size.
- "ALL FAIL" means the cache is not coherent at any size.
- "ALL MAPS FAIL" means it wasn't possible to do a MAP_SHARED
mapping of a file at two different addresses. This only happens
with some other operating systems, not GNU/Linux.
i386
====
all pass - x86, unknown type running NetBSD
all pass - x86, unknown type running OpenBSD
all pass - x86, unknown type running SCO
all pass - x86, Intel 90MHz Pentium
all pass - x86, Intel 133MHz Pentium
all pass - x86, Intel 200MHz Pentium MMX
all pass - x86, Intel 200MHz dual Pentium Pro
all pass - x86, Intel 233MHz P2 Klamath
all pass - x86, Intel 300MHz P2 Klamath
all pass - x86, Intel 300MHz P2 Deschutes
all pass - x86, Intel 350MHz P2 Deschutes
all pass - x86, Intel 366MHz Celeron Mendocino
all pass - x86, Intel 400MHz unknown 686 on FreeBSD
all pass - x86, Intel 400MHz P2 Deschutes
all pass - x86, Intel 400MHz dual P2 Deschutes
all pass - x86, Intel 400MHz dual P3 Katmai
all pass - x86, Intel 450MHz dual Xeon running SunOS 5.7
all pass - x86, Intel 466MHz Celeron Mendocino
all pass - x86, Intel 466MHz Celeron Mendocino
all pass - x86, Intel 466MHz dual Celeron Mendocino
all pass - x86, Intel 500MHz Celeron Mendocino
all pass - x86, Intel 500MHz dual P3 Katmai
all pass - x86, Intel 533MHz Celeron Mendocino
all pass - x86, Intel 550MHz dual P3 Katmai
all pass - x86, Intel 668MHz dual P3 Coppermine
all pass - x86, Intel 700MHz P3 Coppermine
all pass - x86, Intel 700MHz P3 Coppermine
all pass - x86, Intel 700MHz Celeron Coppermine
all pass - x86, Intel 800MHz P3 Coppermine
all pass - x86, Intel 850MHz P3 Coppermine +2HT
all pass - x86, Intel 900MHz Celeron Coppermine
all pass - x86, Intel 1.133GHz dual P3
all pass - x86, Intel 1.7GHz P4
all pass - x86, Intel 1.7GHz P4
all pass - x86, Intel 1.7GHz P4
all pass - x86, Intel 1.8GHz P4
all pass - x86, Intel 1.8GHz dual P4 Xeon
all pass - x86, Intel 2.0GHz mobile P4
all pass - x86, Intel 2.4GHz dual P4 Xeon +2HT - cpufreq set to minimum
unsure - x86, Intel 2.4GHz dual P4 Xeon +2HT - cpufreq set to maximum
SLOW <16k - x86, AMD 200MHz K6
SLOW <16k - x86, AMD 233MHz K6-2
SLOW <16k - x86, AMD 300MHz K6-2
SLOW <16k - x86, AMD 300MHz K6 3D
SLOW <16k - x86, AMD 350MHz K6 3D
SLOW <16k - x86, AMD 450MHz K6 3D
SLOW <32k - x86, AMD 750MHz Athlon
SLOW <32k - x86, AMD 800MHz Athlon
SLOW <32k - x86, AMD 900MHz Athlon
SLOW <32k - x86, AMD 900MHz Athlon
SLOW <32k - x86, AMD 1.2GHz Athlon
SLOW <32k - x86, AMD 1.3GHz Duron
SLOW <32k - x86, AMD 1.4GHz Athlon XP 1600+
SLOW <32k - x86, AMD 1.5GHz Athlon XP 1800+
SLOW <32k - x86, AMD 1.5GHz dual Athlon
SLOW <32k - x86, AMD 1.5GHz dual Athlon MP 1800+
SLOW <32k - x86, AMD 1.5GHz dual Athlon MP 1800+
SLOW <32k - x86, AMD 1.5GHz dual Athlon MP 1800+
SLOW <32k - x86, AMD 1.6Ghz Athlon XP 1900+
SLOW <32k - x86, AMD 1.6GHz dual Athlon MP 1900+
SLOW <32k - x86, AMD 1.85GHz Athlon XP 2100+
SLOW <32k - x86, AMD 1.8GHz dual Athlon MP 2200+
SLOW <32k - x86, AMD 2.15GHz Athlon XP 2700+
SLOW <32k - x86, AMD 2.1GHZ Athlon XP 2800+
IA64
====
all pass - IA64, 800MHz dual Itanium
all pass - IA64, 900MHz dual Itanium 2 in HP ZX6000
all pass - IA64, 900MHz quad Itanium 2
x86_64
======
SLOW <32k - x86_64, AMD 1.8GHz dual Opteron 244
Sparc
=====
all pass - Sparc, TI MicroSparc, 50 bogomips (test takes >1 second)
all pass - Sparc, TI dual SuperSparc II, 50 bogomips
all pass - Sparc, UltraSparc II 296MHz running SunOS 5.6
FAIL <16k - Sparc, TI UltraSparc IIi Spitfire, 539 bogomips
FAIL <16k - Sparc, TI UltraSparc II BlackBird, 600 bogomips
FAIL <16k - Sparc, TI UltraSparc II BlackBird, 600 bogomips
FAIL <16k - Sparc, TI UltraSparc IIi Spitfire, 719 bogomips
PowerPC
=======
all pass - PPC, unknown type, iMac running OS X 10.2, Darwin
all pass - PPC, 604r (PRep Utah Powerstack II Pro4000), 299 bogomips
all pass - PPC, 232MHz 604r (PReP IBM 43P-140 Tiger1)
all pass - PPC, 200MHz 405CR in Ericsson ELN 2XX
all pass - PPC, 333MHz 750 in PowerMac
all pass - PPC, 500MHz G4, 7400 + altivec in PowerMac
all pass - PPC, IBM 440GX Rev. A in Ocotea, 625 bogomips
all pass - PPC, 667MHz G4, 7455 + altivec in PowerBook Titanium 3
PA-RISC
=======
FAIL <128k - PA-RISC, 1.1d PA7100LC 80MHz
PIPELINE FAIL! - PA-RISC, 2.0 Crescendo 550 (9000/800/A500-5X)
PIPELINE FAIL! - PA-RISC, 2.0 Crescendo 550 (9000/800/A500-5X)
(In both of these, L1 FAILs <256k, stores fail <4M, and data cache size is 1M)
ARM
===
all pass - ARM, ARM720T rev 2 (v4l), Philips SAA7752, 47.8 bogomips
FAIL ALL - ARM, SA-110 rev 3 (v4l)
FAIL ALL - ARM, 275MHz SA-110 rev 3, 186 bogomips in Rebel Netwinder
FAIL ALL - ARM, SA-110 rev 3, 262 bogomips, in Rebel NetWinder
SLOW ALL - ARM, SA-1110 rev 8, 147 bogomips in Intel Assabet
SLOW ALL - ARM, Intel XScale-PXA250 rev 3, 397 bogomips
SLOW ALL - ARM, Intel XScale-PXA255 rev 6, 397 bogomips on Zaurus
MIPS
====
all pass - MIPS, R3400A V3.0 40 bogomips in Digital DECstation 5000/240
SLOW <16k - MIPS, R4400SC V4.0 60 bogomips in Digital DECstation 5000/260
SLOW <16k - MIPS, R4000SC V5.0, 75 bogomips, in SGI Indigo2
SLOW <16k - MIPS, R4000SC V6.0, 87 bogomips
FAIL <16k - MIPS, R5000 Nevada V10.0, 250 bogomips, in Cobalt Qube
FAIL <16k - MIPS, R5000 Nevada V10.0, 250 bogomips, in Cobalt Qube
all pass - MIPS, 5Kc V0.1, 160 bogomips in Malta
all pass - MIPS, R10000 V2.6, 195MHz running SGI IRIX64
Alpha
=====
all pass - Alpha, 166MHz LCA4 (1.4sec test)
all pass - Alpha, 500MHz EV56 in Digital AlphaPC 164
SLOW <32k - Alpha, 500MHz dual EV6 in AlphaServer DS20 500 (4sec test)
all pass - Alpha, a Tru64 running OSF1
m68k
====
all pass - m68k, 15.6MHz 68020 + 68851 MMU on Plessey PME 68-22
FAIL ALL - m68k, 19.6MHz 68030, 4.90 bogomips on Motorola MVME147
PIPELINE FAIL! - m68k, 24.8MHz 68040
(Stores fail at all sizes)
all pass - m68k, 50MHz 68060
VAX
===
all pass - VAX 8650
SH
==
FAIL <16k - SH4, 200MHz SH7750 on Sega Dreamcast
FAIL <8k - SH5, 315MHz SH5-101 on Hitachi Cayman
IBM S390
========
all pass - S390, 582 bogomips
all pass - S390, 613 bogomips
OSes that fail to map
=====================
ALL MAPS FAIL - PPC, 332MHz 604e, running AIX
ALL MAPS FAIL - PA-RISC, C360 on HPUX 10.20
ALL MAPS FAIL - PA-RISC, HPPA 9000/785 running HPUX 10.20, all compile flags
ALL MAPS FAIL - x86, unknown type running Windows XP
Enjoy, and thanks again,
-- Jamie
On Wed, Sep 10, 2003 at 10:04:16PM +0100, Jamie Lokier wrote:
> Write buffers
> .............
>
> It was a surprise to discover CPUs which have incoherent write buffers
> yet have coherent L1 caches. (This means that a write to a virtual
> address which is read from within very few instructions returns the
> written data, completely ignoring any intervening write instructions
> which write to a different virtual address which maps to the same
> memory location.) I didn't exect to find any of these.
>
> CPUs with incoherent write buffers: PA-RISC 2.0, 68040 and ARMs.
Some StrongARM CPUs seem to exhibit non-coherence in their write buffers.
I don't think we've done enough testing to make any statement like "ARMs
have uncoherent write buffers."
Tests need to be run on a larger proportion of the following:
* ARM720
ARM920
ARM922
ARM925
ARM926
ARM1020
ARM1022
ARM1026
* StrongARM-110 (DEC/Intel)
StrongARM-1100 (DEC/Intel)
* StrongARM-1110 (Intel)
* Xscale (Intel)
PXA (Intel)
And so far, there are only results for 4 of these devices, with some
revisions of StrongARM-110's passing and others failing.
> Validity of SHMLBA value
> ........................
>
> SHMLBA not valid: ARM, m68k
>
> Note that "SHMLBA" is defined for some architectures on which it
> doesn't actually provide coherent virtual aliases. On the ARM this is
> believed to be due to a chip bug, and very recent kernels may contain
> a workaround for it (disabling the write buffer for aliased pages).
Not correct. Because of the fundamental nature of VIVT caches, there
is no "SHMLBA" value which prevents aliases occuring. Think carefully
about the structure of a VIVT cache and how it would be searched. This
isn't a chip bug.
The kernel works around this, but, due to some bugs on StrongARM chips
in the write buffer, it appears that we need further work-arounds, which
are already implemented.
--
Russell King ([email protected]) http://www.arm.linux.org.uk/personal/
Linux kernel maintainer of:
2.6 ARM Linux - http://www.arm.linux.org.uk/
2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Russell King wrote:
> > CPUs with incoherent write buffers: PA-RISC 2.0, 68040 and ARMs.
>
> Some StrongARM CPUs seem to exhibit non-coherence in their write buffers.
> I don't think we've done enough testing to make any statement like "ARMs
> have uncoherent write buffers."
It should be read as "some ARMs", as in you can't rely on an ARM
userspace having coherent write buffers unless you know you're using
the right chip _or_ the right kernel.
Similarly we don't know that all PA-RISCs or all 68040s have
incoherent write buffers, we just know that some of them do.
> > SHMLBA not valid: ARM, m68k
> >
> > Note that "SHMLBA" is defined for some architectures on which it
> > doesn't actually provide coherent virtual aliases. On the ARM this is
> > believed to be due to a chip bug, and very recent kernels may contain
> > a workaround for it (disabling the write buffer for aliased pages).
>
> Not correct. Because of the fundamental nature of VIVT caches, there
> is no "SHMLBA" value which prevents aliases occuring. Think carefully
> about the structure of a VIVT cache and how it would be searched. This
> isn't a chip bug.
I am describing coherence seen by the application. That is a function
of the kernel, not the chip. I'll make that clearer in any next
revision of the document.
The kernel can make virtual aliases coherent on _any_ CPU, using
software techniques if necessary.
Therefore the VIVT cache is not something the application cares about.
It's irrelevant, an implementation detail for the kernel to handle.
In this case, by making aliased pages uncacheable. The application
only cares whether the pages appear coherent, and how fast that is.
The unexpected chip behaviour is the reason why the (old) ARM kernel
doesn't succesfully make alias appear coherent.
> The kernel works around this, but, due to some bugs on StrongARM chips
> in the write buffer, it appears that we need further work-arounds, which
> are already implemented.
In other words, there _is_ an SHMLBA value which prevents alias
incoherence on the ARM, because the kernel implements a workaround, if
it's a recent kernel. That value is one page.
I put ARM in the "don't rely on this" category simply because there
are older kernels in use which don't have the correct workaround.
Otherwise, I would have said ARM has a reliable SHMLBA, of one page.
To the application, this is correct.
There is one thing which puzzles me. The ARM test program outputs
I've been sent say that the cache is incoherent, _not_ just the write
buffers. You said you have results which report write buffers, but I
don't have those in detail, only descriptions.
Does your fix, which makes pages uncacheable andq disables write
combining (correct?) only fix your test results which intermittently
reported write buffer problems, or does it fix _all_ the ARM test
results I received, including those which don't report write buffer
problems?
If the former, then I have to say that ARM Linux _in general_ still
doesn't have coherent virtual aliases. If they are all fixed, then
there's a flaw in the test program because it should have been
reporting write buffer problems, not general cache incoherence.
Thanks,
-- Jamie
On Thu, Sep 11, 2003 at 12:37:20AM +0100, Jamie Lokier wrote:
> > The kernel works around this, but, due to some bugs on StrongARM chips
> > in the write buffer, it appears that we need further work-arounds, which
> > are already implemented.
>
> There is one thing which puzzles me. The ARM test program outputs
> I've been sent say that the cache is incoherent, _not_ just the write
> buffers. You said you have results which report write buffers, but I
> don't have those in detail, only descriptions.
Well, this machine which my test passes (this is run on a kernel with
the work-around in, but since it detected the write buffer works, does
not activate it.)
Processor : StrongARM-110 rev 2 (v4l)
never reports problems with the write buffer, and produces a consistent
set of results:
(128) [93,20,4] Test separation: 4096 bytes: FAIL - too slow
(128) [94,20,4] Test separation: 8192 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 16384 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 32768 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 65536 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 131072 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 262144 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 524288 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 1048576 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 2097152 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 4194304 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 8388608 bytes: FAIL - too slow
(128) [93,20,4] Test separation: 16777216 bytes: FAIL - too slow
VM page alias coherency test: failed; will use copy buffers instead
And then there's ARM920T, and this is run on a kernel without
the work-around:
# cat /proc/cpuinfo
Processor : Arm920Tid(wb) rev 0 (v4l)
BogoMIPS : 29.90
Features : swp half thumb
CPU implementer : 0x41
CPU architecture: 4T
CPU variant : 0x1
CPU part : 0x920
CPU revision : 0
Cache type : write-back
Cache clean : cp15 c7 ops
Cache lockdown : format A
Cache format : Harvard
I size : 16384
I assoc : 64
I line length : 32
I sets : 8
D size : 16384
D assoc : 64
D line length : 32
D sets : 8
Hardware : ARM-Integrator
Revision : 0000
Serial : 0000000000000000
# /mnt/src/tests/general/pagealias-v3
(64) [83,29,7] Test separation: 4096 bytes: FAIL - too slow
(64) [83,29,7] Test separation: 8192 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 16384 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 32768 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 65536 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 131072 bytes: FAIL - too slow
(64) [83,29,7] Test separation: 262144 bytes: FAIL - too slow
(64) [83,29,7] Test separation: 524288 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 1048576 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 2097152 bytes: FAIL - too slow
(64) [83,29,7] Test separation: 4194304 bytes: FAIL - too slow
(64) [83,29,7] Test separation: 8388608 bytes: FAIL - too slow
(64) [84,29,7] Test separation: 16777216 bytes: FAIL - too slow
VM page alias coherency test: failed; will use copy buffers instead
> Does your fix, which makes pages uncacheable andq disables write
> combining (correct?) only fix your test results which intermittently
> reported write buffer problems, or does it fix _all_ the ARM test
> results I received, including those which don't report write buffer
> problems?
It's relatively simple, and I'm not sure why its causing such
misunderstanding. Let me try one more time:
ARM caches are VIVT. VIVT caches have inherent aliasing issues. The
kernel works around these issues by marking memory uncacheable where
appropriate, and will continue to do so for VIVT cached ARM CPUs.
Separate issue: Some StrongARM-110 devices contain buggy write buffers.
For these devices, we need to do a little bit extra than we currently
do to ensure that they are coherent, by disabling the write buffer as
well as the cache for these regions. Current kernels do not work
around this issue.
--
Russell King ([email protected]) http://www.arm.linux.org.uk/personal/
Linux kernel maintainer of:
2.6 ARM Linux - http://www.arm.linux.org.uk/
2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Hi Jamie,
Thanks for posting the summary of your experiment - very useful!
* Jamie Lokier <[email protected]> [2003-09-10]:
>
> Validity of SHMLBA value
> ........................
>
> Many CPUs offer virtual cache coherency when the aliases are separated
> by a certain CPU-dependent multiple. In principle, all
> Linux-supported architectures _should_ have a multiple which makes
> virtual aliases coherent, because it's defined in the API as "SHMLBA".
> However, on some specific CPUs, no coherent multiple was found.
>
> Valid kernel SHMLBA: Sparc, PA-RISC, MIPS
> (plus all the coherent architectures)
> SHMLBA not valid: ARM, m68k
> SHMLBA not defined: SH
What's the basis for deciding wheter SHMLBA is defined or not? There are
definitions of SHMLBA in include/asm-sh/shmparam.h and
include/asm-sh64/shmparam.h for the kernel. The sh64 /usr/include/asm
headers have effectively the same thing (not identical because the copy
I'm looking at hasn't been synced with the latest kernel sources), and I
assume the sh userland is OK too (haven't checked though).
In both cases the kernel headers are showing correct and useful values :
16k for SH-4 in the sh file (16k direct-mapped, or 32k 2-way associative
on latest devices), 8k for SH-5 in the sh64 file (32k 4-way
associative).
Cheers
Richard
--
Richard \\\ SuperH Core+Debug Architect /// .. At home ..
P. /// [email protected] /// [email protected]
Curnow \\\ http://www.superh.com/ /// http://www.rc0.org.uk
Russell King wrote:
> > Does your fix, which makes pages uncacheable andq disables write
> > combining (correct?) only fix your test results which intermittently
> > reported write buffer problems, or does it fix _all_ the ARM test
> > results I received, including those which don't report write buffer
> > problems?
>
> It's relatively simple, and I'm not sure why its causing such
> misunderstanding. Let me try one more time:
>
> ARM caches are VIVT. VIVT caches have inherent aliasing issues. The
> kernel works around these issues by marking memory uncacheable where
> appropriate, and will continue to do so for VIVT cached ARM CPUs.
That I understand fully.
My question arises because I have 3 SA-110 results which report "cache
not coherent". They do not report "store buffer not coherent". All 3
are Rebel Netwinders, of different bogomips ratings.
The point is: those results _don't_ indicate write buffer problems.
It means that your VIVT explanation and workaround does not explain
those results, so I cannot have confidence that your workaround fixes
those particular ARM devices.
Now, if you can assure me that those results are _definitely_ due to
using very old kernels which don't even mark pages uncacheable, and
with newer kernels those Netwinders would exhibit coherent virtual
aliases, that's great.
Then I'll say that ARM Linux offers coherent virtual aliases on all
known ARM systems, provided they're running a sufficiently new kernel.
Otherwise, I can't say that.
-- Jamie
On Thu, Sep 11, 2003 at 01:35:35PM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > > Does your fix, which makes pages uncacheable andq disables write
> > > combining (correct?) only fix your test results which intermittently
> > > reported write buffer problems, or does it fix _all_ the ARM test
> > > results I received, including those which don't report write buffer
> > > problems?
> >
> > It's relatively simple, and I'm not sure why its causing such
> > misunderstanding. Let me try one more time:
> >
> > ARM caches are VIVT. VIVT caches have inherent aliasing issues. The
> > kernel works around these issues by marking memory uncacheable where
> > appropriate, and will continue to do so for VIVT cached ARM CPUs.
>
> That I understand fully.
I don't think you do.
> My question arises because I have 3 SA-110 results which report "cache
> not coherent". They do not report "store buffer not coherent". All 3
> are Rebel Netwinders, of different bogomips ratings.
>
> The point is: those results _don't_ indicate write buffer problems.
Maybe those StrongARM chips don't exhibit the write buffer bug? Remember,
I said _SOME_ StrongARM-110 chips exhibit the problem. I did not say
_ALL_ StrongARM-110 chips exhibit the problem.
> It means that your VIVT explanation and workaround does not explain
> those results, so I cannot have confidence that your workaround fixes
> those particular ARM devices.
Well, as far as I'm concerned, I completely believe that I have explained
it entirely, and I still don't know why you're trying to make this more
difficult than it factually is.
> Now, if you can assure me that those results are _definitely_ due to
> using very old kernels which don't even mark pages uncacheable, and
> with newer kernels those Netwinders would exhibit coherent virtual
> aliases, that's great.
Well, once you collect the kernel information and forward it to me, I
can have a look.
--
Russell King ([email protected]) http://www.arm.linux.org.uk/personal/
Linux kernel maintainer of:
2.6 ARM Linux - http://www.arm.linux.org.uk/
2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Richard Curnow wrote:
> > SHMLBA not defined: SH
>
> What's the basis for deciding wheter SHMLBA is defined or not? There are
> definitions of SHMLBA in include/asm-sh/shmparam.h and
> include/asm-sh64/shmparam.h for the kernel. The sh64 /usr/include/asm
> headers have effectively the same thing (not identical because the copy
> I'm looking at hasn't been synced with the latest kernel sources), and I
> assume the sh userland is OK too (haven't checked though).
That's a mistake of mine: it's defined in
include/asm-sh/sh*/shmparam.h and I didn't look there.
Userland: I'm reading the Glibc 2.3.1 source and I'm not sure if it
defines suitable SHMLBA for SH4. There's no file in the Glibc tree,
but maybe the SH installation of Glibc includes the kernel header?
-- Jamie
Russell King wrote:
> Maybe those StrongARM chips don't exhibit the write buffer bug? Remember,
> I said _SOME_ StrongARM-110 chips exhibit the problem. I did not say
> _ALL_ StrongARM-110 chips exhibit the problem.
I never assumed they all have the bug. Credit me with at least
reading what you wrote before! :)
The results indicate some StrongARM-110 systems which _don't_ exhibit
the write buffer bug _do_ exhibit some _other_ cause of non-coherence.
> > It means that your VIVT explanation and workaround does not explain
> > those results, so I cannot have confidence that your workaround fixes
> > those particular ARM devices.
>
> Well, as far as I'm concerned, I completely believe that I have explained
> it entirely, and I still don't know why you're trying to make this more
> difficult than it factually is.
I'm thinking the same of you! :)
All I asked is whether _all_ ARMs appear coherent to userspace now, and
you replied with:
> It's relatively simple, and I'm not sure why its causing such
> misunderstanding. Let me try one more time:
and proceeding to answer a different question to the one I asked.
So, neither of us knows if all ARMs appear coherent to userspace, with
the latest kernel, ...
> Well, once you collect the kernel information and forward it to me, I
> can have a look.
...until we learn what kernel versions the Netwinder folks are
running, or they kindly run the test on a new kernel.
Thanks,
-- Jamie
On Thu, Sep 11, 2003 at 05:25:10PM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > Maybe those StrongARM chips don't exhibit the write buffer bug? Remember,
> > I said _SOME_ StrongARM-110 chips exhibit the problem. I did not say
> > _ALL_ StrongARM-110 chips exhibit the problem.
>
> I never assumed they all have the bug. Credit me with at least
> reading what you wrote before! :)
>
> The results indicate some StrongARM-110 systems which _don't_ exhibit
> the write buffer bug _do_ exhibit some _other_ cause of non-coherence.
Sigh. Let me re-state one more time. If you don't get it this time
around, I can't help you to understand, and I ask that you drop all
information concerning ARM from your document in case you mislead other
parties who may think you're stating definitive information.
It would appear that you've completely forgotten about my previous
statements:
| ARM caches are VIVT. VIVT caches have inherent aliasing issues. The
| kernel works around these issues by marking memory uncacheable where
| appropriate, and will continue to do so for VIVT cached ARM CPUs.
On 1st September, I wrote:
| This looks like an old kernel on your NetWinder. Later 2.4 kernels
| should get this right (by marking the pages uncacheable in user space.)
So this says that there _are_ old kernels which didn't do any fixup
_and_ I pointed out that you were receiving reports from such kernels.
> ...until we learn what kernel versions the Netwinder folks are
> running, or they kindly run the test on a new kernel.
Absolutely - so what _you_ need to do now is to go off to each person
who responded (only _you_ have those details and therefore only _you_
can do this) and _ask_ them the question.
Now, lets rewind back to the original mail. You said:
|> CPUs with incoherent write buffers: PA-RISC 2.0, 68040 and ARMs.
I still claim this is an inaccurate summary, and is misleading - this
says "All ARMs have incoherent write buffers" which is many times removed
from reality.
Continuing:
|> SHMLBA not valid: ARM, m68k
|> On the ARM this is
|> believed to be due to a chip bug, and very recent kernels may contain
|> a workaround for it (disabling the write buffer for aliased pages).
I still claim this description is wrong. You are claiming that all ARMs
contain this bug and the kernel needs to work around it for all ARMs.
This is clearly not the case. In addition, the fact that a previously
undiscovered bug exists does not determine whether SHMLBA is valid or
not. The fact that SHMLBA _must_ be defined (it is not optional) _and_
there exists _no_ value for SHMLBA on the buggy _StrongARM_s does not
mean it is invalid as you are claiming.
--
Russell King ([email protected]) http://www.arm.linux.org.uk/personal/
Linux kernel maintainer of:
2.6 ARM Linux - http://www.arm.linux.org.uk/
2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Russell King wrote:
> On 1st September, I wrote:
> | This looks like an old kernel on your NetWinder. Later 2.4 kernels
> | should get this right (by marking the pages uncacheable in user space.)
>
> So this says that there _are_ old kernels which didn't do any fixup
> _and_ I pointed out that you were receiving reports from such kernels.
I hadn't forgotten what you wrote. What you wrote says I _might_ be
receiving reports from such kernels. Until we get feedback, neither
of us knows for sure.
> Now, lets rewind back to the original mail. You said:
>
> |> CPUs with incoherent write buffers: PA-RISC 2.0, 68040 and ARMs.
>
> I still claim this is an inaccurate summary, and is misleading - this
> says "All ARMs have incoherent write buffers" which is many times removed
> from reality.
I agree with you, it's unhelpfully worded and I will change it.
(Though I don't think that it means what you think it does, which just goes
to show how badly worded it is).
> Continuing:
>
> |> SHMLBA not valid: ARM, m68k
> |> On the ARM this is
> |> believed to be due to a chip bug, and very recent kernels may contain
> |> a workaround for it (disabling the write buffer for aliased pages).
>
> I still claim this description is wrong.
I agree with you, the description is quite misleading.
> You are claiming that all ARMs
> contain this bug and the kernel needs to work around it for all ARMs.
> This is clearly not the case.
I'm not claiming that at all, and never have, but I will change
the wording to be clear. Tell me if this works for you:
An application writer might ask if the value of SHMLBA means
"virtual alias mappings separated by a multiple of
SHMLBA will be coherent with each other".
Let's examine whether that's a reliable assumption.
To avoid any misunderstanding, because people do misunderstand
this, let me be clear: this question has _nothing_ to do with
whether the CPU implements coherence in hardware.
It is possible to implement virtual alias coherence _entirely_
in software, on any CPU with an MMU. Conversely, it's
possible for software to make virtual aliases non-coherent,
even if the CPU architecture is itself coherent. (Some
distributed cluster kernels might do that).
To put it starkly: virtual alias coherence is a function of
the kernel, not the CPU.
Returning to the question of whether SHMLBA does actually mean
that an application will see virtual alias coherence at that
separation. That can depend on the kernel version, on which
libc the application is compiled with (because they don't all
define a good SHMLBA value), and on exactly which revision of
the CPU is being used. We observe these Linux platforms:
x86: SHMLBA ok, all separations coherent
IA64: SHMLBA ok, all separations coherent
x86_64: SHMLBA ok, all separations coherent
PPC: SHMLBA ok, all separations coherent
Alpha: SHMLBA ok, all separations coherent
VAX: SHMLBA ok, all separations coherent
S390: SHMLBA ok, all separations coherent
Sparc: SHMLBA ok, with suitable definition
PA-RISC: SHMLBA ok, if using kernel headers
MIPS: SHMLBA ok, if using kernel headers
SH: SHMLBA ok, if using kernel headers
m68k: SHMLBA not ok if it's a 68030 or 68040
else SHMLBA seems ok if it's a 68020 or 68060
ARM: SHMLBA ok, if running kernel >= 2.6.0
else SHMLBA not ok, if running kernel < 2.4.0
else SHMLBA may be ok depending on chip rev
Note that the kernel headers tend to define good values of
SHMLBA for each architecture where it needs to be different
from the page size. The table above assumes the values from
kernel headers are being used.
Applications are compiled with libc headers, not kernel headers.
It is not clear that Glibc defines good values of SHMLBA for the
Sparc, MIPS or SH architectures. Other libcs may be different
again. I haven't investigated this.
Moral (pessimistic): Unfortunately, on Linux, SHMLBA is not a fully
reliable guarantee that an application can depend on virtual alias
coherence. It depends on the architecture, and on some
architectures it depends on libc, exact CPU subtype, and/or the
kernel version.
Moral (optimistic): Starting with Linux 2.6.0, all architectures
offer virtual alias coherence for _some_ value of SHMLBA, except
some subtypes of m68k. The m68k port could be fixed with
appropriate kernel code, but that is not implemented at this time.
Russell, feel free to correct the kernel version numbers, or anything
else in the above text.
> In addition, the fact that a previously undiscovered bug exists does
> not determine whether SHMLBA is valid or not.
No, but whether you are running a kernel with the workaround _does_
determine whether SHMLBA is valid or not.
Remember: SHMLBA is not a description of the CPU, it is a description
of the guarantees made by the kernel to userspace.
> The fact that SHMLBA _must_ be defined (it is not optional) _and_
> there exists _no_ value for SHMLBA on the buggy _StrongARM_s does
> not mean it is invalid as you are claiming.
Here I will respectfully disagree. SHMLBA does not represent any
property of the CPU. It has everthing to do with whatever illusions
the kernel presents to userspace.
With your fix, _all_ values of SHMLBA are valid on the buggy StrongARMs.
On a similar note, I don't see a problem that SHMLBA must be defined, as
it's always possible for the kernel to make it work. On the ARM you've
done it through cacheability attributes. It can be made to work on the
m68k through changes to page fault handling, if cacheability attributes
aren't available on the m68k.
> > ...until we learn what kernel versions the Netwinder folks are
> > running, or they kindly run the test on a new kernel.
>
> Absolutely - so what _you_ need to do now is to go off to each person
> who responded (only _you_ have those details and therefore only _you_
> can do this) and _ask_ them the question.
That's what I'll do.
Please tell me if the indented earlier text is clear enough for you.
-- Jamie
Hello
On Wed, 10 Sep 2003, Russell King wrote:
> Tests need to be run on a larger proportion of the following:
>
> * ARM720
> ARM920
> ARM922
> ARM925
> ARM926
> ARM1020
> ARM1022
> ARM1026
> * StrongARM-110 (DEC/Intel)
> StrongARM-1100 (DEC/Intel)
> * StrongARM-1110 (Intel)
> * Xscale (Intel)
> PXA (Intel)
>
> And so far, there are only results for 4 of these devices, with some
> revisions of StrongARM-110's passing and others failing.
If I got it right, you are saying, that you only have results for the
CPUs, marked with stars in the above list. Then, I'll _repeat_ what I've
sent to the thread:
<quote>
Date: Mon, 1 Sep 2003 21:09:57 +0200 (CEST)
From: Guennadi Liakhovetski
Reply-To: Guennadi Liakhovetski <[email protected]>
To: Russell King <[email protected]>
Cc: [email protected], Jamie Lokier <[email protected]>,
Paul J.Y. Lahaie <[email protected]>
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
On
Processor : Intel XScale-PXA250 rev 3 (v5l)
BogoMIPS : 397.31
Features : swp half thumb fastmult edsp
CPU implementor : 0x69
CPU architecture: 5TE
CPU variant : 0x0
CPU part : 0x290
CPU revision : 3
Cache type : undefined 5
Cache clean : undefined 5
Cache lockdown : undefined 5
Cache unified : Harvard
I size : 32768
I assoc : 32
I line length : 32
I sets : 32
D size : 32768
D assoc : 32
D line length : 32
D sets : 32
and
Processor : StrongARM-1100 rev 9 (v4l)
BogoMIPS : 127.38
Features : swp half 26bit fastmult
version 3 of the test consistently reports "Too slow".
</quote>
Both with 2.4.19-rmk7 (pxa with -pxa1) kernels. SA is a Shannon, PXA is a
Triton.
Guennadi
---
Guennadi Liakhovetski
> ...until we learn what kernel versions the Netwinder folks are
> running, or they kindly run the test on a new kernel.
Two of the Netwinders are running 2.4.19-rmk7-nw1, and one is running
2.2.12-19991020.
Are both of these prior to when alias pages were made uncacheable?
Thanks,
-- Jamie
On Fri, Sep 12, 2003 at 01:45:46AM +0100, Jamie Lokier wrote:
> > ...until we learn what kernel versions the Netwinder folks are
> > running, or they kindly run the test on a new kernel.
>
> Two of the Netwinders are running 2.4.19-rmk7-nw1, and one is running
> 2.2.12-19991020.
>
> Are both of these prior to when alias pages were made uncacheable?
2.2.12 is certainly too old for the fixup. 2.4.19-rmk7 -based kernels
have the fixup.
--
Russell King ([email protected]) http://www.arm.linux.org.uk/personal/
Linux kernel maintainer of:
2.6 ARM Linux - http://www.arm.linux.org.uk/
2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core