2003-08-29 05:35:53

by Jamie Lokier

[permalink] [raw]
Subject: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Dear All,

I'd appreciate if folks would run the program below on various
machines, especially those whose caches aren't automatically coherent
at the hardware level.

It searches for that address multiple which an application can use to
get coherent multiple mappings of shared memory, with good performance.

I want this information for two reasons:

1. To check it correctly detects archs which page fault for
coherency or aren't coherent.
2. To check the timing test is robust, both for 1. and for
detecting archs where the hardware is coherent but slows
down (see Athlon below).
3. To check this is reliable enough to use at run time in an app.

I already got a surprise (to me): my Athlon MP is much slower
accessing multiple mappings which are within 32k of each other, than
mappings which are further apart, although it is coherent. The L1
data cache is 64k. (The explanation is easy: virtually indexed,
physically tagged cache moves data among cache lines, possibly via L2).

This suggests scope for improving x86 kernel performance in the areas
of kmap() and shared library / executable mappings, by good choice of
_virtual_ addresses. This doesn't require a cache colouring
page allocator, so maybe it's a new avenue?

Anyway, please lots of people run the program and post the output +
/proc/cpuinfo. Compile with optimisation, -O or -O2 is fine. (You
can add -DHAVE_SYSV_SHM too if you like):

gcc -o test test.c -O2
time ./test
cat /proc/cpuinfo

Thanks a lot :)
-- Jamie

==============================================================================

/* This code maps shared memory to multiple addresses and tests it
for cache coherency and performance.

Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */

#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/signal.h>
#include <sys/mman.h>
#include <sys/time.h>

#if HAVE_SYSV_SHM
#include <sys/ipc.h>
#include <sys/shm.h>
#endif

//#include "pagealias.h"

/* Helpers to temporarily block all signals. These are used for when a
race condition might leave a temporary file that should have been
deleted -- we do our best to prevent this possibility. */

static void
block_signals (sigset_t * save_state)
{
sigset_t all_signals;
sigfillset (&all_signals);
sigprocmask (SIG_BLOCK, &all_signals, save_state);
}

static void
unblock_signals (sigset_t * restore_state)
{
sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0);
}

/* Open a new shared memory file, either using the POSIX.4 `shm_open'
function, or using a regular temporary file in /tmp. Immediately
after opening the file, it is unlinked from the global namespace
using `shm_unlink' or `unlink'.

On success, the value returned is a file descriptor. Otherwise, -1
is returned and `errno' is set.

The descriptor can be closed using simply `close'. */

/* Note: `shm_open' requires link argument `-lposix4' on Suns.
On GNU/Linux with Glibc, it requires `-lrt'. Unfortunately, Glibc's
-lrt insists on linking to pthreads, which we may not want to use
because that enables thread locking overhead in other functions. So
we implement a direct method of opening shm on Linux. */

/* If this is changed, change the size of `buffer' below too. */
#if HAVE_SHM_OPEN
#define SHM_DIR_PREFIX "/" /* `shm_open' arg needs "/" for portability. */
#elif defined (__linux__)
#include <sys/statfs.h>
#define SHM_DIR_PREFIX "/dev/shm/"
#else
#undef SHM_DIR_PREFIX
#endif

static int
open_shared_memory_file (int use_tmp_file)
{
char * ptr, buffer [19];
int fd, i;
unsigned long number;
sigset_t save_signals;
struct timeval tv;

#if !HAVE_SHM_OPEN && defined (__linux__)
struct statfs sfs;
if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0
|| sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */))
{
errno = ENOSYS;
return -1;
}
#endif

loop:
/* Print a randomised path name into `buffer'. The string depends on
the directory and whether we are using POSIX.4 shared memory or a
regular temporary file. RANDOM is a 5-digit, base-62
representation of a pseudo-random number. The string is used as a
candidate in the search for an unused shared segment or file name. */
#ifdef SHM_DIR_PREFIX
strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-");
#else
strcpy (buffer, "/tmp/shm-");
#endif
ptr = buffer + strlen (buffer);
gettimeofday (&tv, (struct timezone *) 0);
number = (unsigned long) random ();
number += (unsigned long) getpid ();
number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec;
for (i = 0; i < 5; i++)
{
/* Don't use character arithmetic, as not all systems are ASCII. */
*ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62];
number /= 62;
}
*ptr = '\0';

/* Block signals between the open and unlink, to really minimise
the chance of accidentally leaving an unwanted file around. */
block_signals (&save_signals);
#if HAVE_SHM_OPEN
if (!use_tmp_file)
{
fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd != -1)
shm_unlink (buffer);
}
else
#endif /* HAVE_SHM_OPEN */
{
fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd != -1)
unlink (buffer);
}
unblock_signals (&save_signals);

/* If we failed due to a name collision or a signal, try again. */
if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR))
goto loop;

return fd;
}

/* Allocate a region of address space `size' bytes long, so that the
region will not be allocated for any other purpose. It is freed with
`munmap'.

Returns the mapped base address on success. Otherwise, MAP_FAILED is
returned and `errno' is set. */

static size_t system_page_size;

#if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#define MAP_ANONYMOUS MAP_ANON
#endif
#ifndef MAP_NORESERVE
#define MAP_NORESERVE 0
#endif
#ifndef MAP_FILE
#define MAP_FILE 0
#endif
#ifndef MAP_VARIABLE
#define MAP_VARIABLE 0
#endif
#ifndef MAP_FAILED
#define MAP_FAILED ((void *) -1)
#endif
#ifndef PROT_NONE
#define PROT_NONE PROT_READ
#endif

static void *
map_address_space (void * optional_address, size_t size, int access)
{
void * addr;
#ifdef MAP_ANONYMOUS
addr = mmap (optional_address, size,
access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
(MAP_PRIVATE | MAP_ANONYMOUS
| (optional_address ? MAP_FIXED : MAP_VARIABLE)
| (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0);
#else /* not defined MAP_ANONYMOUS */
int save_errno, zero_fd = open ("/dev/zero", O_RDONLY);
if (zero_fd == -1)
return MAP_FAILED;
addr = mmap (optional_address, size,
access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
(MAP_PRIVATE | MAP_FILE
| (optional_address ? MAP_FIXED : MAP_VARIABLE)
| (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0);
save_errno = errno;
close (zero_fd);
errno = save_errno;
#endif /* not defined MAP_ANONMOUS */
return addr;
}

/* Set up a page alias mapping using mmap() on POSIX shared memory or on
a temporary regular file.

Returns the mapped base address on success. Otherwise, 0 is returned
and `errno' is set. */

static void *
page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file)
{
void * base_addr, * addr;
int fd, i, save_errno;
struct stat st;

fd = open_shared_memory_file (use_tmp_file);
if (fd == -1)
goto fail;

/* First, resize the shared memory file to the desired size. */
if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size)
goto close_fail;

/* Map an anonymous region `separation + size' bytes long. This is how
we allocate sufficient contiguous address space. We over-map
this with the aliased buffer. */
if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto close_fail;

/* Map the same shared memory repeatedly, at different addresses. */
for (i = 0; i < 2; i++)
{
addr = mmap ((char *) base_addr + (i ? separation : 0), size,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED,
fd, (off_t) 0);
if (addr == MAP_FAILED)
goto unmap_fail;
if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `mmap' ignored MAP_FIXED! Should never happen. */
munmap (addr, size);
save_errno = EINVAL;
goto unmap_fail_se;
}
}
if (close (fd) != 0)
goto unmap_fail;

/* Success! */
return base_addr;

/* Failure. */
unmap_fail:
save_errno = errno;
unmap_fail_se:
munmap (base_addr, separation + size);
errno = save_errno;
close_fail:
save_errno = errno;
close (fd);
errno = save_errno;
fail:
return 0;
}

/* Set up a page alias mapping using SYSV IPC shared memory.

Returns the mapped base address on success. Otherwise, 0 is returned
and `errno' is set. */

#if HAVE_SYSV_SHM

static void *
page_alias_using_sysv_shm (size_t size, size_t separation)
{
void * base_addr, * addr;
sigset_t save_signals;
int shmid, i, save_errno;

/* Map an anonymous region `separation + size' bytes long. This is how
we allocate sufficient contiguous address space. We over-map
this with the aliased buffer. */
if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto fail;

/* Block signals between the shmget() and IPC_RMID, to minimise the chance
of accidentally leaving an unwanted shared segment around. */
block_signals (&save_signals);

shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600);
if (shmid == -1)
goto unmap_fail;

/* Map the same shared memory repeatedly, at different addresses. */
for (i = 0; i < 2; i++)
{
/* `shmat' is tried twice. The fist time it can fail if the local
implementation of `shmat' refuses to map over a region mapped
with `mmap'. In that case, we punch a hole using `munmap' and
do it again.

If the local `shmat' has this property, the `shmat' calls
to fixed addresses might collide with a concurrent thread
which is also doing mappings, and will fail. At least it
is a safe failure.

On the other hand, if the local `shmat' can map over
already-mapped regions (in the same way that `mmap' does), it
is essential that we do actually use an already-mapped region,
so that collisions with a concurrent thread can't possibly
result in both of us grabbing the same address range with no
indication of error. */
addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
if (addr == (void *) -1 && errno == EINVAL)
{
munmap ((char *) base_addr + (i ? separation : 0), size);
addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
}

/* Check for errors. */
if (addr == (void *) -1)
{
save_errno = errno;
if (i == 1)
shmdt (base_addr);
goto remove_shm_fail_se;
}
else if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `shmat' ignored the requested address! */
if (i == 1)
shmdt (base_addr);
save_errno = EINVAL;
goto remove_shm_fail_se;
}
}

if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0)
goto remove_shm_fail;
unblock_signals (&save_signals);

/* Success! */
return base_addr;

/* Failure. */
remove_shm_fail:
save_errno = errno;
remove_shm_fail_se:
while (--i >= 0)
shmdt ((char *) base_addr + (i ? separation : 0));
shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0);
errno = save_errno;
unmap_fail:
save_errno = errno;
unblock_signals (&save_signals);
munmap (base_addr, separation + size);
errno = save_errno;
fail:
return 0;
}

#endif /* HAVE_SYSV_SHM */

/* Map a page-aliased ring buffer. Shared memory of size `size' is
mapped twice, with the difference between the two addresses being
`separation', which must be at least `size'. The total address range
used is `separation + size' bytes long.

On success, *METHOD is filled with a number which must be passed to
`page_alias_unmap', and the mapped base address is returned.
Otherwise, 0 is returned and `errno' is set. */

static void *
__page_alias_map (size_t size, size_t separation, int * method)
{
void * addr;
if (((size | separation) & (system_page_size - 1)) != 0 || size > separation)
{
errno = -EINVAL;
return 0;
}

/* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */
#ifdef SHM_DIR_PREFIX
*method = 0;
if ((addr = page_alias_using_mmap (size, separation, 0)) != 0)
return addr;
#endif
#if HAVE_SYSV_SHM
*method = 1;
if ((addr = page_alias_using_sysv_shm (size, separation)) != 0)
return addr;
#endif
*method = 2;
return page_alias_using_mmap (size, separation, 1);
}

/* Unmap a page-aliased ring buffer previously allocated by
`page_alias_map'. `address' is the base address, and `size' and
`separation' are the arguments previously passed to
`__page_alias_map'. `method' is the value previously stored in *METHOD.

Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */

static int
__page_alias_unmap (void * address, size_t size, size_t separation, int method)
{
#if HAVE_SYSV_SHM
if (method == 1)
{
shmdt (address);
shmdt (address + separation);
if (separation > size)
munmap (address + size, separation - size);
return 0;
}
#endif

return munmap (address, separation + size);
}

/* Map a page-aliased ring buffer. `size' is the size of the buffer to
create; it will be mapped twice to cover a total address range
`size * 2' bytes long.

On success, *METHOD is filled with a number which must be passed to
`page_alias_unmap', and the mapped base address is returned.
Otherwise, 0 is returned and `errno' is set. */

void *
page_alias_map (size_t size, int * method)
{
return __page_alias_map (size, size, method);
}

/* Unmap a page-aliased ring buffer previously allocated by
`page_alias_map'. `address' is the base address, and `size' is the
size of the buffer (which is half of the total mapped address range).
`method' is a value previously stored in *METHOD by `page_alias_map'.

Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */

int
page_alias_unmap (void * address, size_t size, int method)
{
return __page_alias_unmap (address, size, size, method);
}

/* Map some memory which is not aliased, for timing comparisons against
aliased pages. We use a combination of mappings similar to
page_alias_*(), in case there are resource limitations which would
prevent malloc() or a single mmap() working for the larger address
range tests. */

static void *
page_no_alias (size_t size, size_t separation)
{
void * base_addr, * addr;
int i, save_errno;

if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto fail;

/* Map anonymous memory at the different addresses. */
for (i = 0; i < 2; i++)
{
addr = map_address_space ((char *) base_addr + (i ? separation : 0),
size, 1);
if (addr == MAP_FAILED)
goto unmap_fail;
if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `mmap' ignored MAP_FIXED! Should never happen. */
munmap (addr, size);
save_errno = EINVAL;
goto unmap_fail_se;
}
}

/* Success! */
return base_addr;

/* Failure. */
unmap_fail:
save_errno = errno;
unmap_fail_se:
munmap (base_addr, separation + size);
errno = save_errno;
fail:
return 0;
}

/* This should be a word size that the architecture can read and write
fast in a single instruction. In principle, C's `int' is the natural
word size, but in practice it isn't on 64-bit machines. */

#define WORD long

/* These GCC-specific asm statements force values into registers, and
also act as compiler memory barriers. These are used to force a
group of write/write/read instructions as close together as possible,
to maximise the detection of store buffer conditions. Despite being
asm statements, these will work with any of GCC's target architectures,
provided they have >= 4 registers. */

#if __GNUC__ >= 3
#define __noinline __attribute__ ((__noinline__))
#else
#define __noinline
#endif

#ifdef __GNUC__
#define force_into_register(var) \
__asm__ ("" : "=r" (var) : "0" (var) : "memory")
#define force_into_registers(var1, var2, var3, var4) \
__asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \
: "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory")
#else
#define force_into_register(var) do {} while (0)
#define force_into_registers(var1, var2, var3, var4) do {} while (0)
#endif

/* This function tries to test whether a CPU snoops its store buffer for
reads within a few instructions, and ignores virtual to physical
address translations when doing that. In principle a CPU might do
this even if it's L1 cache is physically tagged or indexed, although
I have not seen such a system. (A CPU which uses store buffer
snooping and with an off-board MMU, which the CPU is unaware of,
could have this property).

It isn't possible to do this test perfectly; we do our best. The
`force_into_register' macros ensure that the write/write/read
sequence is as compact as the compiler can make it. */

static WORD __noinline
test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2)
{
register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2;
register WORD __reg1 = 1, __reg2 = 0;
force_into_registers (__reg1, __reg2, __regptr1, __regptr2);
*__regptr1 = __reg1;
*__regptr2 = __reg2;
__reg1 = *__regptr1;
force_into_register (__reg1);
return __reg1;
}

/* This function tests whether writes to one page are seen in another
page at a different virtual address, and whether they are nearly as
fast as normal writes.

The accesses are timed by the caller of this function.
Alternate writes go to alternate pages, so that if aliasing is
implemented using page faults, it will clearly show up in the
timings. */

static int __noinline
test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops)
{
WORD fail = 0;
while (--timing_loops >= 0)
fail |= test_store_buffer_snoop (ptr1, ptr2);
return fail != 0;
}

/* This function tests L1 cache coherency without checking for store
buffer snoop coherency. To do this, we add delays after each store
to allow the store buffer to drain. The result of this function is
not important: it is only used in a diagnostic message. */

static int __noinline
test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2)
{
static volatile WORD dummy;
int i, j;
WORD fail = 0;
for (i = 0; i < 10; i++)
{
*ptr1 = 1;
for (j = 0; j < 1000; j++) /* Dummy volatile writes for delay. */
dummy = 0;
*ptr2 = 0;
for (j = 0; j < 1000; j++) /* Dummy volatile writes for delay. */
dummy = 0;
fail |= *ptr1;
}
return fail != 0;
}

/* Thoroughly test a pair of aliased pages with a fixed address
separation, to see if they really behave like memory appearing at two
locations, and efficiently. We search through different values of
`separation' searching for a suitable "cache colour" on this machine. */

static inline const char *
test_one_separation (size_t separation)
{
void * buffers [2];
long timings [3];
int i, method, timing_loops = 64;

/* We measure timings of 3 different tests, each 128 times to find the
minimum. 0: Writes and reads to aliased pages. 1: Writes and
reads to non-aliased pages, to compare with 1. 2: Doing nothing,
to measure the time for `gettimeofday' itself.

The measurements are done in a mixed up order. If we did 64
measurements of type 0, then 64 of type 1, then 64 of type 2, the
results could be mislead due to synchronisation with other
processes occuring on the machine. */

/* A previously generated random shuffle of bit-pairs. Each pair is a
number from the set {0,1,2}. Each number occurs exactly 128 times. */
static const unsigned char pattern [96] =
{
0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56,
0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49,
0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99,
0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25,
0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19,
0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15,
0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89,
0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85,
};

buffers [0] = __page_alias_map (system_page_size, separation, &method);
if (buffers [0] == 0)
return "alias map failed";
buffers [1] = page_no_alias (system_page_size, separation);
if (buffers [1] == 0)
{
__page_alias_unmap (buffers [0], system_page_size, separation, method);
return "non-alias map failed";
}

retry:
timings [2] = timings [1] = timings [0] = LONG_MAX;
for (i = 0; i < 384; i++)
{
struct timeval time_before, time_after;
long time_delta;
int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3;
volatile WORD * ptr1 = (volatile WORD *) buffers [which_test];
volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation);

/* Test whether writes to one page appear immediately in the other,
and time how long the memory accesses take. */
gettimeofday (&time_before, (struct timezone *) 0);
if (which_test < 2)
fail = test_page_alias (ptr1, ptr2, timing_loops);
gettimeofday (&time_after, (struct timezone *) 0);

if (fail && which_test == 0)
{
/* Test whether the failure is due to a store buffer bypass
which ignores virtual address translation. */
int l1_fail = test_l1_only (ptr1, ptr2);
__page_alias_unmap (buffers [0], system_page_size, separation,
method);
munmap (buffers [1], separation + system_page_size);
return l1_fail ? "cache not coherent" : "store buffer not coherent";
}

time_delta = ((time_after.tv_usec - time_before.tv_usec)
+ 1000000 * (time_after.tv_sec - time_before.tv_sec));

/* Find the smallest time taken for each test. Ignore negative
glitches due to Linux' tendancy to jump the clock backwards. */
if (time_delta >= 0 && time_delta < timings [which_test])
timings [which_test] = time_delta;
}

/* Remove the cost of `gettimeofday()' itself from measurements. */
timings [0] -= timings [2];
timings [1] -= timings [2];

/* Keep looping until at least one measurement becomes significant. A
very fast CPU will show measurements of zero microseconds for
smaller values of `timing_loops'. Also loop until the cost of
`gettimeofday()' becomes insignificant. When the program is run
under `strace', the latter is a big and this is needed to stabilise
the results. */
if (timings [0] <= 10 * (1 + timings [2])
&& timings [1] <= 10 * (1 + timings [2]))
{
timing_loops <<= 1;
goto retry;
}

__page_alias_unmap (buffers [0], system_page_size, separation, method);
munmap (buffers [1], separation + system_page_size);

/* Reject page aliasing if it is much slower than accessing a single,
definitely cached page directly. */
if (timings [0] > 2 * timings [1])
return "too slow";

/* Success! Passed all tests for these parameters. */
return 0;
}

size_t page_alias_smallest_size;

void
page_alias_init (void)
{
size_t size;

#ifdef _SC_PAGESIZE
system_page_size = sysconf (_SC_PAGESIZE);
#elif defined (_SC_PAGE_SIZE)
system_page_size = sysconf (_SC_PAGE_SIZE);
#else
system_page_size = getpagesize ();
#endif

for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2)
{
const char * reason = test_one_separation (size);

printf ("Test separation: %lu bytes: %s%s\n",
(unsigned long) size, reason ? "FAIL - " : "pass",
reason ? reason : "");

/* This logic searches for the smallest _contiguous_ range
of page sizes for which `page_alias_test' passes. */
if (reason == 0 && page_alias_smallest_size == 0)
page_alias_smallest_size = size;
else if (reason != 0 && page_alias_smallest_size != 0)
{
/* Fail, indicating that page-aliasing is not reliable,
because there's a maximum size. We don't support that as
it seems quite unlikely given our model of cache colouring. */
page_alias_smallest_size = 0;
break;
}
}

printf ("VM page alias coherency test: ");

if (page_alias_smallest_size == 0)
printf ("failed; will use copy buffers instead\n");
else if (page_alias_smallest_size == system_page_size)
printf ("all sizes passed\n");
else
printf ("minimum fast spacing: %lu (%lu page%s)\n",
(unsigned long) page_alias_smallest_size,
(unsigned long) (page_alias_smallest_size / system_page_size),
(page_alias_smallest_size == system_page_size) ? "" : "s");
}

//#ifdef TEST_PAGEALIAS
int
main ()
{
page_alias_init ();
return 0;
}
//#endif


2003-08-29 10:04:17

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Hi Jamie,

On Friday 29 August 2003 09:35, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
rathamahata@test rathamahata $ gcc -march=athlon-xp -mcpu=athlon-xp -fomit-frame-pointer -O2 -o test test.c
rathamahata@test rathamahata $ time ./test
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real 0m0.097s
user 0m0.091s
sys 0m0.006s
rathamahata@test rathamahata $ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) MP 2200+
stepping : 0
cpu MHz : 1800.967
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow
bogomips : 3538.94

processor : 1
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) Processor
stepping : 0
cpu MHz : 1800.967
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow
bogomips : 3596.28


--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc

2003-08-29 10:03:52

by J.A. Magallon

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this


On 08.29, Jamie Lokier wrote:
> Dear All,
[...]
>
> I already got a surprise (to me): my Athlon MP is much slower
> accessing multiple mappings which are within 32k of each other, than
> mappings which are further apart, although it is coherent. The L1
> data cache is 64k. (The explanation is easy: virtually indexed,
> physically tagged cache moves data among cache lines, possibly via L2).
>

Sorry if this is a stupid question, but have you heard about 64K-aliasing ?
We have seen it in P3/P4, do not know if Athlons also suffer it.
In short, x86 is crap. It slows like a dog when accessing two memory
positions sparated by 2^n (address decoder has two 16 bits adders, instead
of 1 32 bits..., cache is 16 bit tagged, etc...)

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))

2003-08-29 10:21:10

by J.A. Magallon

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this


On 08.29, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
>

Dual P4 Xeon

annwn:~> gcc -march=pentium4 -O2 -fomit-frame-pointer -o vm-test vm-test.c
annwn:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
annwn:~> gcc -DHAVE_SYSV_SHM -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c
annwn:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
annwn:~> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) XEON(TM) CPU 1.80GHz
stepping : 4
cpu MHz : 1784.328
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3552.05

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) XEON(TM) CPU 1.80GHz
stepping : 4
cpu MHz : 1784.328
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3565.15

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))

2003-08-29 10:15:28

by J.A. Magallon

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this


On 08.29, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
>

Uh ? So good are my PII ?

werewolf:~> gcc -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c
werewolf:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

werewolf:~> gcc -DHAVE_SYSV_SHM -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c
werewolf:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

werewolf:~> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 400.915
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 799.53

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 400.915
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 801.17


--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))

2003-08-29 10:37:34

by CaT

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> gcc -o test test.c -O2
> time ./test
> cat /proc/cpuinfo

Forgot about this one. :/

$ time ./coherencytest
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

real 0m0.543s
user 0m0.230s
sys 0m0.020s
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 300.691
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow k6_mtrr
bogomips : 599.65

--
"How can I not love the Americans? They helped me with a flat tire the
other day," he said.
- http://tinyurl.com/h6fo

2003-08-29 10:34:35

by CaT

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> gcc -o test test.c -O2
> time ./test
> cat /proc/cpuinfo

16 [20:33:33] hogarth@theirongiant:/home/hogarth>> time ./coherencytest
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.206s
user 0m0.135s
sys 0m0.027s
16 [20:33:44] hogarth@theirongiant:/home/hogarth>> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 3
cpu MHz : 701.641
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1388.54


--
"How can I not love the Americans? They helped me with a flat tire the
other day," he said.
- http://tinyurl.com/h6fo

2003-08-29 10:37:58

by Alan

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Gwe, 2003-08-29 at 11:03, J.A. Magallon wrote:
> Sorry if this is a stupid question, but have you heard about 64K-aliasing ?
> We have seen it in P3/P4, do not know if Athlons also suffer it.
> In short, x86 is crap. It slows like a dog when accessing two memory
> positions sparated by 2^n (address decoder has two 16 bits adders, instead
> of 1 32 bits..., cache is 16 bit tagged, etc...)

Pretty much all processors are bad at handling memory accesses on the
same alignment within powers of two. Thats one of the reasons for slab
and for things like the old kernel code putting skb structs at the end
of the skbuff data.

Grab a copy of "Unix systems for modern architectures".


2003-08-29 10:49:31

by Mikael Pettersson

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Jamie Lokier writes:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

>From a dual Opteron 244 box:

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)
0.08user 0.01system 0:00.08elapsed 101%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (131major+38minor)pagefaults 0swaps

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 244
stepping : 1
cpu MHz : 1791.569
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 3565.15
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 244
stepping : 1
cpu MHz : 1791.569
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 3578.26
TLB size : 1088 4K pages
clflush size : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts ttp

2003-08-29 11:08:56

by Andi Kleen

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Jamie Lokier <[email protected]> writes:

> I already got a surprise (to me): my Athlon MP is much slower
> accessing multiple mappings which are within 32k of each other, than
> mappings which are further apart, although it is coherent. The L1

Most x86 and probably most other modern CPUs have virtually addressed L1 caches.
It's just too slow to wait for the MMU for an L1 access which is really critical.

So such artifacts are expected

> data cache is 64k. (The explanation is easy: virtually indexed,
> physically tagged cache moves data among cache lines, possibly via L2).

On x86 L2 is usually physically tagged.

Mostly only ARM,MIPS et.al. have virtually tagged L2.

-Andi

2003-08-29 11:17:54

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 01:08:51PM +0200, Andi Kleen wrote:
> Jamie Lokier <[email protected]> writes:
> > data cache is 64k. (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
>
> On x86 L2 is usually physically tagged.
>
> Mostly only ARM,MIPS et.al. have virtually tagged L2.

Correction: ARM L1 is mostly VIVT. L2 cache isn't mandated by the
architecture, and therefore generally doesn't exist.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-08-29 11:51:58

by James Morris

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Here's the result for sparc64 (Ultrasparc II):

$ gcc -o test test.c -O2
$ time ./test
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)

real 0m0.194s
user 0m0.160s
sys 0m0.040s
$ gcc -o test test.c -O2 -DHAVE_SYSV_SHM
$ time ./test
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)

real 0m0.162s
user 0m0.140s
sys 0m0.020s

$ cat /proc/cpuinfo

cpu : TI UltraSparc II (BlackBird)
fpu : UltraSparc II integrated FPU
promlib : Version 3 Revision 23
prom : 3.23.1
type : sun4u
ncpus probed : 2
ncpus active : 2
Cpu0Bogo : 591.46
Cpu0ClkTck : 0000000011a4f2ed
Cpu2Bogo : 591.46
Cpu2ClkTck : 0000000011a4f2ed
MMU Type : Spitfire
State:
CPU0: online
CPU2: online



--
James Morris
<[email protected]>

2003-08-29 11:42:25

by Gianni Tedesco

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, 2003-08-29 at 06:35, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

PPC (G4).

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

cpu : 7455, altivec supported
clock : 667MHz
revision : 2.1 (pvr 8001 0201)
bogomips : 665.19
machine : PowerBook3,4
motherboard : PowerBook3,4 MacRISC2 MacRISC Power Macintosh
board revision : 00000000
detected as : 73 (PowerBook Titanium III)
pmac flags : 0000000b
L2 cache : 256K unified
memory : 512MB
pmac-generation : NewWorld

--
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source http://www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import
8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-08-29 15:42:39

by Larry McVoy

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC

If you care, I also have freebsd (v2, v3, v4), netbsd 1.5, openbsd 3.0 (all
bsd systems are x86, mostly celerons), hpux 10.20, sco, solaris, solaris/x86,
Irix, MacOS X, AIX, Tru64 and probably some others.

====== alpha.bitmover.com ======
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux alpha.bitmover.com 2.4.21-pre5 #2 Thu Mar 20 07:54:03 PST 2003 alpha unknown
cpu : Alpha
cpu model : EV56
cpu variation : 7
cpu revision : 0
cpu serial number :
system type : EB164
system variation : PC164
system revision : 0
system serial number :
cycle frequency [Hz] : 500000000
timer frequency [Hz] : 1024.00
page size [bytes] : 8192
phys. address bits : 40
max. addr. space # : 127
BogoMIPS : 992.88
kernel unaligned acc : 0 (pc=0,va=0)
user unaligned acc : 0 (pc=0,va=0)
platform string : Digital AlphaPC 164 500 MHz
cpus detected : 1

====== ia64.bitmover.com ======
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux ia64.bitmover.com 2.4.9-18smp #1 SMP Tue Dec 11 12:59:00 EST 2001 ia64 unknown
processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium
model : 0
revision : 7
archrev : 0
features : standard
cpu number : 0
cpu regs : 4
cpu MHz : 799.486992
itc MHz : 799.486992
BogoMIPS : 796.91

processor : 1
vendor : GenuineIntel
arch : IA-64
family : Itanium
model : 0
revision : 7
archrev : 0
features : standard
cpu number : 0
cpu regs : 4
cpu MHz : 799.486992
itc MHz : 799.486992
BogoMIPS : 796.91


====== mips.bitmover.com ======
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
Linux mips 2.4.18-r4k-ip22 #1 Sun Jun 23 15:30:50 CEST 2002 mips unknown
system type : SGI Indy
processor : 0
cpu model : R4000SC V6.0 FPU V0.0
BogoMIPS : 86.83
byteorder : big endian
wait instruction : no
microsecond timers : yes
tlb_entries : 48
extra interrupt vector : no
hardware watchpoint : yes
VCED exceptions : 2955114
VCEI exceptions : 0

====== netwinder.bitmover.com ======
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead
Linux netwinder 2.2.12-19991020 #1 Wed Oct 20 13:09:07 EDT 1999 armv4l unknown
Processor : Intel sa110 rev 3
BogoMips : 262.14
Hardware : Rebel-NetWinder
Serial # : 3464
Revision : 52ff

====== parisc.bitmover.com ======
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 4194304 (1024 pages)
Linux parisc 2.4.17-64 #1 Sat Mar 16 17:31:44 MST 2002 parisc64 unknown
processor : 0
cpu family : PA-RISC 2.0
cpu : PA8600 (PCX-W+)
cpu MHz : 550.000000
model : 9000/800/A500-5X
model name : Crescendo 550
hversion : 0x00005d50
sversion : 0x00000491
I-cache : 512 KB
D-cache : 1024 KB (WB)
ITLB entries : 160
DTLB entries : 160 - shared with ITLB
bogomips : 1097.72
software id : 580790518


====== ppc.bitmover.com ======
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux ppc.bitmover.com 2.4.6-pre2 #2 Sun Jun 10 20:21:17 PDT 2001 ppc unknown
processor : 0
cpu : 750
temperature : 0 C
clock : 333MHz
revision : 2.2
bogomips : 665.69
zero pages : total: 0 (0Kb) current: 0 (0Kb) hits: 0/0 (0%)
machine : iMac,1
motherboard : iMac MacRISC Power Macintosh
L2 cache : 512K unified
memory : 160MB
pmac-generation : NewWorld

====== qube.bitmover.com ======
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
0.31user 0.10system 0:00.40elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (116major+34minor)pagefaults 0swaps
Linux qube.bitmover.com 2.0.34 #1 Thu Jan 28 03:03:03 PST 1999 mips unknown
cpu : MIPS
cpu model : Nevada V10.0
system type : Cobalt Microserver 27
BogoMIPS : 249.86
byteorder : little endian
unaligned accesses : 16
wait instruction : yes
microsecond timers : yes
extra interrupt vector : yes
hardware watchpoint : no

====== redhat71.bitmover.com ======
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux redhat71.bitmover.com 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 6
model name : Celeron (Mendocino)
stepping : 5
cpu MHz : 467.739
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 933.88


====== sparc.bitmover.com ======
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)
0.29user 0.02system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (107major+36minor)pagefaults 0swaps
Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown
cpu : TI UltraSparc IIi
fpu : UltraSparc IIi integrated FPU
promlib : Version 3 Revision 11
prom : 3.11.12
type : sun4u
ncpus probed : 1
ncpus active : 1
BogoMips : 539.03
MMU Type : Spitfire

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-29 15:47:46

by Herbert Poetzl

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this


# gcc -o test test.c -O2
# ./test
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) MP 1800+
stepping : 2
cpu MHz : 1533.425
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3060.53

processor : 1
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1533.425
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3060.53

2003-08-29 16:33:00

by Brian Jackson

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Friday 29 August 2003 12:35 am, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
>
<snip>

Didn't see a 512k cache athlon-xp yet

skyline:/share/linux/projects/cachetest # sh go
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 10
model name : AMD Athlon(tm) XP 2800+
stepping : 0
cpu MHz : 2088.111
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 4168.08

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real 0m0.110s
user 0m0.070s
sys 0m0.030s

--Brian Jackson

--
OpenGFS -- http://opengfs.sourceforge.net
Gentoo -- http://gentoo.brianandsara.net
Home -- http://www.brianandsara.net

2003-08-29 16:27:59

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, 29 Aug 2003, Jamie Lokier wrote:
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

Are you also interested in m68k? ;-)

cassandra:/tmp# time ./test
Test separation: 4096 bytes: FAIL - store buffer not coherent
Test separation: 8192 bytes: FAIL - store buffer not coherent
Test separation: 16384 bytes: FAIL - store buffer not coherent
Test separation: 32768 bytes: FAIL - store buffer not coherent
Test separation: 65536 bytes: FAIL - store buffer not coherent
Test separation: 131072 bytes: FAIL - store buffer not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
Test separation: 4194304 bytes: FAIL - store buffer not coherent
Test separation: 8388608 bytes: FAIL - store buffer not coherent
Test separation: 16777216 bytes: FAIL - store buffer not coherent
VM page alias coherency test: failed; will use copy buffers instead

real 0m0.478s
user 0m0.110s
sys 0m0.190s
cassandra:/tmp# cat /proc/cpuinfo
CPU: 68040
MMU: 68040
FPU: 68040
Clocking: 24.8MHz
BogoMips: 16.53
Calibration: 82688 loops
cassandra:/tmp#


callisto$ time ./test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.329s
user 0m0.270s
sys 0m0.050s
callisto$ cat /proc/cpuinfo
cpu : 604r
clock : 200MHz
revision : 18.3 (pvr 0009 1203)
bogomips : 398.13
machine : CHRP IBM,LongTrail-2
memory bank 0 : 32 MB SDRAM
memory bank 1 : 32 MB SDRAM
memory bank 2 : 32 MB SDRAM
memory bank 3 : 32 MB SDRAM
board l2 : 512 KB Pipelined Synchronous (Write-Through)
callisto$

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-08-29 17:39:48

by Matt Porter

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> Anyway, please lots of people run the program and post the output +
> /proc/cpuinfo. Compile with optimisation, -O or -O2 is fine. (You
> can add -DHAVE_SYSV_SHM too if you like):
>
> gcc -o test test.c -O2
> time ./test
> cat /proc/cpuinfo

PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI

-----

440gx-1:~/cachetest# gcc -o test test.c -O2
440gx-1:~/cachetest# time ./test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.193s
user 0m0.140s
sys 0m0.010s
440gx-1:~/cachetest# cat /proc/cpuinfo
cpu : 440GX Rev. A
revision : 24.80 (pvr 51b2 1850)
bogomips : 624.23
vendor : IBM
machine : PPC440GX EVB (Ocotea)
440gx-1:~/cachetest#

--
Matt Porter
[email protected]

2003-08-29 19:37:28

by Thorsten Kranzkowski

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.


Dual Alpha ev6:


ds20:~/src/cachetest$ ./doit
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (4 pages)

real 0m4.148s
user 0m4.029s
sys 0m0.075s
cpu : Alpha
cpu model : EV6
cpu variation : 7
cpu revision : 0
cpu serial number :
system type : Tsunami
system variation : Goldrush
system revision : 0
system serial number : ay91560403
cycle frequency [Hz] : 500000000
timer frequency [Hz] : 1024.00
page size [bytes] : 8192
phys. address bits : 44
max. addr. space # : 255
BogoMIPS : 998.56
kernel unaligned acc : 0 (pc=0,va=0)
user unaligned acc : 0 (pc=0,va=0)
platform string : AlphaServer DS20 500 MHz
cpus detected : 2
cpus active : 2
cpu active mask : 0000000000000003



Single Alpha ev4 (AXPpci33):

Marvin:~/src/cachetest$ ./doit
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m1.442s
user 0m0.853s
sys 0m0.471s
cpu : Alpha
cpu model : LCA4
cpu variation : -4294967301
cpu revision : 0
cpu serial number : Linux_is_Great!
system type : Noname
system variation : 0
system revision : 0
system serial number : MILO-2.2-17
cycle frequency [Hz] : 166868457
timer frequency [Hz] : 1024.00
page size [bytes] : 8192
phys. address bits : 34
max. addr. space # : 63
BogoMIPS : 320.40
kernel unaligned acc : 56014443 (pc=fffffc0000ab65a4,va=fffffc0000b99105)
user unaligned acc : 2695 (pc=2000031ff90,va=11fffef26)
platform string : N/A
cpus detected : 0




ordinary Pentium II:


bash-2.03$ ./doit
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.342s
user 0m0.290s
sys 0m0.030s
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 3
model name : Pentium II (Klamath)
stepping : 4
cpu MHz : 300.691
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov mmx
bogomips : 599.65




bye,
Thorsten

--
| Thorsten Kranzkowski Internet: [email protected] |
| Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, [email protected] [44.130.8.19] |

2003-08-29 20:27:46

by Iulian Musat

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this



Jamie Lokier wrote:
> Anyway, please lots of people run the program and post the output +
> /proc/cpuinfo. Compile with optimisation, -O or -O2 is fine. (You
> can add -DHAVE_SYSV_SHM too if you like):
>
> gcc -o test test.c -O2
> time ./test
> cat /proc/cpuinfo

2 AMD Athlon
4 Itanium II (on an altix machine)
2 Pentium III
1 AMD XP
1 Pentium IV


2 AMD Athlon :
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real 0m0.088s
user 0m0.080s
sys 0m0.004s

cat /proc/cpuinfo

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1526.385
cache size : 256 KB
Physical processor ID : -2084402944
Number of siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3038.00

processor : 1
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1526.385
cache size : 256 KB
Physical processor ID : 410321912
Number of siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3046.10

~~~~~~~~~~~~~~~~~~~~~~~~

4 Itanium II (on an altix machine)
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.095s
user 0m0.065s
sys 0m0.028s

cat /proc/cpuinfo

processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1346.37

processor : 1
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1346.37

processor : 2
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1342.17

processor : 3
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1342.17

~~~~~~~~~~~~~~~~~~~~~~~~

2 Pentium III
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.154s
user 0m0.109s
sys 0m0.020s

cat /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 3
cpu MHz : 846.353
cache size : 256 KB
Physical processor ID : 0
Number of siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips : 1682.99

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 3
cpu MHz : 846.353
cache size : 256 KB
Physical processor ID : 0
Number of siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips : 1691.09

~~~~~~~~~~~~~~~~~~~~~~~~

1 AMD XP
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real 0m0.077s
user 0m0.060s
sys 0m0.010s

cat /proc/cpuinfo

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) XP 2100+
stepping : 2
cpu MHz : 1746.168
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3486.51


~~~~~~~~~~~~~~~~~~~~~~~~

1 Pentium IV
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.221s
user 0m0.180s
sys 0m0.025s


cat /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 0
model name : Intel(R) Pentium(R) 4 CPU 1700MHz
stepping : 10
cpu MHz : 1694.928
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3365.99

~~~~~~~~~~~~~~~~~~~~~~~~



-iulian

2003-08-29 20:28:31

by Paul J.Y. Lahaie

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Ran it on a few systems here.

Corel NetWinder (275MHz StrongARM)
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead

cat /proc/cpuinfo
Processor : StrongARM-110 rev 3 (v4l)
BogoMIPS : 185.95
Features : swp half 26bit fastmult

Hardware : Rebel-NetWinder
Revision : 52ff
Serial : 00000000000008bf



HP zx6000 (2xItanium 2)
time ./test
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m7.455s
user 0m7.412s
sys 0m0.040s

cat /proc/cpuinfo
processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1346.37





Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-08-29 20:08:12

by Sean Neakums

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Jamie Lokier <[email protected]> writes:

> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

2-way Pentium III:

$ time ./va
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.096s
user 0m0.073s
sys 0m0.023s
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 11
model name : Intel(R) Pentium(R) III CPU family 1133MHz
stepping : 1
cpu MHz : 1129.879
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 2220.03

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 11
model name : Intel(R) Pentium(R) III CPU family 1133MHz
stepping : 1
cpu MHz : 1129.879
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 2252.80

2003-08-29 22:35:18

by Kenneth Johansson

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, 2003-08-29 at 07:35, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.473s
user 0m0.280s
sys 0m0.100s

>cat /proc/cpuinfo
cpu : 405CR
clock : 200MHz
revision : 1.69 (pvr 4011 0145)
bogomips : 199.88
machine : Ericsson ELN 2XX
plb bus clock : 100MHz




2003-08-29 23:05:25

by Mike Fedyk

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 08:41:01AM -0700, Larry McVoy wrote:

> ====== sparc.bitmover.com ======
> Test separation: 8192 bytes: FAIL - cache not coherent

> VM page alias coherency test: minimum fast spacing: 16384 (2 pages)
> 0.29user 0.02system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (107major+36minor)pagefaults 0swaps
> Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown
> cpu : TI UltraSparc IIi
> fpu : UltraSparc IIi integrated FPU
> promlib : Version 3 Revision 11
> prom : 3.11.12
> type : sun4u
> ncpus probed : 1
> ncpus active : 1
> BogoMips : 539.03
> MMU Type : Spitfire

Does this mean that userspace has to take into consideration that the isn't
coherent for adjacent small memory accesses on sparc? What could happen if
it doesn't, or does it need to at all?

2003-08-29 23:48:21

by Kurt Wall

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Quoth Jamie Lokier:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

[snip]

----- system one ---
$ time ./mmap
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.475s
user 0m0.250s
sys 0m0.020s
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 349.200
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 696.32
-----

----- system two ---
[kwall]$ time ./mmap
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.134s
user 0m0.120s
sys 0m0.010s
]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 3
cpu MHz : 801.830
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1599.07
-----

---- system three -----
$ time ./mmap
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real 0m0.101s
user 0m0.090s
sys 0m0.010s
root@advent:~# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1210.825
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 2418.27
-----

Now, that was interesting. The AMD is my fastest machine...

Kurt
--
"I have the world's largest collection of seashells. I keep it
scattered around the beaches of the world ... Perhaps you've seen it.
-- Steven Wright

2003-08-30 01:49:04

by Stuart Longland

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I've thrown this at a Gateway Microserver (aka. Sun Cobalt Qube) which
runs an r5k little endian MIPS. I'd also throw this at a Silicon
Graphics Indy, but I don't feel energetic enough right now to go and
drag the beast out.

Also attached, is the results from my laptop (Toshiba Protege 7010CT)
and web server (Generic Dual P-Pro).


- --
+-------------------------------------------------------------+
| Stuart Longland stuartl at longlandclan.hopto.org |
| Brisbane Mesh Node: 719 http://stuartl.cjb.net/ |
| I haven't lost my mind - it's backed up on a tape somewhere |
| Griffith Student No: Course: Bachelor/IT (Nathan) |
+-------------------------------------------------------------+


- -------------------< From the qube >-----------------------
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

real 0m0.276s
user 0m0.140s
sys 0m0.120s

system type : MIPS Cobalt
processor : 0
cpu model : Nevada V10.0 FPU V10.0
BogoMIPS : 249.85
wait instruction : yes
microsecond timers : yes
tlb_entries : 48
extra interrupt vector : yes
hardware watchpoint : no
VCED exceptions : not available
VCEI exceptions : not available
- -------------------< From the qube >-----------------------

- ------------------< From the laptop >----------------------
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.195s
user 0m0.142s
sys 0m0.052s

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 300.026
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat
pse36 mmx fxsr
bogomips : 591.87
- ------------------< From the laptop >----------------------

- ----------------< From the web server >--------------------
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 0m0.279s
user 0m0.210s
sys 0m0.060s

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 1
model name : Pentium Pro
stepping : 9
cpu MHz : 199.434
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
bogomips : 398.13

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 1
model name : Pentium Pro
stepping : 9
cpu MHz : 199.434
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
bogomips : 398.13
- ----------------< From the web server >--------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/UAKFIGJk7gLSDPcRAif8AJ9WKjTGIGYJdHgME/Fkac4cNZKUkACdHwA5
yHQlu/O96H4IUHKGflJncmI=
=yAoq
-----END PGP SIGNATURE-----

2003-08-31 05:10:39

by David Miller

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, 29 Aug 2003 16:05:21 -0700
Mike Fedyk <[email protected]> wrote:

> Does this mean that userspace has to take into consideration that the isn't
> coherent for adjacent small memory accesses on sparc? What could happen if
> it doesn't, or does it need to at all?

For shared memory, we enforce the correct mapping alignment
so that coherency issues don't crop up.

How does this program work? I haven't taken a close look
at it. Does it use MAP_SHARED or IPC shm?

2003-08-31 22:51:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

David S. Miller wrote:
> On Fri, 29 Aug 2003 16:05:21 -0700
> Mike Fedyk <[email protected]> wrote:
>
> > Does this mean that userspace has to take into consideration that the isn't
> > coherent for adjacent small memory accesses on sparc? What could happen if
> > it doesn't, or does it need to at all?
>
> For shared memory, we enforce the correct mapping alignment
> so that coherency issues don't crop up.
>
> How does this program work? I haven't taken a close look
> at it. Does it use MAP_SHARED or IPC shm?

It uses POSIX shared memory and (necessarily) MAP_SHARED, which
doesn't constrain the mapping alignment.

I had wondered if some kernels used page faults to maintain coherence
between multiple shared mappings of the same file. It's one of the
things the program checks, and I have seen it mentioned on l-k, which
made me think it might be implemented. None of the results for any
architecture show it, though.

If userspace does create multiple shared mappings at non-coherent
offsets, what is the recommended method for switching between
accessing one page (or page cluster?) and accessing the other. Is it
msync(), a special system call to flush parts of the data cache, a
machine instruction, or something else?

Thanks,
-- Jamie



ps. The program has code to try IPC shm instead. Change "#ifdef
SHM_DIR_PREFIX" in __page_alias_map to "#if 0", and add
-DHAVE_SYSV_SHM to the GCC command line. It should fail the same test
sizes with a different message.

2003-09-01 00:24:44

by Paul Mundt

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
>
sh (VIPT cache):

Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

$ cat /proc/cpuinfo
machine : Sega Dreamcast
processor : 0
cpu family : sh4
cpu type : SH7750
cache size : 8K-bytes/16K-bytes
bogomips : 199.06
cpu clock : 199.49MHz
bus clock : 99.74MHz
module clock : 49.87MHz

and on sh64 (which is sort of VIPT/VIVT, as it looks at physical tags if
there's no match on virtual):

Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 8192 (2 pages)

-sh-2.05b$ cat /proc/cpuinfo
machine : Hitachi Cayman
processor : 0
cpu family : SH-5
cpu type : SH5-101
icache size : 32K-bytes
dcache size : 32K-bytes
itlb entries : 64
dtlb entries : 64
cpu clock : 314.73MHz
bus clock : 157.36MHz
module clock : 26.22MHz
bogomips : 313.75


Attachments:
(No filename) (2.04 kB)
(No filename) (189.00 B)
Download all attachments

2003-09-01 00:37:52

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Paul Mundt wrote:
> sh (VIPT cache):
>
> Test separation: 4096 bytes: FAIL - cache not coherent
> Test separation: 8192 bytes: FAIL - cache not coherent
> Test separation: 16384 bytes: pass

A VIVT cache can do that, but I think a VIPT cache should always be coherent.
Do I misunderstand?

-- Jamie

2003-09-01 01:00:09

by Paul Mundt

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, Sep 01, 2003 at 01:37:50AM +0100, Jamie Lokier wrote:
> > sh (VIPT cache):
> >
> > Test separation: 4096 bytes: FAIL - cache not coherent
> > Test separation: 8192 bytes: FAIL - cache not coherent
> > Test separation: 16384 bytes: pass
>
> A VIVT cache can do that, but I think a VIPT cache should always be coherent.
> Do I misunderstand?
>
There's nothing stating that VIPT == automatic coherency, as is obviously the
case for sh, where we are completely VIPT, but are also non coherent.


Attachments:
(No filename) (502.00 B)
(No filename) (189.00 B)
Download all attachments

2003-09-01 01:14:38

by dean gaudet

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Fri, 29 Aug 2003, Jamie Lokier wrote:

> I already got a surprise (to me): my Athlon MP is much slower
> accessing multiple mappings which are within 32k of each other, than
> mappings which are further apart, although it is coherent. The L1
> data cache is 64k. (The explanation is easy: virtually indexed,
> physically tagged cache moves data among cache lines, possibly via L2).

opteron has 64KiB / 2-way L1 which means 15-bits of indexing... which
totally predicts the 32KiB spacing i saw someone else post about.

tm8000 also has some virtual aliasing and your test detects it properly...
but i'm probably not supposed to say anything about that :)

there's a real oddity i found on p4 just yesterday. i was doing some
pointer-chasing experiments, and i set up two 8192B shared mappings to the
same file, for example:

0x50000000 => /var/tmp/foo offset 0
0x50002000 => /var/tmp/foo offset 0

then i set up a 4 element cycle:

0x50000000 => 0x50001004 => 0x50002008 => 0x5000300c => 0x50000000

when i do this it seems to trip up a p4 badly ... i'm seeing 3000 cycles
per load on a 2.4GHz p4, and 300 cycles per load on a 2.4GHz xeon. the
crazy thing is that small variations in the experiment (such as longer
cycles) make the oddity go away!

i've placed my hack here <http://arctic.org/~dean/noah/chase.c>.


> This suggests scope for improving x86 kernel performance in the areas
> of kmap() and shared library / executable mappings, by good choice of
> _virtual_ addresses. This doesn't require a cache colouring
> page allocator, so maybe it's a new avenue?

i was trying to use wli's pgcl patch to test out larger clustering, but it
still has some perf problems which i never got enough time to dig into
further :) this approach might be better than just colouring.

here's what i've found tripping up virtual aliasing on processors which
have this "feature":

- shared use empty_zero_page trips up virtual aliasing for things like BSS
-- especially if the program for some reason doesn't typically have to
write before reading. this is pretty easy to fix (there's even an
example fix in the mips architecture, i believe R4000 or something)

- kernel and user mappings differ in the virtual index bits. this means
CoW will trip up virtual aliases amongst other things. i imagine it
means network checksum calculation on write(2) data will trip up virtual
aliases. this is more of a pain to fix in a way which is nice on SMP.

- physical pages change their virtual index bits each alloc/free.

mind you overall i'm not sure that i'm seeing any perf loss due to this
sort of thing...

-dean

2003-09-01 01:58:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Paul Mundt wrote:
> On Mon, Sep 01, 2003 at 01:37:50AM +0100, Jamie Lokier wrote:
> > > sh (VIPT cache):
> > >
> > > Test separation: 4096 bytes: FAIL - cache not coherent
> > > Test separation: 8192 bytes: FAIL - cache not coherent
> > > Test separation: 16384 bytes: pass
> >
> > A VIVT cache can do that, but I think a VIPT cache should always be coherent.
> > Do I misunderstand?
> >
> There's nothing stating that VIPT == automatic coherency,
> as is obviously the case for sh, where we are completely VIPT, but
> are also non coherent.

Ah. A VIPT cache needn't be coherent with itself if isn't coherent
w.r.t. external devices. Thanks.

-- Jamie


2003-09-01 04:29:19

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

dean gaudet wrote:
> On Fri, 29 Aug 2003, Jamie Lokier wrote:
> > I already got a surprise (to me): my Athlon MP is much slower
> > accessing multiple mappings which are within 32k of each other, than
> > mappings which are further apart, although it is coherent. The L1
> > data cache is 64k. (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
>
> opteron has 64KiB / 2-way L1 which means 15-bits of indexing... which
> totally predicts the 32KiB spacing i saw someone else post about.

Aha, thanks! All Athlons are the same with 64KiB L1 and 32KiB
threshold, and K6 is the same but with 16KiB threshold instead.

> there's a real oddity i found on p4 just yesterday. i was doing some
> pointer-chasing experiments, and i set up two 8192B shared mappings to the
> same file, for example:
>
> 0x50000000 => /var/tmp/foo offset 0
> 0x50002000 => /var/tmp/foo offset 0
>
> then i set up a 4 element cycle:
>
> 0x50000000 => 0x50001004 => 0x50002008 => 0x5000300c => 0x50000000
>
> when i do this it seems to trip up a p4 badly ... i'm seeing 3000 cycles
> per load on a 2.4GHz p4, and 300 cycles per load on a 2.4GHz xeon. the
> crazy thing is that small variations in the experiment (such as longer
> cycles) make the oddity go away!

I have no idea of the explanation, unless P4 is doing the same as the
Athlon, 3000 cycles is the cost of an L1/L2 miss, and P4 has virtual
aliasing in both L1 and L2. Hmm.

I would certainly like to detect that if it occurs with typical
instruction streams, otherwise it'll clobber my application's
performance on a P4. I don't have a P4 to test on, btw. If you can
investigate further that would be very good.

-- Jamie

2003-09-01 04:49:26

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

J.A. Magallon wrote:
> On 08.29, Jamie Lokier wrote:
> > I already got a surprise (to me): my Athlon MP is much slower
> > accessing multiple mappings which are within 32k of each other, than
> > mappings which are further apart, although it is coherent. The L1
> > data cache is 64k. (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
> >
>
> Sorry if this is a stupid question, but have you heard about 64K-aliasing ?
> We have seen it in P3/P4, do not know if Athlons also suffer it.
> In short, x86 is crap. It slows like a dog when accessing two memory
> positions sparated by 2^n (address decoder has two 16 bits adders, instead
> of 1 32 bits..., cache is 16 bit tagged, etc...)

I don't know what you mean. This test doesn't observe any gross
timing effect at 64K. I have just tried it on a Celeron Coppermine
printing more detailed numbers, and I don't notice anything at all.

So, what exactly do you mean? What kind of code shows the effect you
are talking about?

Thanks,
-- Jamie

2003-09-01 05:03:09

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Andi Kleen wrote:
> > I already got a surprise (to me): my Athlon MP is much slower
> > accessing multiple mappings which are within 32k of each other, than
> > mappings which are further apart, although it is coherent. The L1
>
> Most x86 and probably most other modern CPUs have virtually
> addressed L1 caches. It's just too slow to wait for the MMU for an
> L1 access which is really critical.
>
> So such artifacts are expected

I hadn't thought at first because there's no artefact at all (not even
a small one) on my Celeron, but you're right. They don't appear on
any Intels(*), but they do on all AMDs that I have results for.

(*) With the possible exception of one P4 that reports varying results.

>
> > data cache is 64k. (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
>
> On x86 L2 is usually physically tagged.

I'm speculating that L1 is physically tagged, and when there's a
virtual alias the CPU moves data from one L1 line to another. L2 only
comes into it because the line transfer is slow enough that a
MESI-style transfer through L2 (as if another CPU or device requested
the line) would account for the slowness.

-- Jamie

2003-09-01 05:41:36

by David Miller

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Sun, 31 Aug 2003 23:49:37 +0100
Jamie Lokier <[email protected]> wrote:

> It uses POSIX shared memory and (necessarily) MAP_SHARED, which
> doesn't constrain the mapping alignment.

That's wrong. If a platform needs to, it should properly
align the mapping when MAP_SHARED is used on a file.

If you look in arch/sparc64/kernel/sys_sparc.c, you'll see
that when we're mmap()'ing a file and MAP_SHARED is specified,
we align things to SHMLBA.

If userspace purposefully violates this alignment attempt,
then it's at it's own peril to keep the mappings coherent,
there is simply nothing the kernel should be doing to help
out that case.

2003-09-01 05:44:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Larry McVoy wrote:
> On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> > I'd appreciate if folks would run the program below on various
> > machines, especially those whose caches aren't automatically coherent
> > at the hardware level.
>
> Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC

Thanks Larry. That's a great range you have!
Collected and will be posted shortly in a table with the others.

> If you care, I also have freebsd (v2, v3, v4), netbsd 1.5, openbsd 3.0 (all
> bsd systems are x86, mostly celerons), hpux 10.20, sco, solaris, solaris/x86,
> Irix, MacOS X, AIX, Tru64 and probably some others.

AIX would be interesting; I don't have an RS6000. The rest of the
CPUs I have results for, and it sounds like a lot of effort for what's
basically a compile/compatibility test.

However, if it's very little effort for you to run the test on them please do!

Thanks,
-- Jamie

2003-09-01 06:01:16

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Matt Porter wrote:
> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI

The cache looks very coherent to me.

-- Jamie

2003-09-01 05:58:12

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Geert Uytterhoeven wrote:
> Are you also interested in m68k? ;-)
>
> cassandra:/tmp# time ./test
> Test separation: 4096 bytes: FAIL - store buffer not coherent

Especially! I hadn't expected to see any machine that would print
"store buffer not coherent". It means that if there's an L1 cache, it
is coherent, but any store-then-load bypass in the CPU pipeline is
using the virtual address with no rollback after MMU translation.

I had thought it would only be the case with chips using an external
MMU, but now that I think about it, the older simpler chips aren't
going to bother with things like pipeline rollback wherever they can
get away without it!

(The other CPU that is reporting "store buffer not coherent" is
PA-RISC, which is even more of an eye opener. That has a big 1MiB
coherent L1 cache, and the pipeline bypass is coherent for very large
separations but not others!)

Thanks,
-- Jamie

2003-09-01 06:42:47

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

David S. Miller wrote:
> On Sun, 31 Aug 2003 23:49:37 +0100
> Jamie Lokier <[email protected]> wrote:
>
> > It uses POSIX shared memory and (necessarily) MAP_SHARED, which
> > doesn't constrain the mapping alignment.
>
> That's wrong. If a platform needs to, it should properly
> align the mapping when MAP_SHARED is used on a file.
>
> If you look in arch/sparc64/kernel/sys_sparc.c, you'll see
> that when we're mmap()'ing a file and MAP_SHARED is specified,
> we align things to SHMLBA.

Then you have a bug in the Sparc code. It looks like it should return
-EINVAL when a misaligned mapping is used with MAP_FIXED|MAP_SHARED,
but the test program is clearly getting mappings that aren't aligned
to SHMLBA.

> If userspace purposefully violates this alignment attempt,
> then it's at it's own peril to keep the mappings coherent,
> there is simply nothing the kernel should be doing to help
> out that case.

I understand that userspace needs to keep it coherent, or map to a
multiple of SHMLBA. I don't mind whether the kernel constrains the
mapping address or not, with a slight preference for userspace
flexibility.

Thus I have three Sparc-specific questions:

1. How does userspace find out the value of SHMLBA?
On Sparc, it is not a compile-time constant.

2. Is flushing part of the data cache something I can do from
userspace? (I'll figure out the exact machine instructions
myself if I need to do this, but it'd be nice to know if
it's possible before I have a go).

3. Is there a kernel bug on Sparc, because the test program
is either getting mappings that aren't aligned to run time
SHMLBA, or the kernel's run time SHMLBA value is not correct.

Thanks,
-- Jamie

2003-09-01 07:15:35

by David Miller

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 1 Sep 2003 07:42:31 +0100
Jamie Lokier <[email protected]> wrote:

> David S. Miller wrote:
> > On Sun, 31 Aug 2003 23:49:37 +0100
> > Jamie Lokier <[email protected]> wrote:
> >
> > > It uses POSIX shared memory and (necessarily) MAP_SHARED, which
> > > doesn't constrain the mapping alignment.
> >
> > That's wrong. If a platform needs to, it should properly
> > align the mapping when MAP_SHARED is used on a file.
> >
> > If you look in arch/sparc64/kernel/sys_sparc.c, you'll see
> > that when we're mmap()'ing a file and MAP_SHARED is specified,
> > we align things to SHMLBA.
>
> Then you have a bug in the Sparc code. It looks like it should return
> -EINVAL when a misaligned mapping is used with MAP_FIXED|MAP_SHARED,
> but the test program is clearly getting mappings that aren't aligned
> to SHMLBA.

I disagree, MAP_FIXED means "I know what I am doing don't override
this unless the mapping area is not available in my address space."
You should never specify MAP_FIXED unless you _REALLY_ know what you
are doing.

> Thus I have three Sparc-specific questions:
>
> 1. How does userspace find out the value of SHMLBA?
> On Sparc, it is not a compile-time constant.

Don't specify MAP_FIXED for MAP_SHARED mapping if you want
proper coherency, that's my answer for this one.

> 2. Is flushing part of the data cache something I can do from
> userspace? (I'll figure out the exact machine instructions
> myself if I need to do this, but it'd be nice to know if
> it's possible before I have a go).

There is no efficient way to do this from userspace, only the
kernel has access to the more efficient cache flushing instructions.
You'd need to flush via loads to displace the aliasing cache lines.

> 3. Is there a kernel bug on Sparc, because the test program
> is either getting mappings that aren't aligned to run time
> SHMLBA, or the kernel's run time SHMLBA value is not correct.

No, the user is allowed to hang himself with MAP_FIXED.

The bug is in your code :)

2003-09-01 08:15:33

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

This looks like an old kernel on your NetWinder. Later 2.4 kernels
should get this right (by marking the pages uncacheable in user space.)

However, when I tried this program, it seemed to have some unexpected
results, sometimes claiming that its too slow, sometimes that the
store buffer isn't coherent, and sometimes saying that the cache
isn't coherent.

Oddly, davem's cache aliasing test program works every time.

It's something which I need to look into, but I don't know when I'm
going to find the time to delve into the memory management stuff.

On Fri, Aug 29, 2003 at 04:26:28PM -0400, Paul J.Y. Lahaie wrote:
> Corel NetWinder (275MHz StrongARM)
> Test separation: 4096 bytes: FAIL - cache not coherent
> Test separation: 8192 bytes: FAIL - cache not coherent
> Test separation: 16384 bytes: FAIL - cache not coherent
> Test separation: 32768 bytes: FAIL - cache not coherent
> Test separation: 65536 bytes: FAIL - cache not coherent
> Test separation: 131072 bytes: FAIL - cache not coherent
> Test separation: 262144 bytes: FAIL - cache not coherent
> Test separation: 524288 bytes: FAIL - cache not coherent
> Test separation: 1048576 bytes: FAIL - cache not coherent
> Test separation: 2097152 bytes: FAIL - cache not coherent
> Test separation: 4194304 bytes: FAIL - cache not coherent
> Test separation: 8388608 bytes: FAIL - cache not coherent
> Test separation: 16777216 bytes: FAIL - cache not coherent
> VM page alias coherency test: failed; will use copy buffers instead
>
> cat /proc/cpuinfo
> Processor : StrongARM-110 rev 3 (v4l)
> BogoMIPS : 185.95
> Features : swp half 26bit fastmult
>
> Hardware : Rebel-NetWinder
> Revision : 52ff
> Serial : 00000000000008bf

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-01 08:35:28

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 1 Sep 2003, Jamie Lokier wrote:
> Geert Uytterhoeven wrote:
> > Are you also interested in m68k? ;-)
> >
> > cassandra:/tmp# time ./test
> > Test separation: 4096 bytes: FAIL - store buffer not coherent
>
> Especially! I hadn't expected to see any machine that would print
> "store buffer not coherent". It means that if there's an L1 cache, it
> is coherent, but any store-then-load bypass in the CPU pipeline is
> using the virtual address with no rollback after MMU translation.
>
> I had thought it would only be the case with chips using an external
> MMU, but now that I think about it, the older simpler chips aren't
> going to bother with things like pipeline rollback wherever they can
> get away without it!

As you probably know the 68020 had an external MMU (68551, or Sun-3 or Apollo
MMU). Probably Motorola didn't bother to change the behavior when the MMU got
integrated in later generations (68030 and up).

BTW, probably you want us to run your test program on other m68k boxes? Mine
got a 68040, that leaves us with:
- 68020+68551
- 68020+Sun-3 MMU
- 68030
- 68060

For linux-m68k: You can find the test program source in Jamie's original
posting on lkml. For your convenience, I put a binary for m68k at
http://home.tvd.be/cr26864/Linux/m68k/jamie_test.gz. Just tell us the
program's output and give us a copy of your /proc/cpuinfo. Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-09-01 08:29:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

David S. Miller wrote:
> I disagree, MAP_FIXED means "I know what I am doing don't override
> this unless the mapping area is not available in my address space."
> You should never specify MAP_FIXED unless you _REALLY_ know what you
> are doing.

So explain this from the Sparc architecture code:

if (flags & MAP_FIXED) {
/* We do not accept a shared mapping if it would violate
* cache aliasing constraints.
*/
if ((flags & MAP_SHARED) && (addr & (SHMLBA - 1)))
return -EINVAL;
return addr;
}

Ok, I'll explain it :) At one time, the code did what the comment says,
but nowadays linux/mm/mmap.c doesn't call arch_get_unmapped_area() for
MAP_FIXED, so the above code is redundant and misleading. It already
mislead me, so please remove it. sparc and sparc64 both have it.

> > Thus I have three Sparc-specific questions:
> >
> > 1. How does userspace find out the value of SHMLBA?
> > On Sparc, it is not a compile-time constant.
>
> Don't specify MAP_FIXED for MAP_SHARED mapping if you want
> proper coherency, that's my answer for this one.

I can't safely set up this kind of mapping without MAP_FIXED, unless I
know SHMLBA.

This is my strategy:

mmap MAP_ANON without MAP_FIXED to find a free area
mmap MAP_FIXED over the anon area at same address
mmap MAP_FIXED over the anon area at larger address

I don't see any strategy that lets me establish this kind of circular
mapping on Sparc without either (a) knowing the value of SHMLBA, or
(b) risking clobbering another thread's mmap.

> > 3. Is there a kernel bug on Sparc, because the test program
> > is either getting mappings that aren't aligned to run time
> > SHMLBA, or the kernel's run time SHMLBA value is not correct.
>
> No, the user is allowed to hang himself with MAP_FIXED.
> The bug is in your code :)

Well, my code has no bug because I do run-time tests to see what
rubbish the architecture gave me. As we see, they work :)

I don't see any real alternative to doing that. But that's ok, it
seems robust and portable. It's a shame about the slow cache flush,
because I can sometimes use fast cache flushing to improve my DSP
buffering algorithms.

> > 2. Is flushing part of the data cache something I can do from
> > userspace? (I'll figure out the exact machine instructions
> > myself if I need to do this, but it'd be nice to know if
> > it's possible before I have a go).
>
> There is no efficient way to do this from userspace, only the
> kernel has access to the more efficient cache flushing instructions.
> You'd need to flush via loads to displace the aliasing cache lines.

Will msync() do it?

Thanks,
-- Jamie

2003-09-01 09:11:50

by David Miller

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 1 Sep 2003 09:29:11 +0100
Jamie Lokier <[email protected]> wrote:

> David S. Miller wrote:
> > I disagree, MAP_FIXED means "I know what I am doing don't override
> > this unless the mapping area is not available in my address space."
> > You should never specify MAP_FIXED unless you _REALLY_ know what you
> > are doing.
>
> So explain this from the Sparc architecture code:
>
> if (flags & MAP_FIXED) {
> /* We do not accept a shared mapping if it would violate
> * cache aliasing constraints.
> */
> if ((flags & MAP_SHARED) && (addr & (SHMLBA - 1)))
> return -EINVAL;
> return addr;
> }
>
> Ok, I'll explain it :) At one time, the code did what the comment says,
> but nowadays linux/mm/mmap.c doesn't call arch_get_unmapped_area() for
> MAP_FIXED, so the above code is redundant and misleading. It already
> mislead me, so please remove it. sparc and sparc64 both have it.

I take back what I said, I think the -EINVAL behavior is better
and mmap.c should call into this code to verify the requested
mmap() parameters.

> This is my strategy:
>
> mmap MAP_ANON without MAP_FIXED to find a free area
> mmap MAP_FIXED over the anon area at same address
> mmap MAP_FIXED over the anon area at larger address
>
> I don't see any strategy that lets me establish this kind of circular
> mapping on Sparc without either (a) knowing the value of SHMLBA, or
> (b) risking clobbering another thread's mmap.

Why do you need the same piece of data mapped to multiple places
in the first place, and why at specific addresses? It's purely an
optimization of some sort, right?

> Well, my code has no bug because I do run-time tests to see what
> rubbish the architecture gave me. As we see, they work :)

It doesn't work in just the right set of circumstances, if interrupts
arrive at just the right moment it might flush the bad aliases out
of the cache via displacement during your 'check' phase.

Then during your actual computation you can hit the aliasing problem
silently.

That's just a bad way to do this.

> I don't see any real alternative to doing that.

I'd suggest instead to hardcode the SHMLBA stuff into your sources.

> But that's ok, it seems robust and portable.

Unfortunately, it is anything but robust.

> > There is no efficient way to do this from userspace, only the
> > kernel has access to the more efficient cache flushing instructions.
> > You'd need to flush via loads to displace the aliasing cache lines.
>
> Will msync() do it?

No.

2003-09-01 09:10:43

by Kars de Jong

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:

> BTW, probably you want us to run your test program on other m68k boxes? Mine
> got a 68040, that leaves us with:
> - 68020+68551
> - 68060

I can run it on these boxes if no-one else has done it yet before I come
home tonight. I'm sure there are more people with a 68060 out there, not
too sure about the 68020+68851.


Regards,

Kars.

2003-09-01 10:05:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

David S. Miller wrote:
> Why do you need the same piece of data mapped to multiple places
> in the first place, and why at specific addresses? It's purely an
> optimization of some sort, right?

Right. It's a circular buffer for signal processing: DSP code sees
contiguous ascending addresses. The multiple maps mean we don't have
to copy the contents of the buffer back to the start periodically, nor
mask the offset into the array on each memory access, nor write
extra-complicated DSP code which can handle split regions.

It's an optimisation, it works well on some architectures and on
others it's not worth it. On those, I just copy - it keeps the DSP
code fast and simple.

> > Well, my code has no bug because I do run-time tests to see what
> > rubbish the architecture gave me. As we see, they work :)
>
> It doesn't work in just the right set of circumstances, if interrupts
> arrive at just the right moment it might flush the bad aliases out
> of the cache via displacement during your 'check' phase.
>
> Then during your actual computation you can hit the aliasing problem
> silently.

To fool the coherence test, interrupts would need to arrive in a 2
instruction window, at least 8192 times. It is possible, but unlikely
except in pathological situations.

Of course if you make mmap() return EINVAL then it cannot possible fail :)

> I'd suggest instead to hardcode the SHMLBA stuff into your sources.

How? SHMLBA is a run time value on the Sparc; I have no idea how
to work it out.

-- Jamie

2003-09-01 10:11:45

by David Miller

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 1 Sep 2003 11:04:58 +0100
Jamie Lokier <[email protected]> wrote:

> Of course if you make mmap() return EINVAL then it cannot possible fail :)

Right :-)

> > I'd suggest instead to hardcode the SHMLBA stuff into your sources.
>
> How? SHMLBA is a run time value on the Sparc; I have no idea how
> to work it out.

You're talking about 32-bit sparc, on sparc64 it's a constant
16K.

For sparc 32-bit, just use 4MB, that's the largest possible value.

And you have to check this with uname() results, not with ifdefs
as 32-bit Sparc binaries run on sparc64 systems just fine.

I also would not object at all to a kernel patch that exported the
SHMLBA value via some sysctl value.

2003-09-01 10:09:26

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Kars de Jong wrote:
> On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > got a 68040, that leaves us with:
> > - 68020+68551
> > - 68060
>
> I can run it on these boxes if no-one else has done it yet before I come
> home tonight. I'm sure there are more people with a 68060 out there, not
> too sure about the 68020+68851.

I would prefer that you run the attached program. It fixes a bug in
the function which tests whether the problem is in the L1 cache or
store buffer. The bug probably didn't affect the test, but it might
have.

Ideally you could run the program Geert linked to as well?
Please remember to compile both with optimisation.

Thanks,
-- Jamie

/* This code maps shared memory to multiple addresses and tests it
for cache coherency and performance.

Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */

#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/signal.h>
#include <sys/mman.h>
#include <sys/time.h>

#if HAVE_SYSV_SHM
#include <sys/ipc.h>
#include <sys/shm.h>
#endif

//#include "pagealias.h"

/* Helpers to temporarily block all signals. These are used for when a
race condition might leave a temporary file that should have been
deleted -- we do our best to prevent this possibility. */

static void
block_signals (sigset_t * save_state)
{
sigset_t all_signals;
sigfillset (&all_signals);
sigprocmask (SIG_BLOCK, &all_signals, save_state);
}

static void
unblock_signals (sigset_t * restore_state)
{
sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0);
}

/* Open a new shared memory file, either using the POSIX.4 `shm_open'
function, or using a regular temporary file in /tmp. Immediately
after opening the file, it is unlinked from the global namespace
using `shm_unlink' or `unlink'.

On success, the value returned is a file descriptor. Otherwise, -1
is returned and `errno' is set.

The descriptor can be closed using simply `close'. */

/* Note: `shm_open' requires link argument `-lposix4' on Suns.
On GNU/Linux with Glibc, it requires `-lrt'. Unfortunately, Glibc's
-lrt insists on linking to pthreads, which we may not want to use
because that enables thread locking overhead in other functions. So
we implement a direct method of opening shm on Linux. */

/* If this is changed, change the size of `buffer' below too. */
#if HAVE_SHM_OPEN
#define SHM_DIR_PREFIX "/" /* `shm_open' arg needs "/" for portability. */
#elif defined (__linux__)
#include <sys/statfs.h>
#define SHM_DIR_PREFIX "/dev/shm/"
#else
#undef SHM_DIR_PREFIX
#endif

static int
open_shared_memory_file (int use_tmp_file)
{
char * ptr, buffer [19];
int fd, i;
unsigned long number;
sigset_t save_signals;
struct timeval tv;

#if !HAVE_SHM_OPEN && defined (__linux__)
struct statfs sfs;
if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0
|| sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */))
{
errno = ENOSYS;
return -1;
}
#endif

loop:
/* Print a randomised path name into `buffer'. The string depends on
the directory and whether we are using POSIX.4 shared memory or a
regular temporary file. RANDOM is a 5-digit, base-62
representation of a pseudo-random number. The string is used as a
candidate in the search for an unused shared segment or file name. */
#ifdef SHM_DIR_PREFIX
strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-");
#else
strcpy (buffer, "/tmp/shm-");
#endif
ptr = buffer + strlen (buffer);
gettimeofday (&tv, (struct timezone *) 0);
number = (unsigned long) random ();
number += (unsigned long) getpid ();
number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec;
for (i = 0; i < 5; i++)
{
/* Don't use character arithmetic, as not all systems are ASCII. */
*ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62];
number /= 62;
}
*ptr = '\0';

/* Block signals between the open and unlink, to really minimise
the chance of accidentally leaving an unwanted file around. */
block_signals (&save_signals);
#if HAVE_SHM_OPEN
if (!use_tmp_file)
{
fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd != -1)
shm_unlink (buffer);
}
else
#endif /* HAVE_SHM_OPEN */
{
fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd != -1)
unlink (buffer);
}
unblock_signals (&save_signals);

/* If we failed due to a name collision or a signal, try again. */
if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR))
goto loop;

return fd;
}

/* Allocate a region of address space `size' bytes long, so that the
region will not be allocated for any other purpose. It is freed with
`munmap'.

Returns the mapped base address on success. Otherwise, MAP_FAILED is
returned and `errno' is set. */

static size_t system_page_size;

#if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#define MAP_ANONYMOUS MAP_ANON
#endif
#ifndef MAP_NORESERVE
#define MAP_NORESERVE 0
#endif
#ifndef MAP_FILE
#define MAP_FILE 0
#endif
#ifndef MAP_VARIABLE
#define MAP_VARIABLE 0
#endif
#ifndef MAP_FAILED
#define MAP_FAILED ((void *) -1)
#endif
#ifndef PROT_NONE
#define PROT_NONE PROT_READ
#endif

static void *
map_address_space (void * optional_address, size_t size, int access)
{
void * addr;
#ifdef MAP_ANONYMOUS
addr = mmap (optional_address, size,
access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
(MAP_PRIVATE | MAP_ANONYMOUS
| (optional_address ? MAP_FIXED : MAP_VARIABLE)
| (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0);
#else /* not defined MAP_ANONYMOUS */
int save_errno, zero_fd = open ("/dev/zero", O_RDONLY);
if (zero_fd == -1)
return MAP_FAILED;
addr = mmap (optional_address, size,
access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
(MAP_PRIVATE | MAP_FILE
| (optional_address ? MAP_FIXED : MAP_VARIABLE)
| (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0);
save_errno = errno;
close (zero_fd);
errno = save_errno;
#endif /* not defined MAP_ANONMOUS */
return addr;
}

/* Set up a page alias mapping using mmap() on POSIX shared memory or on
a temporary regular file.

Returns the mapped base address on success. Otherwise, 0 is returned
and `errno' is set. */

static void *
page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file)
{
void * base_addr, * addr;
int fd, i, save_errno;
struct stat st;

fd = open_shared_memory_file (use_tmp_file);
if (fd == -1)
goto fail;

/* First, resize the shared memory file to the desired size. */
if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size)
goto close_fail;

/* Map an anonymous region `separation + size' bytes long. This is how
we allocate sufficient contiguous address space. We over-map
this with the aliased buffer. */
if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto close_fail;

/* Map the same shared memory repeatedly, at different addresses. */
for (i = 0; i < 2; i++)
{
addr = mmap ((char *) base_addr + (i ? separation : 0), size,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED,
fd, (off_t) 0);
if (addr == MAP_FAILED)
goto unmap_fail;
if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `mmap' ignored MAP_FIXED! Should never happen. */
munmap (addr, size);
save_errno = EINVAL;
goto unmap_fail_se;
}
}
if (close (fd) != 0)
goto unmap_fail;

/* Success! */
return base_addr;

/* Failure. */
unmap_fail:
save_errno = errno;
unmap_fail_se:
munmap (base_addr, separation + size);
errno = save_errno;
close_fail:
save_errno = errno;
close (fd);
errno = save_errno;
fail:
return 0;
}

/* Set up a page alias mapping using SYSV IPC shared memory.

Returns the mapped base address on success. Otherwise, 0 is returned
and `errno' is set. */

#if HAVE_SYSV_SHM

static void *
page_alias_using_sysv_shm (size_t size, size_t separation)
{
void * base_addr, * addr;
sigset_t save_signals;
int shmid, i, save_errno;

/* Map an anonymous region `separation + size' bytes long. This is how
we allocate sufficient contiguous address space. We over-map
this with the aliased buffer. */
if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto fail;

/* Block signals between the shmget() and IPC_RMID, to minimise the chance
of accidentally leaving an unwanted shared segment around. */
block_signals (&save_signals);

shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600);
if (shmid == -1)
goto unmap_fail;

/* Map the same shared memory repeatedly, at different addresses. */
for (i = 0; i < 2; i++)
{
/* `shmat' is tried twice. The fist time it can fail if the local
implementation of `shmat' refuses to map over a region mapped
with `mmap'. In that case, we punch a hole using `munmap' and
do it again.

If the local `shmat' has this property, the `shmat' calls
to fixed addresses might collide with a concurrent thread
which is also doing mappings, and will fail. At least it
is a safe failure.

On the other hand, if the local `shmat' can map over
already-mapped regions (in the same way that `mmap' does), it
is essential that we do actually use an already-mapped region,
so that collisions with a concurrent thread can't possibly
result in both of us grabbing the same address range with no
indication of error. */
addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
if (addr == (void *) -1 && errno == EINVAL)
{
munmap ((char *) base_addr + (i ? separation : 0), size);
addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
}

/* Check for errors. */
if (addr == (void *) -1)
{
save_errno = errno;
if (i == 1)
shmdt (base_addr);
goto remove_shm_fail_se;
}
else if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `shmat' ignored the requested address! */
if (i == 1)
shmdt (base_addr);
save_errno = EINVAL;
goto remove_shm_fail_se;
}
}

if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0)
goto remove_shm_fail;
unblock_signals (&save_signals);

/* Success! */
return base_addr;

/* Failure. */
remove_shm_fail:
save_errno = errno;
remove_shm_fail_se:
while (--i >= 0)
shmdt ((char *) base_addr + (i ? separation : 0));
shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0);
errno = save_errno;
unmap_fail:
save_errno = errno;
unblock_signals (&save_signals);
munmap (base_addr, separation + size);
errno = save_errno;
fail:
return 0;
}

#endif /* HAVE_SYSV_SHM */

/* Map a page-aliased ring buffer. Shared memory of size `size' is
mapped twice, with the difference between the two addresses being
`separation', which must be at least `size'. The total address range
used is `separation + size' bytes long.

On success, *METHOD is filled with a number which must be passed to
`page_alias_unmap', and the mapped base address is returned.
Otherwise, 0 is returned and `errno' is set. */

static void *
__page_alias_map (size_t size, size_t separation, int * method)
{
void * addr;
if (((size | separation) & (system_page_size - 1)) != 0 || size > separation)
{
errno = -EINVAL;
return 0;
}

/* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */
#ifdef SHM_DIR_PREFIX
*method = 0;
if ((addr = page_alias_using_mmap (size, separation, 0)) != 0)
return addr;
#endif
#if HAVE_SYSV_SHM
*method = 1;
if ((addr = page_alias_using_sysv_shm (size, separation)) != 0)
return addr;
#endif
*method = 2;
return page_alias_using_mmap (size, separation, 1);
}

/* Unmap a page-aliased ring buffer previously allocated by
`page_alias_map'. `address' is the base address, and `size' and
`separation' are the arguments previously passed to
`__page_alias_map'. `method' is the value previously stored in *METHOD.

Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */

static int
__page_alias_unmap (void * address, size_t size, size_t separation, int method)
{
#if HAVE_SYSV_SHM
if (method == 1)
{
shmdt (address);
shmdt (address + separation);
if (separation > size)
munmap (address + size, separation - size);
return 0;
}
#endif

return munmap (address, separation + size);
}

/* Map a page-aliased ring buffer. `size' is the size of the buffer to
create; it will be mapped twice to cover a total address range
`size * 2' bytes long.

On success, *METHOD is filled with a number which must be passed to
`page_alias_unmap', and the mapped base address is returned.
Otherwise, 0 is returned and `errno' is set. */

void *
page_alias_map (size_t size, int * method)
{
return __page_alias_map (size, size, method);
}

/* Unmap a page-aliased ring buffer previously allocated by
`page_alias_map'. `address' is the base address, and `size' is the
size of the buffer (which is half of the total mapped address range).
`method' is a value previously stored in *METHOD by `page_alias_map'.

Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */

int
page_alias_unmap (void * address, size_t size, int method)
{
return __page_alias_unmap (address, size, size, method);
}

/* Map some memory which is not aliased, for timing comparisons against
aliased pages. We use a combination of mappings similar to
page_alias_*(), in case there are resource limitations which would
prevent malloc() or a single mmap() working for the larger address
range tests. */

static void *
page_no_alias (size_t size, size_t separation)
{
void * base_addr, * addr;
int i, save_errno;

if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto fail;

/* Map anonymous memory at the different addresses. */
for (i = 0; i < 2; i++)
{
addr = map_address_space ((char *) base_addr + (i ? separation : 0),
size, 1);
if (addr == MAP_FAILED)
goto unmap_fail;
if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `mmap' ignored MAP_FIXED! Should never happen. */
munmap (addr, size);
save_errno = EINVAL;
goto unmap_fail_se;
}
}

/* Success! */
return base_addr;

/* Failure. */
unmap_fail:
save_errno = errno;
unmap_fail_se:
munmap (base_addr, separation + size);
errno = save_errno;
fail:
return 0;
}

/* This should be a word size that the architecture can read and write
fast in a single instruction. In principle, C's `int' is the natural
word size, but in practice it isn't on 64-bit machines. */

#define WORD long

/* These GCC-specific asm statements force values into registers, and
also act as compiler memory barriers. These are used to force a
group of write/write/read instructions as close together as possible,
to maximise the detection of store buffer conditions. Despite being
asm statements, these will work with any of GCC's target architectures,
provided they have >= 4 registers. */

#if __GNUC__ >= 3
#define __noinline __attribute__ ((__noinline__))
#else
#define __noinline
#endif

#ifdef __GNUC__
#define force_into_register(var) \
__asm__ ("" : "=r" (var) : "0" (var) : "memory")
#define force_into_registers(var1, var2, var3, var4) \
__asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \
: "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory")
#else
#define force_into_register(var) do {} while (0)
#define force_into_registers(var1, var2, var3, var4) do {} while (0)
#endif

/* This function tries to test whether a CPU snoops its store buffer for
reads within a few instructions, and ignores virtual to physical
address translations when doing that. In principle a CPU might do
this even if it's L1 cache is physically tagged or indexed, although
I have not seen such a system. (A CPU which uses store buffer
snooping and with an off-board MMU, which the CPU is unaware of,
could have this property).

It isn't possible to do this test perfectly; we do our best. The
`force_into_register' macros ensure that the write/write/read
sequence is as compact as the compiler can make it. */

static WORD __noinline
test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2)
{
register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2;
register WORD __reg1 = 1, __reg2 = 0;
force_into_registers (__reg1, __reg2, __regptr1, __regptr2);
*__regptr1 = __reg1;
*__regptr2 = __reg2;
__reg1 = *__regptr1;
force_into_register (__reg1);
return __reg1;
}

/* This function tests whether writes to one page are seen in another
page at a different virtual address, and whether they are nearly as
fast as normal writes.

The accesses are timed by the caller of this function.
Alternate writes go to alternate pages, so that if aliasing is
implemented using page faults, it will clearly show up in the
timings. */

static int __noinline
test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops)
{
WORD fail = 0;
while (--timing_loops >= 0)
fail |= test_store_buffer_snoop (ptr1, ptr2);
return fail != 0;
}

/* This function tests L1 cache coherency without checking for store
buffer snoop coherency. To do this, we add enough stores that the
writes to *PTR1 are flushed (or drain due to the time delay) from the
store buffer before we read from *PTR1. The result of this function
is not important: it is only used in a diagnostic message. */

static int __noinline
test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2)
{
int i, j;
WORD fail = 0;
for (i = 0; i < 10; i++)
{
*ptr1 = 1;
/* This loop of volatile writes creates a short time delay. The
delay gives the store to *PTR1 time to flush from the store
buffer and/or the many writes flush the store buffer. The loop
writes to *PTR2 because if we pick another fixed address and
write to it, that would be testing 3 cache lines (PTR1, PTR2
and the fixed address) and the fixed address _might_ happen to
collide with PTR1 or PTR2 in the L1 cache. If the L1 cache is
2-way set-associative, that would flush it every time, possibly
making it appear coherent when it isn't. */
for (j = 0; j < 1000; j++)
*ptr2 = 0;
fail |= *ptr1;
}
return fail != 0;
}

/* Thoroughly test a pair of aliased pages with a fixed address
separation, to see if they really behave like memory appearing at two
locations, and efficiently. We search through different values of
`separation' searching for a suitable "cache colour" on this machine. */

static inline const char *
test_one_separation (size_t separation)
{
void * buffers [2];
long timings [3];
int i, method, timing_loops = 64;

/* We measure timings of 3 different tests, each 128 times to find the
minimum. 0: Writes and reads to aliased pages. 1: Writes and
reads to non-aliased pages, to compare with 1. 2: Doing nothing,
to measure the time for `gettimeofday' itself.

The measurements are done in a mixed up order. If we did 64
measurements of type 0, then 64 of type 1, then 64 of type 2, the
results could be mislead due to synchronisation with other
processes occuring on the machine. */

/* A previously generated random shuffle of bit-pairs. Each pair is a
number from the set {0,1,2}. Each number occurs exactly 128 times. */
static const unsigned char pattern [96] =
{
0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56,
0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49,
0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99,
0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25,
0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19,
0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15,
0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89,
0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85,
};

buffers [0] = __page_alias_map (system_page_size, separation, &method);
if (buffers [0] == 0)
return "alias map failed";
buffers [1] = page_no_alias (system_page_size, separation);
if (buffers [1] == 0)
{
__page_alias_unmap (buffers [0], system_page_size, separation, method);
return "non-alias map failed";
}

retry:
timings [2] = timings [1] = timings [0] = LONG_MAX;
for (i = 0; i < 384; i++)
{
struct timeval time_before, time_after;
long time_delta;
int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3;
volatile WORD * ptr1 = (volatile WORD *) buffers [which_test];
volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation);

/* Test whether writes to one page appear immediately in the other,
and time how long the memory accesses take. */
gettimeofday (&time_before, (struct timezone *) 0);
if (which_test < 2)
fail = test_page_alias (ptr1, ptr2, timing_loops);
gettimeofday (&time_after, (struct timezone *) 0);

if (fail && which_test == 0)
{
/* Test whether the failure is due to a store buffer bypass
which ignores virtual address translation. */
int l1_fail = test_l1_only (ptr1, ptr2);
__page_alias_unmap (buffers [0], system_page_size, separation,
method);
munmap (buffers [1], separation + system_page_size);
return l1_fail ? "cache not coherent" : "store buffer not coherent";
}

time_delta = ((time_after.tv_usec - time_before.tv_usec)
+ 1000000 * (time_after.tv_sec - time_before.tv_sec));

/* Find the smallest time taken for each test. Ignore negative
glitches due to Linux' tendancy to jump the clock backwards. */
if (time_delta >= 0 && time_delta < timings [which_test])
timings [which_test] = time_delta;
}

/* Remove the cost of `gettimeofday()' itself from measurements. */
timings [0] -= timings [2];
timings [1] -= timings [2];

/* Keep looping until at least one measurement becomes significant. A
very fast CPU will show measurements of zero microseconds for
smaller values of `timing_loops'. Also loop until the cost of
`gettimeofday()' becomes insignificant. When the program is run
under `strace', the latter is a big and this is needed to stabilise
the results. */
if (timings [0] <= 10 * (1 + timings [2])
&& timings [1] <= 10 * (1 + timings [2]))
{
timing_loops <<= 1;
goto retry;
}

__page_alias_unmap (buffers [0], system_page_size, separation, method);
munmap (buffers [1], separation + system_page_size);

printf ("(%d) [%ld,%ld,%ld] ",
timing_loops, timings [0], timings [1], timings [2]);

/* Reject page aliasing if it is much slower than accessing a single,
definitely cached page directly. */
if (timings [0] > 2 * timings [1])
return "too slow";

/* Success! Passed all tests for these parameters. */
return 0;
}

size_t page_alias_smallest_size;

void
page_alias_init (void)
{
size_t size;

#ifdef _SC_PAGESIZE
system_page_size = sysconf (_SC_PAGESIZE);
#elif defined (_SC_PAGE_SIZE)
system_page_size = sysconf (_SC_PAGE_SIZE);
#else
system_page_size = getpagesize ();
#endif

for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2)
{
const char * reason = test_one_separation (size);

printf ("Test separation: %lu bytes: %s%s\n",
(unsigned long) size, reason ? "FAIL - " : "pass",
reason ? reason : "");

/* This logic searches for the smallest _contiguous_ range
of page sizes for which `page_alias_test' passes. */
if (reason == 0 && page_alias_smallest_size == 0)
page_alias_smallest_size = size;
else if (reason != 0 && page_alias_smallest_size != 0)
{
/* Fail, indicating that page-aliasing is not reliable,
because there's a maximum size. We don't support that as
it seems quite unlikely given our model of cache colouring. */
page_alias_smallest_size = 0;
break;
}
}

printf ("VM page alias coherency test: ");

if (page_alias_smallest_size == 0)
printf ("failed; will use copy buffers instead\n");
else if (page_alias_smallest_size == system_page_size)
printf ("all sizes passed\n");
else
printf ("minimum fast spacing: %lu (%lu page%s)\n",
(unsigned long) page_alias_smallest_size,
(unsigned long) (page_alias_smallest_size / system_page_size),
(page_alias_smallest_size == system_page_size) ? "" : "s");
}

//#ifdef TEST_PAGEALIAS
int
main ()
{
page_alias_init ();
return 0;
}
//#endif

2003-09-01 10:13:32

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Russell King wrote:
> This looks like an old kernel on your NetWinder. Later 2.4 kernels
> should get this right (by marking the pages uncacheable in user space.)

How do they know which pages to mark uncacheable? Surely not all
MAP_SHARED|MAP_FIXED mappings are uncacheable?

> However, when I tried this program, it seemed to have some unexpected
> results, sometimes claiming that its too slow, sometimes that the
> store buffer isn't coherent, and sometimes saying that the cache
> isn't coherent.

If it says the store buffer isn't coherent, that means the main test
for coherence failed (test_page_alias), but a second test
(test_l1_only), which is designed to allow any CPU delayed stores to
drain, is showing the same memory to be coherent.

There is a bug in test_l1_only which I just noticed. It's unlikely,
but if `dummy' happens to have the same L1 cache address as both words
being tested, and it's a 2-way (or less) set-associative cache, then
it will inadvertently flush the cache and say "store buffer not
coherent" when it means to say "cache not coherent".

If the duplicate mapping is uncacheable, it should always say it's too
slow, however if _all_ MAP_FIXED|MAP_SHARED mappings are uncacheable
then it compares the timings and will think there is no penalty for
the duplicate mapping.

> On Fri, Aug 29, 2003 at 04:26:28PM -0400, Paul J.Y. Lahaie wrote:
> > Corel NetWinder (275MHz StrongARM)
> > Test separation: 4096 bytes: FAIL - cache not coherent

All the 3 results I have for ARM say that they are all incoherent.
Those results are all for SA-110s of different speeds.

Please try the program below, which is the same as before but with
test_l1_only hopefully improved, and it prints some more helpful
numbers.

Thanks,
-- Jamie

==========================================

/* Version 3! This code maps shared memory to multiple addresses and
tests it for cache coherency and performance.

Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */

#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/signal.h>
#include <sys/mman.h>
#include <sys/time.h>

#if HAVE_SYSV_SHM
#include <sys/ipc.h>
#include <sys/shm.h>
#endif

//#include "pagealias.h"

/* Helpers to temporarily block all signals. These are used for when a
race condition might leave a temporary file that should have been
deleted -- we do our best to prevent this possibility. */

static void
block_signals (sigset_t * save_state)
{
sigset_t all_signals;
sigfillset (&all_signals);
sigprocmask (SIG_BLOCK, &all_signals, save_state);
}

static void
unblock_signals (sigset_t * restore_state)
{
sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0);
}

/* Open a new shared memory file, either using the POSIX.4 `shm_open'
function, or using a regular temporary file in /tmp. Immediately
after opening the file, it is unlinked from the global namespace
using `shm_unlink' or `unlink'.

On success, the value returned is a file descriptor. Otherwise, -1
is returned and `errno' is set.

The descriptor can be closed using simply `close'. */

/* Note: `shm_open' requires link argument `-lposix4' on Suns.
On GNU/Linux with Glibc, it requires `-lrt'. Unfortunately, Glibc's
-lrt insists on linking to pthreads, which we may not want to use
because that enables thread locking overhead in other functions. So
we implement a direct method of opening shm on Linux. */

/* If this is changed, change the size of `buffer' below too. */
#if HAVE_SHM_OPEN
#define SHM_DIR_PREFIX "/" /* `shm_open' arg needs "/" for portability. */
#elif defined (__linux__)
#include <sys/statfs.h>
#define SHM_DIR_PREFIX "/dev/shm/"
#else
#undef SHM_DIR_PREFIX
#endif

static int
open_shared_memory_file (int use_tmp_file)
{
char * ptr, buffer [19];
int fd, i;
unsigned long number;
sigset_t save_signals;
struct timeval tv;

#if !HAVE_SHM_OPEN && defined (__linux__)
struct statfs sfs;
if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0
|| sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */))
{
errno = ENOSYS;
return -1;
}
#endif

loop:
/* Print a randomised path name into `buffer'. The string depends on
the directory and whether we are using POSIX.4 shared memory or a
regular temporary file. RANDOM is a 5-digit, base-62
representation of a pseudo-random number. The string is used as a
candidate in the search for an unused shared segment or file name. */
#ifdef SHM_DIR_PREFIX
strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-");
#else
strcpy (buffer, "/tmp/shm-");
#endif
ptr = buffer + strlen (buffer);
gettimeofday (&tv, (struct timezone *) 0);
number = (unsigned long) random ();
number += (unsigned long) getpid ();
number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec;
for (i = 0; i < 5; i++)
{
/* Don't use character arithmetic, as not all systems are ASCII. */
*ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62];
number /= 62;
}
*ptr = '\0';

/* Block signals between the open and unlink, to really minimise
the chance of accidentally leaving an unwanted file around. */
block_signals (&save_signals);
#if HAVE_SHM_OPEN
if (!use_tmp_file)
{
fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd != -1)
shm_unlink (buffer);
}
else
#endif /* HAVE_SHM_OPEN */
{
fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd != -1)
unlink (buffer);
}
unblock_signals (&save_signals);

/* If we failed due to a name collision or a signal, try again. */
if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR))
goto loop;

return fd;
}

/* Allocate a region of address space `size' bytes long, so that the
region will not be allocated for any other purpose. It is freed with
`munmap'.

Returns the mapped base address on success. Otherwise, MAP_FAILED is
returned and `errno' is set. */

static size_t system_page_size;

#if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#define MAP_ANONYMOUS MAP_ANON
#endif
#ifndef MAP_NORESERVE
#define MAP_NORESERVE 0
#endif
#ifndef MAP_FILE
#define MAP_FILE 0
#endif
#ifndef MAP_VARIABLE
#define MAP_VARIABLE 0
#endif
#ifndef MAP_FAILED
#define MAP_FAILED ((void *) -1)
#endif
#ifndef PROT_NONE
#define PROT_NONE PROT_READ
#endif

static void *
map_address_space (void * optional_address, size_t size, int access)
{
void * addr;
#ifdef MAP_ANONYMOUS
addr = mmap (optional_address, size,
access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
(MAP_PRIVATE | MAP_ANONYMOUS
| (optional_address ? MAP_FIXED : MAP_VARIABLE)
| (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0);
#else /* not defined MAP_ANONYMOUS */
int save_errno, zero_fd = open ("/dev/zero", O_RDONLY);
if (zero_fd == -1)
return MAP_FAILED;
addr = mmap (optional_address, size,
access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
(MAP_PRIVATE | MAP_FILE
| (optional_address ? MAP_FIXED : MAP_VARIABLE)
| (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0);
save_errno = errno;
close (zero_fd);
errno = save_errno;
#endif /* not defined MAP_ANONMOUS */
return addr;
}

/* Set up a page alias mapping using mmap() on POSIX shared memory or on
a temporary regular file.

Returns the mapped base address on success. Otherwise, 0 is returned
and `errno' is set. */

static void *
page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file)
{
void * base_addr, * addr;
int fd, i, save_errno;
struct stat st;

fd = open_shared_memory_file (use_tmp_file);
if (fd == -1)
goto fail;

/* First, resize the shared memory file to the desired size. */
if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size)
goto close_fail;

/* Map an anonymous region `separation + size' bytes long. This is how
we allocate sufficient contiguous address space. We over-map
this with the aliased buffer. */
if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto close_fail;

/* Map the same shared memory repeatedly, at different addresses. */
for (i = 0; i < 2; i++)
{
addr = mmap ((char *) base_addr + (i ? separation : 0), size,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED,
fd, (off_t) 0);
if (addr == MAP_FAILED)
goto unmap_fail;
if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `mmap' ignored MAP_FIXED! Should never happen. */
munmap (addr, size);
save_errno = EINVAL;
goto unmap_fail_se;
}
}
if (close (fd) != 0)
goto unmap_fail;

/* Success! */
return base_addr;

/* Failure. */
unmap_fail:
save_errno = errno;
unmap_fail_se:
munmap (base_addr, separation + size);
errno = save_errno;
close_fail:
save_errno = errno;
close (fd);
errno = save_errno;
fail:
return 0;
}

/* Set up a page alias mapping using SYSV IPC shared memory.

Returns the mapped base address on success. Otherwise, 0 is returned
and `errno' is set. */

#if HAVE_SYSV_SHM

static void *
page_alias_using_sysv_shm (size_t size, size_t separation)
{
void * base_addr, * addr;
sigset_t save_signals;
int shmid, i, save_errno;

/* Map an anonymous region `separation + size' bytes long. This is how
we allocate sufficient contiguous address space. We over-map
this with the aliased buffer. */
if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto fail;

/* Block signals between the shmget() and IPC_RMID, to minimise the chance
of accidentally leaving an unwanted shared segment around. */
block_signals (&save_signals);

shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600);
if (shmid == -1)
goto unmap_fail;

/* Map the same shared memory repeatedly, at different addresses. */
for (i = 0; i < 2; i++)
{
/* `shmat' is tried twice. The fist time it can fail if the local
implementation of `shmat' refuses to map over a region mapped
with `mmap'. In that case, we punch a hole using `munmap' and
do it again.

If the local `shmat' has this property, the `shmat' calls
to fixed addresses might collide with a concurrent thread
which is also doing mappings, and will fail. At least it
is a safe failure.

On the other hand, if the local `shmat' can map over
already-mapped regions (in the same way that `mmap' does), it
is essential that we do actually use an already-mapped region,
so that collisions with a concurrent thread can't possibly
result in both of us grabbing the same address range with no
indication of error. */
addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
if (addr == (void *) -1 && errno == EINVAL)
{
munmap ((char *) base_addr + (i ? separation : 0), size);
addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
}

/* Check for errors. */
if (addr == (void *) -1)
{
save_errno = errno;
if (i == 1)
shmdt (base_addr);
goto remove_shm_fail_se;
}
else if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `shmat' ignored the requested address! */
if (i == 1)
shmdt (base_addr);
save_errno = EINVAL;
goto remove_shm_fail_se;
}
}

if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0)
goto remove_shm_fail;
unblock_signals (&save_signals);

/* Success! */
return base_addr;

/* Failure. */
remove_shm_fail:
save_errno = errno;
remove_shm_fail_se:
while (--i >= 0)
shmdt ((char *) base_addr + (i ? separation : 0));
shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0);
errno = save_errno;
unmap_fail:
save_errno = errno;
unblock_signals (&save_signals);
munmap (base_addr, separation + size);
errno = save_errno;
fail:
return 0;
}

#endif /* HAVE_SYSV_SHM */

/* Map a page-aliased ring buffer. Shared memory of size `size' is
mapped twice, with the difference between the two addresses being
`separation', which must be at least `size'. The total address range
used is `separation + size' bytes long.

On success, *METHOD is filled with a number which must be passed to
`page_alias_unmap', and the mapped base address is returned.
Otherwise, 0 is returned and `errno' is set. */

static void *
__page_alias_map (size_t size, size_t separation, int * method)
{
void * addr;
if (((size | separation) & (system_page_size - 1)) != 0 || size > separation)
{
errno = -EINVAL;
return 0;
}

/* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */
#ifdef SHM_DIR_PREFIX
*method = 0;
if ((addr = page_alias_using_mmap (size, separation, 0)) != 0)
return addr;
#endif
#if HAVE_SYSV_SHM
*method = 1;
if ((addr = page_alias_using_sysv_shm (size, separation)) != 0)
return addr;
#endif
*method = 2;
return page_alias_using_mmap (size, separation, 1);
}

/* Unmap a page-aliased ring buffer previously allocated by
`page_alias_map'. `address' is the base address, and `size' and
`separation' are the arguments previously passed to
`__page_alias_map'. `method' is the value previously stored in *METHOD.

Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */

static int
__page_alias_unmap (void * address, size_t size, size_t separation, int method)
{
#if HAVE_SYSV_SHM
if (method == 1)
{
shmdt (address);
shmdt (address + separation);
if (separation > size)
munmap (address + size, separation - size);
return 0;
}
#endif

return munmap (address, separation + size);
}

/* Map a page-aliased ring buffer. `size' is the size of the buffer to
create; it will be mapped twice to cover a total address range
`size * 2' bytes long.

On success, *METHOD is filled with a number which must be passed to
`page_alias_unmap', and the mapped base address is returned.
Otherwise, 0 is returned and `errno' is set. */

void *
page_alias_map (size_t size, int * method)
{
return __page_alias_map (size, size, method);
}

/* Unmap a page-aliased ring buffer previously allocated by
`page_alias_map'. `address' is the base address, and `size' is the
size of the buffer (which is half of the total mapped address range).
`method' is a value previously stored in *METHOD by `page_alias_map'.

Returns 0 on success. Otherwise, -1 is returned and `errno' is set. */

int
page_alias_unmap (void * address, size_t size, int method)
{
return __page_alias_unmap (address, size, size, method);
}

/* Map some memory which is not aliased, for timing comparisons against
aliased pages. We use a combination of mappings similar to
page_alias_*(), in case there are resource limitations which would
prevent malloc() or a single mmap() working for the larger address
range tests. */

static void *
page_no_alias (size_t size, size_t separation)
{
void * base_addr, * addr;
int i, save_errno;

if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
goto fail;

/* Map anonymous memory at the different addresses. */
for (i = 0; i < 2; i++)
{
addr = map_address_space ((char *) base_addr + (i ? separation : 0),
size, 1);
if (addr == MAP_FAILED)
goto unmap_fail;
if (addr != (char *) base_addr + (i ? separation : 0))
{
/* `mmap' ignored MAP_FIXED! Should never happen. */
munmap (addr, size);
save_errno = EINVAL;
goto unmap_fail_se;
}
}

/* Success! */
return base_addr;

/* Failure. */
unmap_fail:
save_errno = errno;
unmap_fail_se:
munmap (base_addr, separation + size);
errno = save_errno;
fail:
return 0;
}

/* This should be a word size that the architecture can read and write
fast in a single instruction. In principle, C's `int' is the natural
word size, but in practice it isn't on 64-bit machines. */

#define WORD long

/* These GCC-specific asm statements force values into registers, and
also act as compiler memory barriers. These are used to force a
group of write/write/read instructions as close together as possible,
to maximise the detection of store buffer conditions. Despite being
asm statements, these will work with any of GCC's target architectures,
provided they have >= 4 registers. */

#if __GNUC__ >= 3
#define __noinline __attribute__ ((__noinline__))
#else
#define __noinline
#endif

#ifdef __GNUC__
#define force_into_register(var) \
__asm__ ("" : "=r" (var) : "0" (var) : "memory")
#define force_into_registers(var1, var2, var3, var4) \
__asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \
: "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory")
#else
#define force_into_register(var) do {} while (0)
#define force_into_registers(var1, var2, var3, var4) do {} while (0)
#endif

/* This function tries to test whether a CPU snoops its store buffer for
reads within a few instructions, and ignores virtual to physical
address translations when doing that. In principle a CPU might do
this even if it's L1 cache is physically tagged or indexed, although
I have not seen such a system. (A CPU which uses store buffer
snooping and with an off-board MMU, which the CPU is unaware of,
could have this property).

It isn't possible to do this test perfectly; we do our best. The
`force_into_register' macros ensure that the write/write/read
sequence is as compact as the compiler can make it. */

static WORD __noinline
test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2)
{
register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2;
register WORD __reg1 = 1, __reg2 = 0;
force_into_registers (__reg1, __reg2, __regptr1, __regptr2);
*__regptr1 = __reg1;
*__regptr2 = __reg2;
__reg1 = *__regptr1;
force_into_register (__reg1);
return __reg1;
}

/* This function tests whether writes to one page are seen in another
page at a different virtual address, and whether they are nearly as
fast as normal writes.

The accesses are timed by the caller of this function.
Alternate writes go to alternate pages, so that if aliasing is
implemented using page faults, it will clearly show up in the
timings. */

static int __noinline
test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops)
{
WORD fail = 0;
while (--timing_loops >= 0)
fail |= test_store_buffer_snoop (ptr1, ptr2);
return fail != 0;
}

/* This function tests L1 cache coherency without checking for store
buffer snoop coherency. To do this, we add enough stores that the
writes to *PTR1 are flushed (or drain due to the time delay) from the
store buffer before we read from *PTR1. The result of this function
is not important: it is only used in a diagnostic message. */

static int __noinline
test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2)
{
int i, j;
WORD fail = 0;
for (i = 0; i < 10; i++)
{
*ptr1 = 1;
/* This loop of volatile writes creates a short time delay. The
delay gives the store to *PTR1 time to flush from the store
buffer and/or the many writes flush the store buffer. The loop
writes to *PTR2 because if we pick another fixed address and
write to it, that would be testing 3 cache lines (PTR1, PTR2
and the fixed address) and the fixed address _might_ happen to
collide with PTR1 or PTR2 in the L1 cache. If the L1 cache is
2-way set-associative, that would flush it every time, possibly
making it appear coherent when it isn't. */
for (j = 0; j < 1000; j++)
*ptr2 = 0;
fail |= *ptr1;
}
return fail != 0;
}

/* Thoroughly test a pair of aliased pages with a fixed address
separation, to see if they really behave like memory appearing at two
locations, and efficiently. We search through different values of
`separation' searching for a suitable "cache colour" on this machine. */

static inline const char *
test_one_separation (size_t separation)
{
void * buffers [2];
long timings [3];
int i, method, timing_loops = 64;

/* We measure timings of 3 different tests, each 128 times to find the
minimum. 0: Writes and reads to aliased pages. 1: Writes and
reads to non-aliased pages, to compare with 1. 2: Doing nothing,
to measure the time for `gettimeofday' itself.

The measurements are done in a mixed up order. If we did 64
measurements of type 0, then 64 of type 1, then 64 of type 2, the
results could be mislead due to synchronisation with other
processes occuring on the machine. */

/* A previously generated random shuffle of bit-pairs. Each pair is a
number from the set {0,1,2}. Each number occurs exactly 128 times. */
static const unsigned char pattern [96] =
{
0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56,
0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49,
0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99,
0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25,
0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19,
0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15,
0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89,
0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85,
};

buffers [0] = __page_alias_map (system_page_size, separation, &method);
if (buffers [0] == 0)
return "alias map failed";
buffers [1] = page_no_alias (system_page_size, separation);
if (buffers [1] == 0)
{
__page_alias_unmap (buffers [0], system_page_size, separation, method);
return "non-alias map failed";
}

retry:
timings [2] = timings [1] = timings [0] = LONG_MAX;
for (i = 0; i < 384; i++)
{
struct timeval time_before, time_after;
long time_delta;
int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3;
volatile WORD * ptr1 = (volatile WORD *) buffers [which_test];
volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation);

/* Test whether writes to one page appear immediately in the other,
and time how long the memory accesses take. */
gettimeofday (&time_before, (struct timezone *) 0);
if (which_test < 2)
fail = test_page_alias (ptr1, ptr2, timing_loops);
gettimeofday (&time_after, (struct timezone *) 0);

if (fail && which_test == 0)
{
/* Test whether the failure is due to a store buffer bypass
which ignores virtual address translation. */
int l1_fail = test_l1_only (ptr1, ptr2);
__page_alias_unmap (buffers [0], system_page_size, separation,
method);
munmap (buffers [1], separation + system_page_size);
return l1_fail ? "cache not coherent" : "store buffer not coherent";
}

time_delta = ((time_after.tv_usec - time_before.tv_usec)
+ 1000000 * (time_after.tv_sec - time_before.tv_sec));

/* Find the smallest time taken for each test. Ignore negative
glitches due to Linux' tendancy to jump the clock backwards. */
if (time_delta >= 0 && time_delta < timings [which_test])
timings [which_test] = time_delta;
}

/* Remove the cost of `gettimeofday()' itself from measurements. */
timings [0] -= timings [2];
timings [1] -= timings [2];

/* Keep looping until at least one measurement becomes significant. A
very fast CPU will show measurements of zero microseconds for
smaller values of `timing_loops'. Also loop until the cost of
`gettimeofday()' becomes insignificant. When the program is run
under `strace', the latter is a big and this is needed to stabilise
the results. */
if (timings [0] <= 10 * (1 + timings [2])
&& timings [1] <= 10 * (1 + timings [2]))
{
timing_loops <<= 1;
goto retry;
}

__page_alias_unmap (buffers [0], system_page_size, separation, method);
munmap (buffers [1], separation + system_page_size);

printf ("(%d) [%ld,%ld,%ld] ",
timing_loops, timings [0], timings [1], timings [2]);

/* Reject page aliasing if it is much slower than accessing a single,
definitely cached page directly. */
if (timings [0] > 2 * timings [1])
return "too slow";

/* Success! Passed all tests for these parameters. */
return 0;
}

size_t page_alias_smallest_size;

void
page_alias_init (void)
{
size_t size;

#ifdef _SC_PAGESIZE
system_page_size = sysconf (_SC_PAGESIZE);
#elif defined (_SC_PAGE_SIZE)
system_page_size = sysconf (_SC_PAGE_SIZE);
#else
system_page_size = getpagesize ();
#endif

for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2)
{
const char * reason = test_one_separation (size);

printf ("Test separation: %lu bytes: %s%s\n",
(unsigned long) size, reason ? "FAIL - " : "pass",
reason ? reason : "");

/* This logic searches for the smallest _contiguous_ range
of page sizes for which `page_alias_test' passes. */
if (reason == 0 && page_alias_smallest_size == 0)
page_alias_smallest_size = size;
else if (reason != 0 && page_alias_smallest_size != 0)
{
/* Fail, indicating that page-aliasing is not reliable,
because there's a maximum size. We don't support that as
it seems quite unlikely given our model of cache colouring. */
page_alias_smallest_size = 0;
break;
}
}

printf ("VM page alias coherency test: ");

if (page_alias_smallest_size == 0)
printf ("failed; will use copy buffers instead\n");
else if (page_alias_smallest_size == system_page_size)
printf ("all sizes passed\n");
else
printf ("minimum fast spacing: %lu (%lu page%s)\n",
(unsigned long) page_alias_smallest_size,
(unsigned long) (page_alias_smallest_size / system_page_size),
(page_alias_smallest_size == system_page_size) ? "" : "s");
}

//#ifdef TEST_PAGEALIAS
int
main ()
{
page_alias_init ();
return 0;
}
//#endif

2003-09-01 10:35:56

by Sam Creasey

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this



On Mon, 1 Sep 2003, Geert Uytterhoeven wrote:

> As you probably know the 68020 had an external MMU (68551, or Sun-3 or Apollo
> MMU). Probably Motorola didn't bother to change the behavior when the MMU got
> integrated in later generations (68030 and up).
>
> BTW, probably you want us to run your test program on other m68k boxes? Mine
> got a 68040, that leaves us with:

> - 68020+Sun-3 MMU

68020+Sun-3 MMU results attached below (this is for a 3/60, and it's not
suprising that it passes, as there's no real cache in this configuration
(the sun3/2xx did have external cache, but the onboard ethernet in my
3/210 is on the fritz, and it's not booting at the moment). Note that
this is the newer version of the program which Jamie just posted.

bash-2.03# time ./jamie-test2
(2048) [10000,10000,0] Test separation: 8192 bytes: pass
(2048) [10000,10000,0] Test separation: 16384 bytes: pass
(2048) [10000,10000,0] Test separation: 32768 bytes: pass
(2048) [10000,10000,0] Test separation: 65536 bytes: pass
(2048) [10000,10000,0] Test separation: 131072 bytes: pass
(2048) [10000,10000,0] Test separation: 262144 bytes: pass
(2048) [10000,10000,0] Test separation: 524288 bytes: pass
(2048) [10000,10000,0] Test separation: 1048576 bytes: pass
(2048) [10000,10000,0] Test separation: 2097152 bytes: pass
(2048) [10000,10000,0] Test separation: 4194304 bytes: pass
(2048) [10000,10000,0] Test separation: 8388608 bytes: pass
(2048) [10000,10000,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 1m34.330s
user 1m30.030s
sys 0m4.070s
bash-2.03# cat /proc/cpuinfo
CPU: 68020
MMU: Sun-3
FPU: 68881
Clocking: 19.9MHz
BogoMips: 4.97
Calibration: 24896 loops


-- Sam




2003-09-01 10:49:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Sam Creasey wrote:
> 68020+Sun-3 MMU results attached below (this is for a 3/60, and it's not
> suprising that it passes, as there's no real cache in this configuration
> (the sun3/2xx did have external cache, but the onboard ethernet in my
> 3/210 is on the fritz, and it's not booting at the moment). Note that
> this is the newer version of the program which Jamie just posted.

Thanks.

> bash-2.03# time ./jamie-test2
> (2048) [10000,10000,0] Test separation: 8192 bytes: pass

Mighty suspicious gettimeofday() you have there.

> real 1m34.330s
> user 1m30.030s
> sys 0m4.070s

Indeed, on other systems the test completes in a few seconds at most,
not because of CPU speed, but because gettimeofday() returns high
resolution time on them.

Isn't there a way to read high resolution time on the 68020 Sun-3?

-- Jamie

2003-09-01 11:13:12

by Roman Zippel

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Hi,

On Mon, 1 Sep 2003, Jamie Lokier wrote:

> I would prefer that you run the attached program. It fixes a bug in
> the function which tests whether the problem is in the L1 cache or
> store buffer. The bug probably didn't affect the test, but it might
> have.

This is the result for a 060:

$ ./a.out
(256) [175,175,11] Test separation: 4096 bytes: pass
(256) [173,175,11] Test separation: 8192 bytes: pass
(256) [176,175,10] Test separation: 16384 bytes: pass
(256) [174,173,11] Test separation: 32768 bytes: pass
(256) [174,175,11] Test separation: 65536 bytes: pass
(256) [175,175,10] Test separation: 131072 bytes: pass
(256) [176,176,10] Test separation: 262144 bytes: pass
(256) [175,175,11] Test separation: 524288 bytes: pass
(256) [173,175,11] Test separation: 1048576 bytes: pass
(256) [174,174,11] Test separation: 2097152 bytes: pass
(256) [176,176,10] Test separation: 4194304 bytes: pass
(256) [177,177,9] Test separation: 8388608 bytes: pass
(256) [175,176,10] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
$ cat /proc/cpuinfo
CPU: 68060
MMU: 68060
FPU: 68060
Clocking: 49.7MHz
BogoMips: 99.53
Calibration: 497664 loops

bye, Roman

2003-09-01 11:19:07

by Alan

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Llu, 2003-09-01 at 07:00, Jamie Lokier wrote:
> Matt Porter wrote:
> > PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI
>
> The cache looks very coherent to me.

The only x86 which will show the user non cache coherent behaviour (and
then only in a really weird situation) is SMP Pentium Pro due to the
store fence errata.

The Winchip is non SMP so you won't see CPU<->CPU store ordering changes
although I guess mmap of mmio space might show you stuff if you really
tried hard

The Geode has bus level magic so its out of order but if you ask then
you get the right answer (kind of the zen question about falling trees
implemented in silicon).


2003-09-01 11:31:41

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 1 Sep 2003, Jamie Lokier wrote:
> There is a bug in test_l1_only which I just noticed. It's unlikely,
> but if `dummy' happens to have the same L1 cache address as both words
> being tested, and it's a 2-way (or less) set-associative cache, then
> it will inadvertently flush the cache and say "store buffer not
> coherent" when it means to say "cache not coherent".
>
> Please try the program below, which is the same as before but with
> test_l1_only hopefully improved, and it prints some more helpful
> numbers.

Results for 68040 with the new version:

cassandra:/tmp# time ./test2
Test separation: 4096 bytes: FAIL - store buffer not coherent
Test separation: 8192 bytes: FAIL - store buffer not coherent
Test separation: 16384 bytes: FAIL - store buffer not coherent
Test separation: 32768 bytes: FAIL - store buffer not coherent
Test separation: 65536 bytes: FAIL - store buffer not coherent
Test separation: 131072 bytes: FAIL - store buffer not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
Test separation: 4194304 bytes: FAIL - store buffer not coherent
Test separation: 8388608 bytes: FAIL - store buffer not coherent
Test separation: 16777216 bytes: FAIL - store buffer not coherent
VM page alias coherency test: failed; will use copy buffers instead

real 0m0.454s
user 0m0.090s
sys 0m0.210s
cassandra:/tmp# cat /proc/cpuinfo
CPU: 68040
MMU: 68040
FPU: 68040
Clocking: 24.8MHz
BogoMips: 16.53
Calibration: 82688 loops
cassandra:/tmp#

New m68k binary at http://home.tvd.be/cr26864/Linux/m68k/jamie_test2.gz

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


2003-09-01 12:24:14

by Sam Creasey

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this



On Mon, 1 Sep 2003, Jamie Lokier wrote:

> Sam Creasey wrote:
>
> > bash-2.03# time ./jamie-test2
> > (2048) [10000,10000,0] Test separation: 8192 bytes: pass
>
> Mighty suspicious gettimeofday() you have there.
>
> > real 1m34.330s
> > user 1m30.030s
> > sys 0m4.070s
>
> Indeed, on other systems the test completes in a few seconds at most,
> not because of CPU speed, but because gettimeofday() returns high
> resolution time on them.
>
> Isn't there a way to read high resolution time on the 68020 Sun-3?

AFAICT, no. I've dug through the datasheets for the intersil RTC used, as
well as the NetBSD code, and SunOS headers, and it seems that we're stuck
with 1/100th second accuracy. Bummer.

-- Sam

2003-09-01 14:17:24

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, Sep 01, 2003 at 11:12:24AM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > This looks like an old kernel on your NetWinder. Later 2.4 kernels
> > should get this right (by marking the pages uncacheable in user space.)
>
> How do they know which pages to mark uncacheable? Surely not all
> MAP_SHARED|MAP_FIXED mappings are uncacheable?

By looking at the mappings present in the process. If a process maps the
same file using MAP_SHARED _and_ we fault the same page of data into two
or more mappings, we turn off the cache for those pages.

We actually only turn off the cache and leave the write buffer (aka your
store buffer) turned on for these regions, which should be sufficient for
it to remain coherent between different virtual addresses.

I've been doing some further investigation, and I'm now of the opinion
that "SA110" StrongARM chips have buggy write buffers, because:

- if I turn off the cache, leaving the write buffer on, this program
works on StrongARM-1110 CPUs but not some StrongARM-110 CPUs.
- if I turn off the cache and write buffer on these twice-mapped pages,
StrongARM-110 behaves as expected.

I've tested on several silicon revisions of StrongARM-110's:
- H appears buggy (reports as rev. 2)
- K appears fine (reports as rev. 2)
- S appears buggy (reports as rev. 3)

Unfortunately, the written documentation makes zero mention of the exact
write buffer behaviour. The best that I have to go on for the
StrongARM-110 is a block diagram which indicates that the write buffer
uses physical addresses, and that the D-cache contains the physical
address which the line was fetched from for writeback (via the write
buffer.)

So it seems your test program finds problems which DaveM's aliastest
program fails to detect... Gah. ;(

I guess its time to devise a kernel test and alter our behaviour on ARM
accordingly.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-01 14:51:54

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Ok, here's the results for a SA1110 machine (ie, with non-broken
write buffer):

Linux assabet2 2.6.0-test4 #1313 Thu Aug 28 21:05:05 BST 2003 armv4l unknown

Processor : StrongARM-1110 rev 8 (v4l)
BogoMIPS : 147.04
Features : swp half 26bit fastmult
CPU implementer : 0x69
CPU architecture: 4
CPU variant : 0x0
CPU part : 0xb11
CPU revision : 8

Hardware : Intel-Assabet
Revision : 0000
Serial : 0000000000000000

(64) [21,6,1] Test separation: 4096 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 8192 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 16384 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 32768 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 65536 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 131072 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 262144 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 524288 bytes: FAIL - too slow
(64) [21,7,1] Test separation: 1048576 bytes: FAIL - too slow
(64) [21,7,1] Test separation: 2097152 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 4194304 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 8388608 bytes: FAIL - too slow
(64) [21,7,1] Test separation: 16777216 bytes: FAIL - too slow
VM page alias coherency test: failed; will use copy buffers instead


--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-01 14:43:30

by Larry McVoy

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC, s390
on Linux and hpux/parisc, {freebsd, netbsd, openbsd}/x86, sco/x86,
solaris/sparc, solaris/x86, irix/mips, osx/ppc, aix/ppc, tru64/alpha.

This is most of our test machines, it doesn't include all the Windows
boxes but I figured you didn't care.

The version of test.c is the one you posted later. If I got it wrong
send me the latest.

work ~/jamie wc test.c
773 3726 25064 test.c
work ~/jamie md5sum test.c
1e7b9e6fa525c21211abbb8986d7b2e7 test.c

I'm a little concerned I have the wrong test, why would a 2.1Ghz Athlon
say it is too slow?

Format:
==== host name ====
Notes (may be blank)

Results

uname -a output
/proc/cpuinfo (if there)

==== aix ====
332Mhz 604e 7043-150

Test separation: 4096 bytes: FAIL - alias map failed
Test separation: 8192 bytes: FAIL - alias map failed
Test separation: 16384 bytes: FAIL - alias map failed
Test separation: 32768 bytes: FAIL - alias map failed
Test separation: 65536 bytes: FAIL - alias map failed
Test separation: 131072 bytes: FAIL - alias map failed
Test separation: 262144 bytes: FAIL - alias map failed
Test separation: 524288 bytes: FAIL - alias map failed
Test separation: 1048576 bytes: FAIL - alias map failed
Test separation: 2097152 bytes: FAIL - alias map failed
Test separation: 4194304 bytes: FAIL - alias map failed
Test separation: 8388608 bytes: FAIL - alias map failed
Test separation: 16777216 bytes: FAIL - alias map failed
VM page alias coherency test: failed; will use copy buffers instead

AIX aix 1 4 004376804C00

==== alpha ====
PC something-164, that really common cheapo motherboard/test kit.

(512) [14,14,0] Test separation: 8192 bytes: pass
(512) [14,14,0] Test separation: 16384 bytes: pass
(512) [14,14,0] Test separation: 32768 bytes: pass
(512) [14,14,0] Test separation: 65536 bytes: pass
(512) [14,14,0] Test separation: 131072 bytes: pass
(512) [14,14,0] Test separation: 262144 bytes: pass
(512) [14,14,0] Test separation: 524288 bytes: pass
(512) [14,14,0] Test separation: 1048576 bytes: pass
(512) [14,14,0] Test separation: 2097152 bytes: pass
(512) [14,14,0] Test separation: 4194304 bytes: pass
(512) [14,14,0] Test separation: 8388608 bytes: pass
(512) [14,14,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux alpha.bitmover.com 2.4.21-pre5 #2 Thu Mar 20 07:54:03 PST 2003 alpha unknown
cpu : Alpha
cpu model : EV56
cpu variation : 7
cpu revision : 0
cpu serial number :
system type : EB164
system variation : PC164
system revision : 0
system serial number :
cycle frequency [Hz] : 500000000
timer frequency [Hz] : 1024.00
page size [bytes] : 8192
phys. address bits : 40
max. addr. space # : 127
BogoMIPS : 992.88
kernel unaligned acc : 0 (pc=0,va=0)
user unaligned acc : 0 (pc=0,va=0)
platform string : Digital AlphaPC 164 500 MHz
cpus detected : 1

==== disks ====
(128) [17,1,0] Test separation: 4096 bytes: FAIL - too slow
(128) [17,1,0] Test separation: 8192 bytes: FAIL - too slow
(128) [17,1,0] Test separation: 16384 bytes: FAIL - too slow
(1024) [10,13,0] Test separation: 32768 bytes: pass
(1024) [10,13,0] Test separation: 65536 bytes: pass
(1024) [10,13,0] Test separation: 131072 bytes: pass
(1024) [10,13,0] Test separation: 262144 bytes: pass
(1024) [10,13,0] Test separation: 524288 bytes: pass
(1024) [10,13,0] Test separation: 1048576 bytes: pass
(1024) [10,13,0] Test separation: 2097152 bytes: pass
(1024) [10,13,0] Test separation: 4194304 bytes: pass
(1024) [10,13,0] Test separation: 8388608 bytes: pass
(1024) [10,13,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

Linux disks.bitmover.com 2.4.18-14 #1 Wed Sep 4 12:13:11 EDT 2002 i686 athlon i386 GNU/Linux
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) XP 1900+
stepping : 2
cpu MHz : 1593.143
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3172.64

==== freebsd ====
(512) [32,32,1] Test separation: 4096 bytes: pass
(512) [32,32,1] Test separation: 8192 bytes: pass
(512) [32,32,1] Test separation: 16384 bytes: pass
(512) [32,32,1] Test separation: 32768 bytes: pass
(512) [32,32,1] Test separation: 65536 bytes: pass
(512) [32,32,1] Test separation: 131072 bytes: pass
(512) [32,32,1] Test separation: 262144 bytes: pass
(512) [32,32,1] Test separation: 524288 bytes: pass
(512) [32,32,1] Test separation: 1048576 bytes: pass
(512) [32,32,1] Test separation: 2097152 bytes: pass
(512) [32,32,1] Test separation: 4194304 bytes: pass
(512) [32,32,1] Test separation: 8388608 bytes: pass
(512) [32,32,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

FreeBSD freebsd.bitmover.com 2.2.8-RELEASE FreeBSD 2.2.8-RELEASE #0: Mon Nov 30 06:34:08 GMT 1998 [email protected]:/usr/src/sys/compile/GENERIC i386

==== freebsd3 ====
(64) [33,3,1] Test separation: 4096 bytes: FAIL - too slow
(64) [33,3,1] Test separation: 8192 bytes: FAIL - too slow
(512) [19,26,1] Test separation: 16384 bytes: pass
(512) [19,26,1] Test separation: 32768 bytes: pass
(512) [19,26,1] Test separation: 65536 bytes: pass
(512) [19,26,1] Test separation: 131072 bytes: pass
(512) [19,26,1] Test separation: 262144 bytes: pass
(512) [19,26,1] Test separation: 524288 bytes: pass
(512) [19,26,1] Test separation: 1048576 bytes: pass
(512) [19,26,1] Test separation: 2097152 bytes: pass
(512) [19,26,1] Test separation: 4194304 bytes: pass
(512) [19,26,1] Test separation: 8388608 bytes: pass
(512) [19,26,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

FreeBSD freebsd3.bitmover.com 3.2-RELEASE FreeBSD 3.2-RELEASE #0: Fri Jun 2 11:34:52 PDT 2000 [email protected]:/usr/src/sys/compile/DAVICOM i386

==== freebsd4 ====
(256) [92,26,5] Test separation: 4096 bytes: FAIL - too slow
(256) [92,26,5] Test separation: 8192 bytes: FAIL - too slow
(1024) [75,101,5] Test separation: 16384 bytes: pass
(1024) [75,101,5] Test separation: 32768 bytes: pass
(1024) [75,101,5] Test separation: 65536 bytes: pass
(1024) [75,101,5] Test separation: 131072 bytes: pass
(1024) [75,101,5] Test separation: 262144 bytes: pass
(1024) [75,101,5] Test separation: 524288 bytes: pass
(1024) [75,101,5] Test separation: 1048576 bytes: pass
(1024) [75,101,5] Test separation: 2097152 bytes: pass
(1024) [75,101,5] Test separation: 4194304 bytes: pass
(1024) [75,101,5] Test separation: 8388608 bytes: pass
(1024) [75,101,5] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

FreeBSD freebsd4.bitmover.com 4.1-RELEASE FreeBSD 4.1-RELEASE #0: Fri Jul 28 14:30:31 GMT 2000 [email protected]:/usr/src/sys/compile/GENERIC i386

==== hp ====
C360, HPUX 10.20

Test separation: 4096 bytes: FAIL - alias map failed
Test separation: 8192 bytes: FAIL - alias map failed
Test separation: 16384 bytes: FAIL - alias map failed
Test separation: 32768 bytes: FAIL - alias map failed
Test separation: 65536 bytes: FAIL - alias map failed
Test separation: 131072 bytes: FAIL - alias map failed
Test separation: 262144 bytes: FAIL - alias map failed
Test separation: 524288 bytes: FAIL - alias map failed
Test separation: 1048576 bytes: FAIL - alias map failed
Test separation: 2097152 bytes: FAIL - alias map failed
Test separation: 4194304 bytes: FAIL - alias map failed
Test separation: 8388608 bytes: FAIL - alias map failed
Test separation: 16777216 bytes: FAIL - alias map failed
VM page alias coherency test: failed; will use copy buffers instead

HP-UX hp B.10.20 A 9000/785 2004452144 two-user license

==== ia64 ====
(512) [17,17,0] Test separation: 16384 bytes: pass
(512) [17,17,0] Test separation: 32768 bytes: pass
(512) [17,17,0] Test separation: 65536 bytes: pass
(512) [17,17,0] Test separation: 131072 bytes: pass
(512) [17,17,0] Test separation: 262144 bytes: pass
(512) [17,17,0] Test separation: 524288 bytes: pass
(512) [17,17,0] Test separation: 1048576 bytes: pass
(512) [17,17,0] Test separation: 2097152 bytes: pass
(512) [17,17,0] Test separation: 4194304 bytes: pass
(512) [17,17,0] Test separation: 8388608 bytes: pass
(512) [17,17,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux ia64.bitmover.com 2.4.9-18smp #1 SMP Tue Dec 11 12:59:00 EST 2001 ia64 unknown
processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium
model : 0
revision : 7
archrev : 0
features : standard
cpu number : 0
cpu regs : 4
cpu MHz : 799.486992
itc MHz : 799.486992
BogoMIPS : 796.91

processor : 1
vendor : GenuineIntel
arch : IA-64
family : Itanium
model : 0
revision : 7
archrev : 0
features : standard
cpu number : 0
cpu regs : 4
cpu MHz : 799.486992
itc MHz : 799.486992
BogoMIPS : 796.91

==== macos ====
Imac, OS X 10.2

(2048) [67,67,3] Test separation: 4096 bytes: pass
(2048) [67,67,3] Test separation: 8192 bytes: pass
(2048) [67,67,3] Test separation: 16384 bytes: pass
(2048) [67,67,3] Test separation: 32768 bytes: pass
(2048) [67,67,3] Test separation: 65536 bytes: pass
(2048) [67,67,3] Test separation: 131072 bytes: pass
(2048) [67,67,3] Test separation: 262144 bytes: pass
(2048) [67,67,3] Test separation: 524288 bytes: pass
(2048) [67,67,3] Test separation: 1048576 bytes: pass
(2048) [67,67,3] Test separation: 2097152 bytes: pass
(2048) [67,67,3] Test separation: 4194304 bytes: pass
(2048) [67,67,3] Test separation: 8388608 bytes: pass
(2048) [67,67,3] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Darwin macos.bitmover.com 6.6 Darwin Kernel Version 6.6: Thu May 1 21:48:54 PDT 2003; root:xnu/xnu-344.34.obj~1/RELEASE_PPC Power Macintosh powerpc

==== mips ====
(64) [276,11,2] Test separation: 4096 bytes: FAIL - too slow
(64) [276,11,2] Test separation: 8192 bytes: FAIL - too slow
(128) [26,43,2] Test separation: 16384 bytes: pass
(128) [26,43,2] Test separation: 32768 bytes: pass
(128) [26,43,2] Test separation: 65536 bytes: pass
(128) [26,43,2] Test separation: 131072 bytes: pass
(128) [26,43,2] Test separation: 262144 bytes: pass
(128) [26,43,2] Test separation: 524288 bytes: pass
(128) [26,43,2] Test separation: 1048576 bytes: pass
(128) [26,43,2] Test separation: 2097152 bytes: pass
(128) [26,43,2] Test separation: 4194304 bytes: pass
(128) [26,43,2] Test separation: 8388608 bytes: pass
(128) [26,43,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

Linux mips 2.4.18-r4k-ip22 #1 Sun Jun 23 15:30:50 CEST 2002 mips unknown
system type : SGI Indy
processor : 0
cpu model : R4000SC V6.0 FPU V0.0
BogoMIPS : 86.83
byteorder : big endian
wait instruction : no
microsecond timers : yes
tlb_entries : 48
extra interrupt vector : no
hardware watchpoint : yes
VCED exceptions : 8055726
VCEI exceptions : 0

==== netbsd ====
(1024) [53,53,4] Test separation: 4096 bytes: pass
(2048) [106,106,4] Test separation: 8192 bytes: pass
(2048) [104,105,5] Test separation: 16384 bytes: pass
(2048) [105,104,5] Test separation: 32768 bytes: pass
(2048) [105,104,5] Test separation: 65536 bytes: pass
(2048) [104,104,5] Test separation: 131072 bytes: pass
(2048) [105,105,5] Test separation: 262144 bytes: pass
(2048) [105,105,5] Test separation: 524288 bytes: pass
(1024) [53,53,4] Test separation: 1048576 bytes: pass
(2048) [104,104,5] Test separation: 2097152 bytes: pass
(2048) [106,106,4] Test separation: 4194304 bytes: pass
(2048) [105,106,4] Test separation: 8388608 bytes: pass
(2048) [104,105,5] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

NetBSD netbsd.bitmover.com 1.5 NetBSD 1.5 (GENERIC) #1: Sun Nov 19 21:42:11 MET 2000 fvdl@sushi:/work/trees/netbsd-1-5/sys/arch/i386/compile/GENERIC i386

==== netwinder ====
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead

Linux netwinder 2.2.12-19991020 #1 Wed Oct 20 13:09:07 EDT 1999 armv4l unknown
Processor : Intel sa110 rev 3
BogoMips : 262.14
Hardware : Rebel-NetWinder
Serial # : 3464
Revision : 52ff

==== openbsd ====
(512) [27,27,1] Test separation: 4096 bytes: pass
(512) [27,27,1] Test separation: 8192 bytes: pass
(512) [27,27,1] Test separation: 16384 bytes: pass
(512) [27,27,1] Test separation: 32768 bytes: pass
(512) [27,27,1] Test separation: 65536 bytes: pass
(512) [27,27,1] Test separation: 131072 bytes: pass
(512) [27,27,1] Test separation: 262144 bytes: pass
(512) [27,27,1] Test separation: 524288 bytes: pass
(512) [27,27,1] Test separation: 1048576 bytes: pass
(512) [27,27,1] Test separation: 2097152 bytes: pass
(512) [27,27,1] Test separation: 4194304 bytes: pass
(512) [27,27,1] Test separation: 8388608 bytes: pass
(512) [27,27,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

OpenBSD openbsd 3.0 GENERIC#94 i386

==== parisc ====
A500
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
(2048) [41,41,2] Test separation: 4194304 bytes: pass
(2048) [41,41,2] Test separation: 8388608 bytes: pass
(2048) [41,41,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 4194304 (1024 pages)

Linux parisc 2.4.17-64 #1 Sat Mar 16 17:31:44 MST 2002 parisc64 unknown
processor : 0
cpu family : PA-RISC 2.0
cpu : PA8600 (PCX-W+)
cpu MHz : 550.000000
model : 9000/800/A500-5X
model name : Crescendo 550
hversion : 0x00005d50
sversion : 0x00000491
I-cache : 512 KB
D-cache : 1024 KB (WB)
ITLB entries : 160
DTLB entries : 160 - shared with ITLB
bogomips : 1097.72
software id : 580790518

==== ppc ====
(1024) [40,40,1] Test separation: 4096 bytes: pass
(1024) [40,40,1] Test separation: 8192 bytes: pass
(1024) [40,40,1] Test separation: 16384 bytes: pass
(1024) [40,40,1] Test separation: 32768 bytes: pass
(1024) [40,40,1] Test separation: 65536 bytes: pass
(1024) [40,40,1] Test separation: 131072 bytes: pass
(1024) [40,40,1] Test separation: 262144 bytes: pass
(1024) [40,40,1] Test separation: 524288 bytes: pass
(1024) [40,40,1] Test separation: 1048576 bytes: pass
(1024) [40,40,1] Test separation: 2097152 bytes: pass
(1024) [40,40,1] Test separation: 4194304 bytes: pass
(1024) [40,40,1] Test separation: 8388608 bytes: pass
(1024) [40,40,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux ppc.bitmover.com 2.4.6-pre2 #2 Sun Jun 10 20:21:17 PDT 2001 ppc unknown
processor : 0
cpu : 750
temperature : 0 C
clock : 333MHz
revision : 2.2
bogomips : 665.69
zero pages : total: 0 (0Kb) current: 0 (0Kb) hits: 0/0 (0%)
machine : iMac,1
motherboard : iMac MacRISC Power Macintosh
L2 cache : 512K unified
memory : 160MB
pmac-generation : NewWorld

==== qube ====
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
(512) [47,47,2] Test separation: 16384 bytes: pass
(512) [47,47,2] Test separation: 32768 bytes: pass
(512) [47,47,2] Test separation: 65536 bytes: pass
(512) [47,47,2] Test separation: 131072 bytes: pass
(512) [47,47,2] Test separation: 262144 bytes: pass
(512) [47,47,2] Test separation: 524288 bytes: pass
(512) [47,47,2] Test separation: 1048576 bytes: pass
(512) [47,47,2] Test separation: 2097152 bytes: pass
(512) [47,47,2] Test separation: 4194304 bytes: pass
(512) [47,47,2] Test separation: 8388608 bytes: pass
(512) [47,47,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

Linux qube.bitmover.com 2.0.34 #1 Thu Jan 28 03:03:03 PST 1999 mips unknown
cpu : MIPS
cpu model : Nevada V10.0
system type : Cobalt Microserver 27
BogoMIPS : 249.86
byteorder : little endian
unaligned accesses : 16
wait instruction : yes
microsecond timers : yes
extra interrupt vector : yes
hardware watchpoint : no

==== redhat52 ====
(256) [12,12,0] Test separation: 4096 bytes: pass
(256) [12,12,0] Test separation: 8192 bytes: pass
(256) [12,12,0] Test separation: 16384 bytes: pass
(256) [12,12,0] Test separation: 32768 bytes: pass
(256) [12,12,0] Test separation: 65536 bytes: pass
(256) [12,12,0] Test separation: 131072 bytes: pass
(256) [12,12,0] Test separation: 262144 bytes: pass
(256) [12,12,0] Test separation: 524288 bytes: pass
(256) [12,12,0] Test separation: 1048576 bytes: pass
(256) [12,12,0] Test separation: 2097152 bytes: pass
(256) [12,12,0] Test separation: 4194304 bytes: pass
(256) [12,12,0] Test separation: 8388608 bytes: pass
(256) [12,12,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux redhat52.bitmover.com 2.2.15pre9 #10 Sat Apr 8 17:59:35 PDT 2000 i686 unknown
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 6
model name : Celeron (Mendocino)
stepping : 5
cpu MHz : 534.561273
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 532.48

==== redhat62 ====
(256) [12,12,0] Test separation: 4096 bytes: pass
(256) [12,12,0] Test separation: 8192 bytes: pass
(256) [12,12,0] Test separation: 16384 bytes: pass
(256) [12,12,0] Test separation: 32768 bytes: pass
(256) [12,12,0] Test separation: 65536 bytes: pass
(256) [12,12,0] Test separation: 131072 bytes: pass
(256) [12,12,0] Test separation: 262144 bytes: pass
(256) [12,12,0] Test separation: 524288 bytes: pass
(256) [12,12,0] Test separation: 1048576 bytes: pass
(256) [12,12,0] Test separation: 2097152 bytes: pass
(256) [12,12,0] Test separation: 4194304 bytes: pass
(256) [12,12,0] Test separation: 8388608 bytes: pass
(256) [12,12,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux redhat62.bitmover.com 2.2.14-5.0 #1 Tue Mar 7 21:07:39 EST 2000 i686 unknown
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 6
model name : Celeron (Mendocino)
stepping : 5
cpu MHz : 534.552424
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 532.48

==== redhat71 ====
(256) [14,14,0] Test separation: 4096 bytes: pass
(256) [14,14,0] Test separation: 8192 bytes: pass
(256) [14,14,0] Test separation: 16384 bytes: pass
(256) [14,14,0] Test separation: 32768 bytes: pass
(256) [14,14,0] Test separation: 65536 bytes: pass
(256) [14,14,0] Test separation: 131072 bytes: pass
(256) [14,14,0] Test separation: 262144 bytes: pass
(256) [14,14,0] Test separation: 524288 bytes: pass
(256) [14,14,0] Test separation: 1048576 bytes: pass
(256) [14,14,0] Test separation: 2097152 bytes: pass
(256) [14,14,0] Test separation: 4194304 bytes: pass
(256) [14,14,0] Test separation: 8388608 bytes: pass
(256) [14,14,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux redhat71.bitmover.com 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 6
model name : Celeron (Mendocino)
stepping : 5
cpu MHz : 467.739
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 933.88

==== sco ====
(1024) [48,48,2] Test separation: 4096 bytes: pass
(1024) [48,48,2] Test separation: 8192 bytes: pass
(1024) [48,48,2] Test separation: 16384 bytes: pass
(1024) [48,48,2] Test separation: 32768 bytes: pass
(1024) [48,48,2] Test separation: 65536 bytes: pass
(1024) [48,48,2] Test separation: 131072 bytes: pass
(1024) [48,48,1] Test separation: 262144 bytes: pass
(1024) [49,49,1] Test separation: 524288 bytes: pass
(1024) [48,48,2] Test separation: 1048576 bytes: pass
(1024) [48,48,2] Test separation: 2097152 bytes: pass
(1024) [48,48,2] Test separation: 4194304 bytes: pass
(1024) [48,48,2] Test separation: 8388608 bytes: pass
(1024) [48,48,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

SCO_SV sco 3.2 5.0.7 i386

==== sgi ====
FPU: MIPS R10010 Floating Point Chip Revision: 0.0
CPU: MIPS R10000 Processor Chip Revision: 2.6
1 195 MHZ IP28 Processor
Main memory size: 192 Mbytes
Secondary unified instruction/data cache size: 1 Mbyte
Instruction cache size: 32 Kbytes
Data cache size: 32 Kbytes

(1024) [103,103,5] Test separation: 16384 bytes: pass
(1024) [103,103,5] Test separation: 32768 bytes: pass
(1024) [103,103,5] Test separation: 65536 bytes: pass
(1024) [103,103,5] Test separation: 131072 bytes: pass
(1024) [103,103,5] Test separation: 262144 bytes: pass
(1024) [103,103,5] Test separation: 524288 bytes: pass
(1024) [103,103,5] Test separation: 1048576 bytes: pass
(1024) [103,103,5] Test separation: 2097152 bytes: pass
(1024) [103,103,5] Test separation: 4194304 bytes: pass
(1024) [103,103,5] Test separation: 8388608 bytes: pass
(1024) [103,103,5] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

IRIX64 sgi 6.5 10120105 IP28

==== slovax ====
(128) [12,1,0] Test separation: 4096 bytes: FAIL - too slow
(128) [12,1,0] Test separation: 8192 bytes: FAIL - too slow
(128) [12,1,0] Test separation: 16384 bytes: FAIL - too slow
(2048) [15,16,0] Test separation: 32768 bytes: pass
(2048) [13,16,0] Test separation: 65536 bytes: pass
(2048) [13,16,0] Test separation: 131072 bytes: pass
(2048) [15,16,0] Test separation: 262144 bytes: pass
(2048) [15,16,0] Test separation: 524288 bytes: pass
(2048) [15,16,0] Test separation: 1048576 bytes: pass
(2048) [15,16,0] Test separation: 2097152 bytes: pass
(2048) [15,16,0] Test separation: 4194304 bytes: pass
(2048) [15,16,0] Test separation: 8388608 bytes: pass
(2048) [13,16,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

Linux slovax.bitmover.com 2.4.18-14 #1 Wed Sep 4 12:13:11 EDT 2002 i686 athlon i386 GNU/Linux

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) XP 2700+
stepping : 1
cpu MHz : 2162.685
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 4297.33


==== sparc ====
Test separation: 8192 bytes: FAIL - cache not coherent
(1024) [65,71,2] Test separation: 16384 bytes: pass
(1024) [65,68,2] Test separation: 32768 bytes: pass
(512) [2,50,2] Test separation: 65536 bytes: pass
(512) [33,19,2] Test separation: 131072 bytes: pass
(512) [33,20,2] Test separation: 262144 bytes: pass
(512) [33,50,2] Test separation: 524288 bytes: pass
(512) [33,19,2] Test separation: 1048576 bytes: pass
(1024) [35,68,2] Test separation: 2097152 bytes: pass
(512) [33,42,2] Test separation: 4194304 bytes: pass
(512) [2,50,2] Test separation: 8388608 bytes: pass
(512) [5,50,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)

Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown
cpu : TI UltraSparc IIi
fpu : UltraSparc IIi integrated FPU
promlib : Version 3 Revision 11
prom : 3.11.12
type : sun4u
ncpus probed : 1
ncpus active : 1
BogoMips : 539.03
MMU Type : Spitfire

==== sun ====
cpu0: SUNW,UltraSPARC-II (upaid 0 impl 0x11 ver 0x20 clock 296 MHz)
cpu1: SUNW,UltraSPARC-II (upaid 1 impl 0x11 ver 0x20 clock 296 MHz)
SunOS Release 5.6 Version Generic_105181-05 [UNIX(R) System V Release 4.0]

(128) [11,7,0] Test separation: 8192 bytes: pass
(256) [15,21,0] Test separation: 16384 bytes: pass
(256) [15,21,0] Test separation: 32768 bytes: pass
(256) [15,21,0] Test separation: 65536 bytes: pass
(256) [15,21,0] Test separation: 131072 bytes: pass
(256) [15,21,0] Test separation: 262144 bytes: pass
(256) [15,21,0] Test separation: 524288 bytes: pass
(256) [15,21,0] Test separation: 1048576 bytes: pass
(256) [15,21,0] Test separation: 2097152 bytes: pass
(256) [15,21,0] Test separation: 4194304 bytes: pass
(256) [15,21,0] Test separation: 8388608 bytes: pass
(256) [15,21,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

SunOS sun 5.6 Generic_105181-05 sun4u sparc SUNW,Ultra-2

==== sunx86 ====
2x 450Mhz Xeons

(512) [29,29,1] Test separation: 4096 bytes: pass
(512) [29,29,1] Test separation: 8192 bytes: pass
(512) [29,29,1] Test separation: 16384 bytes: pass
(512) [29,29,1] Test separation: 32768 bytes: pass
(512) [29,29,1] Test separation: 65536 bytes: pass
(512) [29,29,1] Test separation: 131072 bytes: pass
(512) [29,29,1] Test separation: 262144 bytes: pass
(512) [29,29,1] Test separation: 524288 bytes: pass
(512) [29,29,1] Test separation: 1048576 bytes: pass
(512) [29,29,1] Test separation: 2097152 bytes: pass
(512) [29,29,1] Test separation: 4194304 bytes: pass
(512) [29,29,1] Test separation: 8388608 bytes: pass
(512) [29,29,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

SunOS sunx86.bitmover.com 5.7 Generic_106542-18 i86pc i386 i86pc

==== tru64 ====
600AU (nicely made machine)

(65536) [976,976,0] Test separation: 8192 bytes: pass
(65536) [976,976,0] Test separation: 16384 bytes: pass
(65536) [976,976,0] Test separation: 32768 bytes: pass
(65536) [976,976,0] Test separation: 65536 bytes: pass
(65536) [976,976,0] Test separation: 131072 bytes: pass
(65536) [976,976,0] Test separation: 262144 bytes: pass
(65536) [976,976,0] Test separation: 524288 bytes: pass
(65536) [976,976,0] Test separation: 1048576 bytes: pass
(65536) [976,976,0] Test separation: 2097152 bytes: pass
(65536) [976,976,0] Test separation: 4194304 bytes: pass
(65536) [976,976,0] Test separation: 8388608 bytes: pass
(65536) [976,976,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

OSF1 tru64.bitmover.com V5.1 2650 alpha

==== winxp ====
I just did a gcc on this system, I have no idea what that did but it didn't
complain so it did something.

win32-xp /build/jamie ./a.exe
Test separation: 4096 bytes: FAIL - alias map failed
Test separation: 8192 bytes: FAIL - alias map failed
Test separation: 16384 bytes: FAIL - alias map failed
Test separation: 32768 bytes: FAIL - alias map failed
Test separation: 65536 bytes: FAIL - alias map failed
Test separation: 131072 bytes: FAIL - alias map failed
Test separation: 262144 bytes: FAIL - alias map failed
Test separation: 524288 bytes: FAIL - alias map failed
Test separation: 1048576 bytes: FAIL - alias map failed
Test separation: 2097152 bytes: FAIL - alias map failed
Test separation: 4194304 bytes: FAIL - alias map failed
Test separation: 8388608 bytes: FAIL - alias map failed
Test separation: 16777216 bytes: FAIL - alias map failed
VM page alias coherency test: failed; will use copy buffers instead

=== zseries/RedHat ===
(256) [11,11,0] Test separation: 4096 bytes: pass
(256) [11,11,0] Test separation: 8192 bytes: pass
(256) [11,11,0] Test separation: 16384 bytes: pass
(256) [11,11,0] Test separation: 32768 bytes: pass
(256) [11,11,0] Test separation: 65536 bytes: pass
(256) [11,11,0] Test separation: 131072 bytes: pass
(256) [11,11,0] Test separation: 262144 bytes: pass
(256) [11,11,0] Test separation: 524288 bytes: pass
(256) [11,13,0] Test separation: 1048576 bytes: pass
(256) [11,13,0] Test separation: 2097152 bytes: pass
(256) [11,13,0] Test separation: 4194304 bytes: pass
(256) [11,13,0] Test separation: 8388608 bytes: pass
(256) [11,13,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux l006034.zseriespenguins.ihost.com 2.4.9-38 #1 SMP Tue Sep 10 00:16:26 CEST 2002 s390 unknown

vendor_id : IBM/S390
# processors : 1
bogomips per cpu: 612.76
processor 0: version = FF, identification = 049321, machine = 9672

=== zseries/SuSE ===
(512) [21,21,1] Test separation: 4096 bytes: pass
(256) [11,11,0] Test separation: 8192 bytes: pass
(512) [21,21,1] Test separation: 16384 bytes: pass
(512) [21,21,1] Test separation: 32768 bytes: pass
(512) [21,21,1] Test separation: 65536 bytes: pass
(512) [22,22,0] Test separation: 131072 bytes: pass
(512) [22,22,0] Test separation: 262144 bytes: pass
(512) [21,21,1] Test separation: 524288 bytes: pass
(512) [21,25,1] Test separation: 1048576 bytes: pass
(512) [22,26,0] Test separation: 2097152 bytes: pass
(256) [11,13,0] Test separation: 4194304 bytes: pass
(512) [22,26,0] Test separation: 8388608 bytes: pass
(512) [21,25,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux lh003022 2.2.16 #6 SMP Wed May 23 16:39:31 EDT 2001 s390 unknown

vendor_id : IBM/S390
# processors : 1
bogomips per cpu: 581.63
processor 0: version = FF, identification = 049321, machine = 9672

2003-09-01 16:34:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Larry McVoy wrote:
> I'm a little concerned I have the wrong test, why would a 2.1Ghz Athlon
> say it is too slow?

It's the right test. "too slow" means that where shared memory is
mapped at a certain separation, alternating accesses between the
different virtual addresses are much slower (10-20 times) than if the
underlying mapped memory is not shared.

All Athlons show this slowdown for any virtual address separation
which is not a multiple of 32k. No Intels do, with the possible
exception of a P4 which showed inconsistent results and needs further
investigation.

Your freebsds don't what CPU they are, but let me guess..

freebsd isn't an AMD
freebsd3 and freebsd4 are both AMD K6, and freebsd3 is the faster

-- Jamie

> ==== freebsd ====
> (512) [32,32,1] Test separation: 4096 bytes: pass
...
> FreeBSD freebsd.bitmover.com 2.2.8-RELEASE FreeBSD 2.2.8-RELEASE #0: Mon Nov 30 06:34:08 GMT 1998 [email protected]:/usr/src/sys/compile/GENERIC i386

> ==== freebsd3 ====
> (64) [33,3,1] Test separation: 4096 bytes: FAIL - too slow
> (64) [33,3,1] Test separation: 8192 bytes: FAIL - too slow
> (512) [19,26,1] Test separation: 16384 bytes: pass
> VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
>
> FreeBSD freebsd3.bitmover.com 3.2-RELEASE FreeBSD 3.2-RELEASE #0: Fri Jun 2 11:34:52 PDT 2000 [email protected]:/usr/src/sys/compile/DAVICOM i386
>
> ==== freebsd4 ====
> (256) [92,26,5] Test separation: 4096 bytes: FAIL - too slow
> (256) [92,26,5] Test separation: 8192 bytes: FAIL - too slow
> (1024) [75,101,5] Test separation: 16384 bytes: pass
> VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
>
> FreeBSD freebsd4.bitmover.com 4.1-RELEASE FreeBSD 4.1-RELEASE #0: Fri Jul 28 14:30:31 GMT 2000 [email protected]:/usr/src/sys/compile/GENERIC i386

2003-09-01 16:52:50

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Russell King wrote:
> On Mon, Sep 01, 2003 at 11:12:24AM +0100, Jamie Lokier wrote:
> > Russell King wrote:
> > > This looks like an old kernel on your NetWinder. Later 2.4 kernels
> > > should get this right (by marking the pages uncacheable in user space.)
> >
> > How do they know which pages to mark uncacheable? Surely not all
> > MAP_SHARED|MAP_FIXED mappings are uncacheable?
>
> By looking at the mappings present in the process. If a process maps the
> same file using MAP_SHARED _and_ we fault the same page of data into two
> or more mappings, we turn off the cache for those pages.

1. That's not necessary when the virtual addresses are separated
by some multiple, is it?

2. The other architectures with incoherent caches set SHMLBA to the
multiple, and they don't do anything special in
update_mmu_cache(), so MAP_FIXED can create incoherent mappings.

Is there any special reason why ARM is different?

> I've tested on several silicon revisions of StrongARM-110's:
> - H appears buggy (reports as rev. 2)
> - K appears fine (reports as rev. 2)
> - S appears buggy (reports as rev. 3)

It's possible that all of them are buggy, but the write buffer test
doesn't manage to get writes into the buffer with the exact timing
needed to trigger it. Unfortunately, while the write buffer test does
pretty much guarantee a store/store/load instruction sequence, because
it's generic it can't guarantee how those are executed in a
superscalar or out of order pipeline.

> So it seems your test program finds problems which DaveM's aliastest
> program fails to detect... Gah. ;(

Well, it's good to know it was useful :/

Thanks,
-- Jamie

2003-09-01 16:58:58

by Larry McVoy

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, Sep 01, 2003 at 05:33:54PM +0100, Jamie Lokier wrote:
> Your freebsds don't what CPU they are, but let me guess..
>
> freebsd isn't an AMD
> freebsd3 and freebsd4 are both AMD K6, and freebsd3 is the faster

Right you are on all points.

freebsd:
CPU: Unknown 80686 (400.91-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0x660 Stepping=0
Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,<b16>,<b17>,MMX,<b24>>

freebsd3
CPU: AMD-K6(tm) 3D processor (451.03-MHz 586-class CPU)
Origin = "AuthenticAMD" Id = 0x58c Stepping=12
Features=0x8021bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,PGE,MMX>

freebsd4
CPU: AMD-K6tm w/ multimedia extensions (233.87-MHz 586-class CPU)
Origin = "AuthenticAMD" Id = 0x562 Stepping = 2
Features=0x8001bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,MMX>
AMD Features=0x400<<b10>>
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-01 17:11:55

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, Sep 01, 2003 at 05:52:39PM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > By looking at the mappings present in the process. If a process maps the
> > same file using MAP_SHARED _and_ we fault the same page of data into two
> > or more mappings, we turn off the cache for those pages.
>
> 1. That's not necessary when the virtual addresses are separated
> by some multiple, is it?

Incorrect - with a VIVT, you have alias hell. There is no multiple
which makes it safe.

> > I've tested on several silicon revisions of StrongARM-110's:
> > - H appears buggy (reports as rev. 2)
> > - K appears fine (reports as rev. 2)
> > - S appears buggy (reports as rev. 3)
>
> It's possible that all of them are buggy, but the write buffer test
> doesn't manage to get writes into the buffer with the exact timing
> needed to trigger it.

Well, I've just generated a kernel test which does more or less the
same thing (write to one mapping, write to other, read from first.)
This indicates the same result.

If you take a moment to think about what should be going on -

- first write gets translated to physical address, and the address with
the data is placed in the write buffer.
- second write gets translated to the same physical address, and the
address and data is placed into the write buffer such that we store
the first write then the second write to the same physical memory.
- reading from the first mapping should return the second writes value
no matter what.

But it doesn't in some cases.

> Unfortunately, while the write buffer test does
> pretty much guarantee a store/store/load instruction sequence, because
> it's generic it can't guarantee how those are executed in a
> superscalar or out of order pipeline.

ARM doesn't do any of those tricks.

> > So it seems your test program finds problems which DaveM's aliastest
> > program fails to detect... Gah. ;(
>
> Well, it's good to know it was useful :/

Well, we now have a kernel test to detect the problem, which alters our
behaviour appropriately. Thanks.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-01 17:22:15

by Roland Dreier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Matt> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache
Matt> is PTPI

Jamie> The cache looks very coherent to me.

Matt (like me) is probably just used to thinking of the IBM PPC 440
chips as non-coherent because they are not cache coherent with respect
to external bus masters (eg they don't snoop the PCI bus). Of course,
this is a different type of coherency from what you are measuring.

- Roland

2003-09-01 19:12:57

by Guennadi Liakhovetski

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On

Processor : Intel XScale-PXA250 rev 3 (v5l)
BogoMIPS : 397.31
Features : swp half thumb fastmult edsp
CPU implementor : 0x69
CPU architecture: 5TE
CPU variant : 0x0
CPU part : 0x290
CPU revision : 3
Cache type : undefined 5
Cache clean : undefined 5
Cache lockdown : undefined 5
Cache unified : Harvard
I size : 32768
I assoc : 32
I line length : 32
I sets : 32
D size : 32768
D assoc : 32
D line length : 32
D sets : 32

and

Processor : StrongARM-1100 rev 9 (v4l)
BogoMIPS : 127.38
Features : swp half 26bit fastmult

version 3 of the test consistently reports "Too slow".

Guennadi
---
Guennadi Liakhovetski



2003-09-02 02:16:52

by Matt Porter

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, Sep 01, 2003 at 10:22:02AM -0700, Roland Dreier wrote:
> Matt> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache
> Matt> is PTPI
>
> Jamie> The cache looks very coherent to me.
>
> Matt (like me) is probably just used to thinking of the IBM PPC 440
> chips as non-coherent because they are not cache coherent with respect
> to external bus masters (eg they don't snoop the PCI bus). Of course,
> this is a different type of coherency from what you are measuring.

Exactly. After reading some other subthreads I see the other version of
"cache coherency" that Jamie is interested in.

-Matt

2003-09-02 05:34:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Russell King wrote:
> > 1. That's not necessary when the virtual addresses are separated
> > by some multiple, is it?
>
> Incorrect - with a VIVT, you have alias hell. There is no multiple
> which makes it safe.

Ok. I guess I was thinking of VIPT, but by now I am just guessing :)

> > > I've tested on several silicon revisions of StrongARM-110's:
> > > - H appears buggy (reports as rev. 2)
> > > - K appears fine (reports as rev. 2)
> > > - S appears buggy (reports as rev. 3)
> >
> > It's possible that all of them are buggy, but the write buffer test
> > doesn't manage to get writes into the buffer with the exact timing
> > needed to trigger it.
>
> Well, I've just generated a kernel test which does more or less the
> same thing (write to one mapping, write to other, read from first.)
> This indicates the same result.
>
> If you take a moment to think about what should be going on -
>
> - first write gets translated to physical address, and the address with
> the data is placed in the write buffer.
> - second write gets translated to the same physical address, and the
> address and data is placed into the write buffer such that we store
> the first write then the second write to the same physical memory.
> - reading from the first mapping should return the second writes value
> no matter what.

That is an incomplete explanation, because it should never be possible
for reads to access data from the write buffer which isn't the most
recent. That would break ordinary programs which don't have alias mappings.

> > Unfortunately, while the write buffer test does
> > pretty much guarantee a store/store/load instruction sequence, because
> > it's generic it can't guarantee how those are executed in a
> > superscalar or out of order pipeline.
>
> ARM doesn't do any of those tricks.

Don't some of the ARMs executed two instructions concurrently, like
the original Pentium? The simple test is only valid if a
store/store/load sequence is guaranteed to pass through the buggy part
of the pipeline in exactly the same way, no matter which programs it
appears in.

> > > So it seems your test program finds problems which DaveM's aliastest
> > > program fails to detect... Gah. ;(
> >
> > Well, it's good to know it was useful :/
>
> Well, we now have a kernel test to detect the problem, which alters our
> behaviour appropriately. Thanks.

Fwiw, PA-RISC shows a similar problem.

-- Jamie

2003-09-02 05:40:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Matt Porter wrote:
> Exactly. After reading some other subthreads I see the other version of
> "cache coherency" that Jamie is interested in.

Indeed, quite a lot of systems don't offer cache coherence with
peripherals, other CPUs (if any) and in some cases even with other
tasks on the same CPU. Isn't memory fun? :)

-- Jamie

2003-09-02 08:16:02

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Tue, Sep 02, 2003 at 06:34:15AM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > If you take a moment to think about what should be going on -
> >
> > - first write gets translated to physical address, and the address with
> > the data is placed in the write buffer.
> > - second write gets translated to the same physical address, and the
> > address and data is placed into the write buffer such that we store
> > the first write then the second write to the same physical memory.
> > - reading from the first mapping should return the second writes value
> > no matter what.
>
> That is an incomplete explanation, because it should never be possible
> for reads to access data from the write buffer which isn't the most
> recent.

Umm, that's what I said.

> > ARM doesn't do any of those tricks.
>
> Don't some of the ARMs executed two instructions concurrently, like
> the original Pentium?

Nope - they're all single issue CPUs, and, if non-buggy, they guarantee
that stores never bypass loads. (In a later architecture revision, this
is controllable.)

Remember - ARM CPUs aren't a high spec desktop CPU. They're an embedded
CPU where power consumption matters. Superscalar/multiple issue/high
performance isn't viable in such many embedded environments.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-02 11:57:53

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Russell King wrote:
> > > If you take a moment to think about what should be going on -
> > >
> > > - first write gets translated to physical address, and the address with
> > > the data is placed in the write buffer.
> > > - second write gets translated to the same physical address, and the
> > > address and data is placed into the write buffer such that we store
> > > the first write then the second write to the same physical memory.
> > > - reading from the first mapping should return the second writes value
> > > no matter what.
> >
> > That is an incomplete explanation, because it should never be possible
> > for reads to access data from the write buffer which isn't the most
> > recent.
>
> Umm, that's what I said.

You say that "reading from the first mapping _should_ return the
second write value no matter what", but that there's a bug in the
write buffer and it isn't doing that.

I'm saying that the bug can't be that, because such a bug would affect
normal applications.

> > Don't some of the ARMs executed two instructions concurrently, like
> > the original Pentium?
>
> Nope - they're all single issue CPUs, and, if non-buggy, they guarantee
> that stores never bypass loads. (In a later architecture revision, this
> is controllable.)
>
> Remember - ARM CPUs aren't a high spec desktop CPU. They're an embedded
> CPU where power consumption matters. Superscalar/multiple issue/high
> performance isn't viable in such many embedded environments.

Fair enough. I recall someone mentioning a dual issue ARM once upon a
time, that's all.

-- Jamie

2003-09-02 18:52:36

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Tue, Sep 02, 2003 at 12:57:31PM +0100, Jamie Lokier wrote:
> You say that "reading from the first mapping _should_ return the
> second write value no matter what", but that there's a bug in the
> write buffer and it isn't doing that.
>
> I'm saying that the bug can't be that, because such a bug would affect
> normal applications.

I know of no other explaination which fits with the information I have
available to me here. If you'd care to speculate further, you may,
but I see further speculation as being rather academic, unless it comes
from one of the people who designed the chip.

All this is, however, immateral - the facts are that the write buffer
is buggy, this test detects it, and we can take fairly easy measures
to ensure we fix it up.

Multiple mappings of the same object rarely occur in my experience, so
the resulting performance loss caused by working around the cache and
writebuffer is something we can live with.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-02 21:39:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Kars de Jong wrote:
> And no, this board has no way of getting a better time resolution than
> the 100 Hz tick timer either ;-)

The coherency test is fine. That's just logic.

The clock granularity got me wondering whether the timing measurement
is meaningful on these machines. It's possible for the shared test to
take 2000 microseconds and the unshared test to take 10 microseconds,
and they can still show up as 10ms if they both cross a clock tick
boundary.

The minimum of 128 tests of each type is likely to report 0 until
timing_loops is larger enough to make all 128 consistently almost
10ms, according to the timing when each test starts. Then as we only
care if there is an approximately 2:1 ratio or more, it is fine.

That depends on the timing of each test not being synchronised with
the clock ticks, or when they are, that not affecting the result.

I'm not sure, but I have a feeling that the random shuffle makes it ok.

Hmm.

-- Jamie

2003-09-02 20:30:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Larry McVoy wrote:
> Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC, s390
> on Linux and hpux/parisc, {freebsd, netbsd, openbsd}/x86, sco/x86,
> solaris/sparc, solaris/x86, irix/mips, osx/ppc, aix/ppc, tru64/alpha.

It's interesting to see all the free unixes, Solaris and SCO have no
trouble mapping files. But AIX, HPUX and whatever environment you
have on Windows XP couldn't even do the mmaps.

Could you be able to try the aix/ppc, hpux/parisc and Windows XP (or
any Windows) tests again, but this time try each of these:

1. Compile with -DHAVE_SHM_OPEN
2. Compile with -DHAVE_SYSV_SHM

Thanks again,
-- Jamie

2003-09-02 20:43:13

by Kars de Jong

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 2003-09-01 at 12:08, Jamie Lokier wrote:
> Kars de Jong wrote:
> > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > > got a 68040, that leaves us with:
> > > - 68020+68551
> > > - 68060
> >
> > I can run it on these boxes if no-one else has done it yet before I come
> > home tonight. I'm sure there are more people with a 68060 out there, not
> > too sure about the 68020+68851.
>
> I would prefer that you run the attached program. It fixes a bug in
> the function which tests whether the problem is in the L1 cache or
> store buffer. The bug probably didn't affect the test, but it might
> have.
>
> Ideally you could run the program Geert linked to as well?
> Please remember to compile both with optimisation.

OK, here are my results (I'll skip the 68060 because Roman has already
run the program on that one):

This is on a Plessey PME 68-22. It's sooooo fast... Sam, is there a Sun
slower than this?

Original program:

fikkie:/tmp# ./jamie_test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

New program:

fikkie:/tmp# time ./jamie_test2
(2048) [10000,10000,0] Test separation: 4096 bytes: pass
(2048) [10000,10000,0] Test separation: 8192 bytes: pass
(2048) [10000,10000,0] Test separation: 16384 bytes: pass
(2048) [10000,10000,0] Test separation: 32768 bytes: pass
(2048) [10000,10000,0] Test separation: 65536 bytes: pass
(2048) [10000,10000,0] Test separation: 131072 bytes: pass
(2048) [10000,10000,0] Test separation: 262144 bytes: pass
(2048) [10000,10000,0] Test separation: 524288 bytes: pass
(2048) [10000,10000,0] Test separation: 1048576 bytes: pass
(2048) [10000,10000,0] Test separation: 2097152 bytes: pass
(2048) [10000,10000,0] Test separation: 4194304 bytes: pass
(2048) [10000,10000,0] Test separation: 8388608 bytes: pass
(2048) [10000,10000,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real 1m51.210s
user 1m44.950s
sys 0m4.930s
fikkie:/tmp# cat /proc/cpuinfo
CPU: 68020
MMU: 68851
FPU: 68881
Clocking: 15.6MHz
BogoMips: 3.90
Calibration: 19520 loops
fikkie:/tmp#

And no, this board has no way of getting a better time resolution than
the 100 Hz tick timer either ;-)

Regards,

Kars.

2003-09-02 23:59:09

by Larry McVoy

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Tue, Sep 02, 2003 at 07:52:22PM +0100, Russell King wrote:
> Multiple mappings of the same object rarely occur in my experience, so
> the resulting performance loss caused by working around the cache and
> writebuffer is something we can live with.

Multiple *writable* mappings. Don't forget about libc et al.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-03 07:31:24

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Tue, Sep 02, 2003 at 04:59:00PM -0700, Larry McVoy wrote:
> On Tue, Sep 02, 2003 at 07:52:22PM +0100, Russell King wrote:
> > Multiple mappings of the same object rarely occur in my experience, so
> > the resulting performance loss caused by working around the cache and
> > writebuffer is something we can live with.
>
> Multiple *writable* mappings. Don't forget about libc et al.

I mean in the same group of threads with the same struct mm, not the whole
system.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-03 07:41:39

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Russell King wrote:
> > > Multiple mappings of the same object rarely occur in my experience, so
> > > the resulting performance loss caused by working around the cache and
> > > writebuffer is something we can live with.
> >
> > Multiple *writable* mappings. Don't forget about libc et al.
>
> I mean in the same group of threads with the same struct mm, not the whole
> system.

Larry means that it's perfectly normal for libc to map the same file
more than once: you have the code section and the data section.

I don't know if ARM's ELF is like the x86, but on the x86 the final
partial page of code or read-only data will be mapped twice, as the
latter part of the page can contain writable data. This avoids
wasting up to a page's worth of bytes in the ELF file.

-- Jamie

2003-09-03 08:04:27

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On 2 Sep 2003, Kars de Jong wrote:
> fikkie:/tmp# ./jamie_test
> Test separation: 4096 bytes: pass
> Test separation: 8192 bytes: pass
> Test separation: 16384 bytes: pass
> Test separation: 32768 bytes: pass
> Test separation: 65536 bytes: pass
> Test separation: 131072 bytes: pass
> Test separation: 262144 bytes: pass
> Test separation: 524288 bytes: pass
> Test separation: 1048576 bytes: pass
> Test separation: 2097152 bytes: pass
> Test separation: 4194304 bytes: pass
> Test separation: 8388608 bytes: pass
> Test separation: 16777216 bytes: pass
> VM page alias coherency test: all sizes passed
>
> New program:
>
> fikkie:/tmp# time ./jamie_test2
> (2048) [10000,10000,0] Test separation: 4096 bytes: pass
> (2048) [10000,10000,0] Test separation: 8192 bytes: pass
> (2048) [10000,10000,0] Test separation: 16384 bytes: pass
> (2048) [10000,10000,0] Test separation: 32768 bytes: pass
> (2048) [10000,10000,0] Test separation: 65536 bytes: pass
> (2048) [10000,10000,0] Test separation: 131072 bytes: pass
> (2048) [10000,10000,0] Test separation: 262144 bytes: pass
> (2048) [10000,10000,0] Test separation: 524288 bytes: pass
> (2048) [10000,10000,0] Test separation: 1048576 bytes: pass
> (2048) [10000,10000,0] Test separation: 2097152 bytes: pass
> (2048) [10000,10000,0] Test separation: 4194304 bytes: pass
> (2048) [10000,10000,0] Test separation: 8388608 bytes: pass
> (2048) [10000,10000,0] Test separation: 16777216 bytes: pass
> VM page alias coherency test: all sizes passed
>
> real 1m51.210s
> user 1m44.950s
> sys 0m4.930s
> fikkie:/tmp# cat /proc/cpuinfo
> CPU: 68020
> MMU: 68851
> FPU: 68881
> Clocking: 15.6MHz
> BogoMips: 3.90
> Calibration: 19520 loops
> fikkie:/tmp#

So the store buffer is coherent on 68020 with external MMU, while it isn't on
68040 with internal MMU...

Now all that's left is the 68030.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-09-03 08:00:48

by Kars de Jong

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> BTW, probably you want us to run your test program on other m68k boxes? Mine
> got a 68040, that leaves us with:
> - 68030

Ah, I forgot, I've got one of these here too, a Motorola MVME147 board:

sasscm:/tmp# time ./jamie_test2
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead

real 0m1.149s
user 0m0.240s
sys 0m0.670s
sasscm:/tmp# cat /proc/cpuinfo
CPU: 68030
MMU: 68030
FPU: 68882
Clocking: 19.6MHz
BogoMips: 4.90
Calibration: 24512 loops

Regards,

Kars.

2003-09-03 08:08:12

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On 3 Sep 2003, Kars de Jong wrote:
> On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > got a 68040, that leaves us with:
> > - 68030
>
> Ah, I forgot, I've got one of these here too, a Motorola MVME147 board:
>
> sasscm:/tmp# time ./jamie_test2
> Test separation: 4096 bytes: FAIL - cache not coherent

I guess the Plessey PME 68-22 didn't have cache, since the test passed?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-09-03 09:14:36

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Geert Uytterhoeven wrote:
> So the store buffer is coherent on 68020 with external MMU, while it
> isn't on 68040 with internal MMU...

Does the 68020 even _have_ the equivalent of a store buffer?

-- Jamie

2003-09-03 09:29:26

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, 3 Sep 2003, Jamie Lokier wrote:
> Geert Uytterhoeven wrote:
> > So the store buffer is coherent on 68020 with external MMU, while it
> > isn't on 68040 with internal MMU...
>
> Does the 68020 even _have_ the equivalent of a store buffer?

Good question :-)

After I sent the previous mail, I realized the '030 has 256 bytes I cache and
256 bytes D cache, while the '020 has 256 bytes I cache only.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-09-03 09:24:18

by Kars de Jong

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, 2003-09-03 at 10:05, Geert Uytterhoeven wrote:
> On 3 Sep 2003, Kars de Jong wrote:
> > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > > got a 68040, that leaves us with:
> > > - 68030
> >
> > Ah, I forgot, I've got one of these here too, a Motorola MVME147 board:
> >
> > sasscm:/tmp# time ./jamie_test2
> > Test separation: 4096 bytes: FAIL - cache not coherent
>
> I guess the Plessey PME 68-22 didn't have cache, since the test passed?

No, no cache. Well. A very tiny instruction cache in the 68020 itself.

Regards,

Kars.

2003-09-03 12:17:34

by Roman Zippel

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Hi,

On Wed, 3 Sep 2003, Geert Uytterhoeven wrote:

> > Does the 68020 even _have_ the equivalent of a store buffer?
>
> Good question :-)
>
> After I sent the previous mail, I realized the '030 has 256 bytes I cache and
> 256 bytes D cache, while the '020 has 256 bytes I cache only.

BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060
caches are PIPT.

bye, Roman

2003-09-03 12:13:15

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, 2003-09-03 09:59:02 +0200, Geert Uytterhoeven <[email protected]>
wrote in message <[email protected]>:
> On 2 Sep 2003, Kars de Jong wrote:
> Now all that's left is the 68030.

Maybe I get my Amiga 3000 installed these days... I think it has got an
68030.

MfG, JBG

--
Jan-Benedict Glaw [email protected] . +49-172-7608481
"Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg
fuer einen Freien Staat voll Freier B?rger" | im Internet! | im Irak!
ret = do_actions((curr | FREE_SPEECH) & ~(IRAQ_WAR_2 | DRM | TCPA));


Attachments:
(No filename) (620.00 B)
(No filename) (189.00 B)
Download all attachments

2003-09-03 12:37:25

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, 3 Sep 2003, Roman Zippel wrote:
> On Wed, 3 Sep 2003, Geert Uytterhoeven wrote:
> > > Does the 68020 even _have_ the equivalent of a store buffer?
> >
> > Good question :-)
> >
> > After I sent the previous mail, I realized the '030 has 256 bytes I cache and
> > 256 bytes D cache, while the '020 has 256 bytes I cache only.
>
> BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060
> caches are PIPT.

That explains a bit. But the '060 stores are coherent, while the '040 stores
aren't.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2003-09-03 13:30:23

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Geert Uytterhoeven wrote:
> > BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060
> > caches are PIPT.
>
> That explains a bit. But the '060 stores are coherent, while the '040 stores
> aren't.

The L1 cache is coherent on the '040 according to the results. It's
the store buffer snooping which fails. Presumably the CPU core is
looking ahead at recent writes comparing just virtual addresses.

-- Jamie

2003-09-03 17:45:29

by Bill Davidsen

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

In article <[email protected]>,
David S. Miller <[email protected]> wrote:

| > This is my strategy:
| >
| > mmap MAP_ANON without MAP_FIXED to find a free area
| > mmap MAP_FIXED over the anon area at same address
| > mmap MAP_FIXED over the anon area at larger address
| >
| > I don't see any strategy that lets me establish this kind of circular
| > mapping on Sparc without either (a) knowing the value of SHMLBA, or
| > (b) risking clobbering another thread's mmap.
|
| Why do you need the same piece of data mapped to multiple places
| in the first place, and why at specific addresses? It's purely an
| optimization of some sort, right?

I think he said he was doing DSP... there's a trick of double mapping
the same memory to save one subscript calculation in FFT (or maybe DFT)
inner loop. The only reason I know this is that a friend did a master's
thesis on DSP about 20 years ago, and I absorbed some info I hope to
never need. He also coded an FFT instruction in the LCS (programmable
firmware) of a VAX.

I am only speculating, of course.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-09-03 18:07:58

by Russell King

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, Sep 03, 2003 at 08:41:34AM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > > > Multiple mappings of the same object rarely occur in my experience, so
> > > > the resulting performance loss caused by working around the cache and
> > > > writebuffer is something we can live with.
> > >
> > > Multiple *writable* mappings. Don't forget about libc et al.
> >
> > I mean in the same group of threads with the same struct mm, not the whole
> > system.
>
> Larry means that it's perfectly normal for libc to map the same file
> more than once: you have the code section and the data section.

Code is read-only, data is read-write and is copy on write. Therefore
its a different scenario.

Practical tests indicate that the vast majority of applications do not
trip the test.

You're right in theory, but I don't particularly care about theory when
its real life which matters.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-09-04 04:10:23

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Jamie,
Just wondered if the store buffer is snooped in some
architectures. In that case I believe the OS need not do anything for
serialization (except for aliases, if they do not hit the same cache line).
In x86 store buffer is not snooped which leads to all these serialization
issues (other CPUs looking at stale value of data which is in the store
buffer of some other CPU).
Pl correct me if I have got anything wrong/

Thanx,
tomar



On Wed, 3 Sep 2003, Jamie Lokier wrote:

> Geert Uytterhoeven wrote:
> > > BTW the 020/030 caches are VIVT (and also only writethrough), the
> 040/060
> > > caches are PIPT.
> >
> > That explains a bit. But the '060 stores are coherent, while the '040
> stores
> > aren't.
>
> The L1 cache is coherent on the '040 according to the results. It's
> the store buffer snooping which fails. Presumably the CPU core is
> looking ahead at recent writes comparing just virtual addresses.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-09-04 05:09:49

by Davide Libenzi

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote:

> Jamie,
> Just wondered if the store buffer is snooped in some
> architectures. In that case I believe the OS need not do anything for
> serialization (except for aliases, if they do not hit the same cache line).
> In x86 store buffer is not snooped which leads to all these serialization
> issues (other CPUs looking at stale value of data which is in the store
> buffer of some other CPU).
> Pl correct me if I have got anything wrong/

To avoid the so called 'load hazard' (that, BTW, triggers read over
writes, that are not allowed in x86) you have two options. Snoop the write
buffer or flush it upon L1 miss. Otherwise you might end up getting stale
data from L2.



- Davide

2003-09-04 06:06:09

by Nagendra Singh Tomar

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this


On Thu, 4 Sep 2003, Davide Libenzi wrote:

> On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote:
>
> > Jamie,
> > Just wondered if the store buffer is snooped in some
> > architectures. In that case I believe the OS need not do anything for
> > serialization (except for aliases, if they do not hit the same cache
> line).
> > In x86 store buffer is not snooped which leads to all these
> serialization
> > issues (other CPUs looking at stale value of data which is in the
> store
> > buffer of some other CPU).
> > Pl correct me if I have got anything wrong/
>
> To avoid the so called 'load hazard' (that, BTW, triggers read over
> writes, that are not allowed in x86) you have two options. Snoop the
> write
> buffer or flush it upon L1 miss. Otherwise you might end up getting
> stale
> data from L2.
>

I meant to ask if the store buffer is snooped by *other CPUs*. To maintain
self coherence the local store buffer has to be anyway consulted by local
loads to give the latest stored value.

Thanx,

tomar
>
>
> - Davide
>

2003-09-04 06:43:45

by Davide Libenzi

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote:

> I meant to ask if the store buffer is snooped by *other CPUs*. To maintain
> self coherence the local store buffer has to be anyway consulted by local
> loads to give the latest stored value.

There are CPUs (at least some version of Alpha, 21064 IIRC) that uses
flush upon L1 read miss, so they do not snoop their local WB. IIRC P5 has
internal and external snooping while P6, using a write allocate L1, does
not have external snooping.



- Davide

2003-09-04 11:20:50

by Alan

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mer, 2003-09-03 at 17:07, Nagendra Singh Tomar wrote:
> In x86 store buffer is not snooped which leads to all these serialization
> issues (other CPUs looking at stale value of data which is in the store
> buffer of some other CPU).

x86 gives you coherency and store ordering (barring errata and special
CPU modes)

2003-09-04 17:37:31

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Mon, 1 Sep 2003, Jamie Lokier wrote:

> Please try the program below, which is the same as before but with
> test_l1_only hopefully improved, and it prints some more helpful
> numbers.

A few MIPS systems:

1. An R3400-based DECstation 5000/240 -- the CPU has a 64kB I-cache and a
64kB D-cache, both are direct mapped, PIPT:

$ uname -a
Linux 3maxp 2.4.21 #3 Thu Aug 14 04:14:33 CEST 2003 mips unknown unknown GNU/Linux
$ time ./test
(256) [155,155,7] Test separation: 4096 bytes: pass
(256) [155,155,7] Test separation: 8192 bytes: pass
(256) [155,155,7] Test separation: 16384 bytes: pass
(256) [155,155,7] Test separation: 32768 bytes: pass
(256) [155,155,7] Test separation: 65536 bytes: pass
(256) [155,155,7] Test separation: 131072 bytes: pass
(256) [155,155,7] Test separation: 262144 bytes: pass
(256) [155,155,7] Test separation: 524288 bytes: pass
(256) [155,155,7] Test separation: 1048576 bytes: pass
(256) [155,155,7] Test separation: 2097152 bytes: pass
(256) [155,155,7] Test separation: 4194304 bytes: pass
(256) [155,155,7] Test separation: 8388608 bytes: pass
(256) [155,155,7] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
1.01user 0.27system 0:01.33elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (135major+44minor)pagefaults 0swaps
$ cat /proc/cpuinfo
system type : Digital DECstation 5000/2x0
processor : 0
cpu model : R3000A V3.0 FPU V4.0
BogoMIPS : 39.90
wait instruction : no
microsecond timers : no
tlb_entries : 64
extra interrupt vector : no
hardware watchpoint : no
VCED exceptions : not available
VCEI exceptions : not available

2. An R4400SC-based DECstation 5000/260 -- the CPU has a 16kB primary
I-cache and a 16kB primary D-cache, both are direct mapped, VIPT, and a
1024kB secondary joint (I+D) cache, direct mapped, PIPT:

$ uname -a
Linux 4maxp64 2.4.21 #19 Mon Aug 25 00:16:25 CEST 2003 mips64 unknown unknown GNU/Linux
$ time ./test
(64) [331,17,3] Test separation: 4096 bytes: FAIL - too slow
(64) [331,17,3] Test separation: 8192 bytes: FAIL - too slow
(128) [38,63,3] Test separation: 16384 bytes: pass
(128) [38,63,3] Test separation: 32768 bytes: pass
(128) [38,63,3] Test separation: 65536 bytes: pass
(128) [38,63,3] Test separation: 131072 bytes: pass
(128) [38,63,3] Test separation: 262144 bytes: pass
(128) [38,63,3] Test separation: 524288 bytes: pass
(128) [38,63,3] Test separation: 1048576 bytes: pass
(128) [38,63,3] Test separation: 2097152 bytes: pass
(128) [38,63,3] Test separation: 4194304 bytes: pass
(128) [38,63,3] Test separation: 8388608 bytes: pass
(128) [38,63,3] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
0.34user 0.14system 0:00.53elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (135major+250minor)pagefaults 0swaps
$ cat /proc/cpuinfo
system type : Digital DECstation 5000/2x0
processor : 0
cpu model : R4400SC V4.0 FPU V0.0
BogoMIPS : 59.86
wait instruction : no
microsecond timers : yes
tlb_entries : 48
extra interrupt vector : no
hardware watchpoint : yes
VCED exceptions : 464662
VCEI exceptions : 667534

3. A MIPS 5Kc-based Malta -- the CPU has a 16kB I-cache and a 16kB
D-cache, both are 4-way set associative, VIPT:

$ uname -a
Linux malta 2.4.21 #5 Sun Aug 3 21:51:32 CEST 2003 mips unknown unknown GNU/Linux
$ time ./test
(128) [25,23,1] Test separation: 4096 bytes: pass
(128) [25,23,1] Test separation: 8192 bytes: pass
(128) [25,23,1] Test separation: 16384 bytes: pass
(128) [25,23,1] Test separation: 32768 bytes: pass
(256) [49,46,1] Test separation: 65536 bytes: pass
(128) [25,23,1] Test separation: 131072 bytes: pass
(128) [25,23,1] Test separation: 262144 bytes: pass
(256) [49,46,1] Test separation: 524288 bytes: pass
(256) [49,46,1] Test separation: 1048576 bytes: pass
(256) [49,46,1] Test separation: 2097152 bytes: pass
(256) [48,45,2] Test separation: 4194304 bytes: pass
(256) [49,46,1] Test separation: 8388608 bytes: pass
(128) [25,23,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
0.22user 0.06system 0:00.30elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (135major+44minor)pagefaults 0swaps
$ cat /proc/cpuinfo
system type : MIPS Malta
processor : 0
cpu model : MIPS 5Kc V0.1
BogoMIPS : 159.74
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : yes
VCED exceptions : not available
VCEI exceptions : not available

The slowdown for the R4400SC processor is surely the result of Virtual
Coherency Exceptions (reported in cpuinfo for both primary caches) -- the
secondary cache (S-cache) remembers a few bits of the virtual address (VA)
and if there is a hit in the S-cache, but the VA bits don't match, an
exception is taken to write back and invalidate the old entry from the
respective primary cache (P-cache) and reset the VA bits to the new value.
Then a reexecution of the faulting instruction does a refill to the
P-cache from the S-cache. This problem doesn't happen for the two other
processors as neither has an S-cache and also the R3400's P-cache is PIPT.

We avoid the hit resulting from cache aliasing for MIPS by aligning maps
appropriately.

Maciej

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2003-09-04 22:21:32

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Russell King wrote:
> > Larry means that it's perfectly normal for libc to map the same file
> > more than once: you have the code section and the data section.
>
> Code is read-only, data is read-write and is copy on write. Therefore
> its a different scenario.

Yes, a thinko on my part :)

-- Jamie

2003-09-04 22:50:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

bill davidsen wrote:
> | Why do you need the same piece of data mapped to multiple places
> | in the first place, and why at specific addresses? It's purely an
> | optimization of some sort, right?
>
> I think he said he was doing DSP... there's a trick of double mapping
> the same memory to save one subscript calculation in FFT (or maybe DFT)
> inner loop.

It is for DSP, but nothing to do with FFT. I hadn't ever thought of
using this techinque for FFT, and it would probably make little
difference on a modern CPU given the form of FFT algorithms.

No, I use it to make a circular buffer, in which the data always
appears as a contiguous block - no split. This is useful for
operations on streams of data, such as FIR & IIR filters, equalisers,
upconverters, downcoverters, etc. Many DSP algorithms fall into this
category.

A characteristic of these algorithms is that they consist of a long,
tight sequence of streaming memory accesses with calculations at each
step.

DSP chips often implement circular buffers by masking the offset into
the buffer's memory.

On a CPU, I prefer to avoid the masking operation which happens for
each address calculation. This saves a couple of registers, as I can
just use an incrementing pointer into the buffer, rather than a base
address, offset and mask value. Especially on x86, a couple of
registers saved is good.

It's possible to write DSP algorithms which avoid address masking,
after all a circular buffer in an ordinary array is just two separate
regions. But that complicates the algorithms especially with corner
cases, and some of them are complicated enough already.

Using the duplicate mappings, I can use the most straightforward
streaming DSP code, and it runs as fast as possible if the mappings
don't incur a penalty.

When mappings aren't available or are too slow, then I just copy the
contents of the buffer backwards whenever the write pointer will cross
the end of the array. That costs some, but keeps the DSP code simple.

Fwiw, the test program asseses whether there's a cost to using
duplicate mappings and whether they work. However, for the above kind
of DSP buffer, the measurement isn't the best it could be (although
it's what I'm using). There's a balance of factors. For a large
buffer, it's ok even if page faults were to be needed as we switch
between alias pages, because the access pattern doesn't do that very
often. Then the occasional page faults are just a potentially faster
version of the copy backwards. On the other hand, if aliased pages
are made coherent by making then uncacheable (such as the ARM port),
even though that's much faster than faulting, it isn't good for the
DSP algorithms.

Fwiw#2, in the DSP I'm working on it's better to use the copying
method for most of my buffers even on x86, because they aren't that
large and fit better into L1 cache without the mappings. Maybe for a
different project, it will get used for more of the buffers. Mainly,
having developed the testing code, I wanted to know if it worked
properly on the different architectures. It's nice to see some spin
offs, such as finding the ARM write buffer bug.

So thanks to everyone who responded. I'll post a table of the results
soon.

Thanks,
-- Jamie

2003-09-06 21:37:22

by Pavel Machek

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Hi!

> > In x86 store buffer is not snooped which leads to all these serialization
> > issues (other CPUs looking at stale value of data which is in the store
> > buffer of some other CPU).
>
> x86 gives you coherency and store ordering (barring errata and special
> CPU modes)

Special CPU modes? You mean some special SSE stores?
Pavel

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-06 23:10:11

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Pavel Machek wrote:
> > x86 gives you coherency and store ordering (barring errata and special
> > CPU modes)
>
> Special CPU modes? You mean some special SSE stores?

Take a look at arch/i386/kernel/cpu/centaur.c, and CONFIG_X86_OOSTORE.

You can change the memory settings to weakly ordered writes, which
means that a plain write isn't suitable for spin_unlock. Presumably
this mode is faster (though I don't see why, if Intel, AMD et al. can
manage good memory performance without weak writes).

-- Jamie

2003-09-07 13:10:16

by Pavel Machek

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Hi!

> > > x86 gives you coherency and store ordering (barring errata and special
> > > CPU modes)
> >
> > Special CPU modes? You mean some special SSE stores?
>
> Take a look at arch/i386/kernel/cpu/centaur.c, and CONFIG_X86_OOSTORE.
>
> You can change the memory settings to weakly ordered writes, which
> means that a plain write isn't suitable for spin_unlock. Presumably
> this mode is faster (though I don't see why, if Intel, AMD et al. can
> manage good memory performance without weak writes).

Wow, seems interesting, how much performance does it buy? [Maybe AMD
and Intel just threw a lot of silicon at the problem and it went
away. Centaur solution might be nicer, through -- spin_unlock is so
uncommon that this seems like nice optimalization.]

--
Horseback riding is like software...
...vgf orggre jura vgf serr.

2003-09-07 13:36:28

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Pavel Machek wrote:
> Wow, seems interesting, how much performance does it buy? [Maybe AMD
> and Intel just threw a lot of silicon at the problem and it went
> away. Centaur solution might be nicer, through -- spin_unlock is so
> uncommon that this seems like nice optimalization.]

I didn't realise Centaur SMP systems existed, but I guess they must do
for weak memory writes to mean anything.

-- Jamie

2003-09-07 13:40:24

by Pavel Machek

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Hi!

Perhaps weak ordering matters when you are writting to the MMIO, too?


> > Wow, seems interesting, how much performance does it buy? [Maybe AMD
> > and Intel just threw a lot of silicon at the problem and it went
> > away. Centaur solution might be nicer, through -- spin_unlock is so
> > uncommon that this seems like nice optimalization.]
>
> I didn't realise Centaur SMP systems existed, but I guess they must do
> for weak memory writes to mean anything.
>
> -- Jamie

--
Horseback riding is like software...
...vgf orggre jura vgf serr.

2003-09-07 13:54:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

Pavel Machek wrote:
> Perhaps weak ordering matters when you are writting to the MMIO, too?

Perhaps, but the code in arch/i386/kernel/cpu/centaur.c seems to try
hard to set weak ordering for RAM, not the whole address space.

-- Jamie

2003-09-07 17:57:58

by Alan

[permalink] [raw]
Subject: Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this

On Sul, 2003-09-07 at 14:53, Jamie Lokier wrote:
> Pavel Machek wrote:
> > Perhaps weak ordering matters when you are writting to the MMIO, too?
>
> Perhaps, but the code in arch/i386/kernel/cpu/centaur.c seems to try
> hard to set weak ordering for RAM, not the whole address space.

There are three cases I know of where you get weak store ordering that
is visible in some way

#1 Pentium Pro due to an errata, hence the need for lock in the
spin_unlock

#2 Centaur Winchip (where OOSTORE off is worth 10-30% performance on
common tasks). A lot of that has to do with the nature of the CPU and
the old socket 7 bus stuff. Its not SMP but we have to care about it
for mmio not because mmio is itself out of order (we leave it in order)
but because of DMA. We must ensure that our writes to ram finish
-before- we kick off the hardware copying the data...

#3 Weak store ordering via sse type instructions, where its intentional
and an sfence is needed eventually