2006-11-09 00:56:36

by Bela Lubkin

[permalink] [raw]
Subject: touch_cache() only touches two thirds

Submitted as <http://bugzilla.kernel.org/show_bug.cgi?id=7476>:

I noticed this bug while looking at something else. These comments are based
purely on source inspection without any runtime observations. The bug
remains in the latest sources I could find: 2.6.19-rc5 and rc5-mm1.

Ingo Molnar's patch referred to as "scheduler cache-hot-auto-tune" introduced
the function kernel/sched.c:touch_cache(). Here is the 2.6.19-rc5-mm1
version (my "forward / backward" comments):

/*
* Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
* is the operation that is timed, so we try to generate unpredictable
* cachemisses that still end up filling the L2 cache:
*/
static void touch_cache(void *__cache, unsigned long __size)
{
unsigned long size = __size / sizeof(long);
unsigned long chunk1 = size / 3;
unsigned long chunk2 = 2 * size / 3;
unsigned long *cache = __cache;
int i;

for (i = 0; i < size / 6; i += 8) {
switch (i % 6) {
case 0: cache[i]++; /* 1st third, forward */
case 1: cache[size-1-i]++; /* 3rd third, backward */
case 2: cache[chunk1-i]++; /* 1st third, backward */
case 3: cache[chunk1+i]++; /* 2nd third, forward */
case 4: cache[chunk2-i]++; /* 2nd third, backward */
case 5: cache[chunk2+i]++; /* 3rd third, forward */
}
}
}

Notice that the for() loop increments `i' by 8 (earlier versions of the code
used a stride of 4). Since all visited values of i are even, `i % 6' can
never be odd. The switch cases 1/3/5 will never be hit -- so the 3rd third
of the test buffer isn't being touched. Migration costs are actually being
calculated relative to buffers that are only 2/3 as large as intended.

An easy fix is to make the stride relatively prime to the modulus:

--- sched.c.orig 2006-11-08 16:17:37.299500000 -0800
+++ sched.c 2006-11-08 16:28:21.699750000 -0800
@@ -5829,5 +5829,5 @@ static void touch_cache(void *__cache, u
int i;

- for (i = 0; i < size / 6; i += 8) {
+ for (i = 0; i < size / 6; i += 7) {
switch (i % 6) {
case 0: cache[i]++;

>Bela<

PS: I suspect this at least partially explains Kenneth W Chen's observations
in "Variation in measure_migration_cost() with
scheduler-cache-hot-autodetect.patch in -mm",
<http://lkml.org/lkml/2005/6/21/473>:

> I'm consistently getting a smaller than expected cache migration cost
> as measured by Ingo's scheduler-cache-hot-autodetect.patch currently
> in -mm tree.


2006-11-10 08:26:09

by Andi Kleen

[permalink] [raw]
Subject: Re: touch_cache() only touches two thirds

"Bela Lubkin" <[email protected]> writes:
>
> /*
> * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
> * is the operation that is timed, so we try to generate unpredictable
> * cachemisses that still end up filling the L2 cache:
> */

The comment is misleading anyways. AFAIK several of the modern
CPUs (at least K8, later P4s, Core2, POWER4+, PPC970) have prefetch
predictors advanced enough to follow several streams forward and backwards
in parallel.

I hit this while doing NUMA benchmarking for example.

Most likely to be really unpredictable you need to use a
true RND and somehow make sure still the full cache range
is covered.


-Andi

2006-11-11 00:13:08

by Bela Lubkin

[permalink] [raw]
Subject: RE: touch_cache() only touches two thirds

Andi Kleen wrote:

> "Bela Lubkin" <[email protected]> writes:
> >
> > /*
> > * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
> > * is the operation that is timed, so we try to generate unpredictable
> > * cachemisses that still end up filling the L2 cache:
> > */
>
> The comment is misleading anyways. AFAIK several of the modern
> CPUs (at least K8, later P4s, Core2, POWER4+, PPC970) have prefetch
> predictors advanced enough to follow several streams forward and backwards
> in parallel.
>
> I hit this while doing NUMA benchmarking for example.
>
> Most likely to be really unpredictable you need to use a
> true RND and somehow make sure still the full cache range
> is covered.

The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
covers the full cache range. Granted that modern CPUs may be able to track
multiple simultaneous cache access streams: how many such streams are they
likely to be able to follow at once? It seems like going from 1 to 2 would
be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
incremental difference. So what do the actual implementations in the field
support?

The code (original and corrected) uses 6 simultaneous streams.

I have a modified version that takes a `ways' parameter to use an arbitrary
number of streams. I'll post that onto bugzilla.kernel.org.

>Bela<

2006-11-11 00:23:57

by Andi Kleen

[permalink] [raw]
Subject: Re: touch_cache() only touches two thirds


> The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
> covers the full cache range. Granted that modern CPUs may be able to track
> multiple simultaneous cache access streams: how many such streams are they
> likely to be able to follow at once? It seems like going from 1 to 2 would
> be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
> incremental difference. So what do the actual implementations in the field
> support?

I remember reading at some point that a POWER4 could track at least 5+ parallel
streams. I don't know how many K8 handles, but it is multiple too at least
(forward and backwards)

I don't have more data, but at least the newer Intel CPUs seem to be also
very good at prefetching and when you look at a die photo the L/S unit
in general is quite big. More than 6 streams handled is certainly a
possibility.

I guess it could be figured out with some clever benchmarking.
>
> The code (original and corrected) uses 6 simultaneous streams.

My gut feeling is that this is not enough.

> I have a modified version that takes a `ways' parameter to use an arbitrary
> number of streams. I'll post that onto bugzilla.kernel.org.

Post it to the list please.

-Andi

2006-11-11 01:48:10

by Bela Lubkin

[permalink] [raw]
Subject: Re: touch_cache() only touches two thirds

Andi Kleen wrote:

Bela>> The corrected code in
Bela>> <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4> covers the
Bela>> full cache range. Granted that modern CPUs may be able to track
Bela>> multiple simultaneous cache access streams: how many such streams
Bela>> are they likely to be able to follow at once? It seems like
Bela>> going from 1 to 2 would be a big win, 2 to 3 a small win, beyond
Bela>> that it wouldn't likely make much incremental difference. So
Bela>> what do the actual implementations in the field support?

Andi> I remember reading at some point that a POWER4 could track at
Andi> least 5+ parallel streams. I don't know how many K8 handles, but
Andi> it is multiple too at least (forward and backwards)

Andi> I don't have more data, but at least the newer Intel CPUs seem to
Andi> be also very good at prefetching and when you look at a die photo
Andi> the L/S unit in general is quite big. More than 6 streams handled
Andi> is certainly a possibility.

Andi> I guess it could be figured out with some clever benchmarking.

Bela>> The code (original and corrected) uses 6 simultaneous streams.

Andi> My gut feeling is that this is not enough.

Bela>> I have a modified version that takes a `ways' parameter to
Bela>> use an arbitrary number of streams. I'll post that onto
Bela>> bugzilla.kernel.org.

Andi> Post it to the list please.

Ok, will do. I'm not real sure about list etiquette here. A discussion
is underway on <http://bugzilla.kernel.org/show_bug.cgi?id=7476>. Here
is the code I've posted there. (Slightly newer versions here.)

First is a C program -- a test harness that embeds the new touch_cache()
routine (needs memory management work to go into kernel). Then a shell
script I've been using to torture it.

The torture script tests it with cache lines up to 66-longs, and with up
to 656-way streaming (artifacts of the shell's $RANDOM ;-)

Moved this to my home account so I have some control over the mailer...

>Bela<

=============================================================================
/*
* Test program to demonstrate touch_cache() algorithms
*
* Bela Lubkin 2006-11-10
*/

#include <stdio.h>
#include <stdlib.h>

/* Elements Per Displayline -- display parameter for self-check code */
#define EPD 64

/*
* The following definitions describe the cache line size of the machine
* architecture:
*
* cache_t, here defined as `long', is a single cache element
* LPC, Longs Per Cacheline, is the number of elements per cache line
*
* For consistency, cache_t should probably be int32_t, and only LPC
* should be varied to match various architectures.
*/

#define LPC 8
int lpc = LPC;
typedef long cache_t;

bar()
{
int i;

printf("+");
for (i = 0; i < EPD + (EPD / lpc) - 1; i++)
printf("=");
printf("+\n");
}

clear(cache_t *cache, int size)
{
int i;

for (i = -EPD; i < size + EPD; i++)
cache[i] = 0;
}

/*
* show() dumps the resulting touched cache and checks it for
* correctness.
*
* The `misplaced' test isn't strictly necessary, as the actual goal is
* merely to touch each cache line (anywhere within the line). I found
* the additional restriction useful to promote overall correctness
* during the process of refining the touch_cache() algorithm.
*/
show(cache_t *cache, int size)
{
int i;
int missed = 0, underrun = 0, misplaced = 0, overrun = 0;

for (i = -EPD; i < size + EPD; i++) {
if ((i + EPD) % EPD == 0)
printf("|");
printf("%c", cache[i] ? '0' + cache[i] : (i < 0 || i >= size) ? '-' : '.');
if ((i + EPD) % EPD == EPD - 1)
printf("\n");
else if ((i + EPD) % lpc == lpc - 1)
printf(":");
if (i >= 0 && i < size && (i % lpc) == 0 && cache[i] == 0)
missed++;
if (cache[i]) {
if (i < 0)
underrun += cache[i];
if (i >= size)
overrun += cache[i];
if (i % lpc != 0)
misplaced += cache[i];
}
}
if ((i + EPD) % EPD != 0)
printf("\n");
if (missed)
printf("ERROR: %d cache lines were missed!\n", missed);
if (underrun)
printf("ERROR: %d writes before beginning of buffer!\n", underrun);
if (overrun)
printf("ERROR: %d writes after end of buffer!\n", overrun);
if (misplaced)
printf("ERROR: %d writes at unexpected alignments within a cache line!\n", misplaced);
if (missed || underrun || misplaced || overrun)
exit(1);
}

static int *waystart;

/*
* When putting into kernel, use vmalloc()/vfree();
* change error handling.
*/

prep_ways(int ways, int size)
{
int way, waysize = size / ways;

/* one extra `way' is used when `ways' is odd */
/* (actually, only the even elements of this array are used) */
waystart = (int *)malloc((ways + 1) * sizeof(*waystart));

if (!waystart) {
fprintf(stderr, "malloc failed\n");
exit(1);
}

for (way = 0; way < ways + 1; way++) {
if ((way & 1) == 0)
/* even `waystart' points to 1st line in its `way' */
waystart[way] = way * waysize;
else
/* odd `waystart' points to last line in its `way' */
waystart[way] = (way + 1) * waysize - lpc;
/* align to next cache line */
waystart[way] = lpc * ((waystart[way] + lpc - 1) / lpc);
}
}

free_ways()
{
free(waystart);
}

touch_cache(cache_t *cache, int ways, int size)
{
int way, i;

/*
* We access the buffer via `ways' independent 'streams' of
* read/write access which are interleaved; every other one
* is written backwards. This is supposed to keep the cache
* from recognizing any linear access pattern.
*
* [---> <---|---> <---|---> <---]
*
* We touch every cacheline in the buffer (32 bytes on 32-bit
* systems, 64 bytes on 64-bit systems; actually now `lpc *
* sizeof(cache_t)', could be determined at runtime).
*/
for (i = 0; i < size / ways; i += lpc) {
for (way = 0; way < ways; way++) {
if ((way & 1) == 0)
cache[waystart[way] + i]++;
else
cache[waystart[way] - i]++;
}
}
}

main(int argc, char *argv[])
{
int i;
int size, ways;
cache_t *cache;

size = atoi(argv[1]);
ways = atoi(argv[2]);
if (argc > 3)
lpc = atoi(argv[3]);
if (argc < 3 || ways <= 0 || size < ways) {
fprintf(stderr, "usage: touch_cache cache_size ways [longs-per-cacheline]\n");
fprintf(stderr, " cache_size >= ways\n");
exit(1);
}

/*
* This is a bit of a shuck, papering over the fact that it's
* hard to get perfect 1:1 cache line coverage in an odd-sized
* buffer...
*/
if (size % (ways * lpc)) {
printf("cache_size should be a multiple of %d * ways\n", lpc);
printf("Raising cache_size...\n");
size = (lpc * ways) * (1 + size / lpc / ways);
}

printf("size: %d, ways: %d, longs-per-cacheline: %d\n", size, ways, lpc);

/* allocate an extra 2*EPD cache elements, two display lines,
to demonstrate not running off the ends of the buffer */
cache = (cache_t *)malloc((EPD * 2 + size) * sizeof(*cache));
cache += EPD;

if (!cache) {
fprintf(stderr, "malloc failed\n");
exit(1);
}

clear(cache, size);
prep_ways(ways, size);

touch_cache(cache, ways, size);

free_ways();

bar();
show(cache, size);
bar();

exit(0);
}
=============================================================================
#!/bin/bash

# Torture the touch_cache() algorithm.
#
# This produces about 24MB of output. Any "ERROR" messages indicate a
# problem; the rest should be rather boring. Run as:
#
# ./touch_cache.test.sh > touch_cache.test.out
# grep -i error touch_cache.test.out

exec 2>&1

i=0
err=0
while [ $i -lt 1000 ]; do
let i=i+1
let size=$RANDOM+1
let ways=$RANDOM/50+1
case $RANDOM in
1[01234]???) lpc=4;;
1[56789]???) lpc=8;;
2[01234]???) lpc=16;;
2[56789]???) lpc=32;;
3????) lpc=64;;
*) let lpc=$RANDOM/500+1;;
esac
if [ $ways -gt $size ]; then
x=$ways
ways=$size
size=$x
fi
./touch_cache $size $ways $lpc || let err=err+1
done
if [ $err -gt 0 ]; then
echo ERROR: errors above, check output
else
echo Test completed with no errors.
fi

2006-11-18 04:18:14

by dean gaudet

[permalink] [raw]
Subject: RE: touch_cache() only touches two thirds

On Fri, 10 Nov 2006, Bela Lubkin wrote:

> The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
> covers the full cache range. Granted that modern CPUs may be able to track
> multiple simultaneous cache access streams: how many such streams are they
> likely to be able to follow at once? It seems like going from 1 to 2 would
> be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
> incremental difference. So what do the actual implementations in the field
> support?

p-m family, core, core2 track one stream on each of 12 to 16 pages. in
the earlier ones they split the trackers into some forward-only and some
backward-only, but on core2 i think they're all bidirectional. if i had
to guess they round-robin the trackers, so once you hit 17 pages with
streams they're defeated.

a p4 (0f0403, probably "prescott") i have here is tracking 16 -- seems to
use LRU or pLRU but i haven't tested really, you need to get out past 32
streams before it really starts falling off... and even then the next-line
prefetch in the L2 helps too much (64-byte lines, but 128-byte tags and a
pair of dirty/state bits -- it prefetches the other half of a pair
automatically). oh it can track forward or backward, and is happy with
strides up to 128.

k8 rev F tracks one stream on each of 20 pages (forward or backward). it
also seems to use round-robin, and is defeated as soon as you have 21
streams.

i swear there was an x86 which did 28 streams, but it was a few years ago
that i last looked really closely at the prefetchers and i don't have
access to the data at the moment.

i suggest that streams are the wrong approach. i was actually considering
this same problem this week, happy to see your thread.

the approach i was considering was to set up two pointer chases:

one pointer chase covering enough cache lines (and in a prefetchable
ordering) for "scrubbing" the cache(s).

another pointer chase arranged to fill the L1 (or L2) using many many
pages. i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines
then i'd allocate 512 pages and put one line on each page (pages ordered
randomly), but colour them so they fill the L1. this conveniently happens
to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB
pressure from the microbenchmark.

you can actually get away with a pointer every 256 bytes today -- none of
the prefetchers on today's x86 cores consider a 256 byte stride to be
prefetchable. for safety you might want to use 512 byte alignment... this
lets you get away with fewer pages for colouring larger caches.

the benchmark i was considering would be like so:

switch to cpu m
scrub the cache
switch to cpu n
scrub the cache
traverse the coloured list and modify each cache line as we go
switch to cpu m
start timing
traverse the coloured list without modification
stop timing

-dean

2006-11-18 04:31:35

by dean gaudet

[permalink] [raw]
Subject: RE: touch_cache() only touches two thirds

On Fri, 17 Nov 2006, dean gaudet wrote:

> another pointer chase arranged to fill the L1 (or L2) using many many
> pages. i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines
> then i'd allocate 512 pages and put one line on each page (pages ordered
> randomly), but colour them so they fill the L1. this conveniently happens
> to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB
> pressure from the microbenchmark.

btw, for L2-sized measurements you don't need to be so clever... you can
get away with a random traversal of the L2 on 128B boundaries. (need to
avoid the "next-line prefetch" issues on p-m/core/core2, p4 model 3 and
later.) there's just so many more pages required to map the L2 than any
reasonable prefetcher is going to have any time soon.

-dean


> the benchmark i was considering would be like so:
>
> switch to cpu m
> scrub the cache
> switch to cpu n
> scrub the cache
> traverse the coloured list and modify each cache line as we go
> switch to cpu m
> start timing
> traverse the coloured list without modification
> stop timing