2001-12-31 03:30:19

by Dave Jones

[permalink] [raw]
Subject: [patch] Prefetching file_read_actor()

After noticing file_read_actor() showing up in profiles quite
a bit, I grepped old l-k messages, and turned up a post by
Manfred Spraul in which he posted a patch using inline asm
to prefetch read data. This was x86 specific in generic code,
so was a little hackish..
Now that we have the prefetch macros, I decided to play with
this a little tonight, and came up with this patch..


diff -urN --exclude-from=/home/davej/.exclude linux-2.5.2-pre5/mm/filemap.c linux-2.5/mm/filemap.c
--- linux-2.5.2-pre5/mm/filemap.c Sun Dec 16 23:21:24 2001
+++ linux-2.5/mm/filemap.c Mon Dec 31 03:22:51 2001
@@ -1570,6 +1570,15 @@
size = count;

kaddr = kmap(page);
+
+ if (size > 128) {
+ int i;
+ for(i=0; i<size; i+=64) {
+ prefetch (kaddr+offset);
+ prefetch (kaddr+offset+(L1_CACHE_BYTES*2));
+ }
+ }
+
left = __copy_to_user(desc->buf, kaddr + offset, size);
kunmap(page);

The results are an earth shaking 5 seconds off a make dep bzImage
on my P3, I've not benched an Athlon (or other prefetch aware
architecture) yet, but I expect to see similar speed ups..

Comments ?

Dave.

--
Dave Jones. http://www.codemonkey.org.uk
SuSE Labs.


2001-12-31 05:48:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

Dave Jones wrote:
>
> After noticing file_read_actor() showing up in profiles quite
> a bit, I grepped old l-k messages, and turned up a post by
> Manfred Spraul in which he posted a patch using inline asm
> to prefetch read data. This was x86 specific in generic code,
> so was a little hackish..
> Now that we have the prefetch macros, I decided to play with
> this a little tonight, and came up with this patch..

Well I don't know diddly about the prefetch instruction, but.

> ...
> +
> + if (size > 128) {
> + int i;
> + for(i=0; i<size; i+=64) {
> + prefetch (kaddr+offset);
> + prefetch (kaddr+offset+(L1_CACHE_BYTES*2));
> + }
> + }
> +
> left = __copy_to_user(desc->buf, kaddr + offset, size);
> kunmap(page);
>

This needs to be inside ARCH_HAS_PREFETCH. Otherwise, for
architectures which do not implement prefetch, we have

for (i = 0; i < size; i += 64) {
{;}
{;}
}

and the compiler will *not* optimise this away. The compiler deliberately
leaves this construct alone because it assumes the programmer is asking for
a busywait loop.

We shouldn't add a busywait loop to file_read_actor(), yes?

<ot>
Reminds me of a version of the Microsoft C compiler back in about '85 which
"optimise" away

for ( ; ; )
;

Completely.
</ot>

Is prefetching 4k at a time optimal? Should the prefetch be embedded
in copy_*_user?

The code assumes that L1_CACHE_BYTES equates to the prefetch chunk.
Is this reasonable, or should it be abstracted out to a new PREFETCH_BYTES?

That `64' needs to be PREFETCH_BYTES * 2 or L1_CACHE_BYTES * 2, yes?

How come the code keeps prefetching the same address? Shouldn't
we be adding `i' to the address in there?

The code makes no attempt to align the prefetch address to anything.
Should it?

What happens if a prefetch spills over the end of the page and
faults?

> Comments ?

That'll do for now :)

-

2001-12-31 10:59:00

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

Andrew Morton wrote:
>
>
> Is prefetching 4k at a time optimal? Should the prefetch be embedded
> in copy_*_user?
>
For the Pentium III, that's the Intel recommended sequence.

>
> The code makes no attempt to align the prefetch address to anything.
> Should it?

Not needed.

> What happens if a prefetch spills over the end of the page and
> faults?
>
Doesn't hurt, prefetch instruction never cause page faults.
BUT: Prefetch doesn't preload the TLB. If the TLB entry for the kmap is
not in the TLB, all prefetch instructions are treated as nops.(on pIII).

--
Manfred

2001-12-31 12:50:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

On Mon, 31 Dec 2001, Dave Jones wrote:

> diff -urN --exclude-from=/home/davej/.exclude linux-2.5.2-pre5/mm/filemap.c linux-2.5/mm/filemap.c
> --- linux-2.5.2-pre5/mm/filemap.c Sun Dec 16 23:21:24 2001
> +++ linux-2.5/mm/filemap.c Mon Dec 31 03:22:51 2001
> @@ -1570,6 +1570,15 @@
> size = count;
>
> kaddr = kmap(page);
> +
> + if (size > 128) {
> + int i;
> + for(i=0; i<size; i+=64) {
> + prefetch (kaddr+offset);
> + prefetch (kaddr+offset+(L1_CACHE_BYTES*2));

Neat piece of deep magic, please document it ;)

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-31 12:50:46

by Alan

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

> > +
> > + if (size > 128) {
> > + int i;
> > + for(i=0; i<size; i+=64) {
> > + prefetch (kaddr+offset);
> > + prefetch (kaddr+offset+(L1_CACHE_BYTES*2));
> > + }
> > + }
> > +

Thats almost certainly wrong for most processors. It might work on the PIII
but I wouldnt trust the right results on others. Fix copy_to_user to
have a prefetching version if appropriate

2001-12-31 19:32:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

Manfred Spraul wrote:
>
> BUT: Prefetch doesn't preload the TLB. If the TLB entry for the kmap is
> not in the TLB, all prefetch instructions are treated as nops.(on pIII).

umm.. That will be the case practically 100% of the time in this
application, won't it?

Looks like you'll need to do a __get_user() against the page to
populate the tlb. We're going to need it in the copy_to_user()
anyway, so the cost is negligible.


-

2001-12-31 19:48:09

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

On Mon, 31 Dec 2001, Andrew Morton wrote:

> Looks like you'll need to do a __get_user() against the page to
> populate the tlb. We're going to need it in the copy_to_user()
> anyway, so the cost is negligible.

Completly puzzled right now. Moving the prefetching to copy_to_user
(and doing the tlb preload & prefetching the whole chunk to be copied
(or cachesize if smaller)) results in a performance drop instead of a win.

My initial guess is that some of the callers of copy_to_user are
doing something that is harmed the prefetching.
(Maybe they are doing additional prefetch() calls)

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2001-12-31 20:29:32

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

On Mon, Dec 31, 2001 at 11:58:27AM +0100, Manfred Spraul wrote:
> > What happens if a prefetch spills over the end of the page and
> > faults?

> Doesn't hurt, prefetch instruction never cause page faults.
> BUT: Prefetch doesn't preload the TLB. If the TLB entry for the kmap is
> not in the TLB, all prefetch instructions are treated as nops.(on pIII).

Wasn't this the reason for the voodoo workaround we have in
the mmx version of fast_copy_page() ?

Dave.

--
Dave Jones. http://www.codemonkey.org.uk
SuSE Labs.

2001-12-31 20:40:52

by Ryan Cumming

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

On December 31, 2001 11:47, Dave Jones wrote:
> Completly puzzled right now. Moving the prefetching to copy_to_user
> (and doing the tlb preload & prefetching the whole chunk to be copied
> (or cachesize if smaller)) results in a performance drop instead of a win.
>
> My initial guess is that some of the callers of copy_to_user are
> doing something that is harmed the prefetching.
> (Maybe they are doing additional prefetch() calls)

Maybe syscalls that only have to move a very small chunk of data
(gettimeofday(2), for instance), are hurt because of the wasted bytes they
are prefetching after the intended data? Also, for sizes greater than 512,
copy_to_user will call mmx_copy_user, which might call mmx_memcpy, which does
prefetching already on x86 CPUs that support it.

-Ryan

2002-01-01 10:19:45

by Alan

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

> My initial guess is that some of the callers of copy_to_user are
> doing something that is harmed the prefetching.
> (Maybe they are doing additional prefetch() calls)

Most callers are dealing with cached data, and unless you checked the
length in advance to make a sensible decision it will just make things ugly.
If you aren't copying 512 bytes+ its probably not worth it.

2002-01-01 10:20:27

by Alan

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

> Looks like you'll need to do a __get_user() against the page to
> populate the tlb. We're going to need it in the copy_to_user()
> anyway, so the cost is negligible.

Please bury such things in arch/cpu specific routines. Most non intel
processor hardware isnt that broken.

2002-01-01 11:13:23

by Anton Blanchard

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()


> Please bury such things in arch/cpu specific routines. Most non intel
> processor hardware isnt that broken.

Exactly, for example ppc32 and ppc64 implement the prefetch macros so we
cant put intel specific prefetch hacks in generic code.

The correct place to do it is in the copy_*_user routines where you can
make decisions based on length etc. Davem already does this for sparc64
and Paulus is working on it for ppc32/64.

I just grepped around for current prefetch usage and noticed we now
use it in the list macros. Converting non struct list code (eg pagecache
hash) to use for_each_list etc might give some benefits.

Perhaps for_each_task could benefit from the same prefetching.

Anton

2002-01-01 12:45:42

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

On Tue, 1 Jan 2002, Alan Cox wrote:

> > My initial guess is that some of the callers of copy_to_user are
> > doing something that is harmed the prefetching.
> > (Maybe they are doing additional prefetch() calls)
>
> Most callers are dealing with cached data, and unless you checked the
> length in advance to make a sensible decision it will just make things ugly.
> If you aren't copying 512 bytes+ its probably not worth it.

It didn't make an iota of difference what sizes I played with
(and I tried quite a few different ones)

I've come up with a preload_cache() function (for want of a better name)
and plonked that in prefetch.h for now. Having this #define to nothing
on other arch's would mean we could use this in places like we currently
do prefetch(). I'm still not happy with it, but its cleaner than my
original hack.

It's in -dj10 for the curious.

Dave.

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-01-01 15:16:38

by Alan

[permalink] [raw]
Subject: Re: [patch] Prefetching file_read_actor()

> I've come up with a preload_cache() function (for want of a better name)
> and plonked that in prefetch.h for now. Having this #define to nothing
> on other arch's would mean we could use this in places like we currently
> do prefetch(). I'm still not happy with it, but its cleaner than my
> original hack.

copy_*_user_prefetched() is probably the one you need. Then you can prefetch
first for PIII, prefetch as copying for athlon etc. It will also provide
the needed hints about which copies are wins.