2002-09-20 01:06:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

Andi Kleen wrote:
>
> Andrew Morton <[email protected]> writes:
>
> > Hirokazu Takahashi wrote:
> > >
> > > ...
> > > > It needs redoing. These differences are really big, and this
> > > > is the kernel's most expensive function.
> > > >
> > > > A little project for someone.
> > >
> > > OK, if there is nobody who wants to do it I'll do it by myself.
> >
> > That would be fantastic - thanks. This is more a measurement
> > and testing exercise than a coding one. And if those measurements
> > are sufficiently nice (eg: >5%) then a 2.4 backport should be done.
>
> Very interesting IMHO would be to find a heuristic to switch between
> a write combining copy and a cache hot copy. Write combining is good
> for blasting huge amounts of data quickly without killing your caches.
> Cache hot is good for everything else.

I expect that caching userspace and not pagecache would be
a reasonable choice.

> But it'll need hints from the higher level code. e.g. read and write
> could turn on write combining for bigger writes (let's say >8K)
> I discovered that just unconditionally turning it on for all copies
> is not good because it forces data out of cache. But I still have hope
> that it helps for selected copies.

Well if it's a really big read then bypassing the CPU cache on
the userspace-side buffer would make sense.

Can you control the cachability of the memory reads as well?

What restrictions are there on these instructions? Would
they force us to bear the cost of the aligment problem?


2002-09-20 01:18:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote:
> > Very interesting IMHO would be to find a heuristic to switch between
> > a write combining copy and a cache hot copy. Write combining is good
> > for blasting huge amounts of data quickly without killing your caches.
> > Cache hot is good for everything else.
>
> I expect that caching userspace and not pagecache would be
> a reasonable choice.

Normally yes, but not always. e.g. for squid you don't really want to
cache user space.

But I guess it would be a reasonable heuristic. Or at least worth a try :-)

>
> > But it'll need hints from the higher level code. e.g. read and write
> > could turn on write combining for bigger writes (let's say >8K)
> > I discovered that just unconditionally turning it on for all copies
> > is not good because it forces data out of cache. But I still have hope
> > that it helps for selected copies.
>
> Well if it's a really big read then bypassing the CPU cache on
> the userspace-side buffer would make sense.
>
> Can you control the cachability of the memory reads as well?

SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different
cache hierarchies), but it's not completely clear on how much
the CPUs follow these.

For writing it's much more obvious and usually documented even.

>
> What restrictions are there on these instructions? Would
> they force us to bear the cost of the aligment problem?

They should be aligned, otherwise it makes no sense. When you assume it's
more likely that one target or destination are unaligned then you can easily
align either target or destination. Trick is to chose the right one,
it varies on the call site.
(these are for big copies so a small alignment function is lost in the noise)

x86-64 copy_*_user currently aligns the destination, but hardcoding that
is a bit dumb and I'm not completely happy with it.


-Andi

2002-09-20 01:32:19

by David Miller

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

From: Andi Kleen <[email protected]>
Date: Fri, 20 Sep 2002 03:23:46 +0200

On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote:
> Can you control the cachability of the memory reads as well?

SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different
cache hierarchies), but it's not completely clear on how much
the CPUs follow these.

For writing it's much more obvious and usually documented even.

See "montdq/movnti", the latter of which even works on register
registers. Ben LaHaise pointed this out to me earlier today.

2002-09-20 02:01:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

> See "montdq/movnti", the latter of which even works on register
> registers. Ben LaHaise pointed this out to me earlier today.

The issue is that you really want to do prefetching in these loops
(waiting for the hardware prefetch is too slow because it needs several
cache misses to trigger) so for cache hints on reading only prefetch
instructions are interesting.

-Andi

2002-09-20 02:06:30

by David Miller

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

From: Andi Kleen <[email protected]>
Date: Fri, 20 Sep 2002 04:06:19 +0200

> See "montdq/movnti", the latter of which even works on register
> registers. Ben LaHaise pointed this out to me earlier today.

The issue is that you really want to do prefetching in these loops
(waiting for the hardware prefetch is too slow because it needs several
cache misses to trigger) so for cache hints on reading only prefetch
instructions are interesting.

I'm talking about using this to bypass the cache on the stores.
The prefetches are a seperate issue and I agree with you on that.

2002-09-20 02:23:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, Sep 19, 2002 at 07:01:54PM -0700, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Fri, 20 Sep 2002 04:06:19 +0200
>
> > See "montdq/movnti", the latter of which even works on register
> > registers. Ben LaHaise pointed this out to me earlier today.
>
> The issue is that you really want to do prefetching in these loops
> (waiting for the hardware prefetch is too slow because it needs several
> cache misses to trigger) so for cache hints on reading only prefetch
> instructions are interesting.
>
> I'm talking about using this to bypass the cache on the stores.
> The prefetches are a seperate issue and I agree with you on that.

I was talking generally. You cannot really use these instructions on Athlon,
because they're microcoded and slow or do not exist. On Athlon it needs
3dnow write combining functions (adding FPU overhead so may not be worth
it). On P3/P4 you can use movnti/movntdq yes.

Just doing it for reads is more tricky/dubious.

-Andi

2002-09-20 02:25:21

by David Miller

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

From: Andi Kleen <[email protected]>
Date: Fri, 20 Sep 2002 04:28:19 +0200

You cannot really use these instructions on Athlon,

I know that Athlon lacks these instructions, they are p4 sse2
only.

2002-09-20 02:30:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, Sep 19, 2002 at 07:20:48PM -0700, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Fri, 20 Sep 2002 04:28:19 +0200
>
> You cannot really use these instructions on Athlon,
>
> I know that Athlon lacks these instructions, they are p4 sse2
> only.

AFAIK it is an SSE1 feature.

Athlon actually has movnti in newer models, just you do not really want to
use it.

-Andi