2009-03-01 14:07:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

On Sunday 01 March 2009 12:40:51 H. Peter Anvin wrote:
> Arjan van de Ven wrote:
> > the reason that movntq and co are faster is because you avoid the
> > write-allocate behavior of the caches....
> >
> > the cache polluting part of it I find hard to buy for general use (as
> > this discussion shows)... that will be extremely hard to measure as
> > a real huge thing, while the WA part is like a 1.5x to 2x thing.
>
> Note that hardware *can* (which is not the same thing as hardware
> *will*) elide the write-allocate behavior. We did that at Transmeta for
> rep movs and certain other instructions which provably filled in entire
> cache lines. I haven't investigated if newer Intel CPUs do that in the
> "fast rep movs" case.

I would expect any high performance CPU these days to combine entries
in the store queue, even for normal store instructions (especially for
linear memcpy patterns). Isn't this likely to be the case?


2009-03-02 04:50:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

Nick Piggin wrote:
>
> I would expect any high performance CPU these days to combine entries
> in the store queue, even for normal store instructions (especially for
> linear memcpy patterns). Isn't this likely to be the case?
>

Actually, that is often not the case simply because it doesn't buy that
much. The big win comes when you don't read a whole cache line in from
memory, but that is a property of the cache, not the store queue.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-03-02 06:19:03

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

On Monday 02 March 2009 15:46:03 H. Peter Anvin wrote:
> Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> Actually, that is often not the case simply because it doesn't buy that
> much. The big win comes when you don't read a whole cache line in from
> memory, but that is a property of the cache, not the store queue.

Hm, maybe I'm confused. As far as I thought, you could avoid the
RMW write allocate behaviour by bypassing the cache on a store
miss, or combining stores into cacheline blocks before they leave
the store queue.

2009-03-02 21:17:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()



On Mon, 2 Mar 2009, Nick Piggin wrote:
>
> I would expect any high performance CPU these days to combine entries
> in the store queue, even for normal store instructions (especially for
> linear memcpy patterns). Isn't this likely to be the case?

None of this really matters.

The big issue is that before you can do any write to any cacheline, if the
memory is cacheable, it needs the cache coherency protocol to synchronize
with any other CPU's that may have that line in the cache.

The _only_ time a write is "free" is when you already have that cacheline
in your own cache, and in an "exclusive" state. If that is the case, then
you know that you don't need to do anything else.

In _any_ other case, before you do the write, you need to make sure that
no other CPU in the system has that line in its cache. Whether you do that
with a "write and invalidate" model (which would be how a store buffer
would do it or a write-through cache would work), or whether you do it
with a "acquire exclusive cacheline" (which is how the cache coherency
protocol would do it), it's going to end up using cache coherency
bandwidth.

Of course, what will be the limiting factor is unclear. On a single-socket
thing, you don't have any cache coherency issues, an the only bandwidth
you'd end up using is the actual memory write at the memory controller
(which may be on-die, and entirely separate from the cache coherency
protocol). It may be idle and the write queue may be deep enough that you
reach memory speeds and the write buffer is the optimal approach.

On many sockets, the limiting factor will almost certainly be the cache
coherency overhead (since the cache coherency traffic needs to go to _all_
sockets, rather than just one stream to memory), at least unless you have
a good cache coherency filter that can filter out part of the traffic
based on whether it could be cached or not on some socket(s).

IOW, it's almost impossible to tell what is the best approach. It will
depend on number of sockets, it will depend on size of cache, and it will
depend on the capabilities and performance of the memory controllers vs
the cache coherency protocol.

On a "single shared bus" model, the "write with invalidate" is fine, and
it basically ends up working a lot like a single socket even if you
actually have multiple sockets - it just won't scale much beyond two
sockets. With HT or QPI, things are different, and the presense or absense
of a snoop filter could make a big difference for 4+ socket situations.

There simply is no single answer.

And we really should keep that in mind. There is no right answer, and the
right answer will depend on hardware. Playing cache games in software is
almost always futile. It can be a huge improvement, but it can be a huge
deprovement too, and it really tends to be only worth it if you (a) know
your hardware really quite well and (b) know your _load_ pretty well too.

We can play games in the kernel. We do know how many sockets there are. We
do know the cache size. We _could_ try to make an educated guess at
whether the next user of the data will be DMA or not. So there are
unquestionably heuristics we could apply, but I also do suspect that
they'd inevitably be pretty arbitrary.

I suspect that we could make some boot-time (or maybe CPU hotplug time)
decision that simply just sets a threshold value for when it is worth
using non-temporal stores. With smaller caches, and with a single socket
(or a single bus), it likely makes sense to use non-temporal stores
earlier.

But even with some rough heuristic, it will be wrong part of the time. So
I think "simple and predictable" in the end tends to be better than
"complex and still known to be broken".

Btw, the "simple and predictable" could literally look at _where_ in the
file the IO is. Because I know there are papers on the likelihood of
re-use of data depending on where in the file it is written. Data written
to low offsets is more likely to be accessed again (think temp-files),
while data written to big offsets are much more likely to be random or to
be written out (think databases or simply just large streaming files).

So I suspect a "simple and predictable" algorithm could literally be
something like

- use nontemporal stores only if you are writing a whole page, and the
byte offset of the page is larger than 'x', where 'x' may optionally
even depend on size of cache.

But removing it entirely may be fine too.

What I _don't_ think is fine is to think that you've "solved" it, or that
you should even try!

Linus

2009-03-02 21:26:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()


* Linus Torvalds <[email protected]> wrote:

> We can play games in the kernel. We do know how many sockets
> there are. We do know the cache size. We _could_ try to make
> an educated guess at whether the next user of the data will be
> DMA or not. So there are unquestionably heuristics we could
> apply, but I also do suspect that they'd inevitably be pretty
> arbitrary.

There's a higher-level meta-code argument to consider as well,
and i find it equally important: finding this rather obvious and
easy to measure performance regression caused by the
introduction of MOVNT took us two years.

That's _way_ too long - and adding more heuristics and more
complexity will just increase this latency. (as we will create
small niches of special cases with special workloads where we
might or might not regress)

So, if any such change is done, we can only do it if we have the
latency of performance-regression-finding on the order of days
or at most weaks - not years.

Ingo

2009-03-03 04:21:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> On Mon, 2 Mar 2009, Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> None of this really matters.

Well that's just what I was replying to. Of course nontemporal/uncached
stores can't avoid cc operations either, but somebody was hoping that
they would avoid the write-allocate / RMW behaviour. I just replied because
I think that modern CPUs can combine stores in their store queues to get
the same result for cacheable stores.

Of course it doesn't make it free especially if it is a cc protocol that
has to go on the interconnect anyway. But avoiding the RAM read is a
good thing anyway.


> The big issue is that before you can do any write to any cacheline, if the
> memory is cacheable, it needs the cache coherency protocol to synchronize
> with any other CPU's that may have that line in the cache.
>
> The _only_ time a write is "free" is when you already have that cacheline
> in your own cache, and in an "exclusive" state. If that is the case, then
> you know that you don't need to do anything else.
>
> In _any_ other case, before you do the write, you need to make sure that
> no other CPU in the system has that line in its cache. Whether you do that
> with a "write and invalidate" model (which would be how a store buffer
> would do it or a write-through cache would work), or whether you do it
> with a "acquire exclusive cacheline" (which is how the cache coherency
> protocol would do it), it's going to end up using cache coherency
> bandwidth.
>
> Of course, what will be the limiting factor is unclear. On a single-socket
> thing, you don't have any cache coherency issues, an the only bandwidth
> you'd end up using is the actual memory write at the memory controller
> (which may be on-die, and entirely separate from the cache coherency
> protocol). It may be idle and the write queue may be deep enough that you
> reach memory speeds and the write buffer is the optimal approach.
>
> On many sockets, the limiting factor will almost certainly be the cache
> coherency overhead (since the cache coherency traffic needs to go to _all_
> sockets, rather than just one stream to memory), at least unless you have
> a good cache coherency filter that can filter out part of the traffic
> based on whether it could be cached or not on some socket(s).
>
> IOW, it's almost impossible to tell what is the best approach. It will
> depend on number of sockets, it will depend on size of cache, and it will
> depend on the capabilities and performance of the memory controllers vs
> the cache coherency protocol.
>
> On a "single shared bus" model, the "write with invalidate" is fine, and
> it basically ends up working a lot like a single socket even if you
> actually have multiple sockets - it just won't scale much beyond two
> sockets. With HT or QPI, things are different, and the presense or absense
> of a snoop filter could make a big difference for 4+ socket situations.
>
> There simply is no single answer.
>
> And we really should keep that in mind. There is no right answer, and the
> right answer will depend on hardware. Playing cache games in software is
> almost always futile. It can be a huge improvement, but it can be a huge
> deprovement too, and it really tends to be only worth it if you (a) know
> your hardware really quite well and (b) know your _load_ pretty well too.
>
> We can play games in the kernel. We do know how many sockets there are. We
> do know the cache size. We _could_ try to make an educated guess at
> whether the next user of the data will be DMA or not. So there are
> unquestionably heuristics we could apply, but I also do suspect that
> they'd inevitably be pretty arbitrary.
>
> I suspect that we could make some boot-time (or maybe CPU hotplug time)
> decision that simply just sets a threshold value for when it is worth
> using non-temporal stores. With smaller caches, and with a single socket
> (or a single bus), it likely makes sense to use non-temporal stores
> earlier.
>
> But even with some rough heuristic, it will be wrong part of the time. So
> I think "simple and predictable" in the end tends to be better than
> "complex and still known to be broken".
>
> Btw, the "simple and predictable" could literally look at _where_ in the
> file the IO is. Because I know there are papers on the likelihood of
> re-use of data depending on where in the file it is written. Data written
> to low offsets is more likely to be accessed again (think temp-files),
> while data written to big offsets are much more likely to be random or to
> be written out (think databases or simply just large streaming files).
>
> So I suspect a "simple and predictable" algorithm could literally be
> something like
>
> - use nontemporal stores only if you are writing a whole page, and the
> byte offset of the page is larger than 'x', where 'x' may optionally
> even depend on size of cache.
>
> But removing it entirely may be fine too.
>
> What I _don't_ think is fine is to think that you've "solved" it, or that
> you should even try!

Right. I don't know if you misunderstood me or aimed this post at the
general discussion rather than my reply specifically.

I know even if a CPU does write combining in the store buffer and even
if it does have "big-hammer" nontemporal stores like x86 apparently does,
then there are still cases where nontemporal stores will win if the data
doesn't get used by the CPU again.

I agree that if a heuristic can't get it right a *significant* amount of
time, then it is not worthwhile. Even if it gets it right a little more
often than wrong, the unpredictability is a negative factor. I agree
completely with you there :)

I would like to remove it, as in Ingo's last patch, FWIW. But I can see
obviously there are cases where nontemporal helps, so there will never be
a "right" answer.

2009-03-03 04:30:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

On Tuesday 03 March 2009 08:25:43 Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
> > We can play games in the kernel. We do know how many sockets
> > there are. We do know the cache size. We _could_ try to make
> > an educated guess at whether the next user of the data will be
> > DMA or not. So there are unquestionably heuristics we could
> > apply, but I also do suspect that they'd inevitably be pretty
> > arbitrary.
>
> There's a higher-level meta-code argument to consider as well,
> and i find it equally important: finding this rather obvious and
> easy to measure performance regression caused by the
> introduction of MOVNT took us two years.
>
> That's _way_ too long - and adding more heuristics and more
> complexity will just increase this latency. (as we will create
> small niches of special cases with special workloads where we
> might or might not regress)
>
> So, if any such change is done, we can only do it if we have the
> latency of performance-regression-finding on the order of days
> or at most weaks - not years.

Something like temporal vs nontemporal stores, into the pagecache,
is a fundamental tradeoff that will depend on userspace access
pattern that the kernel can't know about. For the kernel side of
the equation, we could query the underlying backing store to see
whether it is going to do any CPU operations on the data before
writeout, eg checksumming or encryption etc. but even then, the
latency between dirtying the pagecache and initiating the writeout
means that it is probably not in cache any longer at the time of
writeout anyway. So the primary factor really is userspace access
patterns I think.

In situations like that, I think the only way to really "solve"
the problem is to provide a way for userspace to ask for temporal
or nontemporal access. Yeah this is just passing the buck, and
probably most apps that try to use this will do no better job than
the kernel :) But it allows ones that really care to give the
kernel much better information.

2009-03-03 09:03:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()


* Nick Piggin <[email protected]> wrote:

> On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> > On Mon, 2 Mar 2009, Nick Piggin wrote:
> > > I would expect any high performance CPU these days to combine entries
> > > in the store queue, even for normal store instructions (especially for
> > > linear memcpy patterns). Isn't this likely to be the case?
> >
> > None of this really matters.
>
> Well that's just what I was replying to. Of course
> nontemporal/uncached stores can't avoid cc operations either,
> but somebody was hoping that they would avoid the
> write-allocate / RMW behaviour. I just replied because I think
> that modern CPUs can combine stores in their store queues to
> get the same result for cacheable stores.
>
> Of course it doesn't make it free especially if it is a cc
> protocol that has to go on the interconnect anyway. But
> avoiding the RAM read is a good thing anyway.

Hm, why do you assume that there is a RAM read? A sufficiently
advanced x86 CPU will have good string moves with full cacheline
transfers - removing partial cachelines and removing the need
for the physical read.

The cacheline still has to be flushed/queried/transferred across
the cc domain according to the cc protocol in use, to make sure
there's no stale cached data elsewhere, but that is not a RAM
read and in the common case (when the address is not present in
any cache) it can be quite cheap.

The only cost is the dirty cacheline that is left around that
increases the flush-out pressure on the cache. (the CPU might
still be smart about this detail too so in practice a lot of
write-allocates might not even cause that much trouble.)

Ingo

2009-03-03 16:38:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

On Tuesday 03 March 2009 20:02:52 Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> > > On Mon, 2 Mar 2009, Nick Piggin wrote:
> > > > I would expect any high performance CPU these days to combine entries
> > > > in the store queue, even for normal store instructions (especially
> > > > for linear memcpy patterns). Isn't this likely to be the case?
> > >
> > > None of this really matters.
> >
> > Well that's just what I was replying to. Of course
> > nontemporal/uncached stores can't avoid cc operations either,
> > but somebody was hoping that they would avoid the
> > write-allocate / RMW behaviour. I just replied because I think
> > that modern CPUs can combine stores in their store queues to
> > get the same result for cacheable stores.
> >
> > Of course it doesn't make it free especially if it is a cc
> > protocol that has to go on the interconnect anyway. But
> > avoiding the RAM read is a good thing anyway.
>
> Hm, why do you assume that there is a RAM read?

I don't ;) Re-read back a few posts. I thought that nontemporal stores
would not necessarily have an advantage with avoiding write allocate
behaviour. Because I thought CPUs should combine stores in their store
buffer.

Doing some simple tests is showing that a nontemporal stores takes about
0.7 the time of doing a rep stosq here, if the destination is much larger
than cache. So the CPU isn't quite as clever as I assumed.

I can't find any references to back up my assumption, but I thought I
heard it somewhere. It might have been in relation to some powerpc CPUs
not requiring their cacheline clear instruction because they combine
store buffer entries. But I could be way off.


> A sufficiently
> advanced x86 CPU will have good string moves with full cacheline
> transfers - removing partial cachelines and removing the need
> for the physical read.

I thought this should be the case even with a plain sequence of normal
stores. But that's taking about 1.4 the time of rep sto, so again
maybe I overestimate. I don't know.