DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com.au;
  h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:From:To:Subject:Date:User-Agent:Cc:References:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Disposition:Message-Id;
  b=wv0k3jz7udJ4VIhSfwBpfJ5m+kxvFaulDlhdqPZYmv7XDtFXwGfhBVsjdUxXNsaS39TToOlJNw5YhNpUpltSXEmgcETHecfvT32Tuh4FdRF8HyEv3nSzBU+Xvvl+b6TSNe9EmtAqoa5bBVRtn7wqUR7lDDsYctJAGqldd4v5pAk=  ;
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
Date: Tue, 3 Mar 2009 15:20:59 +1100
User-Agent: KMail/1.9.51 (KDE/4.0.4; ; )
Cc: "H. Peter Anvin" <hpa@zytor.com>, Arjan van de Ven <arjan@infradead.org>,
       Andi Kleen <andi@firstfloor.org>, David Miller <davem@davemloft.net>,
       mingo@elte.hu, sqazi@google.com, linux-kernel@vger.kernel.org,
       tglx@linutronix.de
References: <alpine.LFD.2.00.0902280904271.3111@localhost.localdomain> <200903020106.51865.nickpiggin@yahoo.com.au> <alpine.LFD.2.00.0903021255020.3111@localhost.localdomain>
In-Reply-To: <alpine.LFD.2.00.0903021255020.3111@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200903031521.00217.nickpiggin@yahoo.com.au>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6285
Lines: 124

On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> On Mon, 2 Mar 2009, Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> None of this really matters.

Well that's just what I was replying to. Of course nontemporal/uncached
stores can't avoid cc operations either, but somebody was hoping that
they would avoid the write-allocate / RMW behaviour. I just replied because
I think that modern CPUs can combine stores in their store queues to get
the same result for cacheable stores.

Of course it doesn't make it free especially if it is a cc protocol that
has to go on the interconnect anyway. But avoiding the RAM read is a
good thing anyway.


> The big issue is that before you can do any write to any cacheline, if the
> memory is cacheable, it needs the cache coherency protocol to synchronize
> with any other CPU's that may have that line in the cache.
>
> The _only_ time a write is "free" is when you already have that cacheline
> in your own cache, and in an "exclusive" state. If that is the case, then
> you know that you don't need to do anything else.
>
> In _any_ other case, before you do the write, you need to make sure that
> no other CPU in the system has that line in its cache. Whether you do that
> with a "write and invalidate" model (which would be how a store buffer
> would do it or a write-through cache would work), or whether you do it
> with a "acquire exclusive cacheline" (which is how the cache coherency
> protocol would do it), it's going to end up using cache coherency
> bandwidth.
>
> Of course, what will be the limiting factor is unclear. On a single-socket
> thing, you don't have any cache coherency issues, an the only bandwidth
> you'd end up using is the actual memory write at the memory controller
> (which may be on-die, and entirely separate from the cache coherency
> protocol). It may be idle and the write queue may be deep enough that you
> reach memory speeds and the write buffer is the optimal approach.
>
> On many sockets, the limiting factor will almost certainly be the cache
> coherency overhead (since the cache coherency traffic needs to go to _all_
> sockets, rather than just one stream to memory), at least unless you have
> a good cache coherency filter that can filter out part of the traffic
> based on whether it could be cached or not on some socket(s).
>
> IOW, it's almost impossible to tell what is the best approach. It will
> depend on number of sockets, it will depend on size of cache, and it will
> depend on the capabilities and performance of the memory controllers vs
> the cache coherency protocol.
>
> On a "single shared bus" model, the "write with invalidate" is fine, and
> it basically ends up working a lot like a single socket even if you
> actually have multiple sockets - it just won't scale much beyond two
> sockets. With HT or QPI, things are different, and the presense or absense
> of a snoop filter could make a big difference for 4+ socket situations.
>
> There simply is no single answer.
>
> And we really should keep that in mind. There is no right answer, and the
> right answer will depend on hardware. Playing cache games in software is
> almost always futile. It can be a huge improvement, but it can be a huge
> deprovement too, and it really tends to be only worth it if you (a) know
> your hardware really quite well and (b) know your _load_ pretty well too.
>
> We can play games in the kernel. We do know how many sockets there are. We
> do know the cache size. We _could_ try to make an educated guess at
> whether the next user of the data will be DMA or not. So there are
> unquestionably heuristics we could apply, but I also do suspect that
> they'd inevitably be pretty arbitrary.
>
> I suspect that we could make some boot-time (or maybe CPU hotplug time)
> decision that simply just sets a threshold value for when it is worth
> using non-temporal stores. With smaller caches, and with a single socket
> (or a single bus), it likely makes sense to use non-temporal stores
> earlier.
>
> But even with some rough heuristic, it will be wrong part of the time. So
> I think "simple and predictable" in the end tends to be better than
> "complex and still known to be broken".
>
> Btw, the "simple and predictable" could literally look at _where_ in the
> file the IO is. Because I know there are papers on the likelihood of
> re-use of data depending on where in the file it is written. Data written
> to low offsets is more likely to be accessed again (think temp-files),
> while data written to big offsets are much more likely to be random or to
> be written out (think databases or simply just large streaming files).
>
> So I suspect a "simple and predictable" algorithm could literally be
> something like
>
>  - use nontemporal stores only if you are writing a whole page, and the
>    byte offset of the page is larger than 'x', where 'x' may optionally
>    even depend on size of cache.
>
> But removing it entirely may be fine too.
>
> What I _don't_ think is fine is to think that you've "solved" it, or that
> you should even try!

Right. I don't know if you misunderstood me or aimed this post at the
general discussion rather than my reply specifically.

I know even if a CPU does write combining in the store buffer and even
if it does have "big-hammer" nontemporal stores like x86 apparently does,
then there are still cases where nontemporal stores will win if the data
doesn't get used by the CPU again.

I agree that if a heuristic can't get it right a *significant* amount of
time, then it is not worthwhile. Even if it gets it right a little more
often than wrong, the unpredictability is a negative factor. I agree
completely with you there :)

I would like to remove it, as in Ingo's last patch, FWIW. But I can see
obviously there are cases where nontemporal helps, so there will never be
a "right" answer.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/