FYI, we just released a new white paper describing memory ordering for
Intel processors:
http://developer.intel.com/products/processor/manuals/index.htm
Should help answer some questions about some of the ordering primitives
we use on i386 and x86_64.
Jesse
On Saturday 08 September 2007 08:26, Jesse Barnes wrote:
> FYI, we just released a new white paper describing memory ordering for
> Intel processors:
> http://developer.intel.com/products/processor/manuals/index.htm
>
> Should help answer some questions about some of the ordering primitives
> we use on i386 and x86_64.
So, can we finally noop smp_rmb and smp_wmb on x86?
On Sat, 8 Sep 2007, Nick Piggin wrote:
>
> So, can we finally noop smp_rmb and smp_wmb on x86?
Did AMD already release their version? If so, we should probably add a
commit that does that in somewhat early 2.6.24 rc, and add the pointers to
the whitepapers in the commit message.
Linus
On Saturday 08 September 2007 09:20, Linus Torvalds wrote:
> On Sat, 8 Sep 2007, Nick Piggin wrote:
> > So, can we finally noop smp_rmb and smp_wmb on x86?
>
> Did AMD already release their version? If so, we should probably add a
> commit that does that in somewhat early 2.6.24 rc, and add the pointers to
> the whitepapers in the commit message.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
AMD64 Architecture Programmer's Manual Volume 2: System Programming
section 7.2: Multiprocessor Memory Access Ordering, a paragraph on the
first page says
"Loads do not pass previous loads (loads are not re-ordered). Stores do
not pass previous stores (stores are not re-ordered)"
So, yes, it should be easy to do.
On Sunday 09 September 2007 03:34, Nick Piggin wrote:
> On Saturday 08 September 2007 09:20, Linus Torvalds wrote:
> > On Sat, 8 Sep 2007, Nick Piggin wrote:
> > > So, can we finally noop smp_rmb and smp_wmb on x86?
> >
> > Did AMD already release their version? If so, we should probably add a
> > commit that does that in somewhat early 2.6.24 rc, and add the pointers
> > to the whitepapers in the commit message.
>
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/245
>93.pdf
>
> AMD64 Architecture Programmer's Manual Volume 2: System Programming
> section 7.2: Multiprocessor Memory Access Ordering, a paragraph on the
> first page says
>
> "Loads do not pass previous loads (loads are not re-ordered). Stores do
> not pass previous stores (stores are not re-ordered)"
>
> So, yes, it should be easy to do.
There is some suggestion in the source code that non-temporal stores
(movntq) are weakly ordered. But AFAIKS from the documents, it is ordered
when operating on wb memory. What's the situation there?
I've also heard that string operations do not follow the normal ordering, but
that's just with respect to individual loads/stores in the one operation, I
hope? And they will still follow ordering rules WRT surrounding loads and
stores?
On Sunday 09 September 2007 03:48, Nick Piggin wrote:
> There is some suggestion in the source code that non-temporal stores
> (movntq) are weakly ordered. But AFAIKS from the documents, it is ordered
> when operating on wb memory. What's the situation there?
Sorry, it looks from the AMD document like nontemporal stores to wb
memory can go out of order. It is a bit hard to decipher what the types
mean.
If this is the case, we can either retain the sfence in smp_wmb(), or noop
it, and put explicit sfences around any place that performs nontemporal
stores...
Anyway, the lfence should be able to go away without so much trouble.
On Friday 07 September 2007 20:13:12 Nick Piggin wrote:
> On Sunday 09 September 2007 03:48, Nick Piggin wrote:
>
> > There is some suggestion in the source code that non-temporal stores
> > (movntq) are weakly ordered. But AFAIKS from the documents, it is ordered
> > when operating on wb memory. What's the situation there?
>
> Sorry, it looks from the AMD document like nontemporal stores to wb
> memory can go out of order.
Yes, that is how NT stores are defined.
> If this is the case, we can either retain the sfence in smp_wmb(), or noop
> it, and put explicit sfences around any place that performs nontemporal
> stores...
We do this already, but in most cases it doesn't matter anyways. We AFAIK
do not rely on any ordering for copy_*_user for example. There are not
that many users of nt so it's not a huge issue.
>
> Anyway, the lfence should be able to go away without so much trouble.
You mean sfence? lfence in rmb is definitely needed.
sfence on x86-64 is not strictly needed, but also shouldn't hurt very much
so I always kept it in.
-Andi
On Saturday 08 September 2007 18:53, Andi Kleen wrote:
> On Friday 07 September 2007 20:13:12 Nick Piggin wrote:
> > On Sunday 09 September 2007 03:48, Nick Piggin wrote:
> > > There is some suggestion in the source code that non-temporal stores
> > > (movntq) are weakly ordered. But AFAIKS from the documents, it is
> > > ordered when operating on wb memory. What's the situation there?
> >
> > Sorry, it looks from the AMD document like nontemporal stores to wb
> > memory can go out of order.
>
> Yes, that is how NT stores are defined.
>
> > If this is the case, we can either retain the sfence in smp_wmb(), or
> > noop it, and put explicit sfences around any place that performs
> > nontemporal stores...
>
> We do this already, but in most cases it doesn't matter anyways. We AFAIK
> do not rely on any ordering for copy_*_user for example. There are not
> that many users of nt so it's not a huge issue.
OK, but we just don't want to be making lots of little exceptions. For
bulk copies, I don't see it being a big issue to always sfence around
them (it would be a relatively minor cost).
> > Anyway, the lfence should be able to go away without so much trouble.
>
> You mean sfence? lfence in rmb is definitely needed.
I mean lfence in smp_rmb().
> sfence on x86-64 is not strictly needed, but also shouldn't hurt very much
> so I always kept it in.
>
> -Andi
On Friday 07 September 2007 21:57:35 Nick Piggin wrote:
>
> > > Anyway, the lfence should be able to go away without so much trouble.
> >
> > You mean sfence? lfence in rmb is definitely needed.
>
> I mean lfence in smp_rmb().
One point of rmb is to stop speculative loads and I don't think we
can get that without lfence.
-Andi
On Fri, 7 Sep 2007 15:26:50 -0700
Jesse Barnes <[email protected]> wrote:
> FYI, we just released a new white paper describing memory ordering for
> Intel processors:
> http://developer.intel.com/products/processor/manuals/index.htm
>
> Should help answer some questions about some of the ordering primitives
> we use on i386 and x86_64.
Nice - but it appears to be 64bit only - and indeed it appears to be
untrue for real 32bit because of the Pentium Pro fencing errata.
The kernel also runs on IDT Winchip, Cyrix and AMD processors not all of
which have exactly the same behaviour (the IDT Winchip as we run it
profoundly differs)
Alan
On Sat, 8 Sep 2007 18:54:57 +1000
Nick Piggin <[email protected]> wrote:
> On Saturday 08 September 2007 08:26, Jesse Barnes wrote:
> > FYI, we just released a new white paper describing memory ordering for
> > Intel processors:
> > http://developer.intel.com/products/processor/manuals/index.htm
> >
> > Should help answer some questions about some of the ordering primitives
> > we use on i386 and x86_64.
>
> So, can we finally noop smp_rmb and smp_wmb on x86?
Nakked-by: Alan Cox <[email protected]>
You can only no-op it on 64bit Intel processors. On 32bit it needs to be
conditional on whether your processor family (or back compat for it) as
the Pentium Pro has some serious store ordering errata (hence the way it
needs lock decb for spin_unlock)
Alan
On Saturday 08 September 2007 20:19, Andi Kleen wrote:
> On Friday 07 September 2007 21:57:35 Nick Piggin wrote:
> > > > Anyway, the lfence should be able to go away without so much trouble.
> > >
> > > You mean sfence? lfence in rmb is definitely needed.
> >
> > I mean lfence in smp_rmb().
>
> One point of rmb is to stop speculative loads and I don't think we
> can get that without lfence.
smp_rmb() should not need to do anything because loads are done
in order anyway. Both AMD and Intel have committed to this now.
The important point is that they *appear* to be done in order. AFAIK,
the CPUs can still do speculative and out of order loads, but throw
out the results if they could be wrong.
On Saturday 08 September 2007 20:30, Alan Cox wrote:
> On Sat, 8 Sep 2007 18:54:57 +1000
>
> Nick Piggin <[email protected]> wrote:
> > On Saturday 08 September 2007 08:26, Jesse Barnes wrote:
> > > FYI, we just released a new white paper describing memory ordering for
> > > Intel processors:
> > > http://developer.intel.com/products/processor/manuals/index.htm
> > >
> > > Should help answer some questions about some of the ordering primitives
> > > we use on i386 and x86_64.
> >
> > So, can we finally noop smp_rmb and smp_wmb on x86?
>
> Nakked-by: Alan Cox <[email protected]>
>
> You can only no-op it on 64bit Intel processors. On 32bit it needs to be
> conditional on whether your processor family (or back compat for it) as
> the Pentium Pro has some serious store ordering errata (hence the way it
> needs lock decb for spin_unlock)
We already noop smp_wmb on i386 even when CONFIG_X86_PPRO_FENCE.
I'm not sure if either errata can be solved completely by adding lock ops
in barrier instructions anyway: they both seem to involve situations where
there is just a single problematic cacheline in question.
On Saturday 08 September 2007 20:29, Alan Cox wrote:
> On Fri, 7 Sep 2007 15:26:50 -0700
>
> Jesse Barnes <[email protected]> wrote:
> > FYI, we just released a new white paper describing memory ordering for
> > Intel processors:
> > http://developer.intel.com/products/processor/manuals/index.htm
> >
> > Should help answer some questions about some of the ordering primitives
> > we use on i386 and x86_64.
>
> Nice - but it appears to be 64bit only - and indeed it appears to be
> untrue for real 32bit because of the Pentium Pro fencing errata.
As I said, we're not doing anything special in barriers for the ppro errata
today anyway.
> The kernel also runs on IDT Winchip, Cyrix and AMD processors not all of
> which have exactly the same behaviour (the IDT Winchip as we run it
> profoundly differs)
AMD processors guarantee loads are ordered and stores are ordered
(with exceptions of non-temporal, and non-wb policy).
As for the others that do out of order stores, are any of them SMP?
On Sun, 9 Sep 2007, Nick Piggin wrote:
> I've also heard that string operations do not follow the normal ordering, but
> that's just with respect to individual loads/stores in the one operation, I
> hope? And they will still follow ordering rules WRT surrounding loads and
> stores?
see section 7.2.3 of intel volume 3A...
"Code dependent upon sequential store ordering should not use the string
operations for the entire data structure to be stored. Data and semaphores
should be separated. Order dependent code should use a discrete semaphore
uniquely stored to after any string operations to allow correctly ordered
data to be seen by all processors."
i think we need sfence after things like copy_page, clear_page, and
possibly copy_user... at least on intel processors with fast strings
option enabled.
-dean
dean gaudet wrote:
> On Sun, 9 Sep 2007, Nick Piggin wrote:
>
>> I've also heard that string operations do not follow the normal ordering, but
>> that's just with respect to individual loads/stores in the one operation, I
>> hope? And they will still follow ordering rules WRT surrounding loads and
>> stores?
>
> see section 7.2.3 of intel volume 3A...
>
> "Code dependent upon sequential store ordering should not use the string
> operations for the entire data structure to be stored. Data and semaphores
> should be separated. Order dependent code should use a discrete semaphore
> uniquely stored to after any string operations to allow correctly ordered
> data to be seen by all processors."
>
> i think we need sfence after things like copy_page, clear_page, and
> possibly copy_user... at least on intel processors with fast strings
> option enabled.
I do not think. I believe that authors are trying to say that
struct { uint8 lock; uint8 data; } x;
lea (x.data),%edi
mov $2,%ecx
std
rep movsb
to set both data and lock does not guarantee that x.lock will be set
after x.data and that you should do
lea (x.data),%edi
std
movsb
movsb # or mov (%esi),%al; mov %al,(%edi), but movsb looks discrete
enough to me
instead (and yes, I know that my example is silly).
Petr
On Sat, 8 Sep 2007, Petr Vandrovec wrote:
> dean gaudet wrote:
> > On Sun, 9 Sep 2007, Nick Piggin wrote:
> >
> > > I've also heard that string operations do not follow the normal ordering,
> > > but
> > > that's just with respect to individual loads/stores in the one operation,
> > > I
> > > hope? And they will still follow ordering rules WRT surrounding loads and
> > > stores?
> >
> > see section 7.2.3 of intel volume 3A...
> >
> > "Code dependent upon sequential store ordering should not use the string
> > operations for the entire data structure to be stored. Data and semaphores
> > should be separated. Order dependent code should use a discrete semaphore
> > uniquely stored to after any string operations to allow correctly ordered
> > data to be seen by all processors."
> >
> > i think we need sfence after things like copy_page, clear_page, and possibly
> > copy_user... at least on intel processors with fast strings option enabled.
>
> I do not think. I believe that authors are trying to say that
>
> struct { uint8 lock; uint8 data; } x;
>
> lea (x.data),%edi
> mov $2,%ecx
> std
> rep movsb
>
> to set both data and lock does not guarantee that x.lock will be set after
> x.data and that you should do
>
> lea (x.data),%edi
> std
> movsb
> movsb # or mov (%esi),%al; mov %al,(%edi), but movsb looks discrete enough to
> me
>
> instead (and yes, I know that my example is silly).
no it's worse than that -- intel fast string stores can become globally
visible in any order at all w.r.t. normal loads or stores... so take all
those great examples in their recent whitepaper and throw out all the
ordering guarantees for addresses on different cachelines if any of the
stores are rep string.
for example transitive store ordering for locations on multiple cachelines
is not guaranteed at all. the kernel could return a zero page and one
core could see the zeroes out of order with another core performing some
sort of lockless data structure operation.
fast strings don't break ordering from the point of view of the core
performing the rep string operation, but externally there are no
guarantees (it's right there in the docs).
-dean
> AMD processors guarantee loads are ordered and stores are ordered
> (with exceptions of non-temporal, and non-wb policy).
>
> As for the others that do out of order stores, are any of them SMP?
IDT winchip isn't, Geode isn't
Nick Piggin wrote:
> smp_rmb() should not need to do anything because loads are done
> in order anyway. Both AMD and Intel have committed to this now.
>
> The important point is that they *appear* to be done in order. AFAIK,
> the CPUs can still do speculative and out of order loads, but throw
> out the results if they could be wrong.
Is there anything even semiofficial from VIA? Not that the x86
architecture isn't pretty much definable as the AMD-Intel consensus...
-hpa
* Jesse Barnes ([email protected]) wrote:
> FYI, we just released a new white paper describing memory ordering for
> Intel processors:
> http://developer.intel.com/products/processor/manuals/index.htm
>
> Should help answer some questions about some of the ordering primitives
> we use on i386 and x86_64.
Hi Jesse,
Thanks for letting everyone know about that paper, however - it
has confused me somewhat; there seem to be differences in that
description and that described in the 'Intel 64 and IA-32 Architectures
Software Developer's Manual' and I'd like to understand whether
this paper is designed just to explain points or is actually
intended to change what can be expected of the processor.
That ordering doc states:
'Loads are not reordered with other loads'
Vol3a section 7.2.1 of the architecture manual states:
'Reads can be carried out speculatively and in any order.'
Is this a:
1) Change in the definition of the architecture that existing
processors actually follow anyway.
2) A difference between what the processor does and what is visible
to the software (the intro to this paper does seem to emphasize
software visibility more than the architecture manual).
3) Some other difference I haven't spotted.
The other thing that made me think about it was that the Itanium
Architecture Software Dev Manul vol2 2.1.2 states that the Itanium
uses ld.acq/st.rel (acquire/release) references to
'operate according to the IA-32 ordering model.' which I think means
that all those loads are in order relative to all the other acquire
loads?
Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
On Wednesday, September 12, 2007 11:26 am Dr. David Alan Gilbert wrote:
> * Jesse Barnes ([email protected]) wrote:
> > FYI, we just released a new white paper describing memory ordering
> > for Intel processors:
> > http://developer.intel.com/products/processor/manuals/index.htm
> >
> > Should help answer some questions about some of the ordering
> > primitives we use on i386 and x86_64.
>
> Hi Jesse,
> Thanks for letting everyone know about that paper, however - it
> has confused me somewhat; there seem to be differences in that
> description and that described in the 'Intel 64 and IA-32
> Architectures Software Developer's Manual' and I'd like to understand
> whether this paper is designed just to explain points or is actually
> intended to change what can be expected of the processor.
>
> That ordering doc states:
> 'Loads are not reordered with other loads'
>
> Vol3a section 7.2.1 of the architecture manual states:
>
> 'Reads can be carried out speculatively and in any order.'
>
> Is this a:
> 1) Change in the definition of the architecture that existing
> processors actually follow anyway.
> 2) A difference between what the processor does and what is visible
> to the software (the intro to this paper does seem to emphasize
> software visibility more than the architecture manual).
> 3) Some other difference I haven't spotted.
It's really both (1) and (2). This document will become part of the
regular manuals when the next version is published. And yes,
processors may do something different internally, but software can rely
on the behavior described by the rules in the document.
Jesse
Jesse Barnes <[email protected]> writes:
>
> It's really both (1) and (2). This document will become part of the
> regular manuals when the next version is published. And yes,
> processors may do something different internally, but software can rely
> on the behavior described by the rules in the document.
... until the first erratum comes around. With the multitude of x86
cores being introduced all the time (how many did only Intel just announce at
IDF?@) that is going to happen sooner or later.
i386 with full legacy enabled already has to care about old PPros and
those seriously violate write ordering.
-Andi