> According to latest memory ordering specification documents from Intel
> and AMD, both manufacturers are committed to in-order loads from
> cacheable memory for the x86 architecture. Hence, smp_rmb() may be a
> simple barrier.
>
> http://developer.intel.com/products/processor/manuals/318147.pdf
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
Hi
I'm just wondering about one thing --- what is LFENCE instruction good
for?
SFENCE is for enforcing ordering in write-combining buffers (it doesn't
have sense in write-back cache mode).
MFENCE is for preventing of moving stores past loads.
But what is LFENCE for? I read the above documents and they already say
that CPUs have ordered loads.
In Intel instruction reference, the description for LFENCE is copied from
SFENCE (with the word "store" replaced with the word "load"), so it
doesn't really give much insight into the operation of the instruction.
Or is LFENCE just a no-op reserved for the possibility that Intel would
relax ordering rules?
Mikulas
On Mon, 15 Oct 2007 22:47:42 +0200 (CEST)
Mikulas Patocka <[email protected]> wrote:
> > According to latest memory ordering specification documents from
> > Intel and AMD, both manufacturers are committed to in-order loads
> > from cacheable memory for the x86 architecture. Hence, smp_rmb()
> > may be a simple barrier.
> >
> > http://developer.intel.com/products/processor/manuals/318147.pdf
> > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
>
> Hi
>
> I'm just wondering about one thing --- what is LFENCE instruction
> good for?
>
> SFENCE is for enforcing ordering in write-combining buffers (it
> doesn't have sense in write-back cache mode).
> MFENCE is for preventing of moving stores past loads.
>
> But what is LFENCE for? I read the above documents and they already
> say that CPUs have ordered loads.
>
The cpus also have an explicit set of instructions that deliberately do
unordered stores/loads, and s/lfence etc are mostly designed for those.
> On Mon, 15 Oct 2007 22:47:42 +0200 (CEST)
> Mikulas Patocka <[email protected]> wrote:
>
> > > According to latest memory ordering specification documents from
> > > Intel and AMD, both manufacturers are committed to in-order loads
> > > from cacheable memory for the x86 architecture. Hence, smp_rmb()
> > > may be a simple barrier.
> > >
> > > http://developer.intel.com/products/processor/manuals/318147.pdf
> > > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
> >
> > Hi
> >
> > I'm just wondering about one thing --- what is LFENCE instruction
> > good for?
> >
> > SFENCE is for enforcing ordering in write-combining buffers (it
> > doesn't have sense in write-back cache mode).
> > MFENCE is for preventing of moving stores past loads.
> >
> > But what is LFENCE for? I read the above documents and they already
> > say that CPUs have ordered loads.
> >
>
> The cpus also have an explicit set of instructions that deliberately do
> unordered stores/loads, and s/lfence etc are mostly designed for those.
I know about unordered stores (movnti & similar) --- they basically use
write-combining method on memory that is normally write-back --- and they
need sfence. But which one instruction does unordered load and needs
lefence?
Mikulas
Mikulas Patocka wrote:
>
> I know about unordered stores (movnti & similar) --- they basically use
> write-combining method on memory that is normally write-back --- and they
> need sfence. But which one instruction does unordered load and needs
> lefence?
>
PREFETCHNTA.
-hpa
On Tue, Oct 16, 2007 at 12:08:01AM +0200, Mikulas Patocka wrote:
> > On Mon, 15 Oct 2007 22:47:42 +0200 (CEST)
> > Mikulas Patocka <[email protected]> wrote:
> >
> > > > According to latest memory ordering specification documents from
> > > > Intel and AMD, both manufacturers are committed to in-order loads
> > > > from cacheable memory for the x86 architecture. Hence, smp_rmb()
> > > > may be a simple barrier.
> > > >
> > > > http://developer.intel.com/products/processor/manuals/318147.pdf
> > > > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
> > >
> > > Hi
> > >
> > > I'm just wondering about one thing --- what is LFENCE instruction
> > > good for?
> > >
> > > SFENCE is for enforcing ordering in write-combining buffers (it
> > > doesn't have sense in write-back cache mode).
> > > MFENCE is for preventing of moving stores past loads.
> > >
> > > But what is LFENCE for? I read the above documents and they already
> > > say that CPUs have ordered loads.
> > >
> >
> > The cpus also have an explicit set of instructions that deliberately do
> > unordered stores/loads, and s/lfence etc are mostly designed for those.
>
> I know about unordered stores (movnti & similar) --- they basically use
> write-combining method on memory that is normally write-back --- and they
> need sfence. But which one instruction does unordered load and needs
> lefence?
Also, for non-wb memory. I don't think the Intel document referenced
says anything about this, but the AMD document says that loads can pass
loads (page 8, rule b).
This is why our rmb() is still an lfence.
On Mon, 15 Oct 2007, H. Peter Anvin wrote:
> Mikulas Patocka wrote:
> >
> > I know about unordered stores (movnti & similar) --- they basically use
> > write-combining method on memory that is normally write-back --- and they
> > need sfence. But which one instruction does unordered load and needs
> > lefence?
> >
>
> PREFETCHNTA.
PREFETCH* doesn't change program semantics. The processor is allowed to
ignore prefetch instruction if it doesn't have resources needed for
prefetch. It not ordered wrt. fences.
PREFETCHNTA was implemented as prefetch into L1 cache and omitting L2
cache on Pentium 3 and M --- and it is implemented as prefetch into L2
cache on other --- do it doesn't really use any special buffers.
Mikulas
> -hpa
>
On Tue, 16 Oct 2007, Nick Piggin wrote:
> > > The cpus also have an explicit set of instructions that deliberately do
> > > unordered stores/loads, and s/lfence etc are mostly designed for those.
> >
> > I know about unordered stores (movnti & similar) --- they basically use
> > write-combining method on memory that is normally write-back --- and they
> > need sfence. But which one instruction does unordered load and needs
> > lefence?
>
> Also, for non-wb memory. I don't think the Intel document referenced
> says anything about this, but the AMD document says that loads can pass
> loads (page 8, rule b).
>
> This is why our rmb() is still an lfence.
I see, AMD says that WC memory loads can be out-of-order.
There is very little usability to it --- framebuffer and AGP aperture is
the only piece of memory that is WC and no kernel structures are placed
there, so it is possible to remove that lfence.
Mikulas
Mikulas Patocka wrote:
> On Mon, 15 Oct 2007, H. Peter Anvin wrote:
>
>> Mikulas Patocka wrote:
>>> I know about unordered stores (movnti & similar) --- they basically use
>>> write-combining method on memory that is normally write-back --- and they
>>> need sfence. But which one instruction does unordered load and needs
>>> lefence?
>>>
>> PREFETCHNTA.
>
> PREFETCH* doesn't change program semantics. The processor is allowed to
> ignore prefetch instruction if it doesn't have resources needed for
> prefetch. It not ordered wrt. fences.
>
> PREFETCHNTA was implemented as prefetch into L1 cache and omitting L2
> cache on Pentium 3 and M --- and it is implemented as prefetch into L2
> cache on other --- do it doesn't really use any special buffers.
>
It's semantics allows it to, though. It's not clear to me whether it is
actually necessary on existing chips.
It does, I believe, way-restricted prefetch on existing silicon.
-hpa
On Tue, 16 Oct 2007, H. Peter Anvin wrote:
> Mikulas Patocka wrote:
> >
> > PREFETCH* doesn't change program semantics. The processor is allowed to
> > ignore prefetch instruction if it doesn't have resources needed for
> > prefetch. It not ordered wrt. fences.
> >
> > PREFETCHNTA was implemented as prefetch into L1 cache and omitting L2 cache
> > on Pentium 3 and M --- and it is implemented as prefetch into L2 cache on
> > other --- do it doesn't really use any special buffers.
> >
>
> It's semantics allows it to, though. It's not clear to me whether it is
> actually necessary on existing chips.
>
> It does, I believe, way-restricted prefetch on existing silicon.
It is allowed to use special buffers for prefetch, but --- because
prefetch doesn't change program semantics, these special buffers must be
kept consistent just like caches --- they must be snooped for bus
transactions and they must be checked each time something writes to cache.
So I doubt anyone will ever implement it this way --- it's too much
silicon for too little effect.
Mikulas
> -hpa
On Tue, Oct 16, 2007 at 12:33:54PM +0200, Mikulas Patocka wrote:
>
>
> On Tue, 16 Oct 2007, Nick Piggin wrote:
>
> > > > The cpus also have an explicit set of instructions that deliberately do
> > > > unordered stores/loads, and s/lfence etc are mostly designed for those.
> > >
> > > I know about unordered stores (movnti & similar) --- they basically use
> > > write-combining method on memory that is normally write-back --- and they
> > > need sfence. But which one instruction does unordered load and needs
> > > lefence?
> >
> > Also, for non-wb memory. I don't think the Intel document referenced
> > says anything about this, but the AMD document says that loads can pass
> > loads (page 8, rule b).
> >
> > This is why our rmb() is still an lfence.
>
> I see, AMD says that WC memory loads can be out-of-order.
>
> There is very little usability to it --- framebuffer and AGP aperture is
> the only piece of memory that is WC and no kernel structures are placed
> there, so it is possible to remove that lfence.
No. In Linux kernel, rmb() means that all previous loads, including to
any IO regions, will be executed before any subsequent load.
How can you possibly get rid of lfence from there just because you may
happen to *know* that it isn't used (btw. the IO serialisation isn't for
kernel data structures, it is for actual IO operations, generally).
Doing that would lead to an unmaintainable mess. If drivers don't need rmb,
then they don't call it.
> > I see, AMD says that WC memory loads can be out-of-order.
> >
> > There is very little usability to it --- framebuffer and AGP aperture is
> > the only piece of memory that is WC and no kernel structures are placed
> > there, so it is possible to remove that lfence.
>
> No. In Linux kernel, rmb() means that all previous loads, including to
> any IO regions, will be executed before any subsequent load.
You already must not place any data structures into WC memory --- for
example, spinlocks wouldn't work there. wmb() also won't work on WC
memory, because it assumes that writes are ordered.
> How can you possibly get rid of lfence from there just because you may
> happen to *know* that it isn't used (btw. the IO serialisation isn't for
> kernel data structures, it is for actual IO operations, generally).
IO regions are in uncached memory, and x86 already serializes it fine. It
flushes any write buffers on access to uncached memory.
(BTW. what is the general portable rule for serializing writel() and
readl()? On x86 they are serialized in hardware, but what on other archs?)
> Doing that would lead to an unmaintainable mess. If drivers don't need rmb,
> then they don't call it.
If wmb() doesn't currently work on write-combining memory, why should
rmb() work there?
The purpose of rmb() is to enforce ordering on architectures that don't
force it in hardware --- that is not the case of x86.
Mikulas
On Wed, Oct 17, 2007 at 01:05:16AM +0200, Mikulas Patocka wrote:
> > > I see, AMD says that WC memory loads can be out-of-order.
> > >
> > > There is very little usability to it --- framebuffer and AGP aperture is
> > > the only piece of memory that is WC and no kernel structures are placed
> > > there, so it is possible to remove that lfence.
> >
> > No. In Linux kernel, rmb() means that all previous loads, including to
> > any IO regions, will be executed before any subsequent load.
>
> You already must not place any data structures into WC memory --- for
> example, spinlocks wouldn't work there.
What do you mean "already"? If we already have drivers loading data from
WC memory, then rmb() needs to order them, whether or not they actually
need it. If that were prohibitively costly, then we'd introduce a new
barrier which does not order WC memory, right?
> wmb() also won't work on WC
> memory, because it assumes that writes are ordered.
You mean the one defined like this:
#define wmb() asm volatile("sfence" ::: "memory")
? If it assumed writes are ordered, then it would just be a barrier().
> > How can you possibly get rid of lfence from there just because you may
> > happen to *know* that it isn't used (btw. the IO serialisation isn't for
> > kernel data structures, it is for actual IO operations, generally).
>
> IO regions are in uncached memory, and x86 already serializes it fine. It
> flushes any write buffers on access to uncached memory.
>
> (BTW. what is the general portable rule for serializing writel() and
> readl()? On x86 they are serialized in hardware, but what on other archs?)
Most tend to order them strongly these days. There are also relaxed
variants for architectures that can take advantage of them.
> > Doing that would lead to an unmaintainable mess. If drivers don't need rmb,
> > then they don't call it.
>
> If wmb() doesn't currently work on write-combining memory, why should
> rmb() work there?
I don't understand why you say wmb() doesn't work on WC memory. What part
of which spec are you reading (or, given your mistrust of specs, what CPU
are you seeing failures with)?
> The purpose of rmb() is to enforce ordering on architectures that don't
> force it in hardware --- that is not the case of x86.
Well it clearly is the case because I just pointed you to a document
that says they can go out of order. If you want to argue that existing
implementations do not, then by all means go ahead and send a patch to
Linus and see what he says about it ;)
> > You already must not place any data structures into WC memory --- for
> > example, spinlocks wouldn't work there.
>
> What do you mean "already"?
I mean "in current kernel" (I checked it in 2.6.22)
> If we already have drivers loading data from
> WC memory, then rmb() needs to order them, whether or not they actually
> need it. If that were prohibitively costly, then we'd introduce a new
> barrier which does not order WC memory, right?
>
>
> > wmb() also won't work on WC
> > memory, because it assumes that writes are ordered.
>
> You mean the one defined like this:
> #define wmb() asm volatile("sfence" ::: "memory")
> ? If it assumed writes are ordered, then it would just be a barrier().
You read wrong part of the include file. Really, it is
(2.6.22,include/asm-i386/system.h):
#ifdef CONFIG_X86_OOSTORE
#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence",
X86_FEATURE_XMM)
#else
#define wmb() __asm__ __volatile__ ("": : :"memory")
#endif
CONFIG_X86_OOSTORE is dependent on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
--- so on Intel and AMD, it is really just barrier().
So drivers can't assume that wmb() works on write-combining memory.
> > > Doing that would lead to an unmaintainable mess. If drivers don't
> > > need rmb, then they don't call it.
> >
> > If wmb() doesn't currently work on write-combining memory, why should
> > rmb() work there?
>
> I don't understand why you say wmb() doesn't work on WC memory.
Because it is defined as __asm__ __volatile__ ("": : :"memory")
And WC memory can reorder writes (WB memory can't).
> > The purpose of rmb() is to enforce ordering on architectures that don't
> > force it in hardware --- that is not the case of x86.
>
> Well it clearly is the case because I just pointed you to a document
> that says they can go out of order.
> If you want to argue that existing
> implementations do not, then by all means go ahead and send a patch to
> Linus and see what he says about it ;)
I mean this: wmb() assumes that the data to be ordered are not in WC
memory. rmb() assumes that the data can be in WC memory (lfence is only
useful on WC --- it doesn't have any effect on other memory types).
Mikulas
Nick Piggin <[email protected]> wrote:
>
> Also, for non-wb memory. I don't think the Intel document referenced
> says anything about this, but the AMD document says that loads can pass
> loads (page 8, rule b).
>
> This is why our rmb() is still an lfence.
BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb
instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's
shared with other Xen domains or the hypervisor.
The reason this is necessary is because even if a Xen domain is
UP the hypervisor might be SMP.
It would be nice if we can have these adopt the new SMP barriers
on x86 instead of the IO ones as they currently do.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Wed, Oct 17, 2007 at 02:30:32AM +0200, Mikulas Patocka wrote:
> > > You already must not place any data structures into WC memory --- for
> > > example, spinlocks wouldn't work there.
> >
> > What do you mean "already"?
>
> I mean "in current kernel" (I checked it in 2.6.22)
Ahh, that's not "current kernel", though ;)
4071c718555d955a35e9651f77086096ad87d498
> > If we already have drivers loading data from
> > WC memory, then rmb() needs to order them, whether or not they actually
> > need it. If that were prohibitively costly, then we'd introduce a new
> > barrier which does not order WC memory, right?
> >
> >
> > > wmb() also won't work on WC
> > > memory, because it assumes that writes are ordered.
> >
> > You mean the one defined like this:
> > #define wmb() asm volatile("sfence" ::: "memory")
> > ? If it assumed writes are ordered, then it would just be a barrier().
>
> You read wrong part of the include file. Really, it is
> (2.6.22,include/asm-i386/system.h):
> #ifdef CONFIG_X86_OOSTORE
> #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence",
> X86_FEATURE_XMM)
> #else
> #define wmb() __asm__ __volatile__ ("": : :"memory")
> #endif
>
> CONFIG_X86_OOSTORE is dependent on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6
> --- so on Intel and AMD, it is really just barrier().
>
> So drivers can't assume that wmb() works on write-combining memory.
Drivers should be able to assume that wmb() orders _everything_ (except
some whacky Altix thing, which I really want to fold under wmb at some
point anyway).
So I decided that old x86 semantics isn't right, and now it really is a
lock op / sfence everywhere.
On Wed, Oct 17, 2007 at 01:51:17PM +0800, Herbert Xu wrote:
> Nick Piggin <[email protected]> wrote:
> >
> > Also, for non-wb memory. I don't think the Intel document referenced
> > says anything about this, but the AMD document says that loads can pass
> > loads (page 8, rule b).
> >
> > This is why our rmb() is still an lfence.
>
> BTW, Xen (in particular, the code in drivers/xen) uses mb/rmb/wmb
> instead of smp_mb/smp_rmb/smp_wmb when it accesses memory that's
> shared with other Xen domains or the hypervisor.
>
> The reason this is necessary is because even if a Xen domain is
> UP the hypervisor might be SMP.
>
> It would be nice if we can have these adopt the new SMP barriers
> on x86 instead of the IO ones as they currently do.
That's a good point actually. Something like raw_smp_*mb, which
always orders memory, but only for regular WB operatoins. I could
put that on the todo list...
> > > > You already must not place any data structures into WC memory --- for
> > > > example, spinlocks wouldn't work there.
> > >
> > > What do you mean "already"?
> >
> > I mean "in current kernel" (I checked it in 2.6.22)
>
> Ahh, that's not "current kernel", though ;)
>
> 4071c718555d955a35e9651f77086096ad87d498
>
> > So drivers can't assume that wmb() works on write-combining memory.
>
> Drivers should be able to assume that wmb() orders _everything_ (except
> some whacky Altix thing, which I really want to fold under wmb at some
> point anyway).
>
> So I decided that old x86 semantics isn't right, and now it really is a
> lock op / sfence everywhere.
I see. I'm just curious --- is there any real usage for WC memory, except
graphics card memory?
Mikulas