2007-10-04 05:22:07

by Nick Piggin

[permalink] [raw]
Subject: [rfc][patch 1/3] x86_64: fence nontemproal stores

Hi,

Here's a couple of patches to improve the memory barrier situation on x86.
They probably aren't going upstream until after the x86 merge, however I'm
posting them here for RFC, and in case anybody wants to backport into stable
trees.

---
movnt* instructions are not strongly ordered with respect to other stores,
so if we are to assume stores are strongly ordered in the rest of the x86_64
kernel, we must fence these off (see similar examples in i386 kernel).

[ The AMD memory ordering document seems to say that nontemporal stores can
also pass earlier regular stores, so maybe we need sfences _before_ movnt*
everywhere too? ]

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/arch/x86_64/lib/copy_user_nocache.S
===================================================================
--- linux-2.6.orig/arch/x86_64/lib/copy_user_nocache.S
+++ linux-2.6/arch/x86_64/lib/copy_user_nocache.S
@@ -117,6 +117,7 @@ ENTRY(__copy_user_nocache)
popq %rbx
CFI_ADJUST_CFA_OFFSET -8
CFI_RESTORE rbx
+ sfence
ret
CFI_RESTORE_STATE


2007-10-04 05:23:17

by Nick Piggin

[permalink] [raw]
Subject: [rfc][patch 2/3] x86: fix IO write barriers


wmb() on x86 must always include a barrier, because stores can go out of
order in many cases when dealing with devices (eg. WC memory).

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -216,6 +216,7 @@ static inline unsigned long get_limit(un

#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)

/**
* read_barrier_depends - Flush all pending reads that subsequents reads
@@ -271,18 +272,14 @@ static inline unsigned long get_limit(un

#define read_barrier_depends() do { } while(0)

-#ifdef CONFIG_X86_OOSTORE
-/* Actually there are no OOO store capable CPUs for now that do SSE,
- but make it already an possibility. */
-#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
-#else
-#define wmb() __asm__ __volatile__ ("": : :"memory")
-#endif
-
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
-#define smp_wmb() wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif
#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
#else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -159,12 +159,8 @@ static inline void write_cr8(unsigned lo
*/
#define mb() asm volatile("mfence":::"memory")
#define rmb() asm volatile("lfence":::"memory")
-
-#ifdef CONFIG_UNORDERED_IO
#define wmb() asm volatile("sfence" ::: "memory")
-#else
-#define wmb() asm volatile("" ::: "memory")
-#endif
+
#define read_barrier_depends() do {} while(0)
#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)

2007-10-04 05:23:58

by Nick Piggin

[permalink] [raw]
Subject: [rfc][patch 3/3] x86: optimise barriers


According to latest memory ordering specification documents from Intel and
AMD, both manufacturers are committed to in-order loads from cacheable memory
for the x86 architecture. Hence, smp_rmb() may be a simple barrier.

Also according to those documents, and according to existing practice in Linux
(eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are
visible in program order too. Special string stores are safe -- their
constituent stores may be out of order, but they must complete in order WRT
surrounding stores. Nontemporal stores to WB memory can go out of order, and so
they should be fenced explicitly to make them appear in-order WRT other stores.
Hence, smp_wmb() may be a simple barrier.

http://developer.intel.com/products/processor/manuals/318147.pdf
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

In userspace microbenchmarks on a core2 system, fence instructions range
anywhere from around 15 cycles to 50, which may not be totally insignificant
in performance critical paths (code size will go down too).

However the primary motivation for this is to have the canonical barrier
implementation for x86 architecture.

smp_rmb on buggy pentium pros remains a locked op, which is apparently
required.

Signed-off-by: Nick Piggin <[email protected]>

---
Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -274,7 +274,11 @@ static inline unsigned long get_limit(un

#ifdef CONFIG_SMP
#define smp_mb() mb()
-#define smp_rmb() rmb()
+#ifdef CONFIG_X86_PPRO_FENCE
+# define smp_rmb() rmb()
+#else
+# define smp_rmb() barrier()
+#endif
#ifdef CONFIG_X86_OOSTORE
# define smp_wmb() wmb()
#else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -141,8 +141,8 @@ static inline void write_cr8(unsigned lo

#ifdef CONFIG_SMP
#define smp_mb() mb()
-#define smp_rmb() rmb()
-#define smp_wmb() wmb()
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
#define smp_read_barrier_depends() do {} while(0)
#else
#define smp_mb() barrier()

2007-10-04 17:32:19

by Dave Jones

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

On Thu, Oct 04, 2007 at 07:22:58AM +0200, Nick Piggin wrote:

> -#ifdef CONFIG_X86_OOSTORE
> -/* Actually there are no OOO store capable CPUs for now that do SSE,
> - but make it already an possibility. */
> -#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
> -#else
> -#define wmb() __asm__ __volatile__ ("": : :"memory")
> -#endif
> -
> #ifdef CONFIG_SMP
> #define smp_mb() mb()
> #define smp_rmb() rmb()
> -#define smp_wmb() wmb()
> +#ifdef CONFIG_X86_OOSTORE
> +# define smp_wmb() wmb()
> +#else
> +# define smp_wmb() barrier()
> +#endif

The only vendor that ever implemented OOSTOREs was Centaur, and they
only did in the Winchip generation of the CPUs. When they dropped it
from the C3, I asked whether they intended to bring it back, and the
answer was "extremely unlikely".

So we can probably just drop that "just in case" clause above, and just
do..

#define smp_wmb() barrier()


Dave

--
http://www.codemonkey.org.uk

2007-10-04 17:55:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers


> The only vendor that ever implemented OOSTOREs was Centaur, and they
> only did in the Winchip generation of the CPUs. When they dropped it
> from the C3, I asked whether they intended to bring it back, and the
> answer was "extremely unlikely".
>

Do you know if it made a big performance difference?

But yes we should probably just remove this special case to make
maintenance easier.

-Andi

2007-10-04 18:10:58

by Dave Jones

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
>
> > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > only did in the Winchip generation of the CPUs. When they dropped it
> > from the C3, I asked whether they intended to bring it back, and the
> > answer was "extremely unlikely".
> >
>
> Do you know if it made a big performance difference?

On the winchip, it was a huge win. I can't remember exact numbers,
but pretty much every benchmark I threw at it at the time showed
significant improvement.

> But yes we should probably just remove this special case to make
> maintenance easier.

It's CONFIG_SMP anyway, which none of the winchips were.
SMP+OOSTORE just didn't happen, and I'd be surprised if
any vendor makes it happen any time soon.
(Even if so, it's likely we'd need to make additional changes
anyway, so adding it back shouldn't be a big deal.)

Dave

--
http://www.codemonkey.org.uk

2007-10-04 18:22:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
> On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
> >
> > > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > > only did in the Winchip generation of the CPUs. When they dropped it
> > > from the C3, I asked whether they intended to bring it back, and the
> > > answer was "extremely unlikely".
> > >
> >
> > Do you know if it made a big performance difference?
>
> On the winchip, it was a huge win. I can't remember exact numbers,
> but pretty much every benchmark I threw at it at the time showed
> significant improvement.

Significant as in >10%?


> > But yes we should probably just remove this special case to make
> > maintenance easier.
>
> It's CONFIG_SMP anyway, which none of the winchips were.

It's not. And we need memory barriers even without SMP
when talking to device drivers. Only the smp_*b()s get noped
on UP.

-Andi

2007-10-04 18:41:23

by Dave Jones

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
> On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
> > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
> > >
> > > > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > > > only did in the Winchip generation of the CPUs. When they dropped it
> > > > from the C3, I asked whether they intended to bring it back, and the
> > > > answer was "extremely unlikely".
> > > >
> > >
> > > Do you know if it made a big performance difference?
> >
> > On the winchip, it was a huge win. I can't remember exact numbers,
> > but pretty much every benchmark I threw at it at the time showed
> > significant improvement.
>
> Significant as in >10%?

"Worth about 10-20% performance" according to the 2.4.18pre9-ac4
release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN

> > > But yes we should probably just remove this special case to make
> > > maintenance easier.
> > It's CONFIG_SMP anyway, which none of the winchips were.
>
> It's not.

You're right it isn't now, but Nicks patch seems to change it so that it is.

...

#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
-#define smp_wmb() wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif

> And we need memory barriers even without SMP
> when talking to device drivers. Only the smp_*b()s get noped
> on UP.

Good point.

Dave

--
http://www.codemonkey.org.uk

2007-10-04 18:58:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

On Thursday 04 October 2007 20:41:07 Dave Jones wrote:
> On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
> > On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
> > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
> > > >
> > > > > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > > > > only did in the Winchip generation of the CPUs. When they dropped it
> > > > > from the C3, I asked whether they intended to bring it back, and the
> > > > > answer was "extremely unlikely".
> > > > >
> > > >
> > > > Do you know if it made a big performance difference?
> > >
> > > On the winchip, it was a huge win. I can't remember exact numbers,
> > > but pretty much every benchmark I threw at it at the time showed
> > > significant improvement.
> >
> > Significant as in >10%?
>
> "Worth about 10-20% performance" according to the 2.4.18pre9-ac4
> release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN

Are there numbers for a newer kernel available too?

> > > > But yes we should probably just remove this special case to make
> > > > maintenance easier.
> > > It's CONFIG_SMP anyway, which none of the winchips were.
> >
> > It's not.
>
> You're right it isn't now, but Nicks patch seems to change it so that it is.
>
> ...
>
> #ifdef CONFIG_SMP
> #define smp_mb() mb()
> #define smp_rmb() rmb()
> -#define smp_wmb() wmb()
> +#ifdef CONFIG_X86_OOSTORE
> +# define smp_wmb() wmb()
> +#else
> +# define smp_wmb() barrier()
> +#endif

That is only for smp_wmb() which are always SMP only

-Andi

2007-10-04 19:08:50

by Dave Jones

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

On Thu, Oct 04, 2007 at 08:58:27PM +0200, Andi Kleen wrote:
> On Thursday 04 October 2007 20:41:07 Dave Jones wrote:
> > On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
> > > On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
> > > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
> > > > >
> > > > > > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > > > > > only did in the Winchip generation of the CPUs. When they dropped it
> > > > > > from the C3, I asked whether they intended to bring it back, and the
> > > > > > answer was "extremely unlikely".
> > > > > >
> > > > >
> > > > > Do you know if it made a big performance difference?
> > > >
> > > > On the winchip, it was a huge win. I can't remember exact numbers,
> > > > but pretty much every benchmark I threw at it at the time showed
> > > > significant improvement.
> > >
> > > Significant as in >10%?
> >
> > "Worth about 10-20% performance" according to the 2.4.18pre9-ac4
> > release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN
>
> Are there numbers for a newer kernel available too?

no idea, my winchips died about 5 years ago.

Dave

--
http://www.codemonkey.org.uk

2007-10-04 20:49:00

by Alan

[permalink] [raw]
Subject: Re: [rfc][patch 2/3] x86: fix IO write barriers

> > > "Worth about 10-20% performance" according to the 2.4.18pre9-ac4
> > > release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN
> >
> > Are there numbers for a newer kernel available too?
>
> no idea, my winchips died about 5 years ago

Got a couple here just need a mainboard 8)

2007-10-12 08:22:52

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On 04-10-2007 07:23, Nick Piggin wrote:
> According to latest memory ordering specification documents from Intel and
> AMD, both manufacturers are committed to in-order loads from cacheable memory
> for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
...

Great news!

First it looks like a really great thing that it's revealed at last.
But then... there is probably some confusion: did we have to use
ineffective code for so long?

First again, we could try to blame Intel etc. But then, wait a minute:
is it such a mystery knowledge? If this reordering is done there are
some easy rules broken (just like in examples from these manuals). And
if somebody cared to do this for optimization, then this is probably
noticeable optimization, let's say 5 or 10%. Then any test shouldn't
need to take very long to tell the truth in less than 100 loops!

So, maybe linux needs something like this, instead of waiting few
years with each new model for vendors goodwill? IMHO, even for less
popular processors, this could be checked under some debugging option
at the system start (after disabling suspicios barrier for a while
plus some WARN_ONs).

Thanks,
Jarek P.

2007-10-12 08:44:23

by Helge Hafting

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

Jarek Poplawski wrote:
> On 04-10-2007 07:23, Nick Piggin wrote:
>
>> According to latest memory ordering specification documents from Intel and
>> AMD, both manufacturers are committed to in-order loads from cacheable memory
>> for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
>>
> ...
>
> Great news!
>
> First it looks like a really great thing that it's revealed at last.
> But then... there is probably some confusion: did we have to use
> ineffective code for so long?
>
You could have tried the optimization before, and
gotten better performance. But if without solid knowledge that
the optimization is _valid_, you risk having a kernel
that performs great but suffer the occational glitch and
therefore is unstable and crash the machine "now and then".
This sort of thing can't really be figured out by experimentation, because
the bad cases might happen only with some processors, some
combinations of memory/chipsets, or with some minimum
number of processors. Such problems can be very hard
to find, especially considering that other plain bugs also
cause crashes.

Therefore, the "ineffective code" was used because it was
the only safe alternative. Now we know, so now we may optimize.


Helge Hafting

2007-10-12 08:58:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 10:25:34AM +0200, Jarek Poplawski wrote:
> On 04-10-2007 07:23, Nick Piggin wrote:
> > According to latest memory ordering specification documents from Intel and
> > AMD, both manufacturers are committed to in-order loads from cacheable memory
> > for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
> ...
>
> Great news!
>
> First it looks like a really great thing that it's revealed at last.
> But then... there is probably some confusion: did we have to use
> ineffective code for so long?

I'm not sure exactly what the situation is with the manufacturers,
but maybe they (at least Intel) wanted to keep their options open
WRT their barrier semantics, even if current implementations were
not taking full liberty of them.


> First again, we could try to blame Intel etc. But then, wait a minute:
> is it such a mystery knowledge? If this reordering is done there are
> some easy rules broken (just like in examples from these manuals). And
> if somebody cared to do this for optimization, then this is probably
> noticeable optimization, let's say 5 or 10%. Then any test shouldn't
> need to take very long to tell the truth in less than 100 loops!

I don't know quite what you're saying... the CPUs could probably get
performance by having weakly ordered loads, OTOH I think the Intel
ones might already do this speculatively so they appear in order but
essentially have the performance of weak order.

If you're just talking about this patch, then it probably isn't much
performance gain. I'm guessing you'd be lucky to measure it from
userspace.


> So, maybe linux needs something like this, instead of waiting few
> years with each new model for vendors goodwill? IMHO, even for less
> popular processors, this could be checked under some debugging option
> at the system start (after disabling suspicios barrier for a while
> plus some WARN_ONs).

I don't know if that would be worthwhile. It actually isn't always
trivial to trigger reordering. For example, on my dual-core core2,
in order to see reads pass writes, I have to do work on a set that
exceeds the cache size and does a huge amount of work to ensure it
is going to trigger that. If you can actually come up with a test
case that triggers load/load or store/store reordering, I'm sure
Intel / AMD would like to see it ;)

All existing processors as far as we know are in-order WRT loads vs
loads and stores vs stores. It was just a matter of getting the docs
clarified, which gives us more confidence that we're correct and a
reasonable guarnatee of forward compatibility.

So, I think the plan is just to merge these 3 patches during the
current window.

2007-10-12 09:21:10

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 10:42:34AM +0200, Helge Hafting wrote:
> Jarek Poplawski wrote:
> >On 04-10-2007 07:23, Nick Piggin wrote:
> >
> >>According to latest memory ordering specification documents from Intel and
> >>AMD, both manufacturers are committed to in-order loads from cacheable
> >>memory
> >>for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
> >>
> >...
> >
> >Great news!
> >
> >First it looks like a really great thing that it's revealed at last.
> >But then... there is probably some confusion: did we have to use
> >ineffective code for so long?
> >
> You could have tried the optimization before, and
> gotten better performance. But if without solid knowledge that
> the optimization is _valid_, you risk having a kernel
> that performs great but suffer the occational glitch and
> therefore is unstable and crash the machine "now and then".
> This sort of thing can't really be figured out by experimentation, because
> the bad cases might happen only with some processors, some
> combinations of memory/chipsets, or with some minimum
> number of processors. Such problems can be very hard
> to find, especially considering that other plain bugs also
> cause crashes.
>
> Therefore, the "ineffective code" was used because it was
> the only safe alternative. Now we know, so now we may optimize.

Sorry, I don't understand this logic at all. Since bad cases
happen independently from any specifications and Intel doesn't
take any legal responsibility for such information, it seems we
should better still not optimize?

Jarek P.

2007-10-12 09:44:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 11:12:13AM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 10:42:34AM +0200, Helge Hafting wrote:
> > Jarek Poplawski wrote:
> > >On 04-10-2007 07:23, Nick Piggin wrote:
> > >
> > >>According to latest memory ordering specification documents from Intel and
> > >>AMD, both manufacturers are committed to in-order loads from cacheable
> > >>memory
> > >>for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
> > >>
> > >...
> > >
> > >Great news!
> > >
> > >First it looks like a really great thing that it's revealed at last.
> > >But then... there is probably some confusion: did we have to use
> > >ineffective code for so long?
> > >
> > You could have tried the optimization before, and
> > gotten better performance. But if without solid knowledge that
> > the optimization is _valid_, you risk having a kernel
> > that performs great but suffer the occational glitch and
> > therefore is unstable and crash the machine "now and then".
> > This sort of thing can't really be figured out by experimentation, because
> > the bad cases might happen only with some processors, some
> > combinations of memory/chipsets, or with some minimum
> > number of processors. Such problems can be very hard
> > to find, especially considering that other plain bugs also
> > cause crashes.
> >
> > Therefore, the "ineffective code" was used because it was
> > the only safe alternative. Now we know, so now we may optimize.
>
> Sorry, I don't understand this logic at all. Since bad cases
> happen independently from any specifications and Intel doesn't
> take any legal responsibility for such information, it seems we
> should better still not optimize?

We already do in probably more critical and lible to be problematic
cases (notably, spin_unlock).

So unless there is reasonable information for us to believe this
will be a problem, IMO the best thing to do is stick with the
specs. Intel is pretty reasonable with documenting errata I think.

With memory barriers specifically, I'm sure we have many more bugs
in the kernel than AMD or Intel have in their chips ;)

2007-10-12 09:52:22

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 10:57:33AM +0200, Nick Piggin wrote:
> On Fri, Oct 12, 2007 at 10:25:34AM +0200, Jarek Poplawski wrote:
> > On 04-10-2007 07:23, Nick Piggin wrote:
> > > According to latest memory ordering specification documents from Intel and
> > > AMD, both manufacturers are committed to in-order loads from cacheable memory
> > > for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
> > ...
> >
> > Great news!
> >
> > First it looks like a really great thing that it's revealed at last.
> > But then... there is probably some confusion: did we have to use
> > ineffective code for so long?
>
> I'm not sure exactly what the situation is with the manufacturers,
> but maybe they (at least Intel) wanted to keep their options open
> WRT their barrier semantics, even if current implementations were
> not taking full liberty of them.
>
>
> > First again, we could try to blame Intel etc. But then, wait a minute:
> > is it such a mystery knowledge? If this reordering is done there are
> > some easy rules broken (just like in examples from these manuals). And
> > if somebody cared to do this for optimization, then this is probably
> > noticeable optimization, let's say 5 or 10%. Then any test shouldn't
> > need to take very long to tell the truth in less than 100 loops!
>
> I don't know quite what you're saying... the CPUs could probably get
> performance by having weakly ordered loads, OTOH I think the Intel
> ones might already do this speculatively so they appear in order but
> essentially have the performance of weak order.

I meant: if there is any reordering possible this should be quite
distinctly visible, because why would any vendor enable such nasty
things if not for performance. But now I start to doubt: of course
there is such a possibility someone makes this reordering for some
other reasons which could be so rare it's hard to check. And this
someone knows it's processors are seen less efficient because of eg.
mostly unneeded read barriers used by operating systems...

>
> If you're just talking about this patch, then it probably isn't much
> performance gain. I'm guessing you'd be lucky to measure it from
> userspace.

No, it's only about the comment to this patch: "Hence, smp_rmb() may be
a simple barrier".

>
>
> > So, maybe linux needs something like this, instead of waiting few
> > years with each new model for vendors goodwill? IMHO, even for less
> > popular processors, this could be checked under some debugging option
> > at the system start (after disabling suspicios barrier for a while
> > plus some WARN_ONs).
>
> I don't know if that would be worthwhile. It actually isn't always
> trivial to trigger reordering. For example, on my dual-core core2,
> in order to see reads pass writes, I have to do work on a set that
> exceeds the cache size and does a huge amount of work to ensure it
> is going to trigger that. If you can actually come up with a test
> case that triggers load/load or store/store reordering, I'm sure
> Intel / AMD would like to see it ;)

Anyway, it seems any heavy testing such as yours, should give us the
same informations years earlier than any vendors manual and then any
gain is multiplied by millions of users. Then only still doubtful
cases could be treated with additional caution and some debugging
code.

>
> All existing processors as far as we know are in-order WRT loads vs
> loads and stores vs stores. It was just a matter of getting the docs
> clarified, which gives us more confidence that we're correct and a
> reasonable guarnatee of forward compatibility.

After reading this Intel's legal information I don't think you should
feel so much more forward confident...

>
> So, I think the plan is just to merge these 3 patches during the
> current window.
>

And they really should be!

Jarek P.

2007-10-12 10:01:35

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 11:44:27AM +0200, Nick Piggin wrote:
...
> So unless there is reasonable information for us to believe this
> will be a problem, IMO the best thing to do is stick with the
> specs. Intel is pretty reasonable with documenting errata I think.

100% right - if there are any specs. But it seems for a few years
this spec was missing or there is some change of mind, I presume?

Jarek P.

2007-10-12 10:42:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 11:55:05AM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 10:57:33AM +0200, Nick Piggin wrote:
> >
> > I don't know quite what you're saying... the CPUs could probably get
> > performance by having weakly ordered loads, OTOH I think the Intel
> > ones might already do this speculatively so they appear in order but
> > essentially have the performance of weak order.
>
> I meant: if there is any reordering possible this should be quite
> distinctly visible.

It's not. Not in the cases where it is explicitly allowed and actively
exploited (loads passing stores), but most definitely not distinctly
visible in errata cases that have slipped through all the V&V.


> because why would any vendor enable such nasty
> things if not for performance. But now I start to doubt: of course
> there is such a possibility someone makes this reordering for some
> other reasons which could be so rare it's hard to check.

Yes: it isn't the explicitly allowed reorderings that we care
about here (because obviously we're retaining the barriers for those).
It would be cases of bugs in the CPUs meaning they don't follow the
standard. But how far do you take your mistrust of a CPU? You could
ask gcc to insert locked ops between every load and store operation?
Or keep it switched off to ensure no bugs ;)


> Anyway, it seems any heavy testing such as yours, should give us the
> same informations years earlier than any vendors manual and then any
> gain is multiplied by millions of users. Then only still doubtful
> cases could be treated with additional caution and some debugging
> code.

Firstly, while it can be possible to write a code to show up reordering,
it is really hard (ie. impossible) to guarantee no reordering happens. For
example, it may have only showed up on SMT+SMP P4 CPUs with some obscure
interactions between threads and cores involving more than 2 threads.

Secondly, even if we were sure that no current implementations reordered
loads, we don't want to go outside the bounds of the specification
because we might break on some future CPUs. This isn't a big performance
win.


> > All existing processors as far as we know are in-order WRT loads vs
> > loads and stores vs stores. It was just a matter of getting the docs
> > clarified, which gives us more confidence that we're correct and a
> > reasonable guarnatee of forward compatibility.
>
> After reading this Intel's legal information I don't think you should
> feel so much more forward confident...

Yes, but that's the same way I feel after reading *any* legal "information" ;)

2007-10-12 11:52:30

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 12:42:38PM +0200, Nick Piggin wrote:
> On Fri, Oct 12, 2007 at 11:55:05AM +0200, Jarek Poplawski wrote:
> > On Fri, Oct 12, 2007 at 10:57:33AM +0200, Nick Piggin wrote:
> > >
> > > I don't know quite what you're saying... the CPUs could probably get
> > > performance by having weakly ordered loads, OTOH I think the Intel
> > > ones might already do this speculatively so they appear in order but
> > > essentially have the performance of weak order.
> >
> > I meant: if there is any reordering possible this should be quite
> > distinctly visible.
>
> It's not. Not in the cases where it is explicitly allowed and actively
> exploited (loads passing stores), but most definitely not distinctly
> visible in errata cases that have slipped through all the V&V.
>
>
> > because why would any vendor enable such nasty
> > things if not for performance. But now I start to doubt: of course
> > there is such a possibility someone makes this reordering for some
> > other reasons which could be so rare it's hard to check.
>
> Yes: it isn't the explicitly allowed reorderings that we care
> about here (because obviously we're retaining the barriers for those).
> It would be cases of bugs in the CPUs meaning they don't follow the
> standard. But how far do you take your mistrust of a CPU? You could
> ask gcc to insert locked ops between every load and store operation?
> Or keep it switched off to ensure no bugs ;)

I'm not sure of your point, but it seems we don't differ here, and
after all there is quirks code for such things.

>
>
> > Anyway, it seems any heavy testing such as yours, should give us the
> > same informations years earlier than any vendors manual and then any
> > gain is multiplied by millions of users. Then only still doubtful
> > cases could be treated with additional caution and some debugging
> > code.
>
> Firstly, while it can be possible to write a code to show up reordering,
> it is really hard (ie. impossible) to guarantee no reordering happens. For
> example, it may have only showed up on SMT+SMP P4 CPUs with some obscure
> interactions between threads and cores involving more than 2 threads.

I'm not sure how much this all above is consistent wrt. this earlier
opinion:

> [...] If you can actually come up with a test
> case that triggers load/load or store/store reordering, I'm sure
> Intel / AMD would like to see it ;)

It seems, after testing only (plus no official spec against this idea),
you could be almost sure there is no such test possible. And, if it
were done a few years ago, you think it still should be not enough to
make a decision on changing this smp_rmb because of lack of official
specs? Besides, there is probably so much features guessing in arch
and drivers sections, this reorder testing should look as solid as a
math proof wrt. them.

>
> Secondly, even if we were sure that no current implementations reordered
> loads, we don't want to go outside the bounds of the specification
> because we might break on some future CPUs. This isn't a big performance
> win.

I don't agree with this - IMO we should care only about currently used
CPUs, and test each time the new ones.


> > > All existing processors as far as we know are in-order WRT loads vs
> > > loads and stores vs stores. It was just a matter of getting the docs
> > > clarified, which gives us more confidence that we're correct and a
> > > reasonable guarnatee of forward compatibility.
> >
> > After reading this Intel's legal information I don't think you should
> > feel so much more forward confident...
>
> Yes, but that's the same way I feel after reading *any* legal "information" ;)
>

Strange... I feel exactly opposite. Are you sure you've chosen the
right job (...and the right system)?

Jarek P.

2007-10-12 12:08:02

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 01:55:10PM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 12:42:38PM +0200, Nick Piggin wrote:
...
> > [...] If you can actually come up with a test
> > case that triggers load/load or store/store reordering, I'm sure
> > Intel / AMD would like to see it ;)
>
> It seems, after testing only (plus no official spec against this idea),

(...plus of course proper smp_rmb & smp_wmb vs. smp_mb interpretation
probably available from Paul McKenney or Davide Libenzi before this
Intel spec, as well...)

Jarek P.

2007-10-12 12:46:49

by Helge Hafting

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 10:42:34AM +0200, Helge Hafting wrote:
>
>> Jarek Poplawski wrote:
>>
>>> On 04-10-2007 07:23, Nick Piggin wrote:
>>>
>>>
>>>> According to latest memory ordering specification documents from Intel and
>>>> AMD, both manufacturers are committed to in-order loads from cacheable
>>>> memory
>>>> for the x86 architecture. Hence, smp_rmb() may be a simple barrier.
>>>>
>>>>
>>> ...
>>>
>>> Great news!
>>>
>>> First it looks like a really great thing that it's revealed at last.
>>> But then... there is probably some confusion: did we have to use
>>> ineffective code for so long?
>>>
>>>
>> You could have tried the optimization before, and
>> gotten better performance. But if without solid knowledge that
>> the optimization is _valid_, you risk having a kernel
>> that performs great but suffer the occational glitch and
>> therefore is unstable and crash the machine "now and then".
>> This sort of thing can't really be figured out by experimentation, because
>> the bad cases might happen only with some processors, some
>> combinations of memory/chipsets, or with some minimum
>> number of processors. Such problems can be very hard
>> to find, especially considering that other plain bugs also
>> cause crashes.
>>
>> Therefore, the "ineffective code" was used because it was
>> the only safe alternative. Now we know, so now we may optimize.
>>
>
> Sorry, I don't understand this logic at all. Since bad cases
> happen independently from any specifications and Intel doesn't
> take any legal responsibility for such information, it seems we
> should better still not optimize?
>
The point is that we _trust_ intel when they says "this will work".
Therefore, we can use the optimizations. It was never about
legal matters. If we didn't trust intel, then we couldn't
use their processors at all.

We couldn't take the chance before. It was not documented
to work, verification by testing would not be trivial at all for
this case.
Linux is about "stability first, then performance".
Now we _know_ that we can have this optimization without
compromising stability. Nobody knew before!

Helge Hafting

2007-10-12 13:26:46

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 02:44:51PM +0200, Helge Hafting wrote:
...
> The point is that we _trust_ intel when they says "this will work".
> Therefore, we can use the optimizations. It was never about
> legal matters. If we didn't trust intel, then we couldn't
> use their processors at all.

But there was nothing about trust. Usually you don't trust somebody
but somebody's opinions. The problem is there was no valid opinion,
or this opinion has been changed now (no reason to not trust yet...).

> We couldn't take the chance before. It was not documented
> to work, verification by testing would not be trivial at all for
> this case.
> Linux is about "stability first, then performance".
> Now we _know_ that we can have this optimization without
> compromising stability. Nobody knew before!

So, you think this would be the first or the least credibly
verified undocumented feature used in linux? Then, it seems
I can try to install this linux on my laptop at last! (...
And, I can trust you, it will not break anything...?)

Thanks,
Jarek P.

2007-10-12 15:14:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers



On Fri, 12 Oct 2007, Jarek Poplawski wrote:
>
> First it looks like a really great thing that it's revealed at last.
> But then... there is probably some confusion: did we have to use
> ineffective code for so long?

I think the chip manufacturers really wanted to keep their options open.

Having the option to re-order loads in architecturally visible ways was
something that they probably felt they really wanted to have. On the other
hand:

- I bet they had noticed that things break, and some applications depend
on fairly strong ordering (not necessarily in Linux-land, but..)

I suspect hw manufacturers go through life hoping that "software
improves". They probably thought that getting rid of the old 16-bit
windows would mean that less people depended on undefined behaviour.

And I suspect that they started noticing that no, with threads and
JVM's and things, *more* people started depending on fairly strong
memory ordering.

- I suspect Intel in particular noticed that they can do a lot of very
aggressive re-ordering at a microarchitectural level, but can still
guarantee that *architecturally* they never show it (dynamic detection
of reordered loads being replayed on cache dirty events etc).

IOW, I suspect that both Intel and AMD noticed that while they had wanted
to keep their options open, those options weren't really realistic, and
not something that the market wanted (aggressive use of threading wants
*stricter* memory ordering, not looser), and they could work well enough
with a fairly strict memory model.

> So, maybe linux needs something like this, instead of waiting few
> years with each new model for vendors goodwill? IMHO, even for less
> popular processors, this could be checked under some debugging option
> at the system start (after disabling suspicios barrier for a while
> plus some WARN_ONs).

Quite frankly, even *within* Intel and AMD, there are damn few people who
understand exactly what the memory ordering requirements and guarantees
are and historically were for the different CPU's.

I would bet that had you asked a random (but still competent) Intel/AMD
engineer that wasn't really intimately involved with the actual design of
the cache protocols and memory pipelines, they would absolutely not have
been able to tell you how the CPU actually worked.

So no, there's no way a software person could have afforded to say "it
seems to work on my setup even without the barrier". On a dual-socket
setup with s shared bus, that says absolutely *nothing* about the
behaviour of the exact same CPU when used with a multi-bus chipset. Not to
mention another revisions of the same CPU - much less a whole other
microarchitecture.

So yes, I've personally been aware for about a year that the memory
ordering was going to likely be documented, but no way was I going to
depend on it until Intel and AMD were ready to state so *publicly*.
Because before that happens, they may have noticed errata etc that made it
not work out.

Also, please note that we didn't even just change the barriers immediately
when the docs came out. I want to do it soon - still *early* in the 2.6.24
development cycle - exactly because bugs happen, and if somebody notices
something strange, we'll have more time to perhaps decide that "oops,
there's something bad going on, let's undo this for the real 2.6.24
release until we can figure out the exact pattern".

Linus

2007-10-15 07:41:20

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Fri, Oct 12, 2007 at 08:13:52AM -0700, Linus Torvalds wrote:
>
>
> On Fri, 12 Oct 2007, Jarek Poplawski wrote:
...
> So no, there's no way a software person could have afforded to say "it
> seems to work on my setup even without the barrier". On a dual-socket
> setup with s shared bus, that says absolutely *nothing* about the
> behaviour of the exact same CPU when used with a multi-bus chipset. Not to
> mention another revisions of the same CPU - much less a whole other
> microarchitecture.

Yes, I still can't believe this, but after some more reading I start
to admit such things can happen in computer "science" too... I've
mentioned a lost performance, but as a matter of fact I've been more
concerned with the problem of truth:

From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Volume 3A:

"7.2.2 Memory Ordering in P6 and More Recent Processor Families
...
1. Reads can be carried out speculatively and in any order.
..."

So, it looks to me like almost the 1-st Commandment. Some people (like
me) did believe this, others tried to check, and it was respected for
years notwithstanding nobody had ever seen such an event.

And then, a few years later, we have this:

From: Intel(R) 64 Architecture Memory Ordering White Paper

"2 Memory ordering for write-back (WB) memory
...
Intel 64 memory ordering obeys the following principles:
1. Loads are not reordered with other loads.
..."

I know, technically this doesn't have to be a contradiction (for not
WB), but to me it's something like: "OK, Elvis lives and this guy is
not real Paul McCartney too" in an official CIA statement!

...
> Also, please note that we didn't even just change the barriers immediately
> when the docs came out. I want to do it soon - still *early* in the 2.6.24
> development cycle - exactly because bugs happen, and if somebody notices
> something strange, we'll have more time to perhaps decide that "oops,
> there's something bad going on, let's undo this for the real 2.6.24
> release until we can figure out the exact pattern".

I'm still so "dazed and confused" that I can't tell this (or anything)
is right...

Thanks very much for so extensive and sound explanation,

Jarek P.

PS: Btw, I apologize Helge for not trusting her: "verification by
testing would not be trivial" words.

2007-10-15 08:09:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Mon, Oct 15, 2007 at 09:44:05AM +0200, Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 08:13:52AM -0700, Linus Torvalds wrote:
> >
> >
> > On Fri, 12 Oct 2007, Jarek Poplawski wrote:
> ...
> > So no, there's no way a software person could have afforded to say "it
> > seems to work on my setup even without the barrier". On a dual-socket
> > setup with s shared bus, that says absolutely *nothing* about the
> > behaviour of the exact same CPU when used with a multi-bus chipset. Not to
> > mention another revisions of the same CPU - much less a whole other
> > microarchitecture.
>
> Yes, I still can't believe this, but after some more reading I start
> to admit such things can happen in computer "science" too... I've
> mentioned a lost performance, but as a matter of fact I've been more
> concerned with the problem of truth:
>
> From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
> Volume 3A:
>
> "7.2.2 Memory Ordering in P6 and More Recent Processor Families
> ...
> 1. Reads can be carried out speculatively and in any order.
> ..."
>
> So, it looks to me like almost the 1-st Commandment. Some people (like
> me) did believe this, others tried to check, and it was respected for
> years notwithstanding nobody had ever seen such an event.

I'd say that's exactly what Intel wanted. It's pretty common (we do
it all the time in the kernel too) to create an API which places a
stronger requirement on the caller than is actually required. It can
make changes much less painful.

Has performance really been much problem for you? (even before the
lfence instruction, when you theoretically had to use a locked op)?
I mean, I'd struggle to find a place in the Linux kernel where there
is actually a measurable difference anywhere... and we're pretty
performance critical and I think we have a reasonable amount of lockless
code (I guess we may not have a lot of tight computational loops, though).
I'd be interested to know what, if any, application had found these
barriers to be problematic...


> And then, a few years later, we have this:
>
> From: Intel(R) 64 Architecture Memory Ordering White Paper
>
> "2 Memory ordering for write-back (WB) memory
> ...
> Intel 64 memory ordering obeys the following principles:
> 1. Loads are not reordered with other loads.
> ..."
>
> I know, technically this doesn't have to be a contradiction (for not
> WB), but to me it's something like: "OK, Elvis lives and this guy is
> not real Paul McCartney too" in an official CIA statement!

The thing is that those documents are not defining what a particular
implementation does, but how the architecture is defined (ie. what
must some arbitrary software/hardware provide and what may it expect).

It's pretty natural that Intel started out with a weaker guarantee
than their CPUs of the time actually supported, and tightened it up
after (presumably) deciding not to implement such relaxed semantics
for the forseeable future.

2007-10-15 09:07:32

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote:
...
> Has performance really been much problem for you? (even before the
> lfence instruction, when you theoretically had to use a locked op)?
> I mean, I'd struggle to find a place in the Linux kernel where there
> is actually a measurable difference anywhere... and we're pretty
> performance critical and I think we have a reasonable amount of lockless
> code (I guess we may not have a lot of tight computational loops, though).
> I'd be interested to know what, if any, application had found these
> barriers to be problematic...

I'm not performance-words at all, so I can't help you, sorry. But, I
understand people who care about this, and think there is a popular
conviction barriers and locked instructions are costly, so I'm
surprised there is any "problem" now with finding these gains...

> The thing is that those documents are not defining what a particular
> implementation does, but how the architecture is defined (ie. what
> must some arbitrary software/hardware provide and what may it expect).

I'm not sure this is the right way to tell it. If there is no
distinction between what is and what could be, how can I believe in
similar Alpha or Itanium stuff? IMHO, these manuals sometimes look
like they describe some real hardware mechanisms, and sometimes they
mention about possible changes and reserved features too. So, when
they don't mention you could think it's a present behavior.

> It's pretty natural that Intel started out with a weaker guarantee
> than their CPUs of the time actually supported, and tightened it up
> after (presumably) deciding not to implement such relaxed semantics
> for the forseeable future.

As a matter of fact it's not natural for me at all. I expected the
other direction, and I still doubt programmers' intentions could be
"automatically" predicted good enough, so IMHO, it's not for long.
Of course, it doesn't seem to be any help for linux or bsd
programmers, which still have to think about different architectures.

Regards,
Jarek P.

2007-10-15 09:21:22

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Mon, Oct 15, 2007 at 11:09:59AM +0200, Jarek Poplawski wrote:
...
> I'm not performance-words at all, so I can't help you, sorry. But, I

...performance-wards?!

Looks like serious: I don't even now who I'm not now!

Jarek P.

2007-10-15 10:19:54

by Helge Hafting

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

Jarek Poplawski wrote:
> On Fri, Oct 12, 2007 at 02:44:51PM +0200, Helge Hafting wrote:
> ...
>
>> The point is that we _trust_ intel when they says "this will work".
>> Therefore, we can use the optimizations. It was never about
>> legal matters. If we didn't trust intel, then we couldn't
>> use their processors at all.
>>
>
> But there was nothing about trust. Usually you don't trust somebody
> but somebody's opinions. The problem is there was no valid opinion,
> or this opinion has been changed now (no reason to not trust yet...).
>
"Trusting people or their opinions" is only about use of the
english language, and not that intersting to bring up here.
Surely you know that lots of people here have english as
a secondary language only. Intersting for me to know, but
probably not for everybody else.

>> We couldn't take the chance before. It was not documented
>> to work, verification by testing would not be trivial at all for
>> this case.
>> Linux is about "stability first, then performance".
>> Now we _know_ that we can have this optimization without
>> compromising stability. Nobody knew before!
>>
>
> So, you think this would be the first or the least credibly
> verified undocumented feature used in linux? Then, it seems
> I can try to install this linux on my laptop at last! (...
> And, I can trust you, it will not break anything...?)
>
I never claimed that linux will work on your laptop, so no:
You can't take my word for that, because I never gave it!
It is well known that some laptops don't work with linux,
I have no idea if yours will work, I don't even know what kind it is.

I told you the reasoning behind using _this particular optimization_,
the same does _not_ apply to everything else. If you think every
kernel decision is made the same way, then you are mistaken.
Things don't work that way.
First, several people are involved - they think differently.
Second, "what kind of tricks to use" is not an all-or-nothing
approach. If linux were to use every undocumented trick
that might or might not work, then linux would fail on
lots of hardware. It would not be useful.
If linux took the other approach and never used any "tricks",
then it'd be slow and boring.

Some things are much easier to test - you construct a testcase
or just build a test kernel and benchmark it. If all is ok, then
the "trick" is useable. Some cases are a clear win for lots of
machines, and the possible failure cases involves
very rare hardware. So it might get used. Some tricks have
a failure mode that is rare but completely obvious when it happens.
So it gets used, and "troublesome hardware" is added to a blacklist
as needed.

Some "tricks" however, are hard to figure out without docs.
There may be no good way to test. The tricks
may cause instability that will be very hard to track down, and this could
happen on a wide range of hardware. So such don't get used, until
adequate documentation appear. In this case, it seems like intel,
who make and design the processors in question and therefore
know them well enough, provided such documentation. That
makes a previously dubious optimization safe.

Helge Hafting


2007-10-15 11:51:46

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Mon, Oct 15, 2007 at 12:17:40PM +0200, Helge Hafting wrote:
> Jarek Poplawski wrote:
> >On Fri, Oct 12, 2007 at 02:44:51PM +0200, Helge Hafting wrote:
> >...
> >
> >>The point is that we _trust_ intel when they says "this will work".
> >>Therefore, we can use the optimizations. It was never about
> >>legal matters. If we didn't trust intel, then we couldn't
> >>use their processors at all.
> >>
> >
> >But there was nothing about trust. Usually you don't trust somebody
> >but somebody's opinions. The problem is there was no valid opinion,
> >or this opinion has been changed now (no reason to not trust yet...).
> >
> "Trusting people or their opinions" is only about use of the
> english language, and not that intersting to bring up here.
> Surely you know that lots of people here have english as
> a secondary language only. Intersting for me to know, but
> probably not for everybody else.

Of curse, I know this problem: sometimes it's very hard to make people
believe it's my secondary language! But this time I didn't see any
language problem. I simply poined out that sometimes trusting could be
not enough - not necessarily in this case.

> >>We couldn't take the chance before. It was not documented
> >>to work, verification by testing would not be trivial at all for
> >>this case.
> >>Linux is about "stability first, then performance".
> >>Now we _know_ that we can have this optimization without
> >>compromising stability. Nobody knew before!
> >>
> >
> >So, you think this would be the first or the least credibly
> >verified undocumented feature used in linux? Then, it seems
> >I can try to install this linux on my laptop at last! (...
> >And, I can trust you, it will not break anything...?)
> >
> I never claimed that linux will work on your laptop, so no:
> You can't take my word for that, because I never gave it!
> It is well known that some laptops don't work with linux,
> I have no idea if yours will work, I don't even know what kind it is.

OK, this was supposed to be a joke... (Btw, can you remember burning
linux laptops?) I thought this "stability first" a bit funny, but this
was a really bad joke, sorry.

Thanks for these additional explanations - you are completely right!

Regards,
Jarek P.

2007-10-15 14:38:44

by David Schwartz

[permalink] [raw]
Subject: RE: [rfc][patch 3/3] x86: optimise barriers


> From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
> Volume 3A:
>
> "7.2.2 Memory Ordering in P6 and More Recent Processor Families
> ...
> 1. Reads can be carried out speculatively and in any order.
> ..."
>
> So, it looks to me like almost the 1-st Commandment. Some people (like
> me) did believe this, others tried to check, and it was respected for
> years notwithstanding nobody had ever seen such an event.

When Intel first added speculative loads to the x86 family, they pegged the
speculative load to the cache line. If the cache line is invalidated, so is
the speculative load. As a result, out-of-order reads to normal memory are
invisible to software. If a write to the same memory location on another CPU
would make the fetched value invalid, it will make the cache line invalid,
which invalidates the fetch.

I think it's extremely unlikely that any x86 CPU will do this any
differently. It's hard to imagine Intel and AMD would go to all this trouble
for so long just to stop so late in the line's lifetime.

DS


2007-10-16 00:50:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Mon, Oct 15, 2007 at 11:10:00AM +0200, Jarek Poplawski wrote:
> On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote:
> ...
> > Has performance really been much problem for you? (even before the
> > lfence instruction, when you theoretically had to use a locked op)?
> > I mean, I'd struggle to find a place in the Linux kernel where there
> > is actually a measurable difference anywhere... and we're pretty
> > performance critical and I think we have a reasonable amount of lockless
> > code (I guess we may not have a lot of tight computational loops, though).
> > I'd be interested to know what, if any, application had found these
> > barriers to be problematic...
>
> I'm not performance-words at all, so I can't help you, sorry. But, I
> understand people who care about this, and think there is a popular
> conviction barriers and locked instructions are costly, so I'm
> surprised there is any "problem" now with finding these gains...

It's more expensive than nothing, sure. However in real code, algorithmic
complexity, cache misses and cacheline bouncing tend to be much bigger
issues.

I can't think of a place in the kernel where smp_rmb matters _that_ much.
seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the
lfence is not going to hurt. Maybe we even gain 0.01% performance in
someone's workload.

Also, remember: if loads are already in-order, then lfence is a noop,
right? (in practice it seems to have to do a _little_ bit of work, but
it's like a dozen cycles).


> > The thing is that those documents are not defining what a particular
> > implementation does, but how the architecture is defined (ie. what
> > must some arbitrary software/hardware provide and what may it expect).
>
> I'm not sure this is the right way to tell it. If there is no
> distinction between what is and what could be, how can I believe in
> similar Alpha or Itanium stuff? IMHO, these manuals sometimes look
> like they describe some real hardware mechanisms, and sometimes they
> mention about possible changes and reserved features too. So, when
> they don't mention you could think it's a present behavior.

No. Why are you reading that much into it? I know for a fact that some
non-x86 architectures actual implementations have stronger ordering than
their ISA allows. It's nothing to do with you "believing" how the hardware
works. That's not what the document is describing (directly).


> > It's pretty natural that Intel started out with a weaker guarantee
> > than their CPUs of the time actually supported, and tightened it up
> > after (presumably) deciding not to implement such relaxed semantics
> > for the forseeable future.
>
> As a matter of fact it's not natural for me at all. I expected the
> other direction, and I still doubt programmers' intentions could be
> "automatically" predicted good enough, so IMHO, it's not for long.

Really? Consider the consequences if, instead of releasing this latest
document tightening consistency, Intel found that out of order loads
were worth 5% more performance and implemented them in their next chip.
The chip could be completely backwards compatible, but all your old code
would break, because it was broken to begin with (because it was outside
the spec).

IMO Intel did exactly the right thing from an engineering perspective,
and so did Linux to always follow the spec.

2007-10-16 08:57:19

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Tue, Oct 16, 2007 at 02:50:33AM +0200, Nick Piggin wrote:
> On Mon, Oct 15, 2007 at 11:10:00AM +0200, Jarek Poplawski wrote:
...
> > I'm not performance-words at all, so I can't help you, sorry. But, I
> > understand people who care about this, and think there is a popular
> > conviction barriers and locked instructions are costly, so I'm
> > surprised there is any "problem" now with finding these gains...
>
> It's more expensive than nothing, sure. However in real code, algorithmic
> complexity, cache misses and cacheline bouncing tend to be much bigger
> issues.
>
> I can't think of a place in the kernel where smp_rmb matters _that_ much.
> seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the
> lfence is not going to hurt. Maybe we even gain 0.01% performance in
> someone's workload.
>
> Also, remember: if loads are already in-order, then lfence is a noop,
> right? (in practice it seems to have to do a _little_ bit of work, but
> it's like a dozen cycles).

You are right: considering current CPUs there could be no performance
problem at all. Removing LOCKs for older ones should probably matter
more, but as a matter of fact, now I wouldn't bet even on this - it
could be mostly noop as well.

> > As a matter of fact it's not natural for me at all. I expected the
> > other direction, and I still doubt programmers' intentions could be
> > "automatically" predicted good enough, so IMHO, it's not for long.
>
> Really? Consider the consequences if, instead of releasing this latest
> document tightening consistency, Intel found that out of order loads
> were worth 5% more performance and implemented them in their next chip.
> The chip could be completely backwards compatible, but all your old code
> would break, because it was broken to begin with (because it was outside
> the spec).

I've different opinion on this: I expect any spec to describe current
implementation. Before issuing new models any changes of
implementation should be made public with proper margin of time. Then
system could be optimally adjusted to a real hardware, instead of
planned only, but possibly never realized (plus doing such not used
things with old means is usually more costly: lock vs. lfence). There
is still problem of specs' completness: there are probably often some
things unspecified which could brake on a new model, so never 100%
guarantee anyway.

> IMO Intel did exactly the right thing from an engineering perspective,
> and so did Linux to always follow the spec.

But, if you follow the spec - you don't follow the spec! Why do you
ignore so much this part of Intel's spec:

"This document contains information which Intel may change at any
time without notice. Do not finalize a design with this information."

Maybe it's a real Intel intention and not for lawyers only? (Btw, it
seems we have an example.)

Regards,
Jarek P.

2007-10-16 09:10:34

by David Lang

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Tue, 16 Oct 2007, Jarek Poplawski wrote:

> On Tue, Oct 16, 2007 at 02:50:33AM +0200, Nick Piggin wrote:
>> On Mon, Oct 15, 2007 at 11:10:00AM +0200, Jarek Poplawski wrote:
> ...
>
>>> As a matter of fact it's not natural for me at all. I expected the
>>> other direction, and I still doubt programmers' intentions could be
>>> "automatically" predicted good enough, so IMHO, it's not for long.
>>
>> Really? Consider the consequences if, instead of releasing this latest
>> document tightening consistency, Intel found that out of order loads
>> were worth 5% more performance and implemented them in their next chip.
>> The chip could be completely backwards compatible, but all your old code
>> would break, because it was broken to begin with (because it was outside
>> the spec).
>
> I've different opinion on this: I expect any spec to describe current
> implementation. Before issuing new models any changes of
> implementation should be made public with proper margin of time. Then
> system could be optimally adjusted to a real hardware, instead of
> planned only, but possibly never realized (plus doing such not used
> things with old means is usually more costly: lock vs. lfence). There
> is still problem of specs' completness: there are probably often some
> things unspecified which could brake on a new model, so never 100%
> guarantee anyway.

what you don't realize is that Intel (and AMD) have built their business
on makeing sure that their new CPU's run existing software with no
modifications, (and almost always faster then the old versions). remember
that for most of the world, getting the software modified would mean
buying a new version, if the vendor bothered to make a different version
for the new chip.

if they required everyone to buy new software to use a new chip it
wouldn't work well. In fact Intel tried to do exactly withat with the
itanium and it has been a spectacular failure (or t the very least, not a
spectacular sucess)

>> IMO Intel did exactly the right thing from an engineering perspective,
>> and so did Linux to always follow the spec.
>
> But, if you follow the spec - you don't follow the spec! Why do you
> ignore so much this part of Intel's spec:
>
> "This document contains information which Intel may change at any
> time without notice. Do not finalize a design with this information."
>
> Maybe it's a real Intel intention and not for lawyers only? (Btw, it
> seems we have an example.)

in theory they could change anything at any time, in practice if they
break old software they won't sell the chips, so the modifications tend to
be along the lines of this one, adding detail to the specifications so
that programmers can get more performance.

David Lang

2007-10-16 12:46:24

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [rfc][patch 3/3] x86: optimise barriers

On Tue, Oct 16, 2007 at 02:14:17AM -0700, [email protected] wrote:
...
> what you don't realize is that Intel (and AMD) have built their business
> on makeing sure that their new CPU's run existing software with no
> modifications, (and almost always faster then the old versions). remember
> that for most of the world, getting the software modified would mean
> buying a new version, if the vendor bothered to make a different version
> for the new chip.

It's a good point to always consider when you analyze how something
new should work if it's used with older programs too. But with newer
things like SMP or multithreading they probably have more choice, and
you can only guess what it's done 'officially'.

> if they required everyone to buy new software to use a new chip it
> wouldn't work well. In fact Intel tried to do exactly withat with the
> itanium and it has been a spectacular failure (or t the very least, not a
> spectacular sucess)

The failure of an architecture doesn't mean all specific new
technologies used in itanium were failure too, so they could be back
when needed (and nothing better in reserve) yet.

...
> in theory they could change anything at any time, in practice if they
> break old software they won't sell the chips, so the modifications tend to
> be along the lines of this one, adding detail to the specifications so
> that programmers can get more performance.

I don't think 'not breaking' is much problem here, rather how to use
all new features (which you seem to ignore a bit) to get maximum of
performance without breaking older things. Or, like current problem:
go rational and remove useless (acording to new specs) things, even
without performance gain, or stay 'safe'?

Jarek P.