In some circumstances, a CPU may perform stores to memory in non-program
order, or held in on-chip store buffers for a potentially long time.
These kinds of circumstances include:
- Stores to a PCI MMIO region that has write combining enabled
- Use of non-temporal store instructions
- The CPU's memory model permitting weak store ordering
This patch introduces a new macro, flush_wc(), that flushes any pending
stores from the local CPU's write combining store buffers, if the CPU has
such a capability. If the CPU doesn't provide explicit control over write
combining, flush_wc() is simply an alias for wmb(). Here is an example:
store A
store B
store C
flush_wc()
store D
store E
flush_wc()
flush_wc() says nothing about whether {A,B,C} may be reordered with
respect to each other, or whether {D,E} may, but it guarantees that
{A,B,C} will make it off-CPU before {D,E}. An arch that implements
flush_wc() should make the stores occur immediately, if possible.
In the case of writes to remote NUMA memory or PCI MMIO space, a
bridge chip may still hold on to stores after they've been issued by
the CPU. Thus flush_wc() makes no guarantees regarding the visibility
of stores to other CPUs, remote memory, or PCI devices.
The intention is that flush_wc() will typically be used for
latency-sensitive MMIO writes where the full cost of guaranteeing that
the writes have made it all the way to their target is not necessary
or desirable.
It is implemented by way of a new header file, <linux/system.h>.
This header can be a site for oft-replicated code from <asm-*/system.h>,
but isn't just yet.
Signed-off-by: Bryan O'Sullivan <[email protected]>
diff -r c89918da5f7b -r e27c8e0061e0 include/asm-i386/system.h
--- a/include/asm-i386/system.h Sat Feb 25 08:01:07 2006 +0800
+++ b/include/asm-i386/system.h Wed Mar 8 13:28:29 2006 -0800
@@ -502,6 +502,12 @@ struct alt_instr {
#define wmb() __asm__ __volatile__ ("": : :"memory")
#endif
+/*
+ * Flush the CPU's write combining store buffers.
+ */
+#define flush_wc() alternative("lock; addl $0,0(%%esp)", "sfence", \
+ X86_FEATURE_XMM)
+
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
diff -r c89918da5f7b -r e27c8e0061e0 include/asm-powerpc/system.h
--- a/include/asm-powerpc/system.h Sat Feb 25 08:01:07 2006 +0800
+++ b/include/asm-powerpc/system.h Wed Mar 8 13:28:29 2006 -0800
@@ -53,6 +53,12 @@
#define smp_wmb() barrier()
#define smp_read_barrier_depends() do { } while(0)
#endif /* CONFIG_SMP */
+
+/*
+ * Guarantee store/store ordering. On some arches, this flushes the CPU's
+ * write combining store buffers.
+ */
+#define flush_wc() __asm__ __volatile__ ("eieio" : : : "memory")
struct task_struct;
struct pt_regs;
diff -r c89918da5f7b -r e27c8e0061e0 include/asm-x86_64/system.h
--- a/include/asm-x86_64/system.h Sat Feb 25 08:01:07 2006 +0800
+++ b/include/asm-x86_64/system.h Wed Mar 8 13:28:29 2006 -0800
@@ -330,6 +330,11 @@ static inline unsigned long __cmpxchg(vo
#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
#define set_wmb(var, value) do { var = value; wmb(); } while (0)
+/*
+ * Flush the CPU's write combining store buffers.
+ */
+#define flush_wc() asm volatile("sfence" ::: "memory")
+
#define warn_if_not_ulong(x) do { unsigned long foo; (void) (&(x) == &foo); } while (0)
/* interrupt control.. */
diff -r c89918da5f7b -r e27c8e0061e0 include/linux/system.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/include/linux/system.h Wed Mar 8 13:28:29 2006 -0800
@@ -0,0 +1,27 @@
+/*
+ * Copyright 2006 PathScale, Inc. All Rights Reserved.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SYSTEM_H
+#define _LINUX_SYSTEM_H
+
+#include <asm/system.h>
+
+#ifndef flush_wc
+#define flush_wc() wmb()
+#endif
+
+#endif /* _LINUX_SYSTEM_H */
On Wed, 2006-03-08 at 13:31 -0800, Bryan O'Sullivan wrote:
> In some circumstances, a CPU may perform stores to memory in non-program
> order, or held in on-chip store buffers for a potentially long time.
> These kinds of circumstances include:
>
> - Stores to a PCI MMIO region that has write combining enabled
>
> - Use of non-temporal store instructions
>
> - The CPU's memory model permitting weak store ordering
>
> This patch introduces a new macro, flush_wc(), that flushes any pending
> stores from the local CPU's write combining store buffers, if the CPU has
> such a capability. If the CPU doesn't provide explicit control over write
> combining, flush_wc() is simply an alias for wmb(). Here is an example:
I think people already don't undersatnd the existing gazillion of
barriers we have with quite unclear semantics in some cases, it's not
time to add a new one ...
Ben.
On Thu, 2006-03-09 at 08:38 +1100, Benjamin Herrenschmidt wrote:
> I think people already don't undersatnd the existing gazillion of
> barriers we have with quite unclear semantics in some cases, it's not
> time to add a new one ...
What do you suggest I do, then? This makes a substantial difference to
performance for us. Should I confine this somehow to the ipath driver
directory and have a nest of ifdefs in an include file there?
<b
On Thu, Mar 09, 2006 at 08:38:39AM +1100, Benjamin Herrenschmidt wrote:
> I think people already don't undersatnd the existing gazillion of
> barriers we have with quite unclear semantics in some cases, it's not
> time to add a new one ...
We went over the details of this in a pretty long thread, and you can't
use any of the existing memory barriers to solve this particular problem.
I think the definition of this one is well done now.
-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.
On Wednesday 08 March 2006 22:43, Bryan O'Sullivan wrote:
> On Thu, 2006-03-09 at 08:38 +1100, Benjamin Herrenschmidt wrote:
>
> > I think people already don't undersatnd the existing gazillion of
> > barriers we have with quite unclear semantics in some cases, it's not
> > time to add a new one ...
>
> What do you suggest I do, then? This makes a substantial difference to
> performance for us. Should I confine this somehow to the ipath driver
> directory and have a nest of ifdefs in an include file there?
I think doing it privately is the better solution because I don't think you
have established it has an universal semantic that works
on all X86-64 systems.
And we don't have a portable way to do WC anyways, so there is
no portable way to use it.
So just put an ifdef in.
-Andi
On Wed, 2006-03-08 at 15:21 +0100, Andi Kleen wrote:
> I think doing it privately is the better solution because I don't think you
> have established it has an universal semantic that works
> on all X86-64 systems.
No, I quoted chapter and verse of the relevant Intel and AMD x86_64 docs
for you, complete with URLs and page numbers so it wouldn't take any
effort to verify what I was asserting.
I don't know what else I could have done (it was enough for bcrl, at
least), and you have come up any with suggestions as to what *would*
satisfy you, so I'm stuck.
> And we don't have a portable way to do WC anyways, so there is
> no portable way to use it.
>
> So just put an ifdef in.
That's fine. Whatever satisfies reviewers :-)
<b
On Mer, 2006-03-08 at 13:31 -0800, Bryan O'Sullivan wrote:
> flush_wc() says nothing about whether {A,B,C} may be reordered with
> respect to each other, or whether {D,E} may, but it guarantees that
> {A,B,C} will make it off-CPU before {D,E}. An arch that implements
> flush_wc() should make the stores occur immediately, if possible.
How is this different to mmiowb() ?
On Wednesday 08 March 2006 23:06, Alan Cox wrote:
> On Mer, 2006-03-08 at 13:31 -0800, Bryan O'Sullivan wrote:
> > flush_wc() says nothing about whether {A,B,C} may be reordered with
> > respect to each other, or whether {D,E} may, but it guarantees that
> > {A,B,C} will make it off-CPU before {D,E}. An arch that implements
> > flush_wc() should make the stores occur immediately, if possible.
>
> How is this different to mmiowb() ?
I think he intends it to be a flush instead of an ordering.
(something like CLFLUSH for WC areas)
-Andi
On Wednesday 08 March 2006 22:57, Bryan O'Sullivan wrote:
> On Wed, 2006-03-08 at 15:21 +0100, Andi Kleen wrote:
>
> > I think doing it privately is the better solution because I don't think you
> > have established it has an universal semantic that works
> > on all X86-64 systems.
>
> No, I quoted chapter and verse of the relevant Intel and AMD x86_64 docs
> for you, complete with URLs and page numbers so it wouldn't take any
> effort to verify what I was asserting.
>
> I don't know what else I could have done (it was enough for bcrl, at
> least), and you have come up any with suggestions as to what *would*
> satisfy you, so I'm stuck.
Hmm, I reread the thread and with the "i don't need a flush, just ordering"
part of your description it makes sense.
My second objection still stands though. Maybe we should add this as
part of a generic portable PAT/WC infrastructure. But isolated
it doesn't make sense.
-Andi
On Wed, 2006-03-08 at 15:35 +0100, Andi Kleen wrote:
> > How is this different to mmiowb() ?
>
> I think he intends it to be a flush instead of an ordering.
> (something like CLFLUSH for WC areas)
Exactly. mmiowb guarantees ordering, but says nothing about timing.
This would guarantee ordering, affect WC store buffers if present, and
try to work in a timely manner.
<b
--
Bryan O'Sullivan <[email protected]>
On Wednesday 08 March 2006 23:05, Bryan O'Sullivan wrote:
> On Wed, 2006-03-08 at 15:35 +0100, Andi Kleen wrote:
>
> > > How is this different to mmiowb() ?
> >
> > I think he intends it to be a flush instead of an ordering.
> > (something like CLFLUSH for WC areas)
>
> Exactly. mmiowb guarantees ordering, but says nothing about timing.
> This would guarantee ordering, affect WC store buffers if present, and
> try to work in a timely manner.
Well if you need the flush, not the ordering then I'm not convinced
SFENCE will do that for you. My understanding is that it only guarantees
ordering.
But at least in some earlier message you said you just needed ordering.
-Andi
On Wed, 2006-03-08 at 13:43 -0800, Bryan O'Sullivan wrote:
> On Thu, 2006-03-09 at 08:38 +1100, Benjamin Herrenschmidt wrote:
>
> > I think people already don't undersatnd the existing gazillion of
> > barriers we have with quite unclear semantics in some cases, it's not
> > time to add a new one ...
>
> What do you suggest I do, then? This makes a substantial difference to
> performance for us. Should I confine this somehow to the ipath driver
> directory and have a nest of ifdefs in an include file there?
What bothers me is that because of that exact same argument "it makes
substantial difference for us", we end up with basically barriers
tailored for architectures... that is as many kind of barriers as we
have architectuers... like mmiowb :)
Currently, PowerPC is losing significant performances for example to try
to "fit" in the linux barriers for things like IOs which would leak out
of spinlocks if we didn't have a strong barrier in pretty much every
writeX() or things like that. We could use the same argument you are
making to come up with yet another set of barriers that are more
friendly to ppc ...
At the end of the day, however, the problem is that pretty much no
driver write understand anything about any of the barriers and get them
all wrong...
Which makes me thing we are trying to use the wrong tool or providing
the wrong level of abstraction or something there...
Ben.
On Wed, 2006-03-08 at 15:38 +0100, Andi Kleen wrote:
> Hmm, I reread the thread and with the "i don't need a flush, just ordering"
> part of your description it makes sense.
Actually, the sfence architecturally guarantees a WC store buffer flush
on x86_64 too, which I need. Obviously, on CPUs that don't have WC
on-chip, this is not an issue.
> My second objection still stands though. Maybe we should add this as
> part of a generic portable PAT/WC infrastructure. But isolated
> it doesn't make sense.
Fine. I'll put it in the driver for now, and work on the generic WC
infrastructure in parallel.
<b
--
Bryan O'Sullivan <[email protected]>
On Thu, 2006-03-09 at 10:40 +1100, Benjamin Herrenschmidt wrote:
> What bothers me is that because of that exact same argument "it makes
> substantial difference for us", we end up with basically barriers
> tailored for architectures... that is as many kind of barriers as we
> have architectuers... like mmiowb :)
Unfortunately, it does express an untidy facet of reality, which is that
you really *do* want about as many different kinds of barrier as you
have kinds of distinct semantics, so that you have some hope of
expressing fairly portable concepts in a somewhat high-performance way.
I can see four problems at hand. You can choose the order of their
importance yourself :-)
1. Linux has a set of barriers that is too small to write
performant code in a portable manner.
2. The semantics of the existing barriers are not well understood.
3. People don't understand when barriers are appropriate in the
first place. Witness the current thread on this topic.
4. If you add more subtle barriers, they'll get misused (see #3)
and inevitably introduce bugs that are a nightmare to find.
Regarding #2, it's not obvious to lots of people why e.g. the
atomic_*_return kinds of routines need to have memory barrier semantics,
in some cases even after reading Documentation/atomic_ops.txt. So this
is a nasty education problem.
Regardless, in the case of the ipath driver, we squeeze every cycle out
of the hardware that we can, and this is our selling point. I'm fine
with a driver-specific hack in this case because it suits my immediate
needs, but it would be nice to have something portable and well-defined
to rely on.
<b