On some platforms, a regular wmb() is not sufficient to guarantee that
PCI writes have been flushed to the bus if write combining is in effect.
This change introduces a new macro, wc_wmb(), that makes this guarantee.
It does so by way of a new header file, <linux/system.h>. This header
can be a site for oft-replicated code from <asm-*/system.h>, but isn't
just yet.
We also define a version of wc_wmb() with the required semantics
on x86_64.
Signed-off-by: Bryan O'Sullivan <[email protected]>
diff -r c89918da5f7b -r 94d372e00ccd include/asm-x86_64/system.h
--- a/include/asm-x86_64/system.h Sat Feb 25 08:01:07 2006 +0800
+++ b/include/asm-x86_64/system.h Fri Feb 24 20:15:07 2006 -0800
@@ -321,6 +321,8 @@ static inline unsigned long __cmpxchg(vo
#define mb() asm volatile("mfence":::"memory")
#define rmb() asm volatile("lfence":::"memory")
+#define wc_wmb() asm volatile("sfence" ::: "memory")
+
#ifdef CONFIG_UNORDERED_IO
#define wmb() asm volatile("sfence" ::: "memory")
#else
diff -r c89918da5f7b -r 94d372e00ccd include/linux/system.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/include/linux/system.h Fri Feb 24 20:15:07 2006 -0800
@@ -0,0 +1,27 @@
+/*
+ * Copyright 2006 PathScale, Inc. All Rights Reserved.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SYSTEM_H
+#define _LINUX_SYSTEM_H
+
+#include <asm/system.h>
+
+#ifndef wc_wmb
+#define wc_wmb() wmb()
+#endif
+
+#endif /* _LINUX_SYSTEM_H */
On Saturday 25 February 2006 05:20, Bryan O'Sullivan wrote:
> On some platforms, a regular wmb() is not sufficient to guarantee that
> PCI writes have been flushed to the bus if write combining is in effect.
On what platforms?
> It does so by way of a new header file, <linux/system.h>. This header
> can be a site for oft-replicated code from <asm-*/system.h>, but isn't
> just yet.
linux/system.h looks unnatural to me.
> We also define a version of wc_wmb() with the required semantics
> on x86_64.
Leaving i386 out in the cold?
-Andi
On Sat, 2006-02-25 at 05:43 +0100, Andi Kleen wrote:
> On Saturday 25 February 2006 05:20, Bryan O'Sullivan wrote:
> > On some platforms, a regular wmb() is not sufficient to guarantee that
> > PCI writes have been flushed to the bus if write combining is in effect.
>
> On what platforms?
On x86_64 in particular, if CONFIG_UNORDERED_IO is defined. Regular
wmb() is implemented as a compiler memory barrier then, which isn't
sufficient for PCI write combining.
> linux/system.h looks unnatural to me.
I used this for symmetry with <asm/system.h> where other barriers are
defined. It could obviously go into io.h instead, since it's an
MMIO-related barrier, but I didn't want to separate it from other
barriers.
If you have a location you'd prefer, please let me know.
> > We also define a version of wc_wmb() with the required semantics
> > on x86_64.
>
> Leaving i386 out in the cold?
Looks like it should be defined there, too, something like this perhaps:
#define wc_wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
Does that look right?
<b
On Saturday 25 February 2006 08:34, Bryan O'Sullivan wrote:
> On Sat, 2006-02-25 at 05:43 +0100, Andi Kleen wrote:
> > On Saturday 25 February 2006 05:20, Bryan O'Sullivan wrote:
> > > On some platforms, a regular wmb() is not sufficient to guarantee that
> > > PCI writes have been flushed to the bus if write combining is in
> > > effect.
> >
> > On what platforms?
>
> On x86_64 in particular, if CONFIG_UNORDERED_IO is defined.
I guess you mean undefined.
> Regular
> wmb() is implemented as a compiler memory barrier then, which isn't
> sufficient for PCI write combining.
It's still totally unclear how you can write portable write combining code.
I think the basic idea is weird. You're basically writing something
that doesn't follow the normal MMIO rules of Linux drivers and you're
trying to do this portable, which won't work anyways because
there is no portable way to do this.
Before we can add such a macro I suspect you would first
need to provide some spec how that "portable write combining"
is supposed to work and get feedback from the other architectures.
Or keep it in architecture specific code, then the generic macro
isn't needed.
-Andi
On Fri, Feb 24, 2006 at 08:20:50PM -0800, Bryan O'Sullivan wrote:
> On some platforms, a regular wmb() is not sufficient to guarantee that
> PCI writes have been flushed to the bus if write combining is in effect.
>
> This change introduces a new macro, wc_wmb(), that makes this guarantee.
>
> It does so by way of a new header file, <linux/system.h>. This header
> can be a site for oft-replicated code from <asm-*/system.h>, but isn't
> just yet.
>
> We also define a version of wc_wmb() with the required semantics
> on x86_64.
This does not do what you're trying to accomplish. sfence has no impact
on the flushing of the write combining buffers -- they're normally flushed
after a few cycles of no subsequent merges. If it is really necessary to
flush such a write, you will have to do a read from the target PCI device
(well, bus would do).
-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.
On Sat, 2006-02-25 at 09:28 -0500, Benjamin LaHaise wrote:
> sfence has no impact
> on the flushing of the write combining buffers
The sfence instruction guarantees that every store that precedes it in
program order is globally visible, including over the likes of the PCI
bus, before any store instruction that follows the fence.
<b
On Sat, 2006-02-25 at 14:28 +0100, Andi Kleen wrote:
> It's still totally unclear how you can write portable write combining code.
>
> I think the basic idea is weird. You're basically writing something
> that doesn't follow the normal MMIO rules of Linux drivers and you're
> trying to do this portable, which won't work anyways because
> there is no portable way to do this.
This is definitely a problem. We have an x86-specific hack in our
driver that diddles the MTRRs to make sure that our MMIO space has write
combining enabled.
On other platforms, it looks like write combining, if supported at all,
is implemented in the northbridge. For those, I think we'd need to mark
the device's mapping as cacheable.
> Before we can add such a macro I suspect you would first
> need to provide some spec how that "portable write combining"
> is supposed to work and get feedback from the other architectures.
>
> Or keep it in architecture specific code, then the generic macro
> isn't needed.
OK, thanks for the feedback.
<b
--
Bryan O'Sullivan <[email protected]>
On Sat, Feb 25, 2006 at 09:11:57AM -0800, Bryan O'Sullivan wrote:
> On Sat, 2006-02-25 at 09:28 -0500, Benjamin LaHaise wrote:
>
> > sfence has no impact
> > on the flushing of the write combining buffers
>
> The sfence instruction guarantees that every store that precedes it in
> program order is globally visible, including over the likes of the PCI
> bus, before any store instruction that follows the fence.
That is not the same as saying the write buffers are flushed and wholly
visible to their destination, it just means that subsequent reads or
writes will not be reordered prior to the sfence instruction. It is
entirely possible that the writes will remain in buffers on the CPU until
well after the sfence instruction has executed, sfence only affects the
order in which they become visible on the bus. If you want to force a
flush, a read from the PCI device is the only way to accomplish that.
-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.
On Sat, 2006-02-25 at 14:28 +0100, Andi Kleen wrote:
> Before we can add such a macro I suspect you would first
> need to provide some spec how that "portable write combining"
> is supposed to work and get feedback from the other architectures.
It seems like we'd need a function that tries to enable or disable write
combining on an MMIO memory range. This would be implemented by arches
that support it, and would fail on others. Drivers could then try to
enable write combining, and if it failed, either bail, print a warning
message, or do something else appropriate.
So on i386 and x86_64, this function would fiddle with the MTRRs. On
powerpc, it would either configure the northbridge appropriately or
fail. On other arches, I don't know enough to say, so the default would
be to fail.
Is this reasonable? I can code a strawman implementation up, if the
basic idea looks sane.
<b
>>>>> "Bryan" == Bryan O'Sullivan <[email protected]> writes:
Bryan> On some platforms, a regular wmb() is not sufficient to
Bryan> guarantee that PCI writes have been flushed to the bus if write
Bryan> combining is in effect.
Bryan> This change introduces a new macro, wc_wmb(), that makes this
Bryan> guarantee.
Bryan,
Could you explain why the current mmiowb() API won't suffice for this?
It seems that this is basically trying to achieve the same thing.
Cheers,
Jes
Jes> Could you explain why the current mmiowb() API won't suffice
Jes> for this? It seems that this is basically trying to achieve
Jes> the same thing.
I don't believe mmiowb() is at all the same thing. mmiowb() is all
about ordering writes between _different_ CPUs without incurring the
cost of flushing posted writes by issuing a read on the bus. wc_wmb()
would just act like a true wmb(), even when using write-combining
regions on x86 -- in other words, there would be no cross-CPU synchronization.
- R.
Roland Dreier wrote:
> Jes> Could you explain why the current mmiowb() API won't suffice
> Jes> for this? It seems that this is basically trying to achieve
> Jes> the same thing.
>
> I don't believe mmiowb() is at all the same thing. mmiowb() is all
> about ordering writes between _different_ CPUs without incurring the
> cost of flushing posted writes by issuing a read on the bus.
Not quite correct as far as I understand it. mmiowb() is supposed to
guarantee that writes to MMIO space have completed before continuing.
That of course covers the multi-CPU case, but it should also cover the
write-combining case.
> wc_wmb()
> would just act like a true wmb(), even when using write-combining
> regions on x86 -- in other words, there would be no cross-CPU synchronization.
I wary of adding yet another variation unless there is a clear
distinction between them that is easy to understandn for driver authors.
Cheers,
Jes
Jes> Not quite correct as far as I understand it. mmiowb() is
Jes> supposed to guarantee that writes to MMIO space have
Jes> completed before continuing. That of course covers the
Jes> multi-CPU case, but it should also cover the write-combining
Jes> case.
I don't believe this is correct. mmiowb() does not guarantee that
writes have completed -- they may still be pending in a buffer in a
bridge somewhere. The _only_ effect of mmiowb() is to make sure that
writes which have been ordered between CPUs using some other mechanism
(i.e. a lock) are properly ordered by the rest of the system. This
only has an effect systems like very large ia64 systems, where (as I
understand it), writes can pass each other on the way to the PCI bus.
In fact, mmiowb() is a NOP on essentially every architecture.
- R.
On Tuesday, February 28, 2006 8:08 am, Jes Sorensen wrote:
> Roland Dreier wrote:
> > Jes> Could you explain why the current mmiowb() API won't
> > suffice Jes> for this? It seems that this is basically trying to
> > achieve Jes> the same thing.
> >
> > I don't believe mmiowb() is at all the same thing. mmiowb() is all
> > about ordering writes between _different_ CPUs without incurring the
> > cost of flushing posted writes by issuing a read on the bus.
>
> Not quite correct as far as I understand it. mmiowb() is supposed to
> guarantee that writes to MMIO space have completed before continuing.
> That of course covers the multi-CPU case, but it should also cover the
> write-combining case.
It only guarantees that any outstanding writes will hit the device before
any subsequent writes. mmiowb() doesn't make any guarantees about when
the data will actually arrive at the device though.
> I wary of adding yet another variation unless there is a clear
> distinction between them that is easy to understandn for driver
> authors.
I think that's a valid concern, there are so many ill-understood barriers
floating around; adding another one will create even more confusion.
Are they all documented somewhere? Are we sure that we don't have
duplicates?
At any rate, any new ones we add should be very well documented (I think
Andi suggested this implicitly when he asked for the semantics to be
well-defined).
Jesse
On Tuesday, February 28, 2006 9:02 am, Roland Dreier wrote:
> Jes> Not quite correct as far as I understand it. mmiowb() is
> Jes> supposed to guarantee that writes to MMIO space have
> Jes> completed before continuing. That of course covers the
> Jes> multi-CPU case, but it should also cover the write-combining
> Jes> case.
>
> I don't believe this is correct. mmiowb() does not guarantee that
> writes have completed -- they may still be pending in a buffer in a
> bridge somewhere. The _only_ effect of mmiowb() is to make sure that
> writes which have been ordered between CPUs using some other mechanism
> (i.e. a lock) are properly ordered by the rest of the system. This
> only has an effect systems like very large ia64 systems, where (as I
> understand it), writes can pass each other on the way to the PCI bus.
> In fact, mmiowb() is a NOP on essentially every architecture.
I think it could be implemented meaningfully on ppc64, mips64, and
perhaps some parisc systems, but I don't think their respective
maintainers have gotten around to that yet.
Anyway, it looks like the write combine ordering Bryan is talking about
really is a distinct semantic. Not sure if it's possible (or desirable)
to overload an existing barrier op to include the semantics he wants.
Jesse
Roland Dreier wrote:
> Jes> Not quite correct as far as I understand it. mmiowb() is
> Jes> supposed to guarantee that writes to MMIO space have
> Jes> completed before continuing. That of course covers the
> Jes> multi-CPU case, but it should also cover the write-combining
> Jes> case.
>
> I don't believe this is correct. mmiowb() does not guarantee that
> writes have completed -- they may still be pending in a buffer in a
> bridge somewhere. The _only_ effect of mmiowb() is to make sure that
> writes which have been ordered between CPUs using some other mechanism
> (i.e. a lock) are properly ordered by the rest of the system. This
> only has an effect systems like very large ia64 systems, where (as I
> understand it), writes can pass each other on the way to the PCI bus.
> In fact, mmiowb() is a NOP on essentially every architecture.
Hmmmm
That could be, seems like Jesse agrees that it could all be in the
pipeline somewhere. Considering Jesse was responsible for mmiowb() I'll
take his word for it ;-)
In any case, I'd strongly recommend that any new barrier version is
clearly documented. The jungle is very dense already ;(
Cheers,
Jes
On Saturday, February 25, 2006 11:01 am, Bryan O'Sullivan wrote:
> On Sat, 2006-02-25 at 14:28 +0100, Andi Kleen wrote:
> > Before we can add such a macro I suspect you would first
> > need to provide some spec how that "portable write combining"
> > is supposed to work and get feedback from the other architectures.
>
> It seems like we'd need a function that tries to enable or disable
> write combining on an MMIO memory range. This would be implemented by
> arches that support it, and would fail on others. Drivers could then
> try to enable write combining, and if it failed, either bail, print a
> warning message, or do something else appropriate.
>
> So on i386 and x86_64, this function would fiddle with the MTRRs. On
> powerpc, it would either configure the northbridge appropriately or
> fail. On other arches, I don't know enough to say, so the default
> would be to fail.
>
> Is this reasonable? I can code a strawman implementation up, if the
> basic idea looks sane.
Something like this would be really handy. Check out fbmem.c:fb_mmap for
a bad example of what can happen w/o it.
In fact, I think it might make sense to export WC functionality via an
mmap flag (on an advisory basis since the platform may not support it or
there may be aliasing issues that prevent it); having an arch
independent routine to request it would make that addition easy to do in
generic code. (In particular I wanted this for the sysfs PCI interface.
Userspace apps can map PCI resources there and it would be nice if they
could map them with WC semantics if requested.)
I don't think it addresses the flushing issue you seem to be concerned
about though. I don't know the exact semantics of sfence, but I think
bcrl is likely right that it won't absolutely guarantee that your writes
have hit the device before proceeding (though it may do that on some CPU
implementations).
Thanks,
Jesse
On Sat, 2006-02-25 at 12:41 -0500, Benjamin LaHaise wrote:
> That is not the same as saying the write buffers are flushed and wholly
> visible to their destination, it just means that subsequent reads or
> writes will not be reordered prior to the sfence instruction.
Right. I don't need the writes to be visible at the destination device,
in this particular case; a write followed by a barrier just has to show
up at the device before the next write.
Here's what a write to our chip looks like, for sending a packet:
* write a 64-bit control register that includes the length of the
segment being written
* do a barrier to make sure it gets to the device before any other
writes
* perform a pile of writes using __iowrite_copy32, not caring
about the order in which they reach the chip
* do another barrier
* perform one final 32-bit write using __raw_writel
* do one final barrier
The last 32-bit write triggers the chip to put the packet on the wire.
We make sure it happens after the earlier bulk write using a barrier.
<b
On Tuesday, February 28, 2006 9:44 am, Jesse Barnes wrote:
> Something like this would be really handy. Check out fbmem.c:fb_mmap
> for a bad example of what can happen w/o it.
>
> In fact, I think it might make sense to export WC functionality via an
> mmap flag (on an advisory basis since the platform may not support it
> or there may be aliasing issues that prevent it); having an arch
> independent routine to request it would make that addition easy to do
> in generic code. (In particular I wanted this for the sysfs PCI
> interface. Userspace apps can map PCI resources there and it would be
> nice if they could map them with WC semantics if requested.)
Oh, forgot to mention fallback semantics. Instead of almost every driver
doing:
if (!(iocookie = ioremap_writecombine(addr, size)))
iocookie = ioremap(addr, size); /* fallback to uncached */
maybe it would be best to have something like
iocookie = ioremap_wc_or_uc(addr, size);
that tries write combine first but silently falls back to UC if the
former is impossible (or a new ioremap with flags or priority args or
whatever). OTOH some drivers may want to be notified if the WC mapping
fails? Just a thought...
Jesse
Jesse> I don't think it addresses the flushing issue you seem to
Jesse> be concerned about though. I don't know the exact
Jesse> semantics of sfence, but I think bcrl is likely right that
Jesse> it won't absolutely guarantee that your writes have hit the
Jesse> device before proceeding (though it may do that on some CPU
Jesse> implementations).
Yeah, I think that Bryan just wrote something different than what he
meant: there is no desire for wc_wmb() to make sure that writes via a
write-combining mapping have gone all the way to the device, any more
than a normal wmb() makes sure normal writes have gone all the way to
the device. All that wc_wmb() needs to do is make sure that writes
via a write-combining mapping don't get passed by later writes.
This does speak to the need for precise documentation though :)
- R.
On Tue, 2006-02-28 at 09:44 -0800, Jesse Barnes wrote:
> Something like this would be really handy. Check out fbmem.c:fb_mmap for
> a bad example of what can happen w/o it.
:-)
> In fact, I think it might make sense to export WC functionality via an
> mmap flag (on an advisory basis since the platform may not support it or
> there may be aliasing issues that prevent it); having an arch
> independent routine to request it would make that addition easy to do in
> generic code.
Yes.
> (In particular I wanted this for the sysfs PCI interface.
> Userspace apps can map PCI resources there and it would be nice if they
> could map them with WC semantics if requested.)
They already sort of can. It just happens that most arches ignore the
WC parameters.
> I don't think it addresses the flushing issue you seem to be concerned
> about though.
Yes, I think I could have made my original wording a bit clearer. I
don't care if writes have hit the device, merely that they do so in an
order that I control.
<b
On Tue, 2006-02-28 at 05:01 -0500, Jes Sorensen wrote:
> Could you explain why the current mmiowb() API won't suffice for this?
> It seems that this is basically trying to achieve the same thing.
It's a no-op on every arch I care about:
#define mmiowb()
Which makes it useless. Also, based on the comments in the qla driver,
mmiowb() seems to have inter-CPU ordering semantics that I don't want.
I'm thus hesitant to appropriate it for my needs.
<b
On Tuesday, February 28, 2006 9:52 am, Bryan O'Sullivan wrote:
> > (In particular I wanted this for the sysfs PCI interface.
> > Userspace apps can map PCI resources there and it would be nice if
> > they could map them with WC semantics if requested.)
>
> They already sort of can. It just happens that most arches ignore the
> WC parameters.
Only on a per-driver basis though at this point, right? The sysfs code
is generic across architectures and exports PCI resources independently
from the driver, so it needs some way of letting the user communicate
that they want WC behavior.
> > I don't think it addresses the flushing issue you seem to be
> > concerned about though.
>
> Yes, I think I could have made my original wording a bit clearer. I
> don't care if writes have hit the device, merely that they do so in an
> order that I control.
Ah, ok. In that case maybe mmiowb *is* what you want. Or at the very
least mmiowcb :).
Jesse
On Tue, Feb 28, 2006 at 09:50:08AM -0800, Bryan O'Sullivan wrote:
> The last 32-bit write triggers the chip to put the packet on the wire.
> We make sure it happens after the earlier bulk write using a barrier.
The barrier you're looking for is wmb() in asm/system.h, which is defined
on both SMP and UP. On x86 you do not need the sfence as writes will show
up on the bus in program order, but you do need wmb() to prevent gcc from
reordering your code. Adding a new primative only makes things slower as
the sfence isn't necessary most of the time (it rightly has a dependancy on
CONFIG_UNORDERED_IO, which the jury is still out on).
-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.
On Tue, 28 Feb 2006, Bryan O'Sullivan wrote:
> On Tue, 2006-02-28 at 05:01 -0500, Jes Sorensen wrote:
>
>> Could you explain why the current mmiowb() API won't suffice for this?
>> It seems that this is basically trying to achieve the same thing.
>
> It's a no-op on every arch I care about:
>
> #define mmiowb()
>
> Which makes it useless. Also, based on the comments in the qla driver,
> mmiowb() seems to have inter-CPU ordering semantics that I don't want.
> I'm thus hesitant to appropriate it for my needs.
>
> <b
>
When accessing PCI, you can cause any/all write combinations to
occur and any/all pending writes to get written to the devices
simply by executing a read. If the code requires that all previous
writes be written NOW, then the code should not hide that in
some macro. It should issue a read in its PCI space. Also, the
PCI bus is a FIFO. Nothing gets reordered. Everything will
get to the devices in the order written, but a byte or word on
a longword boundary may be subject to write-combining if all
the components are present in the FIFO.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.54 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
On Tue, 2006-02-28 at 12:58 -0500, Benjamin LaHaise wrote:
> On Tue, Feb 28, 2006 at 09:50:08AM -0800, Bryan O'Sullivan wrote:
> > The last 32-bit write triggers the chip to put the packet on the wire.
> > We make sure it happens after the earlier bulk write using a barrier.
>
> The barrier you're looking for is wmb() in asm/system.h, which is defined
> on both SMP and UP.
No. We're writing to a region that we've marked as write combining, so
the processor or north bridge will not write in program order. It's
free to write out the write combining store buffers in whatever order it
feels like, unless forced to do otherwise.
<b
Benjamin LaHaise wrote:
> On Tue, Feb 28, 2006 at 09:50:08AM -0800, Bryan O'Sullivan wrote:
>
>>The last 32-bit write triggers the chip to put the packet on the wire.
>>We make sure it happens after the earlier bulk write using a barrier.
>
>
> The barrier you're looking for is wmb() in asm/system.h, which is defined
> on both SMP and UP.
That will synchronize with other CPUs as well, which may not necessarily
be needed.
On PPC for instance, you could implement the desired semantics using
"eieio" (enforce in-order execution of IO). This is lighter weight than
a full "sync", which is what wmb() maps to.
Chris
linux-os (Dick Johnson) wrote:
> Also, the
> PCI bus is a FIFO. Nothing gets reordered.
The CPU itself can reorder the write before it hits the PCI bus. I
don't think x86 will do this, but others will.
Chris
On Tue, Feb 28, 2006 at 10:20:14AM -0800, Bryan O'Sullivan wrote:
> No. We're writing to a region that we've marked as write combining, so
> the processor or north bridge will not write in program order. It's
> free to write out the write combining store buffers in whatever order it
> feels like, unless forced to do otherwise.
The rules are a bit more complex than that -- write are weakly ordered,
but still ordered. They maybe be delayed with respect to other stores,
but will never occur before other stores are visible to the cache
heirarchy. In terms of how writes to the write combining region are
delayed, you'll have to look at the addresses of the registers you are
writing to. If they map to the same write combining buffer (that is each
one can combine with the previous write) and are increasing in address,
then you don't need the explicite barrier. Most hardware is laid out so
that this is possible.
The case the write combining buffers affect memory ordering in an 'unexpected'
way is if your writes combine and you write to registers in an order that
is opposite from that in which they combine. Ie, a write to address 8
followed by a write to address 0 that combines will show up on the bus
as 0, 8 (assuming an 8 byte writes at 0). Besides that, the write combining
buffers can introduce a delay of a few clocks while the cpu defers the write
in the hope that it will combine with another write, but that delay applies
to all writes that go through the write combining buffers and thus do not
change the memory ordering (except as previously noted).
Memory barriers are not cheap. At least for the example you provided,
it looks like things are overdone and performance is going to suck, so
it needs to be avoided if at all possible. I really think that you
should be using wmb().
-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.
On Tue, 2006-02-28 at 14:03 -0500, Benjamin LaHaise wrote:
> Memory barriers are not cheap. At least for the example you provided,
> it looks like things are overdone and performance is going to suck, so
> it needs to be avoided if at all possible.
There's more to it than that :-)
We added the memory barrier to *improve* performance, in addition to
helping correctness and portability. Without it, the CPU or north
bridge is free to hold onto the pending writes for a while; the exact
details of how long it will wait depend on the CPU and NB
implementation, but on AMD64 CPUs the delay is up to 16 cycles.
So if we just use wmb(), we incur a 16-cycle penalty on every packet we
send. This has a deleterious and measurable effect on performance.
<b
On Tuesday 28 February 2006 20:20, Bryan O'Sullivan wrote:
> We added the memory barrier to *improve* performance, in addition to
> helping correctness and portability. Without it, the CPU or north
> bridge is free to hold onto the pending writes for a while; the exact
> details of how long it will wait depend on the CPU and NB
> implementation, but on AMD64 CPUs the delay is up to 16 cycles.
Are you sure you used the right instruction? Normally CLFLUSH is used
for such things, not a write barrier which really only changes ordering.
The documentation is not fully clear, but it sounds like it could
apply to the store buffers too.
Anyways if MFENCE improved performance you're probably relying
on some very specific artifact of the microarchitecture of your
CPU or Northbridge. I don't think it's a architecurally guaranteed
feature.
-Andi
On Tue, Feb 28, 2006 at 11:20:24AM -0800, Bryan O'Sullivan wrote:
> So if we just use wmb(), we incur a 16-cycle penalty on every packet we
> send. This has a deleterious and measurable effect on performance.
sfence doesn't guarantee that the write will be flushed, especially if
the chipset gets involved. The only way to do that is the same way any
pci write can be flushed, which is to read from a register on the device
in question.
-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.
On Tue, 2006-02-28 at 20:33 +0100, Andi Kleen wrote:
> Are you sure you used the right instruction? Normally CLFLUSH is used
> for such things, not a write barrier which really only changes ordering.
Hmm. It's possible we're just getting lucky because another write
happens somewhere else soon after the last write we perform as part of a
packet send. Perhaps, for complete correctness, we should be using
CLFLUSH instead of the last barrier. I'll have to look into this.
Thanks,
<b
On Tue, Feb 28, 2006 at 06:20:32PM +0100, Jes Sorensen wrote:
> Roland Dreier wrote:
> > Jes> Not quite correct as far as I understand it. mmiowb() is
> > Jes> supposed to guarantee that writes to MMIO space have
> > Jes> completed before continuing. That of course covers the
> > Jes> multi-CPU case, but it should also cover the write-combining
> > Jes> case.
> >
> >I don't believe this is correct. mmiowb() does not guarantee that
> >writes have completed -- they may still be pending in a buffer in a
> >bridge somewhere. The _only_ effect of mmiowb() is to make sure that
> >writes which have been ordered between CPUs using some other mechanism
> >(i.e. a lock) are properly ordered by the rest of the system. This
> >only has an effect systems like very large ia64 systems, where (as I
> >understand it), writes can pass each other on the way to the PCI bus.
> >In fact, mmiowb() is a NOP on essentially every architecture.
>
> Hmmmm
>
> That could be, seems like Jesse agrees that it could all be in the
> pipeline somewhere. Considering Jesse was responsible for mmiowb() I'll
> take his word for it ;-)
Right. On the Altix, the mmiowb ensures that the write is into a
FIFO on the destination node (node I/O device is attached to), so
that later writes from other nodes will be ordered after it. But
it doesn't actually force it to the bus. That's one reason why it's
so much quicker than a using a read for ordering.
jeremy
On Tue, Feb 28, 2006 at 06:20:32PM +0100, Jes Sorensen wrote:
> Roland Dreier wrote:
> > Jes> Not quite correct as far as I understand it. mmiowb() is
> > Jes> supposed to guarantee that writes to MMIO space have
> > Jes> completed before continuing. That of course covers the
> > Jes> multi-CPU case, but it should also cover the write-combining
> > Jes> case.
> >
> >I don't believe this is correct. mmiowb() does not guarantee that
> >writes have completed -- they may still be pending in a buffer in a
> >bridge somewhere. The _only_ effect of mmiowb() is to make sure that
> >writes which have been ordered between CPUs using some other mechanism
> >(i.e. a lock) are properly ordered by the rest of the system. This
> >only has an effect systems like very large ia64 systems, where (as I
> >understand it), writes can pass each other on the way to the PCI bus.
> >In fact, mmiowb() is a NOP on essentially every architecture.
>
> Hmmmm
>
> That could be, seems like Jesse agrees that it could all be in the
> pipeline somewhere. Considering Jesse was responsible for mmiowb() I'll
> take his word for it ;-)
>
> In any case, I'd strongly recommend that any new barrier version is
> clearly documented. The jungle is very dense already ;(
I wonder if wc_wmb() is the best name.
mmiowb expands to memory-mapped I/O write barrier which more or less
describes what it does, whereas wc_wmb expands (I'm guessing) to
write-combine write memory barrier. But it's for mmio writes.
Also, the wmb() does not actually "guarantee that PCI writes have been
flushed to the bus", at least on IA64. Even for memory transactions,
it only guarantees ordering on IA64, much like mmiowb does for mmio
transactions.
jeremy
Bryan O'Sullivan wrote:
> On Tue, 2006-02-28 at 05:01 -0500, Jes Sorensen wrote:
>
>
>>Could you explain why the current mmiowb() API won't suffice for this?
>>It seems that this is basically trying to achieve the same thing.
>
>
> It's a no-op on every arch I care about:
>
> #define mmiowb()
>
> Which makes it useless. Also, based on the comments in the qla driver,
> mmiowb() seems to have inter-CPU ordering semantics that I don't want.
> I'm thus hesitant to appropriate it for my needs.
The fact that it's a no-op may simply be because nobody on a specific
arch got to the point where it made sense to define it yet.
Anyway, based on Jesse and Jeremy's comments, then maybe the semantics
here are different. However I do think the name wc_wmb() isn't quite
defining it. If it's only to be used on mmio space, something like
mmio_wc_wmb() would probably be more descriptive.
Cheers,
Jes
Jes> Anyway, based on Jesse and Jeremy's comments, then maybe the
Jes> semantics here are different. However I do think the name
Jes> wc_wmb() isn't quite defining it. If it's only to be used on
Jes> mmio space, something like mmio_wc_wmb() would probably be
Jes> more descriptive.
I don't see any restriction to MMIO space -- it would be a strange
thing to do, but conceivable one might want to order writes that are
made to ordinary RAM via a write-combining mapping.
- R.
On Tue, 2006-02-28 at 20:33 +0100, Andi Kleen wrote:
> Anyways if MFENCE improved performance you're probably relying
> on some very specific artifact of the microarchitecture of your
> CPU or Northbridge. I don't think it's a architecurally guaranteed
> feature.
I looked this up, and you appear to be wrong here.
Here's the appropriate quote from page 246 of the PDF of "AMD64
Architecture Programmer's Manual Volume 2: System Programming":
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_24593.pdf
Section 7.4.1 specifically describes what happens to write buffers:
[...] the processor completely empties the write buffer by
writing the contents to memory as a result of performing any of
the following operations:
SFENCE Instruction
Executing a store-fence (SFENCE) instruction forces all memory
writes before the SFENCE (in program order) to be written into
memory before memory writes that follow the SFENCE instruction.
The memory-fence (MFENCE) instruction has a similar effect, but
it forces the ordering of loads in addition to stores.
[...]
So in fact SFENCE is the appropriate, architecturally guaranteed, thing
for us to be doing on x86_64.
With respect to Ben's contention that wmb() will suffice instead, that
isn't true, either, even on x86-class hardware. The writes absolutely
travel over the HT bus in non-ascending order on AMD64 systems unless we
fence them, and we've verified this using a HT bus analyser.
<b
On Wednesday 01 March 2006 20:20, Bryan O'Sullivan wrote:
>
> [...] the processor completely empties the write buffer by
> writing the contents to memory as a result of performing any of
> the following operations:
>
> SFENCE Instruction
> Executing a store-fence (SFENCE) instruction forces all memory
> writes before the SFENCE (in program order) to be written into
> memory before memory writes that follow the SFENCE instruction.
> The memory-fence (MFENCE) instruction has a similar effect, but
> it forces the ordering of loads in addition to stores.
> [...]
>
> So in fact SFENCE is the appropriate, architecturally guaranteed, thing
> for us to be doing on x86_64.
I don't interpret it as being a full synchronous write. It's just a barrier
preventing reordering. So the writes before could be in theory stuck
forever in some buffer - it just means they won't be later than the writes
after the fence.
Implementing the fences in the way your're suggesting would be very costly
because it could make them potentially stall for thousands of cycles.
I don't have a quote handy right now but volume 3 of the Intel/AMD manuals
have own chapters on the memory ordering rules elaborating on this in much more
detail.
-Andi
On Wed, 2006-03-01 at 20:27 +0100, Andi Kleen wrote:
> I don't interpret it as being a full synchronous write.
Nor do I. We don't expect or need the write to make it all the way to
the device. We already do config space reads for the tiny handful of
non-performance-critical cases where we need to know that a write has in
fact made it to the device.
> It's just a barrier
> preventing reordering. So the writes before could be in theory stuck
> forever in some buffer - it just means they won't be later than the writes
> after the fence.
At worst, they'll be off the CPU (guaranteed per spec, as I already
quoted for you), and on the northbridge. It might hold on to them for a
bit in turn, but there's nothing we can do about that except a config
space read, and we don't need to guarantee that the write has completely
gone to the device. Real northbridges just don't buffer writes for very
long.
> Implementing the fences in the way your're suggesting would be very costly
> because it could make them potentially stall for thousands of cycles.
But it *doesn't*. On existing CPUs and systems, today, the phantom
worse-case semantics you are conjuring up simply do not exist. If
someone builds such an asinine system, the right approach is to handle
it once it exists.
<b
--
Bryan O'Sullivan <[email protected]>
On Wednesday 01 March 2006 20:43, Bryan O'Sullivan wrote:
> > Implementing the fences in the way your're suggesting would be very costly
> > because it could make them potentially stall for thousands of cycles.
>
> But it *doesn't*. On existing CPUs and systems, today, the phantom
> worse-case semantics you are conjuring up simply do not exist. If
> someone builds such an asinine system, the right approach is to handle
> it once it exists.
Normally we write code to the defined architecture.
Relying on undocumented side effects of instructions as you're trying
to do here is not very reliable and would likely cause breakage later.
Especially not for encoding it in the general Linux interface.
-Andi
On Wed, 2006-03-01 at 20:49 +0100, Andi Kleen wrote:
> Relying on undocumented side effects of instructions as you're trying
> to do here is not very reliable and would likely cause breakage later.
I just quoted you the precise relevant semantics of sfence two messages
earlier, from the AMD64 optimisation guide. Here's the same thing from
the EM64T optimisation guide. See page 444 of the following PDF:
ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf
Here's the relevant quote:
Writes may be delayed and combined in the write combining buffer
(WC buffer) to reduce memory accesses. If the WC buffer is
partially filled, the writes may be delayed until the next
occurrence of a serializing event; such as, an SFENCE or MFENCE
instruction, [...]
If you read section 10.3.1 of the same document (page 446), you'll see
that a CLFLUSH (as you suggested earlier) won't even work to get the WC
buffers to flush, because they're not part of the regular cache
hierarchy.
This section also makes it clear yet again that wmb() is absolutely not
sufficient to get program store order semantics in the presence of WC;
you *have* to use an explicit synchronising instruction of some kind.
This is pretty frustrating. I'm quoting the manufacturer documentation
at you, and you're handwaving back at me with complete nonsense.
<b
On Wed, Mar 01, 2006 at 12:05:07PM -0800, Bryan O'Sullivan wrote:
> This section also makes it clear yet again that wmb() is absolutely not
> sufficient to get program store order semantics in the presence of WC;
> you *have* to use an explicit synchronising instruction of some kind.
The semantics your code seems to care about are not those of a memory
barrier (which deals with ordering), but of a flush of the write combining
buffers. That's an important high level distinction as they are implemented
differently across architectures. Please rename the macro something like
flush_wc() and document it as such, at which point I remove my objection.
-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.
On Wed, 2006-03-01 at 15:26 -0500, Benjamin LaHaise wrote:
> Please rename the macro something like
> flush_wc() and document it as such, at which point I remove my objection.
Thanks. That's a useful suggestion; I don't want to give the thing a
misleading name.
<b
--
Bryan O'Sullivan <[email protected]>