2006-09-11 04:04:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC] MMIO accessors & barriers documentation


Ok, here's a formal documentation of the proposed accessor semantics. It
still contains a couple of questions (see [* Question]) that need
answering before we can start implementing anything, so I'm waiting for
feedback here. The Questions are grouped at the end of the document to
avoid cluttering it.

I've on purpose not included Segher proposal of having the
"writel/readl" type accessors behave differently based on an ioremap
flag. There are pros and cons to this approach, but this is almost a
separate debate as we shall first define the semantics we need and
that's what that document attempt to do.


*** Definitions of MMIO accessors and IO related barriers semantics ***


* I * Ordering requirements:
============================

First, let's define 4 types of ordering requirements that can be
provided by MMIO accessors:

1- MMIO + MMIO: This type of ordering means that two consecutive MMIO
accesses performed by one processors are issued in program order on the
bus. Reads can't cross writes. Writes can't be re-oredered vs. each
other. There is no implication on MMIOs issued by different CPUs nor
non-MMIO accesses.

2- memory W + MMIO W: This type of ordering means that an store to main
memory that is performed in program order before an MMIO store to a
device, must be visible to that device before the MMIO store reaches it.
For example: updating of a DMA descriptor in memory is visible to the
chip before the MMIO write that cause that chip to go fetch it. This is
purely a store ordering, there is no assumption made about reads

3- MMIO R + memory R: This type of ordering means that an MMIO read will
be effectively performed (the result returned by the device to the
processor) before a following read for memory. That is, the value
returned by that following read is what was present in the coherency
domain after the MMIO read is complete. For example: reading a DMA
"pointer" from a device with an MMIO read, and then fetching the data in
memory up to that pointer.

4- MMIO W + spin_unlock: This type of ordering means that MMIO stores
followed by a spin unlock will have all reached the host PCI bridge
before the unlocking is visible to other CPUs. For example, two CPUs
have a locked section (same spinlock) issuing some MMIO stores to the
same device. Such ordering means that both sets of MMIO stores will not
be interleaved when reaching the host PCI controller (and thus the
device). All MMIO stores from one locked section will be performed
before all MMIO stores from the other.

[Note] Rule #4 is strictly specific to MMIO stores followed by a
spin_unlock(). There is no ordering requirement provided by Linux to
ensure ordering of an MMIO store followed by a generic memory store
unless an explicit barrier is used:

[ -> Question 1]

* II * Accessors:
=================

We provide 3 classes of accessors:

Class 1: Ordered accessors
--------------------------

[Note] None of these accessors will provide write combining

1- {read,write}{b,w,l,q} : Those accessors provide all MMIO ordering
requirements. They are thus called "fully ordered". That is #1, #2 and
#4 for writes and #1 and #3 for reads.

[ -> Question 2]

2- PIO accessors (all of them, that is inb...inl, ins*, out
equivalents,...): Those are fully ordered, all ordering rules apply. They
are slow anyways :)

3- memcpy_to_io, memcpy_from_io: #1 semantics apply (all MMIO loads or
stores are performed in order to each other). #2+#4 (stores) or #3
(loads) semantics apply to the operation as a whole. That is #2: all
previous memory stores are globally visible before the first MMIO store
of memcpy_to_io, #3: The last MMIO read (and thus all previous ones too
due to rule #1) have been fully performed before a subsequent memory
read is performed by memcpy_from_io. And #4: all MMIO stores performed
by memcpy_to_io will have reached the host bridge before the effect of a
subsequent spin_unlock are visible.

4- io{read,write}{8,16,32}[be]: Those have the same semantics as 1 for
MMIO and the same semantics as 2 for PIO. As for the "repeat" versions
of those, they follow the semantics of memcpy_to_io and memcpy_from_io
(the only difference being the lack of increment of the MMIO address).

Class 2: Partially relaxed accessors
------------------------------------

[Note] Stores using those accessors will provide write combining on MMIO
(not PIO) which have been mapped with the appropriate <insert call name
here, TBD. possible ioremap_wc>

1- __{read,write}{b,w,l,q} : Those accessors provide only ordering rule
#1. That is, MMIOs are ordered vs. each other as issued by one CPU.
Barriers are required to ensure ordering vs. memory and vs. locks (see
"Barriers" section).

2- __io{read,write}{8,16,32}[be] (optional ?) : Those have the same
semantic as 1 for MMIO, and provide full ordering requirements as
defined in Classe 1 for PIO.

3- __memcpy_to_io, __memcpy_from_io. Those provide only requirement #1,
that is the MMIOs within the copy are performed in order and are in
order vs. preceeding and subsequent MMIOs executed on the same CPU.

Class 3: Fully relaxed accessors
--------------------------------

[Note] Stores using those accessors will provide write combining the
same way as Class 2 accessors.

1- __raw_{read,write}{b,w,l,q} : Those accessors provide no ordering
rule whatsoever. They also provide no endian swapping. They are
essentially the equivalent of a direct load/store instruction to/from
the MMIO space. Access is done in platform native endian.


* III * IO related barriers
===========================

Some of the above accessors do not provide all ordering rules define in
* I *, thus explicit barriers are provided to enforce those ordering
rules:

1- io_to_io_barrier() : This barrier provides ordering requirement #1
between two MMIO accesses. It's to be used in conjuction with fully
relaxed accessors of Class 3.

2- memory_to_io_wb() : This barrier provides ordering requirement #2
between a memory store and an MMIO store. It can be used in conjunction
with write accessors of Class 2 and 3.

3- io_to_memory_rb(value) : This barrier provides ordering requirement
#3 between an MMIO read and a subsequent read from memory. For
implementation purposes on some architectures, the value actually read
by the MMIO read shall be passed as an argument to this barrier. (This
allows to generate the appropriate CPU instruction magic to force the
CPU to consider the value as being "used" and thus force the read to be
performed immediately). It can be used in conjunction with read
accessors of Class 2 and 3

4- io_to_lock_wb() : This barrier provides ordering requirement #4
between an MMIO store and a subsequent spin_unlock(). It can be used in
conjunction with write accessors of Class 2 and 3.

[ -> Question 3]
[ -> Question 4]

[Note] A barrier commonly used by drivers and not described here are the
memory-to-memory read and write barriers (rmb, wmb, mb). Those are
necessary when manipulating data structures in memory that are accessed
at the same time via DMA. The rules here are identical to the usual SMP
data ordering rules and are beyond the scope of this document.

[ -> Question 5]

* IV * Mixing of accessors
==========================

There are few rules concerning the mixing of accessors of the different
ordering Classes. Basically, when accessors of different classes share
an ordering rule, then that rule apply. If not, it doesn't apply. For
example:

writel followed by __writel : both accesors provide rule #1, thus it
applies and stores are visible in order. Since the previous writel will
have ordered previous stores to memory, the second __writel naturally
benefits from this despite the fact that __writel doesn't normally
provide that semantic.

Ben.


Questions:
==========

[* Question 1] Should Rule #4 be generalized to MMIO store followed by a
memory store ? (as spin_unlock are essentially a wmb followed by a
memory store) or do we need to keep a rule specific for locks to avoid
arch specific pitfalls on some architecture ? In that case, do we need a
specific barrier to provide MMIO store followed by a memory store ? That
sort of ordering is not generally useful and is generally expensive as
it requires to access the PCI host bridge to enforce that the previous
MMIO stores have reached the bus. Drivers generally don't need such a
rule or a barrier, as they have to deal with write posting anyway, and
thus use an MMIO read to provide the necessary synchronisation when it
makes sense.

[* Question 2] : Do we actually want the "ordered" accessors to also provide
ordering rule #4 in the general case ? This can be very expensive on
some architectures like ia64 where, I think, it has to actually access
the PCI host bridge to provide the guarantee that the previous MMIO
stores have reached it before the unlock is made visible to the
coherency domain. If we decide not to, then an explicit barrier will
still be needed in most drivers before spin_unlock(). This is the
current mmiowb() barrier that I'm proposing to rename (section * III *).
A way to provide that ordering requirement with less performance impact
is to instead set a per-cpu flag in writeX(), and test it in
spin_unlock() which would then do the barrier only if the flag is set.
It's to be measured whether the impact on unrelated spin_unlock() is low
enough to make that solution realstic.
If we decide to not enforce rule #4 for ordered accessors, and thus
require the barrier before spin_unlock, the above trick, could still be
implemented as a debug option to "detect" the lack of appropriate
barriers.

[* Question 3] If we decide that accessors of Class 1 do not provide rule
#4, then this barrier is to be used for all classes of accessors, except
maybe PIO which should always be fully ordered.

[* Question 4] Would it be a useful optimisation on archs like ia64 to
require this accessor to take the struct device of the device as an
argument (with can NULL for a "generic" barrier) or it doesn't matter ?

[* Question 5] Should we document the rules for memory-memory barriers
here as well ? (and give examples, like live updating of a network
driver ring descriptor entry)



2006-09-11 08:35:10

by Alan

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Ar Llu, 2006-09-11 am 14:03 +1000, ysgrifennodd Benjamin Herrenschmidt:
> be interleaved when reaching the host PCI controller (and thus the

"a host PCI controller". The semantics with multiple independant PCI
busses are otherwise evil.

> 1- {read,write}{b,w,l,q} : Those accessors provide all MMIO ordering
> requirements. They are thus called "fully ordered". That is #1, #2 and
> #4 for writes and #1 and #3 for reads.

#4 may be incredibly expensive on NUMA boxes.

> 3- memcpy_to_io, memcpy_from_io: #1 semantics apply (all MMIO loads or
> stores are performed in order to each other). #2+#4 (stores) or #3

What is "in order" here. "In ascending order of address" would be
tighter.

> 1- __{read,write}{b,w,l,q} : Those accessors provide only ordering rule
> #1. That is, MMIOs are ordered vs. each other as issued by one CPU.
> Barriers are required to ensure ordering vs. memory and vs. locks (see
> "Barriers" section).

"Except where the underlying device is marked as cachable or
prefetchable"

Q2:
> coherency domain. If we decide not to, then an explicit barrier will
> still be needed in most drivers before spin_unlock(). This is the
> current mmiowb() barrier that I'm proposing to rename (section * III *).

I think we need mmiowb() still anyway (for __writel etc)

> If we decide to not enforce rule #4 for ordered accessors, and thus
> require the barrier before spin_unlock, the above trick, could still be
> implemented as a debug option to "detect" the lack of appropriate
> barriers.

This I think is an excellent idea.

> [* Question 3] If we decide that accessors of Class 1 do not provide rule
> #4, then this barrier is to be used for all classes of accessors, except
> maybe PIO which should always be fully ordered.

On x86 PIO (outb/inb) etc are always ordered and always stall until the
cycle completes on the device.

> [* Question 5] Should we document the rules for memory-memory barriers
> here as well ? (and give examples, like live updating of a network
> driver ring descriptor entry)
>

Update the existing docs


2006-09-11 09:18:45

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

On Mon, 2006-09-11 at 09:57 +0100, Alan Cox wrote:
> Ar Llu, 2006-09-11 am 14:03 +1000, ysgrifennodd Benjamin Herrenschmidt:
> > be interleaved when reaching the host PCI controller (and thus the
>
> "a host PCI controller". The semantics with multiple independant PCI
> busses are otherwise evil.

Ok.

> > 1- {read,write}{b,w,l,q} : Those accessors provide all MMIO ordering
> > requirements. They are thus called "fully ordered". That is #1, #2 and
> > #4 for writes and #1 and #3 for reads.
>
> #4 may be incredibly expensive on NUMA boxes.

Yes, and that's why there is Question #2 :)

I don't care either way for PowerPC at this point, but it's an open
question and I'd like folks like you to tell me what you prefer.

> > 3- memcpy_to_io, memcpy_from_io: #1 semantics apply (all MMIO loads or
> > stores are performed in order to each other). #2+#4 (stores) or #3
>
> What is "in order" here. "In ascending order of address" would be
> tighter.

In program order. Every time I say "in order", I mean "in program
order". I agree that this is not enough precision as it's not obvious
that memcpy will copy in ascending order of addresses (it doesn't have
to), I'll add that precision... or not. THat could be another question.
What do we want here ? I would rather have those strongly ordered for
Class 1.

> > 1- __{read,write}{b,w,l,q} : Those accessors provide only ordering rule
> > #1. That is, MMIOs are ordered vs. each other as issued by one CPU.
> > Barriers are required to ensure ordering vs. memory and vs. locks (see
> > "Barriers" section).
>
> "Except where the underlying device is marked as cachable or
> prefetchable"

You aren't supposed to use MMIO accessors on cacheable memory, are you ?
On PowerPC, even if using cacheable mappings, they would still be
visible in order to coherency domain, though being cacheable, there is
indeed no saying in what order they'll end up hitting the PCI host
bridge. In fact, I know of platforms (like Apple G5s) who cannot cope
with cacheable mappings of anything behind HT... I'd keep use of
cacheable mapping as an arch specific special case for now, and that
definitely doesn't allow for MMIO accessors ...

> Q2:
> > coherency domain. If we decide not to, then an explicit barrier will
> > still be needed in most drivers before spin_unlock(). This is the
> > current mmiowb() barrier that I'm proposing to rename (section * III *).
>
> I think we need mmiowb() still anyway (for __writel etc)

Oh, we surely have a barrier providing that semantic (I call it
io_to_lock_wb() in my proposal, and it can be #defined to mmiowb to ease
driver migration). The question is wether we want rule #4 to be enforced
by accessors of Class 1 or not ..

> > If we decide to not enforce rule #4 for ordered accessors, and thus
> > require the barrier before spin_unlock, the above trick, could still be
> > implemented as a debug option to "detect" the lack of appropriate
> > barriers.
>
> This I think is an excellent idea.

Thanks :)

> > [* Question 3] If we decide that accessors of Class 1 do not provide rule
> > #4, then this barrier is to be used for all classes of accessors, except
> > maybe PIO which should always be fully ordered.
>
> On x86 PIO (outb/inb) etc are always ordered and always stall until the
> cycle completes on the device.

Yes and I think that as far as PIO is concerned, we shall remain as
close as possible to x86. PIO is mostly used by "old stuff", that is
drivers that are likely not to have been adapted/audited to undestand
ordering issues, and is generally slow anyway. Thus even if we decide to
relax rule #4 for Class 1 MMIO accessors, I'd be tempted to keep it for
PIO (and config space too btw)

> > [* Question 5] Should we document the rules for memory-memory barriers
> > here as well ? (and give examples, like live updating of a network
> > driver ring descriptor entry)
> >
>
> Update the existing docs

Ok.

Thanks for your comments. I'll wait for more of these and post an
updated version tomorrow. I'm still waiting for your preference
regarding including or not rule #4 for Class 1 (ordered) MMIO accessors.

Cheers,
Ben.


2006-09-11 09:46:29

by Alan

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Ar Llu, 2006-09-11 am 19:17 +1000, ysgrifennodd Benjamin Herrenschmidt:
> > > 3- memcpy_to_io, memcpy_from_io: #1 semantics apply (all MMIO loads or
> > > stores are performed in order to each other). #2+#4 (stores) or #3
> >
> > What is "in order" here. "In ascending order of address" would be
> > tighter.
>
> In program order. Every time I say "in order", I mean "in program
> order". I agree that this is not enough precision as it's not obvious
> that memcpy will copy in ascending order of addresses (it doesn't have
> to), I'll add that precision... or not. THat could be another question.
> What do we want here ? I would rather have those strongly ordered for
> Class 1.

I'd rather memcpy_to/from_io only made guarantees about the start/end of
the transfer and not order of read/writes or size of read/writes. The
reason being that a more restrictive sequence can be efficiently
expressed using read/writefoo but the reverse is not true.

> > "Except where the underlying device is marked as cachable or
> > prefetchable"
>
> You aren't supposed to use MMIO accessors on cacheable memory, are you ?

Why not. Providing it is in MMIO space, consider ROMs for example or
write path consider frame buffers.

> with cacheable mappings of anything behind HT... I'd keep use of
> cacheable mapping as an arch specific special case for now, and that
> definitely doesn't allow for MMIO accessors ...

I'm describing existing semantics 8)


2006-09-11 10:00:51

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

> I'd rather memcpy_to/from_io only made guarantees about the start/end of
> the transfer and not order of read/writes or size of read/writes. The
> reason being that a more restrictive sequence can be efficiently
> expressed using read/writefoo but the reverse is not true.

Ok, so we would define ordering on the first and last accesses (being
the first and last in ascending addresses order) and leave it free to
the implementation to do what it wants in between. Is that ok ?

> > > "Except where the underlying device is marked as cachable or
> > > prefetchable"
> >
> > You aren't supposed to use MMIO accessors on cacheable memory, are you ?
>
> Why not. Providing it is in MMIO space, consider ROMs for example or
> write path consider frame buffers.

If we consider cacheable accesses, we need to also provide cache
flushing primitives as MMIO devices are generally not coherent. Take for
example the case of the frame buffer: you may want to upload a texture,
and later use it with the engine. You need a way in between to make sure
all the cached dirty lines have been pushed to the device before you
start the engine. Since we provide no generically useable functions for
doing such cache coherency on MMIO space, I'd rather keep usage of MMIO
accessors on cacheable storage non-defined. That is add a simple note at
the top of the file that the rules defined here only apply to
non-cacheable mappings. Is that ok ?

> > with cacheable mappings of anything behind HT... I'd keep use of
> > cacheable mapping as an arch specific special case for now, and that
> > definitely doesn't allow for MMIO accessors ...
>
> I'm describing existing semantics 8)

Well, there is no clear existing semantics, at least not global to all
archs for cacheable access to MMIO so yeah, let's say that ordering on
cacheable storage is left undefined :)

Ben.


2006-09-11 17:06:13

by Alan

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Ar Llu, 2006-09-11 am 19:59 +1000, ysgrifennodd Benjamin Herrenschmidt:
> Ok, so we would define ordering on the first and last accesses (being
> the first and last in ascending addresses order) and leave it free to
> the implementation to do what it wants in between. Is that ok ?

Not sure you can go that far. I'd stick to "_fromio/_toio" transfer
blocks of data efficiently between host and bus addresses. The
guarantees are the same as readl/writel respectively with respect to the
start and end of the transfer.

[How do you define start and end addresses with memcpy_fromio(foo, bar,
4) for example ]


2006-09-11 18:39:12

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

On Sunday, September 10, 2006 9:03 pm, Benjamin Herrenschmidt wrote:
> 1- {read,write}{b,w,l,q} : Those accessors provide all MMIO ordering
> requirements. They are thus called "fully ordered". That is #1, #2 and
> #4 for writes and #1 and #3 for reads.

Fine.

> 2- PIO accessors (all of them, that is inb...inl, ins*, out
> equivalents,...): Those are fully ordered, all ordering rules apply. They
> are slow anyways :)

Yeah, I think these are already defined to operate this way. Not sure if that
fact is documented clearly though (haven't checked).

> 3- memcpy_to_io, memcpy_from_io: #1 semantics apply (all MMIO loads or
> stores are performed in order to each other). #2+#4 (stores) or #3
> (loads) semantics apply to the operation as a whole. That is #2: all
> previous memory stores are globally visible before the first MMIO store
> of memcpy_to_io, #3: The last MMIO read (and thus all previous ones too
> due to rule #1) have been fully performed before a subsequent memory
> read is performed by memcpy_from_io. And #4: all MMIO stores performed
> by memcpy_to_io will have reached the host bridge before the effect of a
> subsequent spin_unlock are visible.

See Alan's comments here. I don't think the intra-memcpy semantics have to be
defined as strongly as you say here... it should be enough to treat the whole
memcpy as a unit, not specifying what happens inside but rather defining it
to be strongly ordered wrt to previous and subsequent code.

> 4- io{read,write}{8,16,32}[be]: Those have the same semantics as 1 for
> MMIO and the same semantics as 2 for PIO. As for the "repeat" versions
> of those, they follow the semantics of memcpy_to_io and memcpy_from_io
> (the only difference being the lack of increment of the MMIO address).

This reminds me... when these routines were added I asked that they be defined
as having weak ordering wrt DMA (does linux-arch have archives?), but then I
think Linus changed his mind?

> 1- __{read,write}{b,w,l,q} : Those accessors provide only ordering rule
> #1. That is, MMIOs are ordered vs. each other as issued by one CPU.
> Barriers are required to ensure ordering vs. memory and vs. locks (see
> "Barriers" section).

Ok, but I still don't like the naming. __ implies some sort of implementation
detail and doesn't communicate meaning very clearly. But I'm not going to
argue too much about it.

> Some of the above accessors do not provide all ordering rules define in
> * I *, thus explicit barriers are provided to enforce those ordering
> rules:
>
> 1- io_to_io_barrier() : This barrier provides ordering requirement #1
> between two MMIO accesses. It's to be used in conjuction with fully
> relaxed accessors of Class 3.

Ok, basically mb() but for I/O space.

> 2- memory_to_io_wb() : This barrier provides ordering requirement #2
> between a memory store and an MMIO store. It can be used in conjunction
> with write accessors of Class 2 and 3.
>
> 3- io_to_memory_rb(value) : This barrier provides ordering requirement
> #3 between an MMIO read and a subsequent read from memory. For
> implementation purposes on some architectures, the value actually read
> by the MMIO read shall be passed as an argument to this barrier. (This
> allows to generate the appropriate CPU instruction magic to force the
> CPU to consider the value as being "used" and thus force the read to be
> performed immediately). It can be used in conjunction with read
> accessors of Class 2 and 3

These sound fine. I think PPC64 is the only platform that will need them?

> 4- io_to_lock_wb() : This barrier provides ordering requirement #4
> between an MMIO store and a subsequent spin_unlock(). It can be used in
> conjunction with write accessors of Class 2 and 3.

Ok.

> [Note] A barrier commonly used by drivers and not described here are the
> memory-to-memory read and write barriers (rmb, wmb, mb). Those are
> necessary when manipulating data structures in memory that are accessed
> at the same time via DMA. The rules here are identical to the usual SMP
> data ordering rules and are beyond the scope of this document.

Unless as Alan suggests these barriers are also documented in
memory-barriers.txt (probably a good place).

> [* Question 1] Should Rule #4 be generalized to MMIO store followed by a
> memory store ? (as spin_unlock are essentially a wmb followed by a
> memory store) or do we need to keep a rule specific for locks to avoid
> arch specific pitfalls on some architecture ? In that case, do we need a
> specific barrier to provide MMIO store followed by a memory store ? That
> sort of ordering is not generally useful and is generally expensive as
> it requires to access the PCI host bridge to enforce that the previous
> MMIO stores have reached the bus. Drivers generally don't need such a
> rule or a barrier, as they have to deal with write posting anyway, and
> thus use an MMIO read to provide the necessary synchronisation when it
> makes sense.

But isn't this how you'll implement io_to_lock_wb() on PPC anyway? If so,
might be best to name it and document it that way (though keeping the idea of
barriering before unlocking prominent in the documentation).

> [* Question 2] : Do we actually want the "ordered" accessors to also
> provide ordering rule #4 in the general case ?

Isn't that the whole point of making the regular readX/writeX strongly
ordered? To get rid of the need for mmiowb() in the general case and make it
into a performance optimization to be used in conjunction with __writeX?

> If we decide to not enforce rule #4 for ordered accessors, and thus
> require the barrier before spin_unlock, the above trick, could still be
> implemented as a debug option to "detect" the lack of appropriate
> barriers.

I think this should be done in any case, and I think it can be done in generic
code (using per-cpu counters in the spinlock and mmiowb() routines); it's a
good idea.

> [* Question 3] If we decide that accessors of Class 1 do not provide rule
> #4, then this barrier is to be used for all classes of accessors, except
> maybe PIO which should always be fully ordered.

Right, though see above about my understanding of the genesis of this
discussion. :)

> [* Question 4] Would it be a useful optimisation on archs like ia64 to
> require this accessor to take the struct device of the device as an
> argument (with can NULL for a "generic" barrier) or it doesn't matter ?

For ia64 in particular it doesn't matter, though there was speculation several
years that it might be necessary. No actual examples stepped forward though,
so the current implementation doesn't take an argument.

> [* Question 5] Should we document the rules for memory-memory barriers
> here as well ? (and give examples, like live updating of a network
> driver ring descriptor entry)

Should probably be added to memory-barriers.txt.

Thanks,
Jesse

2006-09-11 21:34:34

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

On Mon, 2006-09-11 at 18:26 +0100, Alan Cox wrote:
> Ar Llu, 2006-09-11 am 19:59 +1000, ysgrifennodd Benjamin Herrenschmidt:
> > Ok, so we would define ordering on the first and last accesses (being
> > the first and last in ascending addresses order) and leave it free to
> > the implementation to do what it wants in between. Is that ok ?
>
> Not sure you can go that far. I'd stick to "_fromio/_toio" transfer
> blocks of data efficiently between host and bus addresses. The
> guarantees are the same as readl/writel respectively with respect to the
> start and end of the transfer.
>
> [How do you define start and end addresses with memcpy_fromio(foo, bar,
> 4) for example ]

Ok. So they behave like a writel or a readl globally respective to other
accesses but there is no guarantee in order or size of the individual
transfers making them up.

Ben


2006-09-11 21:46:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> > 2- memory_to_io_wb() : This barrier provides ordering requirement #2
> > between a memory store and an MMIO store. It can be used in conjunction
> > with write accessors of Class 2 and 3.
> >
> > 3- io_to_memory_rb(value) : This barrier provides ordering requirement
> > #3 between an MMIO read and a subsequent read from memory. For
> > implementation purposes on some architectures, the value actually read
> > by the MMIO read shall be passed as an argument to this barrier. (This
> > allows to generate the appropriate CPU instruction magic to force the
> > CPU to consider the value as being "used" and thus force the read to be
> > performed immediately). It can be used in conjunction with read
> > accessors of Class 2 and 3
>
> These sound fine. I think PPC64 is the only platform that will need them?

Ah ? What about the comment in e1000 saying that it needs a wmb()
between descriptor updates in memory and the mmio to kick them ? That
would typically be a memory_to_io_wb(). Or are your MMIOs ordered cs.
your cacheable stores ?

> > 4- io_to_lock_wb() : This barrier provides ordering requirement #4
> > between an MMIO store and a subsequent spin_unlock(). It can be used in
> > conjunction with write accessors of Class 2 and 3.
>
> Ok.
>
> > [Note] A barrier commonly used by drivers and not described here are the
> > memory-to-memory read and write barriers (rmb, wmb, mb). Those are
> > necessary when manipulating data structures in memory that are accessed
> > at the same time via DMA. The rules here are identical to the usual SMP
> > data ordering rules and are beyond the scope of this document.
>
> Unless as Alan suggests these barriers are also documented in
> memory-barriers.txt (probably a good place).

They are, but I was thinking about providing more IO-like examples. I
suppose I could refer to memory-barriers.txt from here and update it
with IO-like examples.

> > [* Question 1] Should Rule #4 be generalized to MMIO store followed by a
> > memory store ? (as spin_unlock are essentially a wmb followed by a
> > memory store) or do we need to keep a rule specific for locks to avoid
> > arch specific pitfalls on some architecture ? In that case, do we need a
> > specific barrier to provide MMIO store followed by a memory store ? That
> > sort of ordering is not generally useful and is generally expensive as
> > it requires to access the PCI host bridge to enforce that the previous
> > MMIO stores have reached the bus. Drivers generally don't need such a
> > rule or a barrier, as they have to deal with write posting anyway, and
> > thus use an MMIO read to provide the necessary synchronisation when it
> > makes sense.
>
> But isn't this how you'll implement io_to_lock_wb() on PPC anyway? If so,
> might be best to name it and document it that way (though keeping the idea of
> barriering before unlocking prominent in the documentation).

Well, the whole question is what does the linux semantics guarantee to
driver writers (accross archs), not what PowerPC implements :) I'd
rather not add guarantees that aren't useful to drivers even if all
current implementations happen to provide them. I'm trying to find a
case where ordering MMIO W + memory W is useful and I can't see any
since the MMIO W will take any time to go to the device anyway. The lock
rule seems to be the only useful, thus the only I think I'll guarantee.

> > [* Question 2] : Do we actually want the "ordered" accessors to also
> > provide ordering rule #4 in the general case ?
>
> Isn't that the whole point of making the regular readX/writeX strongly
> ordered? To get rid of the need for mmiowb() in the general case and make it
> into a performance optimization to be used in conjunction with __writeX?

Well, as far as I'm concerned, the whole point is rule #2 and #3 :)
Those are the ones biting us on PowerPC (we haven't seen the lock
problem but then it can't happen the way our current accessors are
written. However, if we change our accessors to provide rule #2 more
specifically, we'll end up with 2 sync instructions in writel, one for
rule #2 before the store and one for rule #4, thus we go from expensive
to very expensive). It's also my understanding that mmiowb is very
expensive on ia64 and gets worse as the box grows bigger.

Hence the question: do we provide -fully- ordered accessors in class 1,
or do we provide -mostly- ordered accessors, ordered in all means except
rule #4 vs locks. ia64 is afaik by far the platform taking the biggest
hit if you have to provide #4, so I'm interesting in your point of view
here.

> > If we decide to not enforce rule #4 for ordered accessors, and thus
> > require the barrier before spin_unlock, the above trick, could still be
> > implemented as a debug option to "detect" the lack of appropriate
> > barriers.
>
> I think this should be done in any case, and I think it can be done in generic
> code (using per-cpu counters in the spinlock and mmiowb() routines); it's a
> good idea.

We don't need counters, just a flag. We did a test implementation, seems
to work. We also clear the flag in spin_lock. That means that MMIOs
issued before a lock aren't ordered vs. the locked section. But because
of rule #1, they should be ordered vs. other MMIOs inside the locked
section and thus implicitely get ordered anyway.

> > [* Question 3] If we decide that accessors of Class 1 do not provide rule
> > #4, then this barrier is to be used for all classes of accessors, except
> > maybe PIO which should always be fully ordered.
>
> Right, though see above about my understanding of the genesis of this
> discussion. :)

As far as I'm concerned, genesis of this discussion is rules #2 and #3,
not #4 :) Though the later quickly came in of course.

> > [* Question 4] Would it be a useful optimisation on archs like ia64 to
> > require this accessor to take the struct device of the device as an
> > argument (with can NULL for a "generic" barrier) or it doesn't matter ?
>
> For ia64 in particular it doesn't matter, though there was speculation several
> years that it might be necessary. No actual examples stepped forward though,
> so the current implementation doesn't take an argument.

Ok. My question is wether it would improve the implementation to take
it. If we define a new macro with a new name, we can do it....

> > [* Question 5] Should we document the rules for memory-memory barriers
> > here as well ? (and give examples, like live updating of a network
> > driver ring descriptor entry)
>
> Should probably be added to memory-barriers.txt.

Yup, agreed.

Cheers,
Ben.


2006-09-11 21:54:33

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Benjamin Herrenschmidt wrote:
> Ah ? What about the comment in e1000 saying that it needs a wmb()
> between descriptor updates in memory and the mmio to kick them ? That
> would typically be a memory_to_io_wb(). Or are your MMIOs ordered cs.
> your cacheable stores ?

That's likely just following existing practice found in many network
drivers. The following two design patterns have been copied across a
great many network drivers:

1) When in a loop, reading through a DMA ring, put an "rmb()" at the top
of the loop, to ensure that the compiler does not optimize out all
memory loads after the first.

2) Use "wmb()" to ensure that just-written-to memory is visible to a PCI
device that will be reading said memory region via DMA.

I don't claim that either of these is correct, just that's existing
practice, perhaps in some case perpetuated by my own arch ignorance.

So, in a perfect world where I was designing my own API, I would create
two new API functions:

prepare_to_read_dma_memory()
and
make_memory_writes_visible_to_dmaing_devices()

and leave the existing APIs untouched. Those are the two fundamental
operations that are needed.

Jeff


2006-09-11 22:05:36

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

On Monday, September 11, 2006 2:45 pm, Benjamin Herrenschmidt wrote:
> > These sound fine. I think PPC64 is the only platform that will need
> > them?
>
> Ah ? What about the comment in e1000 saying that it needs a wmb()
> between descriptor updates in memory and the mmio to kick them ? That
> would typically be a memory_to_io_wb(). Or are your MMIOs ordered cs.
> your cacheable stores ?

I think that's a separate issue? As Jeff points out, those macros are
intended to provide memory vs. I/O ordering, but isn't PPC the only platform
that will reorder accesses so aggressively and independently? I don't think
ia64 for example will reorder them separately, so a regular memory barrier
*should* be enough to ensure ordering in both domains.

> They are, but I was thinking about providing more IO-like examples. I
> suppose I could refer to memory-barriers.txt from here and update it
> with IO-like examples.

Yeah, either way. Not sure if adding more I/O examples to the existing doc is
better or worse than an I/O specific document.

> > But isn't this how you'll implement io_to_lock_wb() on PPC anyway? If
> > so, might be best to name it and document it that way (though keeping the
> > idea of barriering before unlocking prominent in the documentation).
>
> Well, the whole question is what does the linux semantics guarantee to
> driver writers (accross archs), not what PowerPC implements :) I'd
> rather not add guarantees that aren't useful to drivers even if all
> current implementations happen to provide them. I'm trying to find a
> case where ordering MMIO W + memory W is useful and I can't see any
> since the MMIO W will take any time to go to the device anyway. The lock
> rule seems to be the only useful, thus the only I think I'll guarantee.

Sure, that's fair. If any potential application of the more precise semantics
is just theoretical, we may as well limit our guarantees to locks only.

> Well, as far as I'm concerned, the whole point is rule #2 and #3 :)
> Those are the ones biting us on PowerPC (we haven't seen the lock
> problem but then it can't happen the way our current accessors are
> written. However, if we change our accessors to provide rule #2 more
> specifically, we'll end up with 2 sync instructions in writel, one for
> rule #2 before the store and one for rule #4, thus we go from expensive
> to very expensive). It's also my understanding that mmiowb is very
> expensive on ia64 and gets worse as the box grows bigger.

Yeah, that's true (I see your point about being more worried about other
things on PPC as well ;).

> Hence the question: do we provide -fully- ordered accessors in class 1,
> or do we provide -mostly- ordered accessors, ordered in all means except
> rule #4 vs locks. ia64 is afaik by far the platform taking the biggest
> hit if you have to provide #4, so I'm interesting in your point of view
> here.

Either way is fine with me as long as we have a way to get at the fast and
loose stuff (and required barriers of course) in a portable way. And that we
don't regress the existing users of mmiowb().

> We don't need counters, just a flag. We did a test implementation, seems
> to work. We also clear the flag in spin_lock. That means that MMIOs
> issued before a lock aren't ordered vs. the locked section. But because
> of rule #1, they should be ordered vs. other MMIOs inside the locked
> section and thus implicitely get ordered anyway.

Oh right, a flag would be enough. Is it good enough for -mm yet? Might be
fun to run on an Altix machine with a bunch of supported devices (not that I
work with them anymore...).

> > For ia64 in particular it doesn't matter, though there was speculation
> > several years that it might be necessary. No actual examples stepped
> > forward though, so the current implementation doesn't take an argument.
>
> Ok. My question is wether it would improve the implementation to take
> it. If we define a new macro with a new name, we can do it....

Right, but unless there's a real need at this point, we probably shouldn't
bother. Let the poor sucker with the future machine needing the device
argument do the work. :)

Thanks,
Jesse

2006-09-11 22:57:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

On Mon, 2006-09-11 at 17:54 -0400, Jeff Garzik wrote:
> Benjamin Herrenschmidt wrote:
> > Ah ? What about the comment in e1000 saying that it needs a wmb()
> > between descriptor updates in memory and the mmio to kick them ? That
> > would typically be a memory_to_io_wb(). Or are your MMIOs ordered cs.
> > your cacheable stores ?
>
> That's likely just following existing practice found in many network
> drivers. The following two design patterns have been copied across a
> great many network drivers:

Well, I was mentioning that one specifically because this comment:

/* Force memory writes to complete before letting h/w
* know there are new descriptors to fetch. (Only
* applicable for weak-ordered memory model archs,
* such as IA-64). */

Which made me ask wether, ia64 was or was not ordering memory store
followed by MMIO store, that is does ia64 -current- accessors provide
rule #2 (memory W + MMIO W) currently or not and would it benefit from
not having to provide it with my new partially relaxed accessors ?

> 1) When in a loop, reading through a DMA ring, put an "rmb()" at the top
> of the loop, to ensure that the compiler does not optimize out all
> memory loads after the first.

and rmb is heavy handed for a compiler barrier :) what you might need on
some platforms is an rmb between the MMIO read of whatever status/index
register and the following memory reads of descriptors, and you may want
an rmb in case where it matters if the chip has been changing a value
behind your back (which it generally doesn't) but that's pretty much
it....

> 2) Use "wmb()" to ensure that just-written-to memory is visible to a PCI
> device that will be reading said memory region via DMA.

That will definitely help on PowerPC with our current accessors which
are mostly ordered except for that rule #2 I mentioned above.

> I don't claim that either of these is correct, just that's existing
> practice, perhaps in some case perpetuated by my own arch ignorance.

No worries :) That's also why I'm trying to describe precisely what
semantics are provided by the MMIO accessors with real world examples in
a way that is not arch dependant. The 4 "rules" I've listed in the first
part are precisely what should be needed for drivers, then I list the
accessors and what rules they are guaranteed to comply with, then I list
the barriers allowing to implement those ordering rules when the
accessors don't.

> So, in a perfect world where I was designing my own API, I would create
> two new API functions:
>
> prepare_to_read_dma_memory()
> and
> make_memory_writes_visible_to_dmaing_devices()
>
> and leave the existing APIs untouched. Those are the two fundamental
> operations that are needed.

Well, the argument currently is to make writel and readl imply the above
barriers by making them fully ordered (and slow on some platforms) and
so also provide more weakly ordered routines along with barriers for
people who know what they do. The above 2 barriers are what I've called
io_to_memory_rb() and memory_to_io_wb() (actually,
prepare_to_read_dma_memory() by itself doesn't really make much sense.
It does in conjunction with an MMIO read to flush DMA buffers, in which
case the barrier provides an ordering guarantee that the memory reads
will only be performed after the MMIO read has fully completed).

Ben.


2006-09-11 23:01:37

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> I think that's a separate issue? As Jeff points out, those macros are
> intended to provide memory vs. I/O ordering, but isn't PPC the only platform
> that will reorder accesses so aggressively and independently? I don't think
> ia64 for example will reorder them separately, so a regular memory barrier
> *should* be enough to ensure ordering in both domains.

Well, I don't know, that's what I'm asking since the comment in the
driver specifically mentions IA64 :)

> > Hence the question: do we provide -fully- ordered accessors in class 1,
> > or do we provide -mostly- ordered accessors, ordered in all means except
> > rule #4 vs locks. ia64 is afaik by far the platform taking the biggest
> > hit if you have to provide #4, so I'm interesting in your point of view
> > here.
>
> Either way is fine with me as long as we have a way to get at the fast and
> loose stuff (and required barriers of course) in a portable way. And that we
> don't regress the existing users of mmiowb().

Well, existing users of mmiowb() will regress in performances if we
decide that class 1 (ordered) accessors do imply rule #4 (ordering with
locks) since they'll end up doing redundant mmiowb's ;) but then,
they'll be affected anyway to to the sheer amount of mmiowb's (one per
IO) unless you implement the trick I described, which would bring down
the cost to nothing except maybe the test in spin_unlock (which I still
need to measure on PowerPC).

> > We don't need counters, just a flag. We did a test implementation, seems
> > to work. We also clear the flag in spin_lock. That means that MMIOs
> > issued before a lock aren't ordered vs. the locked section. But because
> > of rule #1, they should be ordered vs. other MMIOs inside the locked
> > section and thus implicitely get ordered anyway.
>
> Oh right, a flag would be enough. Is it good enough for -mm yet? Might be
> fun to run on an Altix machine with a bunch of supported devices (not that I
> work with them anymore...).

The PowerPC patch is probably good enough for 2.6.18 in fact :) I'll let
Paulus post what he has. It's fairly ppc specific in the actual
implementation though.

> > > For ia64 in particular it doesn't matter, though there was speculation
> > > several years that it might be necessary. No actual examples stepped
> > > forward though, so the current implementation doesn't take an argument.
> >
> > Ok. My question is wether it would improve the implementation to take
> > it. If we define a new macro with a new name, we can do it....
>
> Right, but unless there's a real need at this point, we probably shouldn't
> bother. Let the poor sucker with the future machine needing the device
> argument do the work. :)

Ok :)

Ben.


2006-09-11 23:08:29

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Benjamin> and rmb is heavy handed for a compiler barrier :) what
Benjamin> you might need on some platforms is an rmb between the
Benjamin> MMIO read of whatever status/index register and the
Benjamin> following memory reads of descriptors, and you may want
Benjamin> an rmb in case where it matters if the chip has been
Benjamin> changing a value behind your back (which it generally
Benjamin> doesn't) but that's pretty much it....

In drivers/infiniband/hw/mthca/mthca_eq.c, there is:

while ((eqe = next_eqe_sw(eq))) {
/*
* Make sure we read EQ entry contents after we've
* checked the ownership bit.
*/
rmb();

switch (eqe->type) {

where next_eqe_sw() checks a "valid" bit of a 32-byte event queue
entry that is DMA-ed into memory by the device. The device is careful
to write the valid bit (byte actually) last, but on PowerPC 970
without the rmb(), we actually saw the CPU reordering the read of
eqe->type (which is another field of the EQ entry written by the
device) so it happened before the entry was valid, but then executing
the check of the valid bit far enough into the future so that the
entry tested as valid.

This isn't that surprising: if you had two CPUs, with one CPU writing
into a queue and the other CPU polling the queue, you would obviously
need smp_rmb() on the CPU doing the reading. But somehow it's not
quite as obvious when a device plays the role of one of the CPUs.

Of course there's no MMIO anywhere in sight here, so this isn't
directly applicable I guess.

- R.

2006-09-11 23:19:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> where next_eqe_sw() checks a "valid" bit of a 32-byte event queue
> entry that is DMA-ed into memory by the device. The device is careful
> to write the valid bit (byte actually) last, but on PowerPC 970
> without the rmb(), we actually saw the CPU reordering the read of
> eqe->type (which is another field of the EQ entry written by the
> device) so it happened before the entry was valid, but then executing
> the check of the valid bit far enough into the future so that the
> entry tested as valid.

Yes, the CPU can perfectly load it before the previous load, indeed. I'm
sure that wouldn't be powerpc specific. In this case, it would be a
speculative load (since there is a data dependency, thus you would think
it's ok, but it's not on CPUs that do speculative execution).

> This isn't that surprising: if you had two CPUs, with one CPU writing
> into a queue and the other CPU polling the queue, you would obviously
> need smp_rmb() on the CPU doing the reading. But somehow it's not
> quite as obvious when a device plays the role of one of the CPUs.
>
> Of course there's no MMIO anywhere in sight here, so this isn't
> directly applicable I guess.

It's a "normal" case memory barrier in this case. Same as for SMP. Yup.

Ben.


2006-09-11 23:24:54

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Benjamin Herrenschmidt wrote:
> Well, the argument currently is to make writel and readl imply the above
> barriers by making them fully ordered (and slow on some platforms) and
> so also provide more weakly ordered routines along with barriers for
> people who know what they do. The above 2 barriers are what I've called
> io_to_memory_rb() and memory_to_io_wb() (actually,
> prepare_to_read_dma_memory() by itself doesn't really make much sense.
> It does in conjunction with an MMIO read to flush DMA buffers, in which
> case the barrier provides an ordering guarantee that the memory reads
> will only be performed after the MMIO read has fully completed).

<jgarzik throws a monkey wrench into the works>

I think focusing on MMIO just confuses the issue.

wmb() is often used to make sure a memory store is visible to a
busmastering PCI device... before the code proceeds with some more
transactions in the memory space shared by the host and PCI device.

prepare_to_read_dma_memory() is the operation that an ethernet driver's
RX code wants. And this is _completely_ unrelated to MMIO. It just
wants to make sure that the device and host are looking at the same
data. Often this involves polling a DMA descriptor (or index, stored
inside DMA-able memory) looking for changes.

flush_my_writes_to_dma_memory() is the operation that an ethernet
driver's TX code wants, to precede either an MMIO "poke" or any other
non-MMIO operation where the driver needs to be certain that the write
is visible to the PCI device, should the PCI device desire to read that
area of memory.

Jeff




2006-09-12 00:47:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> wmb() is often used to make sure a memory store is visible to a
> busmastering PCI device... before the code proceeds with some more
> transactions in the memory space shared by the host and PCI device.

Yes and that's a different issue. It's purely memory-to-memory barriers
and we already have these well defined.

The problem _is_ with MMIO :) There, you have some ordering issues
happening with some processors that we need to handle, hence the whole
discussion. See below my discussion of your example

> prepare_to_read_dma_memory() is the operation that an ethernet driver's
> RX code wants. And this is _completely_ unrelated to MMIO. It just
> wants to make sure that the device and host are looking at the same
> data. Often this involves polling a DMA descriptor (or index, stored
> inside DMA-able memory) looking for changes.

Why would you need a barrier other than a compiler barrier() for that ?

All you need for such operations that do not involve MMIOs are the
standard wmb(), rmb() and mb() with their usual semantics and polling
for something to change isn't something that requires any of these. Only
a compiler barrier (or an ugly volatile maybe). Though having a
subsequent read from memory that must be done after that change happened
is indeed the job of rmb().

This has nothing to do with MMIO and is not what I'm describing in the
document. MMIO has it's own issues especially when it comes to MMIO vs.
memmory coherency. I though I described them well enough, looks like
not.

> flush_my_writes_to_dma_memory() is the operation that an ethernet
> driver's TX code wants, to precede either an MMIO "poke" or any other
> non-MMIO operation where the driver needs to be certain that the write
> is visible to the PCI device, should the PCI device desire to read that
> area of memory.

That's the problem. You need -different- type of barriers wether the
subsequent operation to "poke" the device is an MMIO or an update in
memory. Again, the whole problem is that on some out of order
architectures, non-cacheable storage is on a completely different domain
than cachaeable storage and ordering between them requires specific
barriers unless you want to ditch performances.

Thus in your 2 above examples, we have:

1- Descriptor update followed by MMIO poke. That needs ordering rule #2
in my list (memory W + MMIO W), which is today not provided by the
PowerPC writel(), but should be according to the discussions we had and
which would be provided by the barrier memory_to_io_wb() in my list if
you chose to use relaxed ordering __writel() version instead for
performances.

2- Descriptor update followed by update of an index in memory (so no
MMIO involved). This is a standard memory ordering issue and thus a
simple wmb() is needed there.

Currently, the PowerPC writel(), as I just said, doesn't provide
ordering for your example #1, but the PowerPC wmb() does provide the
semantics of both memory/memory coherency and memory/MMIO coherency
(thus making it more expensive than necessary in the memory/memory
case).

My goal here, is to:

- remove the problem for people who don't understand the issues by
making writel() etc... fully ordered vs. memory. for the cases that
matter to drivers. Thus the -only- case that drivers writers would have
to care about if using those accessors is the memory-memory case in your
second example.

- provide relaxed __writel etc... for people who -do- understand those
issues and want to improve preformances of the hot path of their driver.
In order to make this actually optimal and safe, I need to precisely
define in what way it is relaxed, what precise ordering semantics are
provided, and provide specific barriers for each of these.

That's what I documented. If you think my document is not clear enough,
I would be happy to have your input on how to make it clearer. Maybe
some introduction explaining the difference above ? (re-using your
examples).

There are still a few questions that I listed about what we want to
provide. The main one is the ordering of MMIO vs. spin_unlock. Do we
want to provide that in the default writel or do we accept that we still
require a barrier in that case even when using "ordered" versions of the
accessors because the performance cost would be too high.

So far, I tend to prefer being fully ordered (and thus not require the
barrier) but I wanted some feedback there. So far, everybody have
carefuly avoided to voice an firm opinion on that one though :)

Ben.


2006-09-12 05:33:23

by Albert Cahalan

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Benjamin Herrenschmidt writes:

> 1- io_to_io_barrier() : This barrier provides ordering requirement #1
> between two MMIO accesses. It's to be used in conjuction with fully
> relaxed accessors of Class 3.
>
> 2- memory_to_io_wb() : This barrier provides ordering requirement #2
> between a memory store and an MMIO store. It can be used in conjunction
> with write accessors of Class 2 and 3.
>
> 3- io_to_memory_rb(value) : This barrier provides ordering requirement
> #3 between an MMIO read and a subsequent read from memory. For
> implementation purposes on some architectures, the value actually read
> by the MMIO read shall be passed as an argument to this barrier. (This
> allows to generate the appropriate CPU instruction magic to force the
> CPU to consider the value as being "used" and thus force the read to be
> performed immediately). It can be used in conjunction with read
> accessors of Class 2 and 3
>
> 4- io_to_lock_wb() : This barrier provides ordering requirement #4
> between an MMIO store and a subsequent spin_unlock(). It can be used in
> conjunction with write accessors of Class 2 and 3.

These can really multiply: read or write, RAM and various types
of IO space, etc.

Let's have a generic arch-provided macro and let gcc do some work
for us.

Example usage:
fence(FENCE_READ_RAM|FENCE_READ_PCI_IO, FENCE_WRITE_PCI_MMIO);

Example implementation for PowerPC:

#define PPC_RAM (FENCE_READ_RAM|FENCE_WRITE_RAM)
#define PPC_MMIO (FENCE_READ_PCI_MMIO|FENCE_READ_PCI_CONFIG|\
FENCE_READ_PCI_RAM|FENCE_READ_PCI_IO | FENCE_WRITE_PCI_MMIO|\
FENCE_WRITE_PCI_CONFIG|FENCE_WRITE_PCI_RAM|FENCE_WRITE_PCI_IO)
#define PPC_OTHER (~(PPC_RAM|PPC_MMIO))

#define fence(before,after) do{ \
if(before&PPC_RAM && after&PPC_MMIO) \
__asm__ __volatile__ ("sync" : : : "memory"); \
else if(before&PPC_MMIO && after&PPC_RAM) \
__asm__ __volatile__ ("sync" : : : "memory"); \
else if((before|after) & PPC_OTHER) \
__asm__ __volatile__ ("sync" : : : "memory"); \
else if(before && after) \
__asm__ __volatile__ ("eieio" : : : "memory"); \
}while(0)

2006-09-12 05:48:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> > 4- io_to_lock_wb() : This barrier provides ordering requirement #4
> > between an MMIO store and a subsequent spin_unlock(). It can be used in
> > conjunction with write accessors of Class 2 and 3.
>
> These can really multiply: read or write, RAM and various types
> of IO space, etc.

No they can't. They are not dependent on the bus type but on the
processor memory model. Only 4 might have some more annoying
dependencies but in practice, it's still manageable. I think I've
defined the 4 base rules that are useful for drivers and the barriers
that provide them. Unless you can show me an example where something
else is needed.

> Let's have a generic arch-provided macro and let gcc do some work
> for us.
>
> Example usage:
> fence(FENCE_READ_RAM|FENCE_READ_PCI_IO, FENCE_WRITE_PCI_MMIO);

<snip>

That's terribly ugly imho.

Ben.


2006-09-12 05:50:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Alan Cox <[email protected]> writes:

>> > "Except where the underlying device is marked as cachable or
>> > prefetchable"
>>
>> You aren't supposed to use MMIO accessors on cacheable memory, are you ?
>
> Why not. Providing it is in MMIO space, consider ROMs for example or
> write path consider frame buffers.

Frame buffers are rarely cachable as such, on x86 they are usually
write-combining. Which means that the writes can be merged and
possibly reordered while they are being written but they can't be
cached. Most arches I believe have something that roughly corresponds
to write combining.

Ensuring we can still use this optimization to mmio space is
moderately important.

Eric



2006-09-12 05:57:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> Frame buffers are rarely cachable as such, on x86 they are usually
> write-combining. Which means that the writes can be merged and
> possibly reordered while they are being written but they can't be
> cached. Most arches I believe have something that roughly corresponds
> to write combining.
>
> Ensuring we can still use this optimization to mmio space is
> moderately important.

I've not gone too much in details about write combining (we need to do
something about it but I don't want to mix problems) but I did define
that the ordered accessors aren't guaranteed to provide write combining
on storage mapped with WC enabled while the relaxed or non ordered ones
are. That should be enough at this point.

Later, we should look into providing an ioremap_wc() and possibly page
table flags for write combining userland mappings. Time to get rid of
MTRRs for graphics :) And infiniband-style stuff seems to want that too.

Ben.


2006-09-12 06:29:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

Benjamin Herrenschmidt <[email protected]> writes:

> I've not gone too much in details about write combining (we need to do
> something about it but I don't want to mix problems) but I did define
> that the ordered accessors aren't guaranteed to provide write combining
> on storage mapped with WC enabled while the relaxed or non ordered ones
> are. That should be enough at this point.

Sounds good.

> Later, we should look into providing an ioremap_wc() and possibly page
> table flags for write combining userland mappings. Time to get rid of
> MTRRs for graphics :) And infiniband-style stuff seems to want that too.

ioremap_wc is actually the easy half. I have an old patch that handles
that. The trick is to make certain multiple people don't map the same
thing with different attributes. Unfortunately I haven't had time to
work through that one yet.

Eric

2006-09-12 07:13:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> ioremap_wc is actually the easy half. I have an old patch that handles
> that. The trick is to make certain multiple people don't map the same
> thing with different attributes. Unfortunately I haven't had time to
> work through that one yet.

Actually, that's interesting because I need the exactly oposite on
PowerPC I think.... That is people will -need- to do both a wc and a
non-wc mapping if they want to be able to issue stores that are
guaranteed not to be combined.

The problem I've seen is that at least one processor (the Cell) and
maybe more seem to be combining between threads on the same CPU (unless
the stores are issues to a guarded mapping which prevents combining
completely, that is the sort of mapping we currently do with ioremap).

That means that it's impossible to prevent combining with explicit
barriers. For example:

Thread 0 Thread 1
store to A store to A+1
barrier barrier
\ /
\ /
\ /
Store unit might sees:
store to A
store to A+1
barrier
barrier

That is the stores aren't tagged with their source thread and thus the
non cacheable store unit will not prevent combining between them.

Again, it might just be a Cell CPU bug in which case we may have to just
disable use of WC on that processor, period. But it might be a more
generic problem too, we need to investigate.

If the problem ends up being widespread, the only ways I see to prevent
the combining from happening are to do a dual mapping as I explained
earlier, or maybe to have drivers always do the stores that must not be
combine as part of spinlocks, with appropriate use of
io_to_lock_barrier() (mmiowb()).

Anyway, let's not pollute this discussion with that too much now :)

Ben.


2006-09-12 15:20:44

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

> Actually, that's interesting because I need the exactly oposite on
> PowerPC I think.... That is people will -need- to do both a wc and a
> non-wc mapping if they want to be able to issue stores that are
> guaranteed not to be combined.

Or you do the sane thing and just not allow two threads of execution
access to the same I/O device at the same time.

> The problem I've seen is that at least one processor (the Cell) and
> maybe more seem to be combining between threads on the same CPU
> (unless
> the stores are issues to a guarded mapping which prevents combining
> completely, that is the sort of mapping we currently do with ioremap).
>
> That means that it's impossible to prevent combining with explicit
> barriers. For example:

Now compare this with the similar scenario for "normal" MMIO, where
we do store;sync (or sync;store or even sync;store;sync) for every
writel() -- exactly the same problem.

> Again, it might just be a Cell CPU bug in which case we may have to
> just
> disable use of WC on that processor, period. But it might be a more
> generic problem too, we need to investigate.

It's a bit like why IA64 has mmiowb(). Not quite the same, but similar.

> If the problem ends up being widespread, the only ways I see to
> prevent
> the combining from happening are to do a dual mapping as I explained
> earlier, or maybe to have drivers always do the stores that must
> not be
> combine as part of spinlocks, with appropriate use of
> io_to_lock_barrier() (mmiowb()).

Better lock at a higher level than just per instruction.

Some devices that want to support multiple clients at the same time
have multiple identical "register files", one for each client, to
prevent this and other problems (and it's useful anyway).

> Anyway, let's not pollute this discussion with that too much now :)

Au contraire -- if you're proposing to hugely invasively change some
core interface, and add millions of little barriers(*), you better
explain how this is going to help us tackle the problems (like WC) that
we are starting to see already, and that will be a big deal in the
near future.

Now I'm saying there's no way to make the barriers needed for write-
combining efficient, unless those barriers can take advantage of the
ordering rules of the path all the way from the CPU to the device;
i.e. make those barriers bus-specific. The MMIO and memory-like-space
read/write accessors will have to follow suit. Non-WC stuff can take
advantage of bus-specific rules as well, e.g. the things you are
proposing, which, face it, are really just designed for PCI.

And even today, only looking at PCI, we already have two different
kinds of drivers: the ones that use the PCI ordering rules, with
wmb() and mmiowb() [your #2 and #4; #1 is implicit on PCI (everything
pushes posted writes); and #3 is covered by the twi;isync we have
in readX()], which work correctly on PowerPC today; and on the other
hand, the drivers that pretend PCI is a bus arch where everything is
strongly ordered (even vs. main memory), which do not all work for
PowerPC in today's kernel [devices not doing DMA might seem to work
fine, since #4 is hard to break and even if you do it's not often
fatal or bad at all; heck we have this one device now where breaking
#2 "almost" works, it took almost two full kernel release cycles for
anyone to notice].

If you change the rules you'll have to audit *all* existing device
drivers.

So, again: unless we make the I/O accessors and barriers bus-specific,
we'll end up with millions(**) of slightly different barriers and
whatnot, in an attempt to get decent performance out of our devices;
and we will never reach that goal. Also, no device driver author
will ever know what barrier to use where and when.

Now if we _do_ make it all bus-specific, we still might have quite
a few barriers in total, but only a few per bus type -- and they
can have descriptive names that explain where to use them. Maybe,
just maybe, we'll for the first time see a device driver that gets
it right ;-)

I still like the idea of overloading the semantics of readX()/writeX()
to do whatever is needed for the region that is mapped for their
arguments, but you can introduce pci_readl() and friends for all I care,
it's a separate issue... If you want to keep the nice short names
with different semantics though, well, have fun fixing device drivers
for the next twenty(***) years.


Segher



(*) Yes I know I'm exaggerating.
(**) It's a habit :-)
(***) Did it again... It's more like fifteen years really.

2006-09-12 15:32:25

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

> prepare_to_read_dma_memory() is the operation that an ethernet
> driver's RX code wants. And this is _completely_ unrelated to
> MMIO. It just wants to make sure that the device and host are
> looking at the same data. Often this involves polling a DMA
> descriptor (or index, stored inside DMA-able memory) looking for
> changes.
>
> flush_my_writes_to_dma_memory() is the operation that an ethernet
> driver's TX code wants, to precede either an MMIO "poke" or any
> other non-MMIO operation where the driver needs to be certain that
> the write is visible to the PCI device, should the PCI device
> desire to read that area of memory.

Because those are the operations, those should be the actual
function names, too (well, prefixed with pci_). Architectures
can implement them whatever way is appropriate, or perhaps default
to some ultra-strong semantics if they prefer; driver writers
should not have to know about the underlying mechanics (like why
we need which barriers).


Segher

2006-09-12 21:23:45

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> Or you do the sane thing and just not allow two threads of execution
> access to the same I/O device at the same time.

Why ? Some devices are designed to be able to handle that...

> Now compare this with the similar scenario for "normal" MMIO, where
> we do store;sync (or sync;store or even sync;store;sync) for every
> writel() -- exactly the same problem.

What problem ? "Normal" MMIO doesn't get combined, thus there is no
problem. Of course there is no guarantee of ordering of the stores from
the 2 CPUs unless there is a spinlock etc etc... but we are talking
about a case where that is acceptable here. Howver, combining is not.

> Better lock at a higher level than just per instruction.
>
> Some devices that want to support multiple clients at the same time
> have multiple identical "register files", one for each client, to
> prevent this and other problems (and it's useful anyway).

Yes, they do, and what happen if those register "files" happen to be
consecutive in the address space and the CPU suddenly combines a store
to the last register of one "file" and an unrelated store from another
thread to the first register of the other ?

This is a very specific problem that has nothing to do with your "grand
general case" which means that at least on Cell, you cannot use explicit
barriers to guarantee the absence of write combining. That's as simple
as that. All I need to figure out now is if that problem is specific to
one CPU implementation or more general, in which case, we'll have to
figure out some way to provide an interface.

> > Anyway, let's not pollute this discussion with that too much now :)
>
> Au contraire -- if you're proposing to hugely invasively change some
> core interface, and add millions of little barriers(*), you better
> explain how this is going to help us tackle the problems (like WC) that
> we are starting to see already, and that will be a big deal in the
> near future.

No, this is totally irrelevant. I'm proposing a simple change (nothing
invasive there) to the MMIO accessors of weakly ordered platforms only,
to make them guarantee ordering like x86 etc... and I'm proposing the
-addition- (which is not something I would cause invasive) of -one-
class of partially relaxed accessors and the -few- (damn, there are only
4 of them) barriers that precisely match the semantics that drivers
need. Oh, and make sure those semantics are well defined or they are
useless.

This has strictly nothing to do with WC and mixing things up will only
confuse the discussion and guarantee that we'll never get anything done.

<snip useless digression>

Ben.


2006-09-13 00:12:32

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation

>> Or you do the sane thing and just not allow two threads of execution
>> access to the same I/O device at the same time.
>
> Why ? Some devices are designed to be able to handle that...

Sure, but not many -- and even then, you normally get a separate
MMIO area to write to for each thread. Not really differrnt.

>> Now compare this with the similar scenario for "normal" MMIO, where
>> we do store;sync (or sync;store or even sync;store;sync) for every
>> writel() -- exactly the same problem.
>
> What problem ? "Normal" MMIO doesn't get combined, thus there is no
> problem. Of course there is no guarantee of ordering of the stores
> from
> the 2 CPUs unless there is a spinlock etc etc... but we are talking
> about a case where that is acceptable here. Howver, combining is not.

As an example, the first access might set off a DMA, and the 2nd MMIO
interferes. That's not necessarily acceptable. Now you might point
me to the spinlock again, but I'll just point you right back to your
original example, because that's my whole point.

>> Better lock at a higher level than just per instruction.
>>
>> Some devices that want to support multiple clients at the same time
>> have multiple identical "register files", one for each client, to
>> prevent this and other problems (and it's useful anyway).
>
> Yes, they do, and what happen if those register "files" happen to be
> consecutive in the address space and the CPU suddenly combines a store
> to the last register of one "file" and an unrelated store from another
> thread to the first register of the other ?

That's why those devices rely on the CPU's not combining over the edges
of (typically) 4kB pages.

> This is a very specific problem that has nothing to do with your
> "grand
> general case"

Oh I have no "grand general case", my main argument still is to have
accessors _per bus_ (per bus type really, archs can make it more
specific
if they want).

In the "grand general case", you have to do lowest-common-denominator
for everything, and you're increasingly forcing yourself into that
corner.

>>> Anyway, let's not pollute this discussion with that too much now :)
>>
>> Au contraire -- if you're proposing to hugely invasively change some
>> core interface, and add millions of little barriers(*), you better
>> explain how this is going to help us tackle the problems (like WC)
>> that
>> we are starting to see already, and that will be a big deal in the
>> near future.
>
> No, this is totally irrelevant.

"The (near) future [and it's only not right now because Linux is
dragging
behind] is totally irrelevant, only my current this-second itch is?"

> I'm proposing a simple change (nothing
> invasive there) to the MMIO accessors of weakly ordered platforms
> only,
> to make them guarantee ordering like x86 etc...

Please explain what drivers will need changes because of this. Not just
the few you really care about, but _all_ that could be plugged into
PowerPC
machines' PCI busses, and might need changes because of changing the
ordering semantics of readX()/writeX() from the supposed standard Linux
semantics (i.e., the x86 semantics).

> and I'm proposing the
> -addition- (which is not something I would cause invasive) of -one-
> class of partially relaxed accessors and the -few- (damn, there are
> only
> 4 of them) barriers that precisely match the semantics that drivers
> need. Oh, and make sure those semantics are well defined or they are
> useless.

Erm, wait a minute, I might start to understand now... You want all
drivers that you care about to be converted to use __readX()/__writeX()
instead? How is this going to help, exactly?

> This has strictly nothing to do with WC and mixing things up will only
> confuse the discussion and guarantee that we'll never get anything
> done.

No, it _has_ to do with WC. If the Linux I/O API is going to be
changed/
amended/expanded/mot du jour, we better do it in such a way that we will
get a positive outlook on the problems that we will have to face next
(or
in this case, that we should be handling already really).

> <snip useless digression>

Very constructive.


Segher

2006-09-13 01:34:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] MMIO accessors & barriers documentation


> Please explain what drivers will need changes because of this. Not just
> the few you really care about, but _all_ that could be plugged into
> PowerPC
> machines' PCI busses, and might need changes because of changing the
> ordering semantics of readX()/writeX() from the supposed standard Linux
> semantics (i.e., the x86 semantics).

They won't. They will still work, and in some (many ?) case better due
to the removal of a potential bug since lots of driver don't have a
barrier where they should be with relaxed semantics. So the net effect
is positive here.

Now, it also means that we -can- start improving drivers we care about
to use the relaxed semantics and benefit from there. And since the
semantics are well defined, all archs with some sort of relaxed ordering
will be able to benefit in a way or another.

In addition, it will allow us to get a small optimisation on PowerPC vs.
the current situation by slightly relaxing wmb() which currently has to
do a full sync because it might be used to order memory vs. MMIO, which
it will no longer do (it will go back to a pure memory store barrier).

Anyway, Paul has a patch we are testing that makes our writel/readl's
synchronous (by moving the sync to before writel, adding an eieio before
readl, and doing the percpu trick so spin_unlock magically does a sync
when a writel occurred). With that, we'll get full correctness with no
more sync's in writel than we had before. We are running some benchs
here now to see what kind of performance impact it has overall, and if
we are happy, that can make it into 2.6.18 and close the problem of
drivers assuming ordered MMIO vs. memory at least.

Then, in a -separate- step, we can provide a set of relaxed accessors
that will allow for additional performance improvements on the hot path
of selected drivers.

I'm tired of arguing over and over again the same thing here anyway,
I'll post a new version of the document including some of the feedback
we got already and will submit it for inclusion along with a
__writel/__readl implementation for powerpc (and a generic one that
defaults to readl/writel) for the 2.6.19 timeframe.

We'll see from there if there are more constructive comments.

Ben.