2004-01-07 17:59:02

by Jesse Barnes

[permalink] [raw]
Subject: [RFC] Relaxed PIO read vs. DMA write ordering

I've already talked with Grant a little about this, but I'm having
second thoughts about the approach we discussed. PCI-X allows PIO read
responses to 'pass' DMA writes to system memory when the relaxed
ordering bit is set in the PCI-X command word _and_ the transaction has
the relaxed ordering bit set (so called "Relaxed Read Ordering" in
section 11.2 of the PCI-X addendum). This effectively 'unserializes'
PIO vs. DMA transactions so that PIO reads doesn't get stuck behind an
unrelated DMA writes from the same device; something which can
potentially take awhile since cacheline ownership has to be acquired,
etc.

I'd like Linux to support relaxed read ordering in some way since on
large systems having PIO reads stuck behind DMA writes can end up eating
into CPU time and limit IOPS (do I have this right, Jeremy?).

The proposal I gave to Grant added a new readX() variant,
readX_relaxed(), that drivers could use when they don't need strict
ordering semantics (this may actually be the majority of cases, but it's
safer to be strict by default than create a read_ordered and open a
window for data corruption). It might be confusing, however, to add yet
another readX() routine, and there are other ways we might go about it.
One suggestion was to overload the pci_sync_* calls so that they'd
explicitly flush DMA writes to system memory, implying that all reads on
some platforms would use relaxed semantics, but that we'd have to modify
drivers to add in pci_sync_* calls where needed.

Thoughts?

Thanks,
Jesse


2004-01-07 19:02:13

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 09:58:02AM -0800, Jesse Barnes wrote:
> I've already talked with Grant a little about this, but I'm having
> second thoughts about the approach we discussed. PCI-X allows PIO read
> responses to 'pass' DMA writes to system memory when the relaxed
> ordering bit is set in the PCI-X command word _and_ the transaction has
> the relaxed ordering bit set (so called "Relaxed Read Ordering" in
> section 11.2 of the PCI-X addendum). This effectively 'unserializes'
> PIO vs. DMA transactions so that PIO reads doesn't get stuck behind an
> unrelated DMA writes from the same device; something which can
> potentially take awhile since cacheline ownership has to be acquired,
> etc.

So we want a pci_set_relaxed() macro / function() to set this bit
(otherwise dozens of drivers will start to try to set the bit themselves,
badly). If this bit *isn't* set, setting the bit in the transaction will have
no effect, right?

> The proposal I gave to Grant added a new readX() variant,
> readX_relaxed(), that drivers could use when they don't need strict
> ordering semantics (this may actually be the majority of cases, but it's
> safer to be strict by default than create a read_ordered and open a
> window for data corruption). It might be confusing, however, to add yet
> another readX() routine, and there are other ways we might go about it.
> One suggestion was to overload the pci_sync_* calls so that they'd
> explicitly flush DMA writes to system memory, implying that all reads on
> some platforms would use relaxed semantics, but that we'd have to modify
> drivers to add in pci_sync_* calls where needed.

How about always setting the bit in readb() and having a readb_ordered()
which doesn't set the bit in the transaction? That way, drivers which
call pci_set_relaxed() have the responsibility to verify they're not
relying on these semantics and use readb_ordered() in any places that
they are.

No doubt you're going to smack this idea down by telling me what SN2
firmware currently does ...

--
"Next the statesmen will invent cheap lies, putting the blame upon
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince
himself that the war is just, and will thank God for the better sleep
he enjoys after this process of grotesque self-deception." -- Mark Twain

2004-01-07 22:21:46

by Grant Grundler

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 07:02:06PM +0000, Matthew Wilcox wrote:
> So we want a pci_set_relaxed() macro / function() to set this bit
> (otherwise dozens of drivers will start to try to set the bit themselves,
> badly). If this bit *isn't* set, setting the bit in the transaction will have
> no effect, right?

I think that's correct if the platform chipset ignores RO signal
by default. I'm not real comfortable with that assumption though.
I want the driver to advertise to PCI services the intent to use
RO capability.

> How about always setting the bit in readb() and having a readb_ordered()
> which doesn't set the bit in the transaction?

I was under the impression the driver can't control RO for
each transaction though. The PCI-X device controls which
transactions set RO "signal" in the PCI-X command on read-return.
The Read-Return is a seperate transaction from the Read-Request.

If anyone has data that specific devices are "smart" and set/clear
RO appropriately, it would be safe to enable RO for them.

On HP ZX1, the "Allow Relaxed Ordering" is only implemented for outbound
DMA/PIO Writes *while they pass through the ZX1 chip*. Ie RO bit settings
don't explicitly apply since we aren't talking about PCI-X bus transactions
even though the system chipset needs to honor PCI-X rules.

> That way, drivers which
> call pci_set_relaxed() have the responsibility to verify they're not
> relying on these semantics and use readb_ordered() in any places that
> they are.

if new variants of readb() are ok, then yours sounds better.

But I wasn't too keen on introducing readb variants to solve what
looks like a DMA flushing problem. I've come to the conclusion
that systems which implement (and enable) RO for inbound DMA are
effectively not coherent. The data the CPU expects to be visible is not.

DMA-mapping.txt already has support (pci_dma_sync_xx() or pci_dma_unmap_xx())
to deal with common forms off non-coherence and syncronize caches
for streaming mappings but not for consistent mappings.
DMA-ABI.txt (2.6 only) has a method to handle non-coherent systems and
I have to reread/study it to see if the provided interface is sufficient
for the case of relaxed ordering. Jesse, have you looked at this already?

hth,
grant

> No doubt you're going to smack this idea down by telling me what SN2
> firmware currently does ...
>
> --
> "Next the statesmen will invent cheap lies, putting the blame upon
> the nation that is attacked, and every man will be glad of those
> conscience-soothing falsities, and will diligently study them, and refuse
> to examine any refutations of them; and thus he will by and by convince
> himself that the war is just, and will thank God for the better sleep
> he enjoys after this process of grotesque self-deception." -- Mark Twain

2004-01-07 22:58:52

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 07:02:06PM +0000, Matthew Wilcox wrote:
> So we want a pci_set_relaxed() macro / function() to set this bit
> (otherwise dozens of drivers will start to try to set the bit themselves,
> badly). If this bit *isn't* set, setting the bit in the transaction will have
> no effect, right?

Right, we'd want that call too. And actually, if the bit in the command
word isn't set, we're not allowed to set it in individual transactions.

> How about always setting the bit in readb() and having a readb_ordered()
> which doesn't set the bit in the transaction? That way, drivers which
> call pci_set_relaxed() have the responsibility to verify they're not
> relying on these semantics and use readb_ordered() in any places that
> they are.

Yep, that would work too.

Jesse

2004-01-07 23:07:20

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 03:21:42PM -0700, Grant Grundler wrote:
> > How about always setting the bit in readb() and having a readb_ordered()
> > which doesn't set the bit in the transaction?
>
> I was under the impression the driver can't control RO for
> each transaction though. The PCI-X device controls which
> transactions set RO "signal" in the PCI-X command on read-return.
> The Read-Return is a seperate transaction from the Read-Request.

My understanding is that you need both. And if we used the
pci_enable_relaxed() routine, we'd have to add a check to readX() for it
so that we don't accidentally set the RO bit in the transaction when the
command word has it clear.

> If anyone has data that specific devices are "smart" and set/clear
> RO appropriately, it would be safe to enable RO for them.

I don't know of any that do it automatically...

> On HP ZX1, the "Allow Relaxed Ordering" is only implemented for outbound
> DMA/PIO Writes *while they pass through the ZX1 chip*. Ie RO bit settings
> don't explicitly apply since we aren't talking about PCI-X bus transactions
> even though the system chipset needs to honor PCI-X rules.

So this wouldn't be helpful for your chipset then.

> > That way, drivers which
> > call pci_set_relaxed() have the responsibility to verify they're not
> > relying on these semantics and use readb_ordered() in any places that
> > they are.
>
> if new variants of readb() are ok, then yours sounds better.
>
> But I wasn't too keen on introducing readb variants to solve what
> looks like a DMA flushing problem. I've come to the conclusion
> that systems which implement (and enable) RO for inbound DMA are
> effectively not coherent. The data the CPU expects to be visible is not.

Ahh... that's a bit of a stretch of the definition of non-coherence I
think, but it might be close enough to use the sync semantics.

> DMA-mapping.txt already has support (pci_dma_sync_xx() or
> pci_dma_unmap_xx()) to deal with common forms off non-coherence and
> syncronize caches for streaming mappings but not for consistent
> mappings. DMA-ABI.txt (2.6 only) has a method to handle non-coherent

Right, that's another option--adding a pci_sync_consistent() call.

> systems and I have to reread/study it to see if the provided interface
> is sufficient for the case of relaxed ordering. Jesse, have you
> looked at this already?

All of them are pretty easy enough to do... so I see our options as one
of the following:

1) add pcix_enable_relaxed() and read_relaxed() (read() would always be
ordered)
2) add pcix_enable_relaxed() and read_ordered() (read() would be
relaxed after the pcix_enable_relaxed() call)
3) add pcix_enable_relaxed() and pci_sync_consistent() (read() would
be relaxed after the pcix_enable_relaxed() call)

Thanks,
Jesse

2004-01-07 23:28:55

by Greg KH

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
>
> 1) add pcix_enable_relaxed() and read_relaxed() (read() would always be
> ordered)

This probably preserves the current situation best, enabling driver
writers to be explicit in knowing what is happening.

thanks,

greg k-h

2004-01-07 23:56:45

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 03:27:54PM -0800, Greg KH wrote:
> On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
> >
> > 1) add pcix_enable_relaxed() and read_relaxed() (read() would always be
> > ordered)
>
> This probably preserves the current situation best, enabling driver
> writers to be explicit in knowing what is happening.

That's what I figured too. It also seems like it has the lowest
probability of introducing PIO vs. DMA races, since you have to
explicitly change a read() call.

What about compatibility though? How should the interface behave if
it's accessing a PCI-X device that happens to be in PCI mode? Ideally,
we could add these calls in and introduce no penalty for platforms that
don't support it...

Thanks,
Jesse

2004-01-08 00:09:05

by Jeremy Higdon

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 03:27:54PM -0800, Greg KH wrote:
> On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
> >
> > 1) add pcix_enable_relaxed() and read_relaxed() (read() would always be
> > ordered)
>
> This probably preserves the current situation best, enabling driver
> writers to be explicit in knowing what is happening.
>
> thanks,
>
> greg k-h

I like this best too. That way, a driver can enable a relaxed read
in the performance path and not have to audit the other reads.

So in a generic PCI driver, you'd call pcix_enable_relaxed() and
then use read() for initialization, error recovery, etc., and
use read_relaxed() in the main execution path where it is determined
to be safe.

The default would be the standard behavior that we have.

One question I have is what the need for pcix_enable_relaxed() is.
Are we thinking that this sets some bit in one or more registers?
What happens if you use read_relaxed() and you didn't call
pcix_enable_relaxed() previously?

jeremy

2004-01-08 00:34:59

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 03:56:33PM -0800, Jesse Barnes wrote:
> On Wed, Jan 07, 2004 at 03:27:54PM -0800, Greg KH wrote:
> > On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
> > >
> > > 1) add pcix_enable_relaxed() and read_relaxed() (read() would always be
> > > ordered)
> >
> > This probably preserves the current situation best, enabling driver
> > writers to be explicit in knowing what is happening.

This is also the easiest solution to implement for the sn2 platform.
Honestly, I haven't used any PCI-X chipsets (nor do I know of any) that
exploit this new relaxed ordering feature, so I'm only guessing at how
it might be usefully exported to the driver API.

The sn2 platform actually _always_ behaves as though relaxed ordering
were enabled, so all we really need to implement this correctly is a
read_relaxed(), which will be a read() but without the software
workaround we put in place to conform to the PCI PIO/DMA semantics.

Maybe we can just add read_relaxed() for now and deal with other
chipsets that allow relaxed ordering as they appear?

Thanks,
Jesse

2004-01-08 06:38:35

by Grant Grundler

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
> > If anyone has data that specific devices are "smart" and set/clear
> > RO appropriately, it would be safe to enable RO for them.
>
> I don't know of any that do it automatically...

....maybe it would be better if more folks read the PCI-X spec.
This quote is from v1.0a PCI-X Addendum to PCI Local Bus Spec,
"Appendix 11 - Use Of Relaxed Ordering" (bottom of page 221):

| In general, read and write transactions to or from I/O devices are
| classified as payload or control. (PCI 2.2 Appendix E refers to payload
| as Data and control as Flag and Status.) If the payload traffic requires
| multiple data phases or multiple transactions, such payload traffic
| rarely requires ordered transactions. That is, the order in which the
| bytes of the payload arrive is inconsequential, if they all arrive before
| the corresponding control traffic. However, control traffic generally does
| require ordered transactions. I/O devices that follow this programming
| model could use this distinction to set the Relaxed Ordering attribute
| in hardware with no device driver intervention.

Read that last sentence again.
It suggests using readb() variants are the wrong approach.

| Such a device could set the Relaxed Ordering attribute bit for all
| payload read and write transactions and not set the attribute for
| all control read and write transactions. Other devices may want to
| provide a means (beyond the scope of the PCI-X specification) for
| their device driver to indicate when it is permissible to set the
| Relaxed Ordering attribute. In all cases, no requester is allowed
| to set the Relaxed Ordering attribute bit if the Enable Relaxed
| Ordering bit in the PCI-X Command register is cleared.

I interpret this to mean:
Setting the RO bit in the PCI-X Command Register only enables
the device to choose when to set RO Attribute bit when the device
generates a PCI-X bus cycle.

My gut feeling is few PCI-X HW developers have had time or experience
to get this right. Most will either ignore RO bit or always set it
for all transactions. But that's just my speculation. Drivers writers
for each device will have to know this and I suspect most won't care.

Secondly, I've convinced myself RO bit can not be set per transaction
(and only per device) by the host. I just re-read the first sentences
in section "2.5. Attributes" (talks about PCI-X bus signalling):

| Attributes are additional information included with each transaction
| that further defines the transaction. The initiator of every
| transaction drives attributes on the C/BE[3::0]# and AD[31::00]
| buses in the attribute phase.

The CPU does not directly generate transactions on the PCI-X bus.
At least not in the current crop of CPUs. ergo we can only program
the PCI-X bus controller before hand or alias address bits to be
attributes (similar to cache/uncached address ranges on ia64).
Is SN2 doing the latter for PCI-X MMIO reads?

And is the read return transaction going to reflect the same attributes
used for read request?

> > On HP ZX1, the "Allow Relaxed Ordering" is only implemented for outbound
> > DMA/PIO Writes *while they pass through the ZX1 chip*. Ie RO bit settings
> > don't explicitly apply since we aren't talking about PCI-X bus transactions
> > even though the system chipset needs to honor PCI-X rules.
>
> So this wouldn't be helpful for your chipset then.

Right. s/your/HP/. But HP has more than one chipset and I'm not
that familiar with SX1000 chipset. Though I don't expect
it supports anything different in this regard.


> Ahh... that's a bit of a stretch of the definition of non-coherence I
> think, but it might be close enough to use the sync semantics.

Can you give a better definition of non-coherence?
I'll defend the following (a variant of what I said before):
Data written by the IO device is not visible to the CPU
when the CPU expects the data to be visible.

I'll assert SN2 is non-coherent with RO enabled.
"mostly coherent" is probably the right level of fuzziness.
But linux doesn't have a "mostly coherent" DMA API. :^)

[ James (Bottomley) - I couldn't find a definition of "non-consistent
memory machine" in DMA-ABI.txt. Was that intentional or could you
include a variant of the above definition?
I guess if one needed to include a definition, then the reader
shouldn't be using the interfaces described in Part II.
But this is a key distinction from DMA-mapping.txt. ]


> Right, that's another option--adding a pci_sync_consistent() call.

yes - something like this would be my preference mostly because it's
less intrusive to the drivers, less confusing for driver writers,
and can be a complete NOP on most platforms.

BTW, Jesse, did you look at part II of Documentation/DMA-ABI.txt?

thanks,
grant

2004-01-08 10:02:02

by Jes Sorensen

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

>>>>> "Greg" == Greg KH <[email protected]> writes:

Greg> On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
>> 1) add pcix_enable_relaxed() and read_relaxed() (read() would
>> always be ordered)

Greg> This probably preserves the current situation best, enabling
Greg> driver writers to be explicit in knowing what is happening.

I concur, it also matches the current convention we have with
__raw_readX()

Cheers,
Jes

2004-01-08 16:25:42

by Leonid Grossman

[permalink] [raw]
Subject: RE: [RFC] Relaxed PIO read vs. DMA write ordering



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Grant Grundler
> Sent: Wednesday, January 07, 2004 10:38 PM
> To: Jesse Barnes
> Cc: [email protected]; [email protected]; Matthew
> Wilcox; [email protected];
> [email protected]
> Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering
>
>
> On Wed, Jan 07, 2004 at 03:07:12PM -0800, Jesse Barnes wrote:
> > > If anyone has data that specific devices are "smart" and
> set/clear
> > > RO appropriately, it would be safe to enable RO for them.
> >
> > I don't know of any that do it automatically...
>
> ....maybe it would be better if more folks read the PCI-X
> spec. This quote is from v1.0a PCI-X Addendum to PCI Local
> Bus Spec, "Appendix 11 - Use Of Relaxed Ordering" (bottom of
> page 221):
>
> | In general, read and write transactions to or from I/O devices are
> | classified as payload or control. (PCI 2.2 Appendix E refers to
> | payload as Data and control as Flag and Status.) If the payload
> | traffic requires multiple data phases or multiple
> transactions, such
> | payload traffic rarely requires ordered transactions. That is, the
> | order in which the bytes of the payload arrive is
> inconsequential, if
> | they all arrive before the corresponding control traffic. However,
> | control traffic generally does require ordered transactions. I/O
> | devices that follow this programming model could use this
> distinction
> | to set the Relaxed Ordering attribute in hardware with no device
> | driver intervention.
>
> Read that last sentence again.
> It suggests using readb() variants are the wrong approach.
>
> | Such a device could set the Relaxed Ordering attribute bit for all
> | payload read and write transactions and not set the
> attribute for all
> | control read and write transactions. Other devices may want
> to provide
> | a means (beyond the scope of the PCI-X specification) for
> their device
> | driver to indicate when it is permissible to set the
> Relaxed Ordering
> | attribute. In all cases, no requester is allowed to set the Relaxed
> | Ordering attribute bit if the Enable Relaxed Ordering bit
> in the PCI-X
> | Command register is cleared.
>
> I interpret this to mean:
> Setting the RO bit in the PCI-X Command Register only enables
> the device to choose when to set RO Attribute bit when the device
> generates a PCI-X bus cycle.

Yes, this is exactly how (at least our 10GbE) PCI-X ASICs work.
If the RO bit is set, the device decides whether the transaction
requires strong ordering,
and sets RO attribute accordingly.
Leonid


>
> My gut feeling is few PCI-X HW developers have had time or
> experience to get this right. Most will either ignore RO bit
> or always set it for all transactions. But that's just my
> speculation. Drivers writers for each device will have to
> know this and I suspect most won't care.
>
> Secondly, I've convinced myself RO bit can not be set per
> transaction (and only per device) by the host. I just re-read
> the first sentences in section "2.5. Attributes" (talks about
> PCI-X bus signalling):
>
> | Attributes are additional information included with each
> transaction
> | that further defines the transaction. The initiator of every
> | transaction drives attributes on the C/BE[3::0]# and
> AD[31::00] buses
> | in the attribute phase.
>
> The CPU does not directly generate transactions on the PCI-X
> bus. At least not in the current crop of CPUs. ergo we can
> only program the PCI-X bus controller before hand or alias
> address bits to be
> attributes (similar to cache/uncached address ranges on
> ia64). Is SN2 doing the latter for PCI-X MMIO reads?
>
> And is the read return transaction going to reflect the same
> attributes used for read request?
>
> > > On HP ZX1, the "Allow Relaxed Ordering" is only implemented for
> > > outbound DMA/PIO Writes *while they pass through the ZX1
> chip*. Ie
> > > RO bit settings don't explicitly apply since we aren't
> talking about
> > > PCI-X bus transactions even though the system chipset
> needs to honor
> > > PCI-X rules.
> >
> > So this wouldn't be helpful for your chipset then.
>
> Right. s/your/HP/. But HP has more than one chipset and I'm
> not that familiar with SX1000 chipset. Though I don't expect
> it supports anything different in this regard.
>
>
> > Ahh... that's a bit of a stretch of the definition of
> non-coherence I
> > think, but it might be close enough to use the sync semantics.
>
> Can you give a better definition of non-coherence?
> I'll defend the following (a variant of what I said before):
> Data written by the IO device is not visible to the CPU
> when the CPU expects the data to be visible.
>
> I'll assert SN2 is non-coherent with RO enabled.
> "mostly coherent" is probably the right level of fuzziness.
> But linux doesn't have a "mostly coherent" DMA API. :^)
>
> [ James (Bottomley) - I couldn't find a definition of "non-consistent
> memory machine" in DMA-ABI.txt. Was that intentional or could you
> include a variant of the above definition?
> I guess if one needed to include a definition, then the reader
> shouldn't be using the interfaces described in Part II.
> But this is a key distinction from DMA-mapping.txt. ]
>
>
> > Right, that's another option--adding a pci_sync_consistent() call.
>
> yes - something like this would be my preference mostly
> because it's less intrusive to the drivers, less confusing
> for driver writers, and can be a complete NOP on most platforms.
>
> BTW, Jesse, did you look at part II of Documentation/DMA-ABI.txt?
>
> thanks,
> grant
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to
> [email protected] More majordomo info at
http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

2004-01-08 17:37:31

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Wed, Jan 07, 2004 at 11:38:29PM -0700, Grant Grundler wrote:
> ....maybe it would be better if more folks read the PCI-X spec.
> This quote is from v1.0a PCI-X Addendum to PCI Local Bus Spec,
> "Appendix 11 - Use Of Relaxed Ordering" (bottom of page 221):
>
> | In general, read and write transactions to or from I/O devices are
> | classified as payload or control. (PCI 2.2 Appendix E refers to payload
> | as Data and control as Flag and Status.) If the payload traffic requires
> | multiple data phases or multiple transactions, such payload traffic
> | rarely requires ordered transactions. That is, the order in which the
> | bytes of the payload arrive is inconsequential, if they all arrive before
> | the corresponding control traffic. However, control traffic generally does
> | require ordered transactions. I/O devices that follow this programming
> | model could use this distinction to set the Relaxed Ordering attribute
> | in hardware with no device driver intervention.
>
> Read that last sentence again.
> It suggests using readb() variants are the wrong approach.

Yep, you're right. Adding readX() would definitely be the wrong thing
to do if we want to support PCI-X RO correctly.

> I'll assert SN2 is non-coherent with RO enabled.
> "mostly coherent" is probably the right level of fuzziness.
> But linux doesn't have a "mostly coherent" DMA API. :^)

I'll buy that.

> [ James (Bottomley) - I couldn't find a definition of "non-consistent
> memory machine" in DMA-ABI.txt. Was that intentional or could you
> include a variant of the above definition?
> I guess if one needed to include a definition, then the reader
> shouldn't be using the interfaces described in Part II.
> But this is a key distinction from DMA-mapping.txt. ]
>
>
> > Right, that's another option--adding a pci_sync_consistent() call.
>
> yes - something like this would be my preference mostly because it's
> less intrusive to the drivers, less confusing for driver writers,
> and can be a complete NOP on most platforms.
>
> BTW, Jesse, did you look at part II of Documentation/DMA-ABI.txt?

I remember seeing discussion of the new API, but haven't read that doc
yet. Since most drivers still use the pci_* API, we'd have to add a
call there, but we may as well make the two APIs as similar as possible
right?

Jesse

2004-01-08 17:43:49

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, Jan 08, 2004 at 08:23:49AM -0800, Leonid Grossman wrote:
> > I interpret this to mean:
> > Setting the RO bit in the PCI-X Command Register only enables
> > the device to choose when to set RO Attribute bit when the device
> > generates a PCI-X bus cycle.
>
> Yes, this is exactly how (at least our 10GbE) PCI-X ASICs work.
> If the RO bit is set, the device decides whether the transaction
> requires strong ordering,
> and sets RO attribute accordingly.

Excellent, a card in the wild that actually does this! :) Ok, now I'll
take of my sn2 tunnel vision glasses--we don't want another readX
variant, but it sounds like we'll need pcix_enable_relaxed() _and_
pci_sync_consistent() to support non-coherent platforms well. How does
that sound?

Thanks,
Jesse

2004-01-08 17:54:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, Jan 08, 2004 at 08:23:49AM -0800, Leonid Grossman wrote:
> Yes, this is exactly how (at least our 10GbE) PCI-X ASICs work.
> If the RO bit is set, the device decides whether the transaction
> requires strong ordering,
> and sets RO attribute accordingly.

Do you have a pointer to the driver source? This would probably
make a good reference driver for Jesse's suggestion.

2004-01-08 18:44:56

by Grant Grundler

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, Jan 08, 2004 at 09:36:55AM -0800, Jesse Barnes wrote:
> > BTW, Jesse, did you look at part II of Documentation/DMA-ABI.txt?
>
> I remember seeing discussion of the new API, but haven't read that doc
> yet. Since most drivers still use the pci_* API, we'd have to add a
> call there, but we may as well make the two APIs as similar as possible
> right?

That would be my preference too.

I haven't studied "part II" closely enough to figure out if adding
pci_sync_consistent() would outright replace much of the DMA-API
interface. The main issue is cacheline ownership.

pci_sync_consistent() needs to indicate CPU wants ownership of outstanding
cachelines vs IO device wanting to own them.
SN2 doesn't care about the latter case since it's "mostly coherent".
SN2 just needs to flush in-flight DMA and it's coherent again.
But older non-coherent platforms do care.

I trust James understands this better than I given the fun
he's had with old parisc HW (715/50).

grant

2004-01-08 19:54:28

by Leonid Grossman

[permalink] [raw]
Subject: RE: [RFC] Relaxed PIO read vs. DMA write ordering



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Thursday, January 08, 2004 9:54 AM
> To: Leonid Grossman
> Cc: 'Grant Grundler'; 'Jesse Barnes';
> [email protected]; [email protected]; 'Matthew
> Wilcox'; [email protected];
> [email protected]
> Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering
>
>
> On Thu, Jan 08, 2004 at 08:23:49AM -0800, Leonid Grossman wrote:
> > Yes, this is exactly how (at least our 10GbE) PCI-X ASICs
> work. If the
> > RO bit is set, the device decides whether the transaction requires
> > strong ordering, and sets RO attribute accordingly.
>
> Do you have a pointer to the driver source? This would
> probably make a good reference driver for Jesse's suggestion.
>

Right now the code goes to our OEMs and end-user customers along with
the cards;
We are planning to submit the driver to 2.6 kernel in about
3 weeks or so.
At that point we will also 'unmask' it on s2io ftp site for downloads.

Leonid

2004-01-09 07:14:08

by Jeremy Higdon

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, Jan 08, 2004 at 11:44:06AM -0700, Grant Grundler wrote:
> On Thu, Jan 08, 2004 at 09:36:55AM -0800, Jesse Barnes wrote:
> > > BTW, Jesse, did you look at part II of Documentation/DMA-ABI.txt?
> >
> > I remember seeing discussion of the new API, but haven't read that doc
> > yet. Since most drivers still use the pci_* API, we'd have to add a
> > call there, but we may as well make the two APIs as similar as possible
> > right?
>
> That would be my preference too.
>
> I haven't studied "part II" closely enough to figure out if adding
> pci_sync_consistent() would outright replace much of the DMA-API
> interface. The main issue is cacheline ownership.
>
> pci_sync_consistent() needs to indicate CPU wants ownership of outstanding
> cachelines vs IO device wanting to own them.
> SN2 doesn't care about the latter case since it's "mostly coherent".
> SN2 just needs to flush in-flight DMA and it's coherent again.
> But older non-coherent platforms do care.


What if the host/bridge sets the RO bit on a PIO read? That would
allow a PIO read response to bypass a DMA write. Now, maybe that
doesn't make much sense with respect to PCI-X. I think it's possible,
though. Or can the RO bit only be set by a device?

In any case, if we can do a PIO read to one address space that flushes
DMA ahead of it or another address space that does not, then you would
need a separate version of readX, rather than an extra call to sync
after the read.

In theory, such a distinction would be useful for any platform that
uses separate paths for PIO read responses and DMA writes. Perhaps
there is only one platform that has that feature?

jeremy

2004-01-09 07:41:35

by Jochen Friedrich

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

Hi Jesse,

> > BTW, Jesse, did you look at part II of Documentation/DMA-ABI.txt?
>
> I remember seeing discussion of the new API, but haven't read that doc
> yet. Since most drivers still use the pci_* API, we'd have to add a
> call there, but we may as well make the two APIs as similar as possible
> right?

And there are reasons for drivers still using the pci_* API. In tms380tr,
i support both PCI and ISA cards. The pci_* API supports mapping ISA cards
for bus master DMA by passing a NULL for pdev. The new API still fails
because of the BUG_ON(dev->bus != &pci_bus_type). Unfortunately, on 64 bit
platforms like Alpha, the mapping is required to set up the IOMMU.

--jochen

2004-01-09 19:51:57

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, Jan 08, 2004 at 11:13:47PM -0800, Jeremy Higdon wrote:
> In theory, such a distinction would be useful for any platform that
> uses separate paths for PIO read responses and DMA writes. Perhaps
> there is only one platform that has that feature?

Well, the big MIPS boxes behave this way too. Ralf and Christoph, what
do you think? The current PCI DMA mapping API is insufficient for
Origin and other Bridge/Hub boxes, although I don't think there's any
way to implement pci_sync_consistent() on Origin unless there happens to
be an extra interrupt line available for each PCI slot. That would
allow the arch code to generate a DMA write barrier (by forcing an
interrupt as we do on sn2) that would force previous DMA into the
coherence domain before it completed.

Jesse

2004-01-09 20:02:43

by Grant Grundler

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, Jan 08, 2004 at 11:13:47PM -0800, Jeremy Higdon wrote:
> What if the host/bridge sets the RO bit on a PIO read? That would
> allow a PIO read response to bypass a DMA write.

*Any* bridge that is on the common path from CPU/Mem to PCI-X
can choose to implement RO anyway it likes. Even though such a
bridge may not use PCI-X, the entire system must honor the semantics
required by PCI/PCI-X one way or another. RO is optional IMHO.

> Now, maybe that
> doesn't make much sense with respect to PCI-X. I think it's possible,
> though. Or can the RO bit only be set by a device?

The ordering of transaction on the way to the device is not the obvious
problem here. It's the ordering of transactions originated by the device.
A read return is generated by the device on PCI-X since PCI-X supports
split transactions.

> In any case, if we can do a PIO read to one address space that flushes
> DMA ahead of it or another address space that does not, then you would
> need a separate version of readX, rather than an extra call to sync
> after the read.

That would be optimal yes.
But a functional pci_sync_consistent() implementation would consist
of a MMIO Read using the address space that enforces ordering.

As I pointed out earlier, the spec clearly states it's up to the
device to specify the ordering, not the device driver. The device
driver can only choose to enable the feature or not.

grant

2004-01-09 20:27:24

by Grant Grundler

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Fri, Jan 09, 2004 at 08:39:28AM +0100, Jochen Friedrich wrote:
> And there are reasons for drivers still using the pci_* API. In tms380tr,
> i support both PCI and ISA cards. The pci_* API supports mapping ISA cards
> for bus master DMA by passing a NULL for pdev. The new API still fails
> because of the BUG_ON(dev->bus != &pci_bus_type).

I don't think that's a problem of API, rather the implementation.

> Unfortunately, on 64 bit
> platforms like Alpha, the mapping is required to set up the IOMMU.

Not just alpha.
ia64, parisc, x86_64, sparc64, mips, (and a few others) also have IO MMUs.

grant

2004-01-09 22:12:46

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Fri, Jan 09, 2004 at 01:27:18PM -0700, Grant Grundler wrote:
> ia64, parisc, x86_64, sparc64, mips, (and a few others) also have IO MMUs.

Hmm, IOMMU *and* ISA slots? :-)

Ivan.

2004-01-09 23:16:06

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Fri, Jan 09, 2004 at 11:51:47AM -0800, Jesse Barnes wrote:
> On Thu, Jan 08, 2004 at 11:13:47PM -0800, Jeremy Higdon wrote:
> > In theory, such a distinction would be useful for any platform that
> > uses separate paths for PIO read responses and DMA writes. Perhaps
> > there is only one platform that has that feature?

Let me clarify this a bit more (and remember to Cc Ralf and Christoph
this time): there are two types of ordering we're worried about here,
DMA vs. other DMA ordering and DMA vs. PIO ordering.

Some platforms allow DMA buffers to arrive at their destination out of
order when a barrier bit is unset. SGI machines using Bridge (or Bridge
variant like sn2 or Origin) chips are implemented this way, so if a DMA
transaction from a device to system memory doesn't have the barrier bit
set it's allowed to pass other non-barriered DMA.

PIOs are another matter. SGI machines using the above chips enforce
_no_ ordering whatsoever between DMA and PIO traffic. That is, a PIO
read response can 'pass' a DMA write (even a barriered one) and get to
the requesting CPU before the DMA is in the coherence domain. So in
effect, all PIOs on SGI boxes are 'relaxed' by default. For sn2, we've
implemented a special sn_dma_flush() function that we use following a
PIO read in our read() and in() routines so that the driver can be
assured that any DMA that it thinks is done actually is. I'm not sure
this workaround is possible on Origin machines, so certain devices may
be open to data corruption depending on how they interact with the host
system. Our next I/O chip will implement PIO vs. DMA ordering as a
seperate address space.

Given the above, a new read_relaxed() routine or an ioremap_relaxed()
routine seem like the best solution for us. Neither is invasive at all
for current drivers (i.e. existing stuff will 'just work'), but it does
add complexity to the API. Adding a pci_enable_relaxed() routine is a
completely seperate issue since it will be a noop on our platform (and
it probably silly to implement until we have real hardware that needs
it).

Anyway, thanks for your patience. I probably made a mistake trying to
tie this proposal to the PCI-X spec since I'm only guessing at how such
hardware might behave...

Thanks,
Jesse

2004-01-11 14:35:16

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC] Relaxed PIO read vs. DMA write ordering

On Thu, 2004-01-08 at 13:44, Grant Grundler wrote:
> I haven't studied "part II" closely enough to figure out if adding
> pci_sync_consistent() would outright replace much of the DMA-API
> interface. The main issue is cacheline ownership.
>
> pci_sync_consistent() needs to indicate CPU wants ownership of outstanding
> cachelines vs IO device wanting to own them.
> SN2 doesn't care about the latter case since it's "mostly coherent".
> SN2 just needs to flush in-flight DMA and it's coherent again.
> But older non-coherent platforms do care.
>
> I trust James understands this better than I given the fun
> he's had with old parisc HW (715/50).

Sorry for being a bit late...I was travelling and didn't have the time
to go over the whole thread until now.

Let me clarify what Part II of the DMA-API is about: it's for drivers
who may be required to operate both on hardware that has a coherency
domain and hardware that hasn't.

Its design is primarily to be as efficient as possible on coherency
domain hardware.

I think it can do exactly what you want for the RO case, because it was
tailored for almost precisely this problem (guaranteeing mailbox
reads/writes become coherent). I think dma_cache_sync() corresponds
almost exactly to the semantics you would require of
pci_sync_coherent().

Of course, it's not the whole solution because even on hardware without
a coherency domain, PIO reads/writes are still coherent.

James