2007-08-18 00:29:46

by Arthur Kepner

[permalink] [raw]
Subject: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64


Define ARCH_DOES_POSTED_DMA, and override the dma_flags_set_dmaflush
function. Also define dma_flags_get_direction, dma_flags_get_dmaflush
to retrieve the direction and "dmaflush attribute" from flags
passed to the sn_dma_map_* functions.


Signed-off-by: Arthur Kepner <[email protected]>
--
arch/ia64/sn/pci/pci_dma.c | 35 ++++++++++++++++++++++++++---------
include/asm-ia64/sn/io.h | 26 ++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d79ddac..754240b 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -153,7 +153,7 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
* @dev: device to map for
* @cpu_addr: kernel virtual address of the region to map
* @size: size of the region
- * @direction: DMA direction
+ * @flags: DMA direction, and arch-specific attributes
*
* Map the region pointed to by @cpu_addr for DMA and return the
* DMA address.
@@ -167,17 +167,23 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
* figure out how to save dmamap handle so can use two step.
*/
dma_addr_t sn_dma_map_single(struct device *dev, void *cpu_addr, size_t size,
- int direction)
+ int flags)
{
dma_addr_t dma_addr;
unsigned long phys_addr;
struct pci_dev *pdev = to_pci_dev(dev);
struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+ int dmaflush = dma_flags_get_dmaflush(flags);

BUG_ON(dev->bus != &pci_bus_type);

phys_addr = __pa(cpu_addr);
- dma_addr = provider->dma_map(pdev, phys_addr, size, SN_DMA_ADDR_PHYS);
+ if (dmaflush)
+ dma_addr = provider->dma_map_consistent(pdev, phys_addr, size,
+ SN_DMA_ADDR_PHYS);
+ else
+ dma_addr = provider->dma_map(pdev, phys_addr, size,
+ SN_DMA_ADDR_PHYS);
if (!dma_addr) {
printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);
return 0;
@@ -240,18 +246,20 @@ EXPORT_SYMBOL(sn_dma_unmap_sg);
* @dev: device to map for
* @sg: scatterlist to map
* @nhwentries: number of entries
- * @direction: direction of the DMA transaction
+ * @flags: direction of the DMA transaction, and arch-specific attributes
*
* Maps each entry of @sg for DMA.
*/
int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
- int direction)
+ int flags)
{
unsigned long phys_addr;
struct scatterlist *saved_sg = sg;
struct pci_dev *pdev = to_pci_dev(dev);
struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
int i;
+ int dmaflush = dma_flags_get_dmaflush(flags);
+ int direction = dma_flags_get_direction(flags);

BUG_ON(dev->bus != &pci_bus_type);

@@ -259,12 +267,21 @@ int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
* Setup a DMA address for each entry in the scatterlist.
*/
for (i = 0; i < nhwentries; i++, sg++) {
+ dma_addr_t dma_addr;
phys_addr = SG_ENT_PHYS_ADDRESS(sg);
- sg->dma_address = provider->dma_map(pdev,
- phys_addr, sg->length,
- SN_DMA_ADDR_PHYS);

- if (!sg->dma_address) {
+ if (dmaflush) {
+ dma_addr = provider->dma_map_consistent(pdev,
+ phys_addr,
+ sg->length,
+ SN_DMA_ADDR_PHYS);
+ } else {
+ dma_addr = provider->dma_map(pdev,
+ phys_addr, sg->length,
+ SN_DMA_ADDR_PHYS);
+ }
+
+ if (!(sg->dma_address = dma_addr)) {
printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);

/*
diff --git a/include/asm-ia64/sn/io.h b/include/asm-ia64/sn/io.h
index 41c73a7..c82eb90 100644
--- a/include/asm-ia64/sn/io.h
+++ b/include/asm-ia64/sn/io.h
@@ -271,4 +271,30 @@ sn_pci_set_vchan(struct pci_dev *pci_dev, unsigned long *addr, int vchan)
return 0;
}

+#define ARCH_DOES_POSTED_DMA
+/* here we steal some upper bits from the "direction" argument to the
+ * dma_map_* routines */
+#define DMA_ATTR_SHIFT 8
+/* bottom 8 bits for direction, remaining bits for additional "attributes" */
+#define DMA_FLUSH_ATTR 0x1
+/* For now the only attribute is "flush in-flight dma when writing to
+ * this DMA mapped memory" */
+#define DMA_DIR_MASK ((1 << DMA_ATTR_SHIFT) - 1)
+#define DMA_ATTR_MASK ~DMA_DIR_MASK
+
+static inline int
+dma_flags_set_dmaflush(int dir) {
+ return (dir | (DMA_FLUSH_ATTR<< DMA_ATTR_SHIFT));
+}
+
+static inline int
+dma_flags_get_direction(int dir) {
+ return (dir & DMA_DIR_MASK);
+}
+
+static inline int
+dma_flags_get_dmaflush(int dir) {
+ return (((dir & DMA_ATTR_MASK) >> DMA_ATTR_SHIFT) & DMA_FLUSH_ATTR);
+}
+
#endif /* _ASM_SN_IO_H */
--
Arthur


2007-08-20 08:26:35

by Jes Sorensen

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

[email protected] wrote:
> Define ARCH_DOES_POSTED_DMA, and override the dma_flags_set_dmaflush
> function. Also define dma_flags_get_direction, dma_flags_get_dmaflush
> to retrieve the direction and "dmaflush attribute" from flags
> passed to the sn_dma_map_* functions.
>

Hi Arthur,

I'm a little concerned about changing the API for the dma_ foo
functions, which are defined cross platform. If you want to change
that, I think it will require updating the documentation explaining
it.

Regarding ARCH_DOES_POSTED_DMA, is that sufficient as a #define or
does it have to be run-time tested? (does it do anything at this
stage). I ask since not all ia64 platforms do posted DMA.

Cheers,
Jes

2007-08-20 16:08:30

by Arthur Kepner

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Mon, Aug 20, 2007 at 10:24:53AM +0200, Jes Sorensen wrote:

> I'm a little concerned about changing the API for the dma_ foo
> functions, which are defined cross platform. If you want to change
> that, I think it will require updating the documentation explaining
> it.

Good idea. I'll post a documentation patchette.

>
> Regarding ARCH_DOES_POSTED_DMA, is that sufficient as a #define or
> does it have to be run-time tested? (does it do anything at this
> stage). I ask since not all ia64 platforms do posted DMA.
>

I think a #define is exactly what we want here.

The newly #define-ed function (for now, and maybe forever) does
nothing except on IA64_SGI_SN2, which does posted DMA.

--
Arthur

2007-08-21 19:36:37

by Arthur Kepner

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64


> I'm a little concerned about changing the API for the dma_ foo
> functions, which are defined cross platform. If you want to change
> that, I think it will require updating the documentation explaining
> it.....

What do you think of the following? (And is there anyone else
I should be cc-ing for review?)


Document semantics of dma_flags_set_dmaflush()

Signed-off-by: Arthur Kepner <[email protected]>
--
DMA-API.txt | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index cc7a8c3..e117b72 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -392,6 +392,28 @@ Notes: You must do this:

See also dma_map_single().

+int
+dma_flags_set_dmaflush(int dir)
+
+Amend dir (one of the enum dma_data_direction values), with a platform-
+specific "dmaflush" attribute. Unless the platform supports "posted DMA"
+this is a no-op.
+
+On platforms that support posted DMA, dma_flags_set_dmaflush() causes
+device writes to the associated memory region to flush in-flight DMA.
+This can be important, for example, when (DMA) writes to the memory
+region indicate that DMA of data is complete. If DMA of data and DMA of
+the completion indication race, as they can do when the platform supports
+posted DMA, then the completion indication may arrive in host memory
+ahead of some data.
+
+To prevent this, you might map the memory region used for completion
+indications as follows:
+
+ int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL);
+ .....
+ count = dma_map_sg(dev, sglist, nents, flags);
+

Part II - Advanced dma_ usage
-----------------------------
--
Arthur

2007-08-21 19:59:55

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, 21 Aug 2007 12:35:22 -0700 [email protected] wrote:

>
> > I'm a little concerned about changing the API for the dma_ foo
> > functions, which are defined cross platform. If you want to change
> > that, I think it will require updating the documentation explaining
> > it.....
>
> What do you think of the following? (And is there anyone else
> I should be cc-ing for review?)

probably the document's author (cc added)


> Document semantics of dma_flags_set_dmaflush()
>
> Signed-off-by: Arthur Kepner <[email protected]>
> --
> DMA-API.txt | 22 ++++++++++++++++++++++
> 1 files changed, 22 insertions(+)
>
> diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
> index cc7a8c3..e117b72 100644
> --- a/Documentation/DMA-API.txt
> +++ b/Documentation/DMA-API.txt
> @@ -392,6 +392,28 @@ Notes: You must do this:
>
> See also dma_map_single().
>
> +int
> +dma_flags_set_dmaflush(int dir)
> +
> +Amend dir (one of the enum dma_data_direction values), with a platform-

no comma.

> +specific "dmaflush" attribute. Unless the platform supports "posted DMA"

add comma after "posted DMA" and drop lots of trailing spaces.

> +this is a no-op.
> +
> +On platforms that support posted DMA, dma_flags_set_dmaflush() causes
> +device writes to the associated memory region to flush in-flight DMA.
> +This can be important, for example, when (DMA) writes to the memory
> +region indicate that DMA of data is complete. If DMA of data and DMA of
> +the completion indication race, as they can do when the platform supports
> +posted DMA, then the completion indication may arrive in host memory
> +ahead of some data.
> +
> +To prevent this, you might map the memory region used for completion
> +indications as follows:
> +
> + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL);
> + .....
> + count = dma_map_sg(dev, sglist, nents, flags);
> +
>
> Part II - Advanced dma_ usage
> -----------------------------
> --
> Arthur


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2007-08-21 20:16:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, Aug 21, 2007 at 12:35:22PM -0700, [email protected] wrote:
> +int
> +dma_flags_set_dmaflush(int dir)
> +
> +Amend dir (one of the enum dma_data_direction values), with a platform-
> +specific "dmaflush" attribute. Unless the platform supports "posted DMA"
> +this is a no-op.
> +
> +On platforms that support posted DMA, dma_flags_set_dmaflush() causes
> +device writes to the associated memory region to flush in-flight DMA.
> +This can be important, for example, when (DMA) writes to the memory
> +region indicate that DMA of data is complete. If DMA of data and DMA of
> +the completion indication race, as they can do when the platform supports
> +posted DMA, then the completion indication may arrive in host memory
> +ahead of some data.

So, let me try to understand ... your hardware allows writes from the
device to pass other writes from the device? Doesn't that violate the
PCI spec? I'm thinking about this (page 43 of PCI 2.3):

Posted memory writes moving in the same direction through a bridge
will complete on the destination bus in the same order they complete
on the originating bus. Even if a single burst on the originating bus
is terminated with Disconnect on the destination bus so that it is
broken into multiple transactions, those transactions must not allow
the data phases to complete on the destination bus in any order other
than their order on the originating bus.

> +To prevent this, you might map the memory region used for completion
> +indications as follows:
> +
> + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL);
> + .....
> + count = dma_map_sg(dev, sglist, nents, flags);

So any device driver used on your hardware has to add a call to this new
function, or it'll see data corruption? Not acceptable, IMO.

If this is a performance optimisation, then by all means add a function
drivers can call to say "it's OK, I know about this brokenness, and I
don't depend on it", but safety first.

--
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-08-21 20:55:55

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, 2007-08-21 at 13:05 -0700, Randy Dunlap wrote:
> > diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
> > index cc7a8c3..e117b72 100644
> > --- a/Documentation/DMA-API.txt
> > +++ b/Documentation/DMA-API.txt
> > @@ -392,6 +392,28 @@ Notes: You must do this:
> >
> > See also dma_map_single().
> >
> > +int
> > +dma_flags_set_dmaflush(int dir)
> > +
> > +Amend dir (one of the enum dma_data_direction values), with a platform-
>
> no comma.
>
> > +specific "dmaflush" attribute. Unless the platform supports "posted DMA"
>
> add comma after "posted DMA" and drop lots of trailing spaces.
>
> > +this is a no-op

Almost every platform supports posted DMA ... its a property of most PCI
bridge chips.

> > +On platforms that support posted DMA, dma_flags_set_dmaflush() causes
> > +device writes to the associated memory region to flush in-flight DMA.
> > +This can be important, for example, when (DMA) writes to the memory
> > +region indicate that DMA of data is complete. If DMA of data and DMA of
> > +the completion indication race, as they can do when the platform supports
> > +posted DMA, then the completion indication may arrive in host memory
> > +ahead of some data.

This isn't possible on most platforms. PCI write posting can only be
flushed by a read transaction on the device (or sometimes any device on
the bridge). Either this interface is misnamed and misdescribed, or it
can't work for most systems.

> > +To prevent this, you might map the memory region used for completion
> > +indications as follows:
> > +
> > + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL);
> > + .....
> > + count = dma_map_sg(dev, sglist, nents, flags);

James


2007-08-21 21:38:25

by Arthur Kepner

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, Aug 21, 2007 at 02:16:32PM -0600, Matthew Wilcox wrote:
>
> So, let me try to understand ... your hardware allows writes from the
> device to pass other writes from the device? Doesn't that violate the
> PCI spec? I'm thinking about this (page 43 of PCI 2.3):
> ....

I should have stated this more carefully, but it's not a PCI reordering
that's being addressed here, it's a reordering that can occur within
the NUMA-interconnect.

--
Arthur

2007-08-22 00:36:04

by Arthur Kepner

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, Aug 21, 2007 at 03:55:29PM -0500, James Bottomley wrote:

> .....
> Almost every platform supports posted DMA ... its a property of most PCI
> bridge chips.
>

The term "posted DMA" is used to describe this behavior in the Altix
Device Driver Writer's Guide, but it may be confusing things here.
Maybe a better term will suggest itself if I can clarify....

On Altix, DMA from a device isn't guaranteed to arrive in host memory
in the order it was sent from the device. This reordering can happen
in the NUMA interconnect (it's specifically not a PCI reordering.)

> ......
> This isn't possible on most platforms. PCI write posting can only be
> flushed by a read transaction on the device (or sometimes any device on
> the bridge). Either this interface is misnamed and misdescribed, or it
> can't work for most systems.
>

Clearly it wasn't described adequately...

A read transaction on the device will flush pending writes to the
device. But I'm worried about DMA from the device to host memory.
On Altix, there are two mechanisms that flush all in-flight DMA
to host memory: 1) an interrupt, and 2) a write to a memory region
which has a "barrier" attribute set. Obviously option 1 isn't
viable for performance reasons. This new interface is about making
"option 2" generally available. (As it is now, the only way to get
memory with the "barrier" attribute is to allocate it with
dma_alloc_coherent().)

--
Arthur

2007-08-22 01:14:26

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
> On Tue, Aug 21, 2007 at 03:55:29PM -0500, James Bottomley wrote:
>
> > .....
> > Almost every platform supports posted DMA ... its a property of most PCI
> > bridge chips.
> >
>
> The term "posted DMA" is used to describe this behavior in the Altix
> Device Driver Writer's Guide, but it may be confusing things here.
> Maybe a better term will suggest itself if I can clarify....

OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
the confusion.

> On Altix, DMA from a device isn't guaranteed to arrive in host memory
> in the order it was sent from the device. This reordering can happen
> in the NUMA interconnect (it's specifically not a PCI reordering.)

This is mmiowb and read_relaxed() again, isn't it?

> > ......
> > This isn't possible on most platforms. PCI write posting can only be
> > flushed by a read transaction on the device (or sometimes any device on
> > the bridge). Either this interface is misnamed and misdescribed, or it
> > can't work for most systems.
> >
>
> Clearly it wasn't described adequately...
>
> A read transaction on the device will flush pending writes to the
> device. But I'm worried about DMA from the device to host memory.
> On Altix, there are two mechanisms that flush all in-flight DMA
> to host memory: 1) an interrupt, and 2) a write to a memory region
> which has a "barrier" attribute set. Obviously option 1 isn't
> viable for performance reasons. This new interface is about making
> "option 2" generally available. (As it is now, the only way to get
> memory with the "barrier" attribute is to allocate it with
> dma_alloc_coherent().)

Which sounds exactly what mmiowb does ... is there a need for a new API;
can't you just use mmiowb()?

James


2007-08-22 07:41:26

by Jes Sorensen

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

James Bottomley wrote:
> On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
>> The term "posted DMA" is used to describe this behavior in the Altix
>> Device Driver Writer's Guide, but it may be confusing things here.
>> Maybe a better term will suggest itself if I can clarify....
>
> OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
> the confusion.

Maybe it would be more better to refer to this as 'out of order DMA'?

>> On Altix, DMA from a device isn't guaranteed to arrive in host memory
>> in the order it was sent from the device. This reordering can happen
>> in the NUMA interconnect (it's specifically not a PCI reordering.)
>
> This is mmiowb and read_relaxed() again, isn't it?

I believe it's the same problem, except this time it's when exposing
structures to userland.

Cheers,
Jes

2007-08-22 07:46:24

by Jes Sorensen

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

Matthew Wilcox wrote:
> So, let me try to understand ... your hardware allows writes from the
> device to pass other writes from the device? Doesn't that violate the
> PCI spec? I'm thinking about this (page 43 of PCI 2.3):

Hi Matthew,

Yes I believe this behavior on sn2 is 'bending' the PCI spec a bit. The
problem is that the NUMA fabric is a routed network in itself so there
can be multiple paths between the device and the physical memory it DMAs
to/from.

Cheers,
Jes

2007-08-22 14:02:54

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, 2007-08-22 at 09:39 +0200, Jes Sorensen wrote:
> James Bottomley wrote:
> > On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
> >> The term "posted DMA" is used to describe this behavior in the Altix
> >> Device Driver Writer's Guide, but it may be confusing things here.
> >> Maybe a better term will suggest itself if I can clarify....
> >
> > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
> > the confusion.
>
> Maybe it would be more better to refer to this as 'out of order DMA'?

Or Relaxed ordering DMA ... that's why the readX_relaxed()?

> >> On Altix, DMA from a device isn't guaranteed to arrive in host memory
> >> in the order it was sent from the device. This reordering can happen
> >> in the NUMA interconnect (it's specifically not a PCI reordering.)
> >
> > This is mmiowb and read_relaxed() again, isn't it?
>
> I believe it's the same problem, except this time it's when exposing
> structures to userland.

Hmm, so how does another kernel API exposing mmiowb in a different way
help with this? Surely, if the driver is exporting something to user
space, it's simply up to the driver to call mmiowb when the user says
it's done?

James


2007-08-22 15:55:42

by Arthur Kepner

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Tue, Aug 21, 2007 at 08:14:09PM -0500, James Bottomley wrote:
> On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
> .....
> > On Altix, DMA from a device isn't guaranteed to arrive in host memory
> > in the order it was sent from the device. This reordering can happen
> > in the NUMA interconnect (it's specifically not a PCI reordering.)
>
> This is mmiowb and read_relaxed() again, isn't it?
> .....

No, this is different.

This problem here has do with ordering writes from the device
to host memory. Specifically this problem can be manifested with
Infiniband - when a Completion Queue Entry is written by the IB
device, it indicates that data is available in host memory. But
the write to the Completion Queue can race with DMA of data.

(Completion Queues can be allocated by the kernel or in userspace.
The race described above can only happen when they are allocated
in userspace - kernel allocations of CQs use dma_alloc_coherent()
and so get the "barrier" attribute that's needed to prevent the
race. The proposed new interface would allow CQs, or anything else,
allocated with plain old malloc(), to set the barrier attribute.)

--
Arthur

2007-08-22 16:03:53

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wednesday, August 22, 2007 7:02:38 am James Bottomley wrote:
> On Wed, 2007-08-22 at 09:39 +0200, Jes Sorensen wrote:
> > James Bottomley wrote:
> > > On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
> > >> The term "posted DMA" is used to describe this behavior in the Altix
> > >> Device Driver Writer's Guide, but it may be confusing things here.
> > >> Maybe a better term will suggest itself if I can clarify....
> > >
> > > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
> > > the confusion.
> >
> > Maybe it would be more better to refer to this as 'out of order DMA'?
>
> Or Relaxed ordering DMA ... that's why the readX_relaxed()?
>
> > >> On Altix, DMA from a device isn't guaranteed to arrive in host memory
> > >> in the order it was sent from the device. This reordering can happen
> > >> in the NUMA interconnect (it's specifically not a PCI reordering.)
> > >
> > > This is mmiowb and read_relaxed() again, isn't it?
> >
> > I believe it's the same problem, except this time it's when exposing
> > structures to userland.
>
> Hmm, so how does another kernel API exposing mmiowb in a different way
> help with this? Surely, if the driver is exporting something to user
> space, it's simply up to the driver to call mmiowb when the user says
> it's done?

mmiowb() is for PIO->device. This interface is for DMA->memory (see akepner's
other mail).

The problem is a DMA write (say to a completion queue) from a device may imply
something about another DMA write from the same device (say the actual data).
If the completion queue write arrives first (which can happen on sn2), the
driver must ensure that the rest of the outstanding DMA is complete prior to
looking at the completion queue status. It can either use a regular PIO read
to do this (i.e. a non-relaxed one) or set a flag on the completion queue DMA
address that makes it act as a barrier wrt other DMA, which is what akepner's
patch does (which should be much more efficient that using a PIO read to
guarantee DMA writes have completed).

Jesse

2007-08-22 16:45:42

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, 2007-08-22 at 09:03 -0700, Jesse Barnes wrote:
> On Wednesday, August 22, 2007 7:02:38 am James Bottomley wrote:
> > On Wed, 2007-08-22 at 09:39 +0200, Jes Sorensen wrote:
> > > James Bottomley wrote:
> > > > On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
> > > >> The term "posted DMA" is used to describe this behavior in the Altix
> > > >> Device Driver Writer's Guide, but it may be confusing things here.
> > > >> Maybe a better term will suggest itself if I can clarify....
> > > >
> > > > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
> > > > the confusion.
> > >
> > > Maybe it would be more better to refer to this as 'out of order DMA'?
> >
> > Or Relaxed ordering DMA ... that's why the readX_relaxed()?
> >
> > > >> On Altix, DMA from a device isn't guaranteed to arrive in host memory
> > > >> in the order it was sent from the device. This reordering can happen
> > > >> in the NUMA interconnect (it's specifically not a PCI reordering.)
> > > >
> > > > This is mmiowb and read_relaxed() again, isn't it?
> > >
> > > I believe it's the same problem, except this time it's when exposing
> > > structures to userland.
> >
> > Hmm, so how does another kernel API exposing mmiowb in a different way
> > help with this? Surely, if the driver is exporting something to user
> > space, it's simply up to the driver to call mmiowb when the user says
> > it's done?
>
> mmiowb() is for PIO->device. This interface is for DMA->memory (see akepner's
> other mail).
>
> The problem is a DMA write (say to a completion queue) from a device may imply
> something about another DMA write from the same device (say the actual data).
> If the completion queue write arrives first (which can happen on sn2), the
> driver must ensure that the rest of the outstanding DMA is complete prior to
> looking at the completion queue status. It can either use a regular PIO read
> to do this (i.e. a non-relaxed one) or set a flag on the completion queue DMA
> address that makes it act as a barrier wrt other DMA, which is what akepner's
> patch does (which should be much more efficient that using a PIO read to
> guarantee DMA writes have completed).

This is a violation of the PCI spec, isn't it, like Matthew pointed out?
The only time a device->host DMA transaction shouldn't follow strict
ordering is when the device sets the relaxed hint in its PCI registers.

James




2007-08-22 16:51:31

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wednesday, August 22, 2007 9:44:55 am James Bottomley wrote:
> > The problem is a DMA write (say to a completion queue) from a device may
> > imply something about another DMA write from the same device (say the
> > actual data). If the completion queue write arrives first (which can
> > happen on sn2), the driver must ensure that the rest of the outstanding
> > DMA is complete prior to looking at the completion queue status. It can
> > either use a regular PIO read to do this (i.e. a non-relaxed one) or set
> > a flag on the completion queue DMA address that makes it act as a barrier
> > wrt other DMA, which is what akepner's patch does (which should be much
> > more efficient that using a PIO read to guarantee DMA writes have
> > completed).
>
> This is a violation of the PCI spec, isn't it, like Matthew pointed out?
> The only time a device->host DMA transaction shouldn't follow strict
> ordering is when the device sets the relaxed hint in its PCI registers.

Yeah, it is. Whether its allowed in PCIe depends on how you read the spec
(but either way it would need to be explicitly enabled).

For better or for worse, Altix hardware always behaves this way (well mostly
for the better, since most device protocols don't care as they involve PIO,
and out of order completion is *much* faster on Altix than strict ordering).

Arthur's patch is pretty straightfoward though, so unless someone can think of
a better way of hiding this architectural detail in lower level code it's
probably a good thing to add (especially given that future revs of PCIe will
probably allow this behavior, and hopefully less ambiguously than the current
spec).

Jesse

2007-08-22 17:04:48

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, 2007-08-22 at 09:51 -0700, Jesse Barnes wrote:
> On Wednesday, August 22, 2007 9:44:55 am James Bottomley wrote:
> > > The problem is a DMA write (say to a completion queue) from a device may
> > > imply something about another DMA write from the same device (say the
> > > actual data). If the completion queue write arrives first (which can
> > > happen on sn2), the driver must ensure that the rest of the outstanding
> > > DMA is complete prior to looking at the completion queue status. It can
> > > either use a regular PIO read to do this (i.e. a non-relaxed one) or set
> > > a flag on the completion queue DMA address that makes it act as a barrier
> > > wrt other DMA, which is what akepner's patch does (which should be much
> > > more efficient that using a PIO read to guarantee DMA writes have
> > > completed).
> >
> > This is a violation of the PCI spec, isn't it, like Matthew pointed out?
> > The only time a device->host DMA transaction shouldn't follow strict
> > ordering is when the device sets the relaxed hint in its PCI registers.
>
> Yeah, it is. Whether its allowed in PCIe depends on how you read the spec
> (but either way it would need to be explicitly enabled).
>
> For better or for worse, Altix hardware always behaves this way (well mostly
> for the better, since most device protocols don't care as they involve PIO,
> and out of order completion is *much* faster on Altix than strict ordering).
>
> Arthur's patch is pretty straightfoward though, so unless someone can think of
> a better way of hiding this architectural detail in lower level code it's
> probably a good thing to add (especially given that future revs of PCIe will
> probably allow this behavior, and hopefully less ambiguously than the current
> spec).

The spec isn't ambiguous ... it says if the device and bridge agree on
relaxed ordering, then it *may* be observed in the transaction. If
there's a disagreement or neither wishes to support relaxed ordering
then the transaction *must* be strict.

I really don't think a work around for a PCI spec violation belongs in
the generic DMA code, do you? The correct fix for this should be to set
the device hints to strict ordering, which presumably altix respects?
In which case, it sounds like what needs exposing are access to the PCI
device hints. I believe both PCI-X and PCIe have these hints as
optional specifiers in the bridges, so it should be in a current Rev of
the PCI spec. Or are you proposing adding an additional PCI API that
allows transaction flushes to be inserted into the stream for devices
and bridges that have already negotiated relaxed ordering? ... in which
case we need to take this to the PCI list.

James


2007-08-22 17:11:17

by Jes Sorensen

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

James Bottomley wrote:
> I really don't think a work around for a PCI spec violation belongs in
> the generic DMA code, do you? The correct fix for this should be to set
> the device hints to strict ordering, which presumably altix respects?
> In which case, it sounds like what needs exposing are access to the PCI
> device hints. I believe both PCI-X and PCIe have these hints as
> optional specifiers in the bridges, so it should be in a current Rev of
> the PCI spec. Or are you proposing adding an additional PCI API that
> allows transaction flushes to be inserted into the stream for devices
> and bridges that have already negotiated relaxed ordering? ... in which
> case we need to take this to the PCI list.

James,

I don't believe it respects those hints - I agree, it's a pita, but
thats the state of the situation. Even if it did, it would make
performance suck as Jesse also pointed out.

As I pointed out in my email to Willy is that the NUMA fabric is routed,
there's not one path through the system, which is what makes this
happen.

Jes

2007-08-22 17:17:35

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wednesday, August 22, 2007 10:04:32 am James Bottomley wrote:
> The spec isn't ambiguous ... it says if the device and bridge agree on
> relaxed ordering, then it *may* be observed in the transaction. If
> there's a disagreement or neither wishes to support relaxed ordering
> then the transaction *must* be strict.

Arg, don't make me dig out my spec, I don't have it handy...

> I really don't think a work around for a PCI spec violation belongs in
> the generic DMA code, do you? The correct fix for this should be to set
> the device hints to strict ordering, which presumably altix respects?

Well, the Altix hw by itself won't honor device hints (I'm not even sure if
there are devices that honor order hints like you outline above). However,
Altix could track per-device ordering as long as arch code was called from
such a hook.

> In which case, it sounds like what needs exposing are access to the PCI
> device hints. I believe both PCI-X and PCIe have these hints as
> optional specifiers in the bridges, so it should be in a current Rev of
> the PCI spec. Or are you proposing adding an additional PCI API that
> allows transaction flushes to be inserted into the stream for devices
> and bridges that have already negotiated relaxed ordering? ... in which
> case we need to take this to the PCI list.

I think it would have to be the latter, since otherwise it would be hard to
setup just completion queue requests to be ordered.

Jesse


2007-08-22 18:10:27

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, 2007-08-22 at 19:03 +0200, Jes Sorensen wrote:
> James Bottomley wrote:
> > I really don't think a work around for a PCI spec violation belongs in
> > the generic DMA code, do you? The correct fix for this should be to set
> > the device hints to strict ordering, which presumably altix respects?
> > In which case, it sounds like what needs exposing are access to the PCI
> > device hints. I believe both PCI-X and PCIe have these hints as
> > optional specifiers in the bridges, so it should be in a current Rev of
> > the PCI spec. Or are you proposing adding an additional PCI API that
> > allows transaction flushes to be inserted into the stream for devices
> > and bridges that have already negotiated relaxed ordering? ... in which
> > case we need to take this to the PCI list.
>
> James,
>
> I don't believe it respects those hints - I agree, it's a pita, but
> thats the state of the situation. Even if it did, it would make
> performance suck as Jesse also pointed out.
>
> As I pointed out in my email to Willy is that the NUMA fabric is routed,
> there's not one path through the system, which is what makes this
> happen.

Hmm, didn't see the email ... but I'm probably not cc'd on all the
thread. However ... it isn't that you couldn't do it ... it's that you
don't want to do it because it's faster to violate the spec ... like all
those nice ATA devices that lie about having a cache and then let you
power down with uncommitted data still in it ... they work much faster
for HDIO tests ... and who ever switches their box off?

James


2007-08-22 18:13:22

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, 2007-08-22 at 10:17 -0700, Jesse Barnes wrote:
> On Wednesday, August 22, 2007 10:04:32 am James Bottomley wrote:
> > The spec isn't ambiguous ... it says if the device and bridge agree on
> > relaxed ordering, then it *may* be observed in the transaction. If
> > there's a disagreement or neither wishes to support relaxed ordering
> > then the transaction *must* be strict.
>
> Arg, don't make me dig out my spec, I don't have it handy...

Well ... I don't have one either. However, Grant Grundler did a
presentation to OLS about relaxed ordering, and I went over it pretty
thoroughly with him a while ago. The bottom line is that the default is
always strict unless both the bridge and the device agree otherwise.

> > I really don't think a work around for a PCI spec violation belongs in
> > the generic DMA code, do you? The correct fix for this should be to set
> > the device hints to strict ordering, which presumably altix respects?
>
> Well, the Altix hw by itself won't honor device hints (I'm not even sure if
> there are devices that honor order hints like you outline above). However,
> Altix could track per-device ordering as long as arch code was called from
> such a hook.
>
> > In which case, it sounds like what needs exposing are access to the PCI
> > device hints. I believe both PCI-X and PCIe have these hints as
> > optional specifiers in the bridges, so it should be in a current Rev of
> > the PCI spec. Or are you proposing adding an additional PCI API that
> > allows transaction flushes to be inserted into the stream for devices
> > and bridges that have already negotiated relaxed ordering? ... in which
> > case we need to take this to the PCI list.
>
> I think it would have to be the latter, since otherwise it would be hard to
> setup just completion queue requests to be ordered.

OK ... I think this is definitely a PCI specific API ... and probably a
generic one for requesting order flushes in devices that have negotiated
relaxed ordering. Do you want to start a new thread on linux-pci and cc
me?

James


2007-08-22 18:45:39

by Arthur Kepner

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, Aug 22, 2007 at 01:13:04PM -0500, James Bottomley wrote:
> ......
> OK ... I think this is definitely a PCI specific API ... and probably a
> generic one for requesting order flushes in devices that have negotiated
> relaxed ordering. Do you want to start a new thread on linux-pci and cc
> me?
>

I'll do this.

--
Arthur

2007-08-23 05:59:17

by Jeremy Higdon

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

On Wed, Aug 22, 2007 at 09:39:36AM +0200, Jes Sorensen wrote:
> James Bottomley wrote:
> >On Tue, 2007-08-21 at 17:34 -0700, [email protected] wrote:
> >>The term "posted DMA" is used to describe this behavior in the Altix
> >>Device Driver Writer's Guide, but it may be confusing things here.
> >>Maybe a better term will suggest itself if I can clarify....
> >
> >OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
> >the confusion.
>
> Maybe it would be more better to refer to this as 'out of order DMA'?
>
> >>On Altix, DMA from a device isn't guaranteed to arrive in host memory
> >>in the order it was sent from the device. This reordering can happen
> >>in the NUMA interconnect (it's specifically not a PCI reordering.)
> >
> >This is mmiowb and read_relaxed() again, isn't it?
>
> I believe it's the same problem, except this time it's when exposing
> structures to userland.

Actually, it's a different problem, but with a similar cause.

mmiowb() is for the case PIO (or MMIO) write order from two different CPUs
can invert somewhere in the NL fabric.

read_relaxed() is a performance optimization to avoid the flush that's
necessary to avoid inversion in order between a PIO (or MMIO) read and
a DMA write.

What Arthur's trying to do here is avoid inversion in the order of two
DMA writes.

jeremy

2007-08-23 08:47:23

by Jes Sorensen

[permalink] [raw]
Subject: Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

James Bottomley wrote:
> Hmm, didn't see the email ... but I'm probably not cc'd on all the
> thread. However ... it isn't that you couldn't do it ... it's that you
> don't want to do it because it's faster to violate the spec ... like all
> those nice ATA devices that lie about having a cache and then let you
> power down with uncommitted data still in it ... they work much faster
> for HDIO tests ... and who ever switches their box off?

James,

I didn't do it, I don't know who did it, but sure we can try and track
them down and line them up outside .....

Point is that this is how the chips were done and they are out there in
numbers. It makes things a *lot* faster to violate the spec in such a
system, yes it sucks, but thats how things are. It wouldn't be the first
time someone violated the PCI spec in their implementation and I am
pretty sure it won't be the last.

Cheers,
Jes