2008-01-08 02:33:58

by Arthur Kepner

[permalink] [raw]
Subject: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines


The following patchset allows additional "attributes" to be
passed to dma_map_*/dma_unmap_* implementations. (The reason
why this is useful/necessary has been mentioned several times,
most recently here:
http://marc.info/?l=linux-kernel&m=119258541412724&w=2.)

This is incomplete in that only ia64 and x86_64 are supported -
the purpose is mainly to give us something specific to discuss.

The approach here is to change the dma_map_* interface so
that the last argument is an u32 which encodes the direction
of the dma and, optionally, other attributes. Changing the
interface is a bit intrusive, but callers of dma_map_* don't
need to be modified.

There are 3 patches:

[1/3] dma: create linux/dma-direction.h
[2/3] dma: ia64/sn2 allow "attributes" to be used by dma_map_*
[2/3] dma: x86_64 allow "attributes" to be used by dma_map_*

--
Arthur


2008-01-08 16:27:24

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Mon, 2008-01-07 at 21:32 -0500, [email protected] wrote:
> The following patchset allows additional "attributes" to be
> passed to dma_map_*/dma_unmap_* implementations. (The reason
> why this is useful/necessary has been mentioned several
> times,
> most recently here:
> http://marc.info/?l=linux-kernel&m=119258541412724&w=2.)
>
> This is incomplete in that only ia64 and x86_64 are supported
> -
> the purpose is mainly to give us something specific to
> discuss.

I'm already on record saying I don't think attributes in the generic
code is the right approach. All of the attributes I can see adding are
bus specific (even to the extent that PCIe ones wouldn't apply to PCI
for instance). I really think the right approach is to sort out the
attribute infrastructure and not try to overload the generic DMA API. I
can certainly see that we might like to use it to take advantage of
other PCIe attributes (where available, of course).

Ultimately, it might be the best implementation course to pass the
attributes in to the dma_map routines, but I can't see them being
generic, they'd still be bus specific, so the correct way is to have a
bus specific attribute setup routine and pass some sort of opaque
attribute handle (or even just use the existing dma handle).

> The approach here is to change the dma_map_* interface so
> that the last argument is an u32 which encodes the direction
> of the dma and, optionally, other attributes. Changing the
> interface is a bit intrusive, but callers of dma_map_* don't
> need to be modified.
>
> There are 3 patches:
>
> [1/3] dma: create linux/dma-direction.h
> [2/3] dma: ia64/sn2 allow "attributes" to be used by dma_map_*
> [2/3] dma: x86_64 allow "attributes" to be used by dma_map_*

James

P.S. please update my email address: the @steeleye.com one no longer
works now.

2008-01-08 17:43:17

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

> I'm already on record saying I don't think attributes in the generic
> code is the right approach. All of the attributes I can see adding are
> bus specific (even to the extent that PCIe ones wouldn't apply to PCI
> for instance).

I think the case before us that Arthur is dealing with is a
counterexample for this: there's nothing bus-specific about it all.
The issue is related to reordering of DMAs within the Altix system
fabric, after they've left the PCI world. This issue would be present
no matter what kind of host bridge you hook up to the system fabric,
be it PCI-X, PCIe, ISA, MCA or whatever.

- R.

2008-01-08 17:52:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Mon, Jan 07, 2008 at 06:32:22PM -0800, [email protected] wrote:
>
> The following patchset allows additional "attributes" to be
> passed to dma_map_*/dma_unmap_* implementations. (The reason
> why this is useful/necessary has been mentioned several times,
> most recently here:
> http://marc.info/?l=linux-kernel&m=119258541412724&w=2.)
>
> This is incomplete in that only ia64 and x86_64 are supported -
> the purpose is mainly to give us something specific to discuss.
>
> The approach here is to change the dma_map_* interface so
> that the last argument is an u32 which encodes the direction
> of the dma and, optionally, other attributes. Changing the
> interface is a bit intrusive, but callers of dma_map_* don't
> need to be modified.

Onething I've missed with these patches is drivers actually using
it. What driver actually needs it and why don't you send patches
for them?

2008-01-08 17:54:28

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines


On Tue, 2008-01-08 at 09:42 -0800, Roland Dreier wrote:
> > I'm already on record saying I don't think attributes in the generic
> > code is the right approach. All of the attributes I can see adding are
> > bus specific (even to the extent that PCIe ones wouldn't apply to PCI
> > for instance).
>
> I think the case before us that Arthur is dealing with is a
> counterexample for this: there's nothing bus-specific about it all.
> The issue is related to reordering of DMAs within the Altix system
> fabric, after they've left the PCI world. This issue would be present
> no matter what kind of host bridge you hook up to the system fabric,
> be it PCI-X, PCIe, ISA, MCA or whatever.

But it is: for performance reasons, the Altix boxes have a rather non
standard PCI bridge implementation that gives relaxed ordering on the
PCI bus. This behaviour was later standardised to some degree in PCIe,
so you could argue they actually have an altix specific PCI bus (PCIa
anyone?). Regardless, other manufacturers are probably going to demand
something equivalent to this based on the PCIe standard, so we should be
ready for it, hence the desire for the bus specific attributes.

James

2008-01-08 17:55:45

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

> Onething I've missed with these patches is drivers actually using
> it. What driver actually needs it and why don't you send patches
> for them?

In previous patch series, Arthur sent fixes for the mthca IB driver.
Other IB drivers like mlx4 also need this on Altix systems. Basically
anything where the device DMAs some data into userspace and then DMAs
a "completed" status to another buffer in userspace leads to a problem
where the "completed" DMA might pass the actual data in the Altix
fabric, which leads to the userspace process working with bogus data.

- R.

2008-01-08 18:06:01

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

> > I think the case before us that Arthur is dealing with is a
> > counterexample for this: there's nothing bus-specific about it all.
> > The issue is related to reordering of DMAs within the Altix system
> > fabric, after they've left the PCI world. This issue would be present
> > no matter what kind of host bridge you hook up to the system fabric,
> > be it PCI-X, PCIe, ISA, MCA or whatever.

> But it is: for performance reasons, the Altix boxes have a rather non
> standard PCI bridge implementation that gives relaxed ordering on the
> PCI bus.

I don't think this is accurate. As I understand things, the reordering
happens within the Altix system interconnect -- nothing to do with the
PCI bridge hanging off this fabric. It is "platform" behavior and I
think is properly handled within the dma_ API, which exists to
abstract platforms.

> This behaviour was later standardised to some degree in PCIe,
> so you could argue they actually have an altix specific PCI bus (PCIa
> anyone?). Regardless, other manufacturers are probably going to demand
> something equivalent to this based on the PCIe standard, so we should be
> ready for it, hence the desire for the bus specific attributes.

But:
a) The Altix has PCI-X, not PCIe, so having something PCIe-specific
is not a solution for this case; and
b) the PCIe behavior is opt-in, in the sense that you have to
specifically ask for looser ordering, while the Altix is loosely
ordered unless you ask for this "flush" property. So I don't
think the same attribute will work for both cases.

- R.

2008-01-08 18:15:22

by Arthur Kepner

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Tue, Jan 08, 2008 at 10:27:08AM -0600, James Bottomley wrote:

> .... All of the attributes I can see adding are
> bus specific (even to the extent that PCIe ones wouldn't apply to PCI
> for instance)....
>

The attribute I want to pass isn't bus-specific, but architecture-
specific.

--
Arthur

2008-01-08 18:22:09

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines


On Tue, 2008-01-08 at 10:05 -0800, Roland Dreier wrote:
> > > I think the case before us that Arthur is dealing with is a
> > > counterexample for this: there's nothing bus-specific about it all.
> > > The issue is related to reordering of DMAs within the Altix system
> > > fabric, after they've left the PCI world. This issue would be present
> > > no matter what kind of host bridge you hook up to the system fabric,
> > > be it PCI-X, PCIe, ISA, MCA or whatever.
>
> > But it is: for performance reasons, the Altix boxes have a rather non
> > standard PCI bridge implementation that gives relaxed ordering on the
> > PCI bus.
>
> I don't think this is accurate. As I understand things, the reordering
> happens within the Altix system interconnect

Which would be a platform bus, rather like the runway bus in parisc
systems ...

> -- nothing to do with the
> PCI bridge hanging off this fabric. It is "platform" behavior and I
> think is properly handled within the dma_ API, which exists to
> abstract platforms.


> > This behaviour was later standardised to some degree in PCIe,
> > so you could argue they actually have an altix specific PCI bus (PCIa
> > anyone?). Regardless, other manufacturers are probably going to demand
> > something equivalent to this based on the PCIe standard, so we should be
> > ready for it, hence the desire for the bus specific attributes.
>
> But:
> a) The Altix has PCI-X, not PCIe, so having something PCIe-specific
> is not a solution for this case; and
> b) the PCIe behavior is opt-in, in the sense that you have to
> specifically ask for looser ordering, while the Altix is loosely
> ordered unless you ask for this "flush" property. So I don't
> think the same attribute will work for both cases.

But the point is that the Altix does something non-standard but which
was later standardised (in a different way) largely so others could also
benefit from the relaxed ordering speedup.

I agree the altix needs something, what I don't agree is that we should
be overloading the dma data direction to do it ... especially given the
confetti like shower of other bus attributes waiting in the wings.

What it wants to do is set strict ordering for the bus ... well, that is
an attribute in the PCIe standard (it just happens to be the default one
for a standard bus, whereas relaxed is the default for altix). However,
set bus attribute strict would be a simple no-op for a standard bus (and
set attribute relaxed a no-op for altix).

James

2008-01-08 18:24:46

by Arthur Kepner

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Tue, Jan 08, 2008 at 05:50:54PM +0000, Christoph Hellwig wrote:
> ...
> Onething I've missed with these patches is drivers actually using
> it. What driver actually needs it and why don't you send patches
> for them?

Some IB drivers need this. Here's an example with the ib_mthca
driver.

--

drivers/infiniband/core/umem.c | 10 +++++----
drivers/infiniband/hw/mthca/mthca_provider.c | 16 +++++++++++++-
drivers/infiniband/hw/mthca/mthca_user.h | 9 +++++++-
include/rdma/ib_umem.h | 4 +--
include/rdma/ib_verbs.h | 30 ++++++++++++++++-----------
5 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 4e3128f..709345f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -35,7 +35,7 @@
*/

#include <linux/mm.h>
-#include <linux/dma-mapping.h>
+#include <linux/dma-direction.h>
#include <linux/sched.h>
#include <linux/hugetlb.h>

@@ -72,9 +72,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
* @addr: userspace virtual address to start at
* @size: length of region to pin
* @access: IB_ACCESS_xxx flags for memory being pinned
+ * @dmabarrier: set "dmabarrier" attribute on this memory
*/
struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access)
+ size_t size, int access, int dmabarrier)
{
struct ib_umem *umem;
struct page **page_list;
@@ -87,6 +88,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
int ret;
int off;
int i;
+ u32 flags = dma_flags_set_dir_attr(DMA_BIDIRECTIONAL,
+ dmabarrier ? DMA_ATTR_BARRIER : 0);

if (!can_do_mlock())
return ERR_PTR(-EPERM);
@@ -176,8 +179,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,

chunk->nmap = ib_dma_map_sg(context->device,
&chunk->page_list[0],
- chunk->nents,
- DMA_BIDIRECTIONAL);
+ chunk->nents, flags);
if (chunk->nmap <= 0) {
for (i = 0; i < chunk->nents; ++i)
put_page(sg_page(&chunk->page_list[i]));
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 6bcde1c..7907671 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1017,17 +1017,31 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
struct mthca_dev *dev = to_mdev(pd->device);
struct ib_umem_chunk *chunk;
struct mthca_mr *mr;
+ struct mthca_reg_mr ucmd;
u64 *pages;
int shift, n, len;
int i, j, k;
int err = 0;
int write_mtt_size;

+ if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+ return ERR_PTR(-EFAULT);
+
mr = kmalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);

- mr->umem = ib_umem_get(pd->uobject->context, start, length, acc);
+#define _DEBUG_IT_
+#ifdef _DEBUG_IT_
+ {
+ printk(KERN_CRIT "%s: ucmd.mr_attrs & MTHCA_MR_DMAFLUSH = %d\n",
+ __FUNCTION__, ucmd.mr_attrs & MTHCA_MR_DMAFLUSH);
+ }
+#endif /* _DEBUG_IT_ */
+
+ mr->umem = ib_umem_get(pd->uobject->context, start, length, acc,
+ ucmd.mr_attrs & MTHCA_MR_DMAFLUSH);
+
if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
goto err;
diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h
index 02cc0a7..76effe3 100644
--- a/drivers/infiniband/hw/mthca/mthca_user.h
+++ b/drivers/infiniband/hw/mthca/mthca_user.h
@@ -41,7 +41,7 @@
* Increment this value if any changes that break userspace ABI
* compatibility are made.
*/
-#define MTHCA_UVERBS_ABI_VERSION 1
+#define MTHCA_UVERBS_ABI_VERSION 2

/*
* Make sure that all structs defined in this file remain laid out so
@@ -61,6 +61,13 @@ struct mthca_alloc_pd_resp {
__u32 reserved;
};

+struct mthca_reg_mr {
+ __u32 mr_attrs;
+#define MTHCA_MR_DMAFLUSH 0x1 /* flush in-flight DMA on a write to
+ * memory region */
+ __u32 reserved;
+};
+
struct mthca_create_cq {
__u32 lkey;
__u32 pdn;
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 2229842..ac3542e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -62,7 +62,7 @@ struct ib_umem_chunk {
#ifdef CONFIG_INFINIBAND_USER_MEM

struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access);
+ size_t size, int access, int dmabarrier);
void ib_umem_release(struct ib_umem *umem);
int ib_umem_page_count(struct ib_umem *umem);

@@ -72,7 +72,7 @@ int ib_umem_page_count(struct ib_umem *umem);

static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
unsigned long addr, size_t size,
- int access) {
+ int access, int dmabarrier) {
return ERR_PTR(-EINVAL);
}
static inline void ib_umem_release(struct ib_umem *umem) { }
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 11f3960..2d7bb9e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1483,11 +1483,12 @@ static inline int ib_dma_mapping_error(struct ib_device *dev, u64 dma_addr)
*/
static inline u64 ib_dma_map_single(struct ib_device *dev,
void *cpu_addr, size_t size,
- enum dma_data_direction direction)
+ u32 flags)
{
+ enum dma_data_direction direction = dma_flags_get_dir(flags);
if (dev->dma_ops)
return dev->dma_ops->map_single(dev, cpu_addr, size, direction);
- return dma_map_single(dev->dma_device, cpu_addr, size, direction);
+ return dma_map_single(dev->dma_device, cpu_addr, size, flags);
}

/**
@@ -1499,12 +1500,13 @@ static inline u64 ib_dma_map_single(struct ib_device *dev,
*/
static inline void ib_dma_unmap_single(struct ib_device *dev,
u64 addr, size_t size,
- enum dma_data_direction direction)
+ u32 flags)
{
+ enum dma_data_direction direction = dma_flags_get_dir(flags);
if (dev->dma_ops)
dev->dma_ops->unmap_single(dev, addr, size, direction);
else
- dma_unmap_single(dev->dma_device, addr, size, direction);
+ dma_unmap_single(dev->dma_device, addr, size, flags);
}

/**
@@ -1519,11 +1521,12 @@ static inline u64 ib_dma_map_page(struct ib_device *dev,
struct page *page,
unsigned long offset,
size_t size,
- enum dma_data_direction direction)
+ u32 flags)
{
+ enum dma_data_direction direction = dma_flags_get_dir(flags);
if (dev->dma_ops)
return dev->dma_ops->map_page(dev, page, offset, size, direction);
- return dma_map_page(dev->dma_device, page, offset, size, direction);
+ return dma_map_page(dev->dma_device, page, offset, size, flags);
}

/**
@@ -1535,12 +1538,13 @@ static inline u64 ib_dma_map_page(struct ib_device *dev,
*/
static inline void ib_dma_unmap_page(struct ib_device *dev,
u64 addr, size_t size,
- enum dma_data_direction direction)
+ u32 flags)
{
+ enum dma_data_direction direction = dma_flags_get_dir(flags);
if (dev->dma_ops)
dev->dma_ops->unmap_page(dev, addr, size, direction);
else
- dma_unmap_page(dev->dma_device, addr, size, direction);
+ dma_unmap_page(dev->dma_device, addr, size, flags);
}

/**
@@ -1552,11 +1556,12 @@ static inline void ib_dma_unmap_page(struct ib_device *dev,
*/
static inline int ib_dma_map_sg(struct ib_device *dev,
struct scatterlist *sg, int nents,
- enum dma_data_direction direction)
+ u32 flags)
{
+ enum dma_data_direction direction = dma_flags_get_dir(flags);
if (dev->dma_ops)
return dev->dma_ops->map_sg(dev, sg, nents, direction);
- return dma_map_sg(dev->dma_device, sg, nents, direction);
+ return dma_map_sg(dev->dma_device, sg, nents, flags);
}

/**
@@ -1568,12 +1573,13 @@ static inline int ib_dma_map_sg(struct ib_device *dev,
*/
static inline void ib_dma_unmap_sg(struct ib_device *dev,
struct scatterlist *sg, int nents,
- enum dma_data_direction direction)
+ u32 flags)
{
+ enum dma_data_direction direction = dma_flags_get_dir(flags);
if (dev->dma_ops)
dev->dma_ops->unmap_sg(dev, sg, nents, direction);
else
- dma_unmap_sg(dev->dma_device, sg, nents, direction);
+ dma_unmap_sg(dev->dma_device, sg, nents, flags);
}

/**


--
Arthur

2008-01-09 01:04:58

by Arthur Kepner

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Tue, Jan 08, 2008 at 12:21:44PM -0600, James Bottomley wrote:
> ...
> But the point is that the Altix does something non-standard but which
> was later standardised (in a different way) largely so others could also
> benefit from the relaxed ordering speedup.
>

When you say that Altix does "something ... which was later
standardized", what is "something"?

The thing that I'm trying to address here is the reordering
that may occur within the NUMA fabric. As far as I'm aware
there's no standard for that.

--
Arthur

2008-01-09 21:00:53

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

> What it wants to do is set strict ordering for the bus ... well, that is
> an attribute in the PCIe standard (it just happens to be the default one
> for a standard bus, whereas relaxed is the default for altix). However,
> set bus attribute strict would be a simple no-op for a standard bus (and
> set attribute relaxed a no-op for altix).

I don't think that this can work. As Arthur and I have said several
times, the Altix issue is within the system NUMA interconnect --
nothing to do with the PCI bus. And as I understand things, to get
good performance, we have to allow reordering within the NUMA
interconnect except that certain transactions need to flush all
earlier writes. So this attribute needs to be set per-mapping, not
per-bus.

If you have a cleaner way to specify this attribute that Arthur's
idea, then it would be very useful to share that. If we were starting
from scratch, then probably adding another "flags" parameter to the
DMA mapping functions would be the way to go, but given the number of
calls to dma_map_xxx all around, changing the signature now doesn't
look very appealing.

The reason this hasn't been an issue until now is that almost all
drivers work correctly if the Altix code just sets the "flush" bit for
memory allocated via the consistent/coherent allocators. However, if
we want the device to write to userspace memory, this doesn't work
(and mapping coherent memory allocated in the kernel into userspace is
a mess on other platforms, because it unnecessarily consumes lowmem
and/or kernel address space).

Also, all of this is quite separate from the PCIe "loose ordering"
stuff. In fact, it's quite conceivable that SGI could hook up a PCIe
3.0 host bridge to the Altix NUMA interconnect, so that you could have
a PCI bus that supported loose ordering and also a system interconnect
that allowed a different type of reordering too.

- R.

2008-01-09 21:07:27

by Arthur Kepner

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Wed, Jan 09, 2008 at 01:00:38PM -0800, Roland Dreier wrote:
> .....
> The reason this hasn't been an issue until now is that almost all
> drivers work correctly if the Altix code just sets the "flush" bit for
> memory allocated via the consistent/coherent allocators. However, if
> we want the device to write to userspace memory, this doesn't work
> (and mapping coherent memory allocated in the kernel into userspace is
> a mess on other platforms, because it unnecessarily consumes lowmem
> and/or kernel address space).
>

And the only way that user level CQs work on Altix now is
that we apply our own patches to allocate them in the
kernel (with dma_alloc_coherent()) them mmap() them back
to user space. A bug in one of these patches recently led
to considerable drama with several customers - I'd love
to get a fix in mainline so we can drop our patches and
avoid the possibility such theatrics in the future ;-)

--
Arthur

2008-01-09 21:31:18

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines


On Wed, 2008-01-09 at 13:00 -0800, Roland Dreier wrote:
> > What it wants to do is set strict ordering for the bus ... well, that is
> > an attribute in the PCIe standard (it just happens to be the default one
> > for a standard bus, whereas relaxed is the default for altix). However,
> > set bus attribute strict would be a simple no-op for a standard bus (and
> > set attribute relaxed a no-op for altix).
>
> I don't think that this can work. As Arthur and I have said several
> times, the Altix issue is within the system NUMA interconnect --
> nothing to do with the PCI bus.

Attributes are supposed to be in force all the way from the bus domain
to the CPU domain. That means that all of the strange busses you
traverse are theoretically part of the same attribute: The PCI spec
doesn't let you re-order just because you traverse the hypertransport to
get into the CPU domain if you're spec compliant, the hypertransport has
to enforce the same ordering as PCI.

> And as I understand things, to get
> good performance, we have to allow reordering within the NUMA
> interconnect except that certain transactions need to flush all
> earlier writes. So this attribute needs to be set per-mapping, not
> per-bus.

I don't believe I said a bus attribute wouldn't be per mapping. Most
PCI attributes are set per transaction. The correct paradigm for us
looks to be per device per mapping ... which is correct .. it's from the
CPU to the device regardless of busses traversed.

> If you have a cleaner way to specify this attribute that Arthur's
> idea, then it would be very useful to share that. If we were starting
> from scratch, then probably adding another "flags" parameter to the
> DMA mapping functions would be the way to go, but given the number of
> calls to dma_map_xxx all around, changing the signature now doesn't
> look very appealing.

My main objection is the direction overloading ... given the number of
potential attributes we run out of bits fast.

However, given that very few device drivers want actually to modify the
attributes away from the default, I think all you need is an opaque
attribute structure

struct bus_attribute;

methods for initialising and setting it

struct bus_attribute attr = BUS_ATTRIBUTE(blah|blah);

bus_set_attribute(&attr, blah)

and a *new* API for taking it.

dma_map/unmap_sg/single_attr()

That way, we can support virtually any weird and wonderful attribute and
because it's done by an opaque structure, we can never run out of bits.
This will work both for the altix bus and the PCIe bus (and any other
future bus type).

Also, there can be a generic implementation of the _attr() calls, so
only architectures actually implementing them need be updated.

> The reason this hasn't been an issue until now is that almost all
> drivers work correctly if the Altix code just sets the "flush" bit for
> memory allocated via the consistent/coherent allocators. However, if
> we want the device to write to userspace memory, this doesn't work
> (and mapping coherent memory allocated in the kernel into userspace is
> a mess on other platforms, because it unnecessarily consumes lowmem
> and/or kernel address space).

Yes, I understand the reasons, I just want the API to suit non-altix.

> Also, all of this is quite separate from the PCIe "loose ordering"
> stuff. In fact, it's quite conceivable that SGI could hook up a PCIe
> 3.0 host bridge to the Altix NUMA interconnect, so that you could have
> a PCI bus that supported loose ordering and also a system interconnect
> that allowed a different type of reordering too.

Actually, no ... there'd just be an even weirder attribute, I suspect.
The attributes will be set per *transaction* not per bus. A transaction
is an operation (DMA, PIO, config space, etc) from the actual bus to the
CPU. It doesn't matter how many physical bus segments this traverses
and whether they're different bus types; all that matters for the
attributes are the characteristics that are CPU visible.

James

2008-01-11 18:21:17

by Grant Grundler

[permalink] [raw]
Subject: Re: [RFC/PARTIAL PATCH 0/3] dma: passing "attributes" to dma_map_* routines

On Wed, Jan 09, 2008 at 03:30:58PM -0600, James Bottomley wrote:
...
> > Also, all of this is quite separate from the PCIe "loose ordering"
> > stuff. In fact, it's quite conceivable that SGI could hook up a PCIe
> > 3.0 host bridge to the Altix NUMA interconnect, so that you could have
> > a PCI bus that supported loose ordering and also a system interconnect
> > that allowed a different type of reordering too.
>
> Actually, no ... there'd just be an even weirder attribute, I suspect.
> The attributes will be set per *transaction* not per bus. A transaction
> is an operation (DMA, PIO, config space, etc) from the actual bus to the
> CPU. It doesn't matter how many physical bus segments this traverses
> and whether they're different bus types; all that matters for the
> attributes are the characteristics that are CPU visible.

James is right. Setting the PCI-e "Relaxed Ordering" bit in config space
allows the device to set the "Relaxed Order" in specific DMA transactions.
Upstream bridges may choose to ignore the bit or follow well defined
transaction flushing/ordering rules if they do implement "Relaxed Ordering".
Key thing here is the device decides when to set RO bit in each transaction.
This is completely different than the re-ordering which occurs in Altix
NUMA bus for _all_ (default config) transactions.

Given SGI/Altix only cares about a limited number of drivers and only a
subset of those support RDMA (or something like it), would it be reasonable
to add the new API to the RDMA code instead of the generic DMA API?

Please don't shoot me for suggesting that...just thinking "out loud" here.

thanks,
grant