2006-11-08 01:54:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: DMA APIs gumble grumble

While looking into some bugs in the powerpc DMA implementation
(regarding DMA masks) I had a hard time figuring out what was the
"right" thing to do in a couple of places...

In fact, I found out that what I though was it didn't seem to match some
bits we have in Documentation/ and in drivers, and I'm starting to
wonder what I missed there ;-)

In addition, it seems that the "PCI" API and the generic "DMA" API have
been diverging in several areas.

So let me first describe my understanding and point out where it seems
to diverge from what is in there:

[ Sorry if the rant is a bit long, near the end I do propose
a few things, mostly asking if people agree it's ok, I'll
happily work on patches if the answers are yes
]

* First the very basic: the DMA good old mask

A driver calls dma_set_mask() providing a mask that represents the range
of addresses the device is capable of addressing for DMA. That means
that the implementation for this function should succeed if it can
guarantee that returned dma addresses from the consistent alloc and
dma_map_* APIs will always fit within that mask.

So far, that's good and that's documented that way.

However, the PCI documentation then adds:

<<<<
Another common scenario is a 64-bit capable device. The approach
here is to try for 64-bit DAC addressing, but back down to a
32-bit mask should that fail. The PCI platform code may fail the
64-bit mask not because the platform is not capable of 64-bit
addressing. Rather, it may fail in this case simply because
32-bit SAC addressing is done more efficiently than DAC addressing.
Sparc64 is one platform which behaves in this way.

Here is how you would handle a 64-bit capable device which can drive
all 64-bits when accessing streaming DMA:

int using_dac;

if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
using_dac = 1;
} else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
using_dac = 0;
} else {
printk(KERN_WARNING
"mydev: No suitable DMA available.\n");
goto ignore_this_device;
}
>>>>

So on one side, the platform should fail if the device mask is too
small, on the other side, it should fail of the device mask is too
big...

I'm sorry Dave, but I really don't like that... it's smells like you
added something unrelated to the semantics of pci_set_dma_mask() which
is basically what your platform would prefer doing. That mixes two
different semantics in one call and I find that confusing at best, but
then, I'm a bit lame so :)

Also, it's not documented that way in the non-PCI "DMA" API
documentation which, however, has this additional call:

<<<<
u64 dma_get_required_mask(struct device *dev)

After setting the mask with dma_set_mask(), this API returns the
actual mask (within that already set) that the platform actually
requires to operate efficiently. Usually this means the returned mask
is the minimum required to cover all of memory. Examining the
required mask gives drivers with variable descriptor sizes the
opportunity to use smaller descriptors as necessary.
>>>>

Which seems to do exactly what your want ... There's a default
implementation in drivers/base/platform.c (yeah, I wouldn't have found
it without grep, what the heck is it doing there ?) but it can be
overriden by the arch.

So should we define a pci_* version of it and change the doc to use that
instead ? What about drivers (there's a few) using the construct
documented above instead of that ? how long should we "support" the old
way if we decide to go that route ?

Also, in general, the whole thing is a mess in the sense that some of
the routines are implemented as pci_* in the PCI core with an ifdef so
arch can replace them, and we have a generic header that archs can use
to implement dma_* on top of pci_*. We also have a header for archs like
powerpc or x86 that implement the base DMA API to define the PCI one on
top of that, but it doesn't expose that call

So now, this new call is done the other way around ... and in a file
that doesn't seem to have anything to do with DMA mappings whatsoever.

Which means that we are in a situation where some calls are implemented
a pci_* in the PCI core (with ugly ifdefs to allow replacing them) and
some are implemented as dma_* in some unrelated files, and there seem to
be no attempt at consistency here.

* Now the new kid in town: The coherent DMA mask

So that's new and for some reason, I missed it's introduction. My fault.
The basic idea looks interesting enough and might even be useful for
embedded powerpc's where coherent memory has to be allocated out of a
special non-cacheable pool.

But first, a problem ! This is only every documented in the PCI DMA API
document, there is no mention of it in the "generic" DMA API. This
second example of divergence between those is annoying especially since
on powerpc, the PCI DMA API is just a wrapper on the DMA API. It's also
not implemented in any of the generic DMA API implementation I've seen
unless I missed something nor by any of the helper headers we have to
implement one of them based on the other. Unless I mised it ...

It's especially non-sensical since the infrastructure to handle that
coherent/consistent DMA mask (pick your preferred terminology, we mix
both in the kernel) is in ... struct device, and thus directly available
to the generic DMA API and in no way PCI specific.

I suppose that's easy to fix and I'll implement it as a generic DMA API
call on powerpc and possibly fix the documentation, and I can even fix
asm-generic/dma-mapping.h for the platforms who implement the generic
DMA API on top of the PCI one (yeah, it sucks).

Now, there are still a few weird things with the implementation there:

First, dma_mask is a pointer in struct device (for some historical
reason that i never understood btw... Why didn't just move the field
there ?) while the coherent mask is directly in the struct device (which
is funny since only a PCI API is exposed for it)

Then we have just added other things like dma_coherent_mem in struct
device, though it's an arch specific data structure. I've been wanting
to have DMA specific bits in there for ages and been hijacking
firmware_data lately, if I only knew I could just add an arch specific
bit there and nobody would complain...

Anyway, we now have this thing and 3 archs making use of it (x86, v32
and frv) though I haven't found where frv actually defines the structure
(and the two others defining it to the exact same thing).

I could go on and on ... It seems to be that we created an absolute
mightmare of a mess here.

Now, the question is how to get out of that situation. First:

- Do we agree that everybody should implement the dma_* API and that
the pci_ one should just be a wrapper on top of the former that takes
pci_dev's as argument ? A quick grep shows that the arch using
asm-generic/dma-mapping.h (that is implementing the dma_* ones on top of
the pci_* ones) are:

- m68knommu
- sparc
- v850

Which looks to me like a ... vast minority :-) Should be easy enough to
fix them unless there is something blocking. Dave, is it ok for sparc ?

I can start making patches I suppose some time next week, maybe earlier.

- Do we agree that dma_mask should be fully moved to struct device once
for all instead of being a pointer in there ?

- If we agree for both things above, then we can remove
pci_set_dma_mask() and pci_set_consistent_dma_mask() from the PCI code
and have them done as dma_*. We can provide default implementations of
them along with dma_get_required_mask() (sitll with an #ifdef mecanism
for overriding them, though I think it's cleaner to use the function
name as a #define rather than some ARCH_HAVE_* thingy) and all of that
be put in drivers/base/dma.c. Fixing up archs that still define their
own pci_* should be fairly easy.

- For platforms like powerpc where I can have multiple busses on
different IOMMU's and with possibly different DMA ops, I really need to
have a per-device data structure for use by the DMA ops (in fact, in my
case, containing the DMA ops themselves). Right now, I defined a notion
of "device extension" (struct device_ext) that my current implementation
puts in device->firmware_data (don't look for it upstream, it's all
patches queuing up for 2.6.20 and about to go into powerpc.git), but
that I'd like to have flat in struct device instead. Would it be agreed
that linux/device.h includes itself an asm/device.h which contains a
definition for a struct device_ext that is within struct device ? That
would also avoid a pointer indirection which is a good thing for DMA
operations

- Since that dma_coherent_mem structure seems to want to be platform
specific, them that struct device_ext I described above would be the
natural place to put it in.

So let me know if you agree with the proposals above, and I'll start
doing patches that, in order, will:

- port m68knommu, sparc and v850 to implement the dma_* API instead of
the pci_* one

- once that's done, rm include/asm-generic/dma-mapping.h (the header
used to implement the dma_* API based on the pci_* one).

- move dma_get_required_mask to a new drivers/base/dma.c and define a
pci_* version in pci-dma-compat.h

- remove pci_set_consistent_dma_mask() and replace it with a dma_*
version in drivers/base/dma.c and update pci-dma-compat.h to define the
pci_* version based on the dma_* version.

- mask dma_mask a normal member of struct device instead of a pointer,
remove pci_set_dma_mask(), make a generic dma_set_dma_mask() in
drivers/base/dma.c that can be replaced by the arch. Fixup all code that
relied on the old bits

- Add a asm/device.h that defines a "struct device_ext" (or device_arch
or whatever other name you guys might prefer. I used "device_ext" on
powerpc but I could change that easily enough. Define it as an empty
structure for all archs except powerpc which has some bits waiting for
it and archs that use the dma_coherent_mem data structure which I'll
them move in there. (and fix up the code accordingly)

Does that look good ?

Cheers,
Ben.



2006-11-08 04:46:53

by David Miller

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

From: Benjamin Herrenschmidt <[email protected]>
Date: Wed, 08 Nov 2006 12:54:37 +1100

> However, the PCI documentation then adds:
>
> <<<<
> Another common scenario is a 64-bit capable device. The approach
> here is to try for 64-bit DAC addressing, but back down to a
> 32-bit mask should that fail. The PCI platform code may fail the
> 64-bit mask not because the platform is not capable of 64-bit
> addressing. Rather, it may fail in this case simply because
> 32-bit SAC addressing is done more efficiently than DAC addressing.
> Sparc64 is one platform which behaves in this way.
>
> Here is how you would handle a 64-bit capable device which can drive
> all 64-bits when accessing streaming DMA:
>
> int using_dac;
>
> if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
> using_dac = 1;
> } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
> using_dac = 0;
> } else {
> printk(KERN_WARNING
> "mydev: No suitable DMA available.\n");
> goto ignore_this_device;
> }
> >>>>
>
> So on one side, the platform should fail if the device mask is too
> small, on the other side, it should fail of the device mask is too
> big...
>
> I'm sorry Dave, but I really don't like that... it's smells like you
> added something unrelated to the semantics of pci_set_dma_mask() which
> is basically what your platform would prefer doing. That mixes two
> different semantics in one call and I find that confusing at best, but
> then, I'm a bit lame so :)

There aren't many sane alternatives. We created even a special set of
DAC interfaces to the PCI DMA API for when "you really really want DAC
addressing."

The only time you really must have it is for things that want to
map all of memory, such as PCI clustering cards like the ones made
by Dolphin. For those things, the limitations of what the IOMMU can
map at once is unacceptable and thus using DAC is mandatory even if
it is slower.

Using DAC cycles on sparc64, when they aren't necessary and SAC works
fine, is a matter of 5 to 10 times worse performance. Absolutely no
caching and prefetching can be performed by the PCI controllers on
sparc64 if you use DAC addresses.

> Also, it's not documented that way in the non-PCI "DMA" API
> documentation which, however, has this additional call:
>
> <<<<
> u64 dma_get_required_mask(struct device *dev)
>
> After setting the mask with dma_set_mask(), this API returns the
> actual mask (within that already set) that the platform actually
> requires to operate efficiently. Usually this means the returned mask
> is the minimum required to cover all of memory. Examining the
> required mask gives drivers with variable descriptor sizes the
> opportunity to use smaller descriptors as necessary.
> >>>>
>
> Which seems to do exactly what your want ... There's a default
> implementation in drivers/base/platform.c (yeah, I wouldn't have found
> it without grep, what the heck is it doing there ?) but it can be
> overriden by the arch.

It's great that the generic DMA api creators decided to do something
different then never bother to suggest adjusting the PCI DMA
interfaces to suit. You cannot blame me for that. :-)

> So should we define a pci_* version of it and change the doc to use that
> instead ? What about drivers (there's a few) using the construct
> documented above instead of that ? how long should we "support" the old
> way if we decide to go that route ?

For this very reason I think we have to keep the PCI DMA interfaces
in this area as they are. You aren't going to be able to rewrite
all the drivers out there already using this stuff, and it took long
enough to get people to use the current scheme properly :)

> Also, in general, the whole thing is a mess in the sense that some of
> the routines are implemented as pci_* in the PCI core with an ifdef so
> arch can replace them, and we have a generic header that archs can use
> to implement dma_* on top of pci_*.

That very header does not work at all, it should be deleted because
it causes bugs.

One such problem that "dma_* in terms of pci_*" header causes is that
it ignored the gfp_t arguments to several routines. This results in
massive memory corruption with the sound/ layer which passes GFP_COMP
down to dma_alloc_consistent() and friends.

I had to fix this on sparc64. :-)

> We also have a header for archs like
> powerpc or x86 that implement the base DMA API to define the PCI one on
> top of that, but it doesn't expose that call

I'm gone away from this direction because the &pci_bus_type comparison
that code does it just rediculious overhead for what should be very
simple and straightforward :-)

> * Now the new kid in town: The coherent DMA mask
>
> So that's new and for some reason, I missed it's introduction. My fault.
> The basic idea looks interesting enough and might even be useful for
> embedded powerpc's where coherent memory has to be allocated out of a
> special non-cacheable pool.
>
> But first, a problem ! This is only every documented in the PCI DMA API
> document, there is no mention of it in the "generic" DMA API.

Something else I didn't personally create. You will notice the
parts I actually designed and implemented are documented accurately.


> Now, the question is how to get out of that situation. First:
>
> - Do we agree that everybody should implement the dma_* API and that
> the pci_ one should just be a wrapper on top of the former that takes
> pci_dev's as argument ? A quick grep shows that the arch using
> asm-generic/dma-mapping.h (that is implementing the dma_* ones on top of
> the pci_* ones) are:
>
> - m68knommu
> - sparc
> - v850

sparc64 used to, but as stated above it was so buggy that I no longer
could do so.

I think what I did for sparc64 is what I would do to make sparc 32-bit
no longer use asm-generic/dma-mapping.h

Really, on both sparc platforms, I have zero reason to implement
the generic DMA API to support anything other than PCI. :-)

> - For platforms like powerpc where I can have multiple busses on
> different IOMMU's and with possibly different DMA ops, I really need to
> have a per-device data structure for use by the DMA ops (in fact, in my
> case, containing the DMA ops themselves). Right now, I defined a notion
> of "device extension" (struct device_ext) that my current implementation
> puts in device->firmware_data (don't look for it upstream, it's all
> patches queuing up for 2.6.20 and about to go into powerpc.git), but
> that I'd like to have flat in struct device instead. Would it be agreed
> that linux/device.h includes itself an asm/device.h which contains a
> definition for a struct device_ext that is within struct device ? That
> would also avoid a pointer indirection which is a good thing for DMA
> operations

Maybe we can put a void * pointer there for "sysdata" like we do
in the pci_dev struct and friends. But there are many many people
who are sensitive to the size of struct device, so expect some
resistence :-)

> - port m68knommu, sparc and v850 to implement the dma_* API instead of
> the pci_* one

Please just mirror what I did on sparc64 for sparc32, see changeset
42f142371e48fbc44956d57b4e506bb6ce673cd7, with followup bug fixes
in 36321426e320c2c6bc2f8a1587d6f4d695fca84c and
7233589d77fdb593b482a8b7ee867e901f54b593.

Thanks.

2006-11-08 05:23:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble


> There aren't many sane alternatives. We created even a special set of
> DAC interfaces to the PCI DMA API for when "you really really want DAC
> addressing."
>
> The only time you really must have it is for things that want to
> map all of memory, such as PCI clustering cards like the ones made
> by Dolphin. For those things, the limitations of what the IOMMU can
> map at once is unacceptable and thus using DAC is mandatory even if
> it is slower.
>
> Using DAC cycles on sparc64, when they aren't necessary and SAC works
> fine, is a matter of 5 to 10 times worse performance. Absolutely no
> caching and prefetching can be performed by the PCI controllers on
> sparc64 if you use DAC addresses.

Ok. That makes sense.

> It's great that the generic DMA api creators decided to do something
> different then never bother to suggest adjusting the PCI DMA
> interfaces to suit. You cannot blame me for that. :-)

I haven't blamed you, did I ? :-) I did indeed suspect that this new API
came up afterward. I'm mostly asking if it would fit your needs and if I
should add "fixing drivers to use new API" on my list of things to do.

> > So should we define a pci_* version of it and change the doc to use that
> > instead ? What about drivers (there's a few) using the construct
> > documented above instead of that ? how long should we "support" the old
> > way if we decide to go that route ?
>
> For this very reason I think we have to keep the PCI DMA interfaces
> in this area as they are. You aren't going to be able to rewrite
> all the drivers out there already using this stuff, and it took long
> enough to get people to use the current scheme properly :)

Well, we can proceed slowly... like defining a pci_* variant of this new
API, making sure it's properly wired up, and slowly converting drivers
as we meet them. That shouldn't cause any problem.

Then, maybe 6 month, maybe 1 year later, we can change archs that use
the "alternate" semantic like sparc64 to no longer fail
pci_set_dma_mask(64bits).

In fact, the only breakage here would be for those archs to have some
drivers start going slowly, though we could expect drivers to have been
fixed by then.... (And we can delay that second part of the change as
long as deemed necessary).

The important thing is that if we agree that the "new" API is
preferrable, then we at least make sure that new code uses it and we
document it, instead of the previous trick.

> > Also, in general, the whole thing is a mess in the sense that some of
> > the routines are implemented as pci_* in the PCI core with an ifdef so
> > arch can replace them, and we have a generic header that archs can use
> > to implement dma_* on top of pci_*.
>
> That very header does not work at all, it should be deleted because
> it causes bugs.

I'm all about deleting that one (see by list of things below).

> One such problem that "dma_* in terms of pci_*" header causes is that
> it ignored the gfp_t arguments to several routines. This results in
> massive memory corruption with the sound/ layer which passes GFP_COMP
> down to dma_alloc_consistent() and friends.
>
> I had to fix this on sparc64. :-)

Yup, but you didn't fix sparc32 :-) I suppose I can try to do it and ask
Anton for help if things go wrong, though I can't be bothered building a
cross toolchain or getting a box on ebay so I'll rely on your for
testing :-)

> > We also have a header for archs like
> > powerpc or x86 that implement the base DMA API to define the PCI one on
> > top of that, but it doesn't expose that call
>
> I'm gone away from this direction because the &pci_bus_type comparison
> that code does it just rediculious overhead for what should be very
> simple and straightforward :-)

Looking at your code, basically, a typical DMA API implentation for
sparc64 is:

BUG_ON(dev->bus != &pci_bus_type);

return pci_iommu_ops->xxxx(to_pci_dev(dev),...);

And the PCI version is:

return pci_iommu_ops->xxxx(pdev, ...);

I don't really have a problem with you open coding both versions as long
as they stay in sync.

On powerpc, I need to support DMA for other bus types (of_platform among
others) and thus I'm taking a different approach which boils down to put
the "ops" callbacks in the struct device. I'm defining that device
extension thingy (that i'd like to have flat in struct device to avoid a
pointer indirection but right now it's a pointer) and it gets filled
with DMA ops for busses where they make sense.

> Something else I didn't personally create. You will notice the
> parts I actually designed and implemented are documented accurately.

I admit I haven't followed precisely who did what but I will trust you
there. One thing is we should probably merge the 2 documentations. That
is, there is no point in keeping 2 documentations for 2 sets of
functions that are supposed to haev the exact same semantics. In fact,
the documentation for the "generic" API also documents the PCI ones, but
is much less complete than the PCI documentation.

I suppose that means one more item to add to my todo list... merging
those docs.

> > Now, the question is how to get out of that situation. First:
> >
> > - Do we agree that everybody should implement the dma_* API and that
> > the pci_ one should just be a wrapper on top of the former that takes
> > pci_dev's as argument ? A quick grep shows that the arch using
> > asm-generic/dma-mapping.h (that is implementing the dma_* ones on top of
> > the pci_* ones) are:
> >
> > - m68knommu
> > - sparc
> > - v850
>
> sparc64 used to, but as stated above it was so buggy that I no longer
> could do so.

Ok.

> I think what I did for sparc64 is what I would do to make sparc 32-bit
> no longer use asm-generic/dma-mapping.h
>
> Really, on both sparc platforms, I have zero reason to implement
> the generic DMA API to support anything other than PCI. :-)

I see, yup !

> Maybe we can put a void * pointer there for "sysdata" like we do
> in the pci_dev struct and friends. But there are many many people
> who are sensitive to the size of struct device, so expect some
> resistence :-)

Well, maybe we can, as I proposed that a while ago in a mail that never
got a reply :-) However, the reason I much prefer being able to just
define a structure that 'expands' in struct device directly is that I
want to avoid yet another pointer indirection in the hot path of DMA
mappings (the ops structure is already two, I don't want three, in fact,
I may even put the ops "flat" in there if I can deal with include
dependencies nightmare and reduce that to one indirection).

Also, look at what we have in there that we could get rid off that way:

void *platform_data; /* Platform specific data, device
core doesn't touch it */
void *firmware_data; /* Firmware specific data (e.g. ACPI,
BIOS data),reserved for device core*/
struct dma_coherent_mem *dma_mem; /* internal for coherent mem

In that list, we have:

- plaform_data : this is used by platform and isa drivers and should
probably be moved in those structures. There might
be a couple of places abusing that field, but easy
enough to fix. In fact, because it's used by the
above, it cannot be used by the platform/arch at
will as one would imagine.

- firmware_data : last I grepped, that was an ACPI thingy. I'm
currently hijacking it for my device extension but
if my idea is accepted, then APCI-using archs can
just put their ACPI bits in their device extension
and that field is gone.

- dma_mem : this seem to be an arch specific data structure for
use by the few archs that implement the separate
dma coherent thing. It would naturally fit in the
device extension for those archs.

Thus, that is 3 pointers gone for archs who don't use these, and the ability
to put things like your dma ops in every struct device.

> > - port m68knommu, sparc and v850 to implement the dma_* API instead of
> > the pci_* one
>
> Please just mirror what I did on sparc64 for sparc32, see changeset
> 42f142371e48fbc44956d57b4e506bb6ce673cd7, with followup bug fixes
> in 36321426e320c2c6bc2f8a1587d6f4d695fca84c and
> 7233589d77fdb593b482a8b7ee867e901f54b593.

Will do.

Thanks for your comments,

Cheers,
Ben.

2006-11-08 05:29:37

by David Miller

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

From: Benjamin Herrenschmidt <[email protected]>
Date: Wed, 08 Nov 2006 16:23:40 +1100

> Then, maybe 6 month, maybe 1 year later, we can change archs that use
> the "alternate" semantic like sparc64 to no longer fail
> pci_set_dma_mask(64bits).
>
> In fact, the only breakage here would be for those archs to have some
> drivers start going slowly, though we could expect drivers to have been
> fixed by then.... (And we can delay that second part of the change as
> long as deemed necessary).

The arch implementations of pci_map_*() et al. might start
failing since they were written assuming that DAC never got
enabled.

> Yup, but you didn't fix sparc32 :-) I suppose I can try to do it and ask
> Anton for help if things go wrong, though I can't be bothered building a
> cross toolchain or getting a box on ebay so I'll rely on your for
> testing :-)

I only do sparc32 build testing, which you can do on a sparc64
box and Al Viro has great recipies for cross tool building and
usage.

> Thus, that is 3 pointers gone for archs who don't use these, and the ability
> to put things like your dma ops in every struct device.

How exactly does your device struct extension work? I ask because
struct device is embedded into other structs, such as pci_dev,
so it has to be fixed in size unless you have some clever trick. :)

2006-11-08 05:44:57

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble


> > Then, maybe 6 month, maybe 1 year later, we can change archs that use
> > the "alternate" semantic like sparc64 to no longer fail
> > pci_set_dma_mask(64bits).
> >
> > In fact, the only breakage here would be for those archs to have some
> > drivers start going slowly, though we could expect drivers to have been
> > fixed by then.... (And we can delay that second part of the change as
> > long as deemed necessary).
>
> The arch implementations of pci_map_*() et al. might start
> failing since they were written assuming that DAC never got
> enabled.

True. However, as I said, we don't have to deprecate the old technique
right away, we have time to get the drivers fixed, provided we agree
that this is the way to go.

> > Yup, but you didn't fix sparc32 :-) I suppose I can try to do it and ask
> > Anton for help if things go wrong, though I can't be bothered building a
> > cross toolchain or getting a box on ebay so I'll rely on your for
> > testing :-)
>
> I only do sparc32 build testing, which you can do on a sparc64
> box and Al Viro has great recipies for cross tool building and
> usage.

Yup, I might have a look. (Or maybe can you give accounts on a box I can
use ? That would be even easier)

> > Thus, that is 3 pointers gone for archs who don't use these, and the ability
> > to put things like your dma ops in every struct device.
>
> How exactly does your device struct extension work? I ask because
> struct device is embedded into other structs, such as pci_dev,
> so it has to be fixed in size unless you have some clever trick. :)

Nah, my extension is fixed, it's just that it's defined by the arch. So
archs who don't care don't get the bloat.

Right now, my implementation just hijacks firmare_data, so it's a
pointer (and thus potentially could be variable size) but I want to have
it "flat" in for performances.

My current device_ext on powerpc is:

struct device_ext {
/* Optional pointer to an OF device node */
struct device_node *of_node;

/* DMA operations on that device */
struct dma_mapping_ops *dma_ops;
void *dma_data;

/* NUMA node if applicable */
int numa_node;
};

If we remove plaform_data and firmware_data from struct device, then the
size difference is one pointer and one int, which isn't -that- much (for
powerpc, I consider that acceptable).

The idea is just to have asm/device.h do

struct device_ext {
};

That is, define an empty struct, for all archs that don't care about it,
though I want to move the dma_cohrerent_map thingy into the extension
for the 3 archs that seem to use it (x86, frv and m32.

Cheers,
Ben


2006-11-08 08:25:40

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, Nov 08, 2006 at 12:54:37PM +1100, Benjamin Herrenschmidt wrote:

> - For platforms like powerpc where I can have multiple busses on
> different IOMMU's and with possibly different DMA ops, I really need to
> have a per-device data structure for use by the DMA ops (in fact, in my
> case, containing the DMA ops themselves). Right now, I defined a notion
> of "device extension" (struct device_ext) that my current implementation
> puts in device->firmware_data (don't look for it upstream, it's all
> patches queuing up for 2.6.20 and about to go into powerpc.git), but
> that I'd like to have flat in struct device instead. Would it be agreed
> that linux/device.h includes itself an asm/device.h which contains a
> definition for a struct device_ext that is within struct device ? That
> would also avoid a pointer indirection which is a good thing for DMA
> operations

I want multiple dma_ops for Calgary on x86-64, so strong thumbs up for
doing this in a generic manner. device_ext kinda sucks as a name,
though... if it's used for just dma_ops, how about device_dma_ops?

I agree with the rest of your suggestions too, FWIW.

Cheers,
Muli

2006-11-08 08:47:51

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, 2006-11-08 at 10:25 +0200, Muli Ben-Yehuda wrote:
> On Wed, Nov 08, 2006 at 12:54:37PM +1100, Benjamin Herrenschmidt wrote:
>
> > - For platforms like powerpc where I can have multiple busses on
> > different IOMMU's and with possibly different DMA ops, I really need to
> > have a per-device data structure for use by the DMA ops (in fact, in my
> > case, containing the DMA ops themselves). Right now, I defined a notion
> > of "device extension" (struct device_ext) that my current implementation
> > puts in device->firmware_data (don't look for it upstream, it's all
> > patches queuing up for 2.6.20 and about to go into powerpc.git), but
> > that I'd like to have flat in struct device instead. Would it be agreed
> > that linux/device.h includes itself an asm/device.h which contains a
> > definition for a struct device_ext that is within struct device ? That
> > would also avoid a pointer indirection which is a good thing for DMA
> > operations
>
> I want multiple dma_ops for Calgary on x86-64, so strong thumbs up for
> doing this in a generic manner. device_ext kinda sucks as a name,
> though... if it's used for just dma_ops, how about device_dma_ops?
>
> I agree with the rest of your suggestions too, FWIW.

Yes, I need multiple dma_ops for powerpc too and additional void * data
that go with them (iommu instance).

I use it for more than dma ops though. I posted the actual structure
content in another reply to Dave.

I agree, though, device_ext sucks as a name, you are welcome to propose
something better, I'm no good at finding names :-)

Ben.


2006-11-08 09:21:54

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, Nov 08, 2006 at 07:47:33PM +1100, Benjamin Herrenschmidt wrote:

> I agree, though, device_ext sucks as a name, you are welcome to propose
> something better, I'm no good at finding names :-)

Seems a lot like `pci_sysdata' on x86-64 and i386 with Jeff's PCI
domains support. dev_sysdata? naming is not my strong suit :-)

struct pci_sysdata {
int domain; /* PCI domain */
int node; /* NUMA node */
void* iommu; /* IOMMU private data */
};

2006-11-08 10:00:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, 2006-11-08 at 11:21 +0200, Muli Ben-Yehuda wrote:
> On Wed, Nov 08, 2006 at 07:47:33PM +1100, Benjamin Herrenschmidt wrote:
>
> > I agree, though, device_ext sucks as a name, you are welcome to propose
> > something better, I'm no good at finding names :-)
>
> Seems a lot like `pci_sysdata' on x86-64 and i386 with Jeff's PCI
> domains support. dev_sysdata? naming is not my strong suit :-)
>
> struct pci_sysdata {
> int domain; /* PCI domain */
> int node; /* NUMA node */
> void* iommu; /* IOMMU private data */
> };

Interesting... It looks a _lot_ like my device_ext :-)

The reason we don't use PCI sysdata is that currently, our PCI layer
already hijacks it with something else (ugh, I know, it sucks). I'm
pondering consolidating all of this though, but since I need that for
more than PCI, it cannot be just sysdata. Also, as I explained, I'd
prefer having it directly in struct device to avoid one more pointer
indirection in the DMA hot path.

But it looks like if we get that, x86 might be able to move their
sysdata to there.

You might be bad at naming, but I'm worse. Your dev_sysdata wins over my
device_ext imho :-)

So unless I get some better proposal, I'll do a patch introducing that
as an empty struct on all archs, possibly tonmorrow, and then start
migrating things to it.

Cheers,
Ben.


2006-11-08 22:56:30

by Russell King

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, Nov 08, 2006 at 07:47:33PM +1100, Benjamin Herrenschmidt wrote:
> Yes, I need multiple dma_ops for powerpc too

Ditto for ARM.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-11-08 23:42:11

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, 2006-11-08 at 22:56 +0000, Russell King wrote:
> On Wed, Nov 08, 2006 at 07:47:33PM +1100, Benjamin Herrenschmidt wrote:
> > Yes, I need multiple dma_ops for powerpc too
>
> Ditto for ARM.

Ok, so there is some interest in having the dma_ops in struct device
beyond powerpc.

I'll put together today a patch doing:

- add #include <asm/device.h> to linux/device.h
- add a device.h file in asm/* that does:

struct dev_sysdata {
};

- add a struct dev_sysdata sysdata; field to struct device

That patch alone is 0 overhead and allows archs to start adding things.
I'll then modify my pending patches for 2.6.20 to use that instead of my
current device_ext thing.

Is that ok with everybody for 2.6.20 ?

Then, we can do, in no special order:

- on x86, put the acpi data in there and remove firmware_data from
struct device

- on x86, m32, frv, put the dma_coherent_mem pointer in there too and
remove it from struct device

- you can use it on ARM to put your dma_operations pointer as I'm
doing in for powerpc

- x86 can do the same

etc...

Cheers,
Ben.


2006-11-09 00:44:11

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

On Wed, 2006-11-08 at 22:56 +0000, Russell King wrote:
> On Wed, Nov 08, 2006 at 07:47:33PM +1100, Benjamin Herrenschmidt wrote:
> > Yes, I need multiple dma_ops for powerpc too
>

Ok, I'm sending something under a new thread with a trimmed CC list for
the basic dev_sysdata bits. Look for [PATCH 0/2] Add dev_sysdata and use
it for ACPI

Cheers,
Ben

2006-11-10 01:06:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble


>
> Please just mirror what I did on sparc64 for sparc32, see changeset
> 42f142371e48fbc44956d57b4e506bb6ce673cd7, with followup bug fixes
> in 36321426e320c2c6bc2f8a1587d6f4d695fca84c and
> 7233589d77fdb593b482a8b7ee867e901f54b593.

Question about sparc. It's implementation of pci_alloc_consistent(),
unlike the other ones from before we had a GFP mask massed, does
GFP_KERNEL allocations and not GFP_ATOMIC. Thus it's never expected to
be called in atomic context. In fact, it does various other things like
calling allocate_resource which is not something you ever want to be
called from interrupt context.

I'm splitting it into a pci_do_alloc_consistent that takes a gfp arg,
and a pair of pci_alloc_consistent & dma_alloc_consistent wrappers.

Do you think I should have the former pass GFP_KERNEL like the current
implementation does or switch it to GFP_ATOMIC like everybody does ? In
this case, should I also change the kmalloc done in there to allocate a
struct resource to use the gfp argument ? (It's currently doing
GFP_KERNEL).

Cheers,
Ben.


2006-11-10 02:50:28

by David Miller

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

From: Benjamin Herrenschmidt <[email protected]>
Date: Fri, 10 Nov 2006 12:02:04 +1100

> > Please just mirror what I did on sparc64 for sparc32, see changeset
> > 42f142371e48fbc44956d57b4e506bb6ce673cd7, with followup bug fixes
> > in 36321426e320c2c6bc2f8a1587d6f4d695fca84c and
> > 7233589d77fdb593b482a8b7ee867e901f54b593.
>
> Question about sparc. It's implementation of pci_alloc_consistent(),
> unlike the other ones from before we had a GFP mask massed, does
> GFP_KERNEL allocations and not GFP_ATOMIC. Thus it's never expected to
> be called in atomic context. In fact, it does various other things like
> calling allocate_resource which is not something you ever want to be
> called from interrupt context.

pci_alloc_consistent() is not allowed from atomic contexts.

> I'm splitting it into a pci_do_alloc_consistent that takes a gfp arg,
> and a pair of pci_alloc_consistent & dma_alloc_consistent wrappers.
>
> Do you think I should have the former pass GFP_KERNEL like the current
> implementation does or switch it to GFP_ATOMIC like everybody does ? In
> this case, should I also change the kmalloc done in there to allocate a
> struct resource to use the gfp argument ? (It's currently doing
> GFP_KERNEL).

pci_alloc_consistent() really cannot be allowed to use GFP_ATOMIC.

2006-11-10 02:55:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble


> pci_alloc_consistent() is not allowed from atomic contexts.

Yes, but some drivers did it anyway, though I can't remember under which
circumstances (IDE probe possibly ? It's a usual culprit for that sort
of thing). This is why most implementations use GFP_ATOMIC (including
sparc64 :-)

> > I'm splitting it into a pci_do_alloc_consistent that takes a gfp arg,
> > and a pair of pci_alloc_consistent & dma_alloc_consistent wrappers.
> >
> > Do you think I should have the former pass GFP_KERNEL like the current
> > implementation does or switch it to GFP_ATOMIC like everybody does ? In
> > this case, should I also change the kmalloc done in there to allocate a
> > struct resource to use the gfp argument ? (It's currently doing
> > GFP_KERNEL).
>
> pci_alloc_consistent() really cannot be allowed to use GFP_ATOMIC.

Oh well, I have no problem with leaving sparc32 do GFP_KERNEL indeed, I
can't remember for sure the reason why we have most architectures do
GFP_ATOMIC, but it probably never hit sparc32.

Ben.


2006-11-10 03:01:46

by David Miller

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble

From: Benjamin Herrenschmidt <[email protected]>
Date: Fri, 10 Nov 2006 13:55:27 +1100

> > pci_alloc_consistent() is not allowed from atomic contexts.
>
> Yes, but some drivers did it anyway, though I can't remember under which
> circumstances (IDE probe possibly ? It's a usual culprit for that sort
> of thing). This is why most implementations use GFP_ATOMIC (including
> sparc64 :-)

Ok, I see.

> Oh well, I have no problem with leaving sparc32 do GFP_KERNEL indeed, I
> can't remember for sure the reason why we have most architectures do
> GFP_ATOMIC, but it probably never hit sparc32.

I wish sparc32 hadn't used alloc_resource() as a poor-man's bitmap
allocator to keep track of IOMMU mappings. That's where the
GFP_KERNEL requirement comes from.

Just use GFP_KERNEL for now, and someone might find the strength
to remove this problem some day :-)

2006-11-10 04:07:48

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: DMA APIs gumble grumble


> I wish sparc32 hadn't used alloc_resource() as a poor-man's bitmap
> allocator to keep track of IOMMU mappings. That's where the
> GFP_KERNEL requirement comes from.
>
> Just use GFP_KERNEL for now, and someone might find the strength
> to remove this problem some day :-)

Well, Anton has a pile of sparc's in a closet... :-)

Ben.