2011-04-21 19:29:20

by Arnd Bergmann

[permalink] [raw]
Subject: [RFC] ARM DMA mapping TODO, v1

I think the recent discussions on linaro-mm-sig and the BoF last week
at ELC have been quite productive, and at least my understanding
of the missing pieces has improved quite a bit. This is a list of
things that I think need to be done in the kernel. Please complain
if any of these still seem controversial:

1. Fix the arm version of dma_alloc_coherent. It's in use today and
is broken on modern CPUs because it results in both cached and
uncached mappings. Rebecca suggested different approaches how to
get there.

2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
that this is needed, and it currently is not implemented, with
an outdated comment explaining why it used to not be possible
to do it.

3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
both IOMMU and direct mapped DMA on some machines.

4. Implement an architecture independent version of dma_map_ops
based on the iommu.h API. As Joerg mentioned, this has been
missing for some time, and it would be better to do it once
than for each IOMMU separately. This is probably a lot of work.

5. Find a way to define per-device IOMMUs, if that is not actually
possible already. We had conflicting statements for this.

6. Implement iommu_ops for each of the ARM platforms that has
an IOMMU. Needs some modifications for MSM and a rewrite for
OMAP. Implementation for Samsung is under work.

7. Extend the dma_map_ops to have a way for mapping a buffer
from dma_alloc_{non,}coherent into user space. We have not
discussed that yet, but after thinking this for some time, I
believe this would be the right approach to map buffers into
user space from code that doesn't care about the underlying
hardware.

After all these are in place, building anything on top of
dma_alloc_{non,}coherent should be much easier. The question
of passing buffers between V4L and DRM is still completely
unsolved as far as I can tell, but that discussion might become
more focused if we can agree on the above points and assume
that it will be done.

I expect that I will have to update the list above as people
point out mistakes in my assumptions.

Arnd


2011-04-21 20:09:39

by Jesse Barnes

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, 21 Apr 2011 21:29:16 +0200
Arnd Bergmann <[email protected]> wrote:

> I think the recent discussions on linaro-mm-sig and the BoF last week
> at ELC have been quite productive, and at least my understanding
> of the missing pieces has improved quite a bit. This is a list of
> things that I think need to be done in the kernel. Please complain
> if any of these still seem controversial:
>
> 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> is broken on modern CPUs because it results in both cached and
> uncached mappings. Rebecca suggested different approaches how to
> get there.
>
> 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> that this is needed, and it currently is not implemented, with
> an outdated comment explaining why it used to not be possible
> to do it.
>
> 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> both IOMMU and direct mapped DMA on some machines.

I don't think the DMA mapping and allocation APIs are sufficient for
high performance graphics at least. It's fairly common to allocate a
bunch of buffers necessary to render a scene, build up a command buffer
that references them, then hand the whole thing off to the kernel to
execute at once on the GPU. That allows for a lot of extra efficiency,
since it allows you to batch the MMU binding until execution occurs (or
even put it off entirely until the page is referenced by the GPU in the
case of faulting support). It's also necessary to avoid livelocks
between two clients trying to render; if mapping is incremental on both
sides, it's possible that neither will be able to make forward
progress due to IOMMU space exhaustion.

So that argues for separating allocation from mapping both on the user
side (which I think everyone agrees on) as well as on the kernel side,
both for CPU access (which some drivers won't need) and for GPU access.

--
Jesse Barnes, Intel Open Source Technology Center

2011-04-21 21:52:26

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

> Arnd Bergmann <[email protected]> wrote:
>
>> I think the recent discussions on linaro-mm-sig and the BoF last week
>> at ELC have been quite productive, and at least my understanding
>> of the missing pieces has improved quite a bit. This is a list of
>> things that I think need to be done in the kernel. Please complain
>> if any of these still seem controversial:
>>
>> 1. Fix the arm version of dma_alloc_coherent. It's in use today and
>> ? ?is broken on modern CPUs because it results in both cached and
>> ? ?uncached mappings. Rebecca suggested different approaches how to
>> ? ?get there.
>>
>> 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
>> ? ?that this is needed, and it currently is not implemented, with
>> ? ?an outdated comment explaining why it used to not be possible
>> ? ?to do it.
>>
>> 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
>> ? ?both IOMMU and direct mapped DMA on some machines.
>
> I don't think the DMA mapping and allocation APIs are sufficient for
> high performance graphics at least. ?It's fairly common to allocate a
> bunch of buffers necessary to render a scene, build up a command buffer
> that references them, then hand the whole thing off to the kernel to
> execute at once on the GPU. ?That allows for a lot of extra efficiency,
> since it allows you to batch the MMU binding until execution occurs (or
> even put it off entirely until the page is referenced by the GPU in the
> case of faulting support). ?It's also necessary to avoid livelocks
> between two clients trying to render; if mapping is incremental on both
> sides, it's possible that neither will be able to make forward
> progress due to IOMMU space exhaustion.
>
> So that argues for separating allocation from mapping both on the user
> side (which I think everyone agrees on) as well as on the kernel side,
> both for CPU access (which some drivers won't need) and for GPU access.

I agree with Jesse that the separation of mapping from allocation is
central to the current usage models. I realize most people didn't like
VCMM, but it provided an abstraction for this - if software can handle
the multiple mapper approach in a rational way across ARM than we can
solve a lot of problems with all the map and unmap current solutions
and we don't have to hack in coherency.

>
> --
> Jesse Barnes, Intel Open Source Technology Center
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

2011-04-22 00:34:04

by Cho KyongHo

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, Apr 22, 2011 at 6:52 AM, Zach Pfeffer <[email protected]> wrote:
>
> I agree with Jesse that the separation of mapping from allocation is
> central to the current usage models. I realize most people didn't like
> VCMM, but it provided an abstraction for this - if software can handle
> the multiple mapper approach in a rational way across ARM than we can
> solve a lot of problems with all the map and unmap current solutions
> and we don't have to hack in coherency.
>

Hi.
I've also noticed that VCMM is the reasonable idea for IOMMU mappings.
We often deal with physical memory blocks to map multiple way.
Allocation of physical memory itself is also important for some
peripheral devices
because it is beneficial to get larger page frame for their performance.

IOMMU api does not provide virtual memory management.
DMA api is not flexible for all our use-cases.

Regards,
KyongHo.

2011-04-26 14:26:27

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thursday 21 April 2011, Jesse Barnes wrote:
> On Thu, 21 Apr 2011 21:29:16 +0200
> Arnd Bergmann <[email protected]> wrote:
>
> > I think the recent discussions on linaro-mm-sig and the BoF last week
> > at ELC have been quite productive, and at least my understanding
> > of the missing pieces has improved quite a bit. This is a list of
> > things that I think need to be done in the kernel. Please complain
> > if any of these still seem controversial:
> >
> > 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> > is broken on modern CPUs because it results in both cached and
> > uncached mappings. Rebecca suggested different approaches how to
> > get there.
> >
> > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > that this is needed, and it currently is not implemented, with
> > an outdated comment explaining why it used to not be possible
> > to do it.
> >
> > 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> > both IOMMU and direct mapped DMA on some machines.
>
> I don't think the DMA mapping and allocation APIs are sufficient for
> high performance graphics at least. It's fairly common to allocate a
> bunch of buffers necessary to render a scene, build up a command buffer
> that references them, then hand the whole thing off to the kernel to
> execute at once on the GPU. That allows for a lot of extra efficiency,
> since it allows you to batch the MMU binding until execution occurs (or
> even put it off entirely until the page is referenced by the GPU in the
> case of faulting support). It's also necessary to avoid livelocks
> between two clients trying to render; if mapping is incremental on both
> sides, it's possible that neither will be able to make forward
> progress due to IOMMU space exhaustion.
>
> So that argues for separating allocation from mapping both on the user
> side (which I think everyone agrees on) as well as on the kernel side,
> both for CPU access (which some drivers won't need) and for GPU access.

I don't thing that this argument has anything to do with what the
underlying API should be, right? I can see this built on top of either
the dma-mapping headers with extensions to map potentially uncached
pages, and with the iommu API. Neither way would however save us from
implementing the three items listed above.

It's certainly a good point to note that we should have a way to
allocate pages for a device without mapping them into any address
space right away. My feeling is still that the dma mapping API is
the right place for this, because it is the only part of the kernel
that has knowledge about whether a device needs uncached memory for
coherent access, under what constraints it can map noncontiguous
memory into its own address space, and what its addressing capabilities
are (dma mask).

Arnd

2011-04-26 14:28:46

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thursday 21 April 2011, Zach Pfeffer wrote:

> I agree with Jesse that the separation of mapping from allocation is
> central to the current usage models. I realize most people didn't like
> VCMM, but it provided an abstraction for this - if software can handle
> the multiple mapper approach in a rational way across ARM than we can
> solve a lot of problems with all the map and unmap current solutions
> and we don't have to hack in coherency.

Any solution we come up with needs to work on only across ARM, but
also across other architectures. Fortunately, most have less weird
constraints, for instance some architecture have no concept of uncached
mappings (and don't need them), or there might be DMA ordering settings
that we have not yet seen on ARM.

Arnd

2011-04-26 14:30:04

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Friday 22 April 2011, KyongHo Cho wrote:
> IOMMU api does not provide virtual memory management.
> DMA api is not flexible for all our use-cases.

We can fix either problem by changing the existing interfaces,
which is much more maintainable in the long run than adding
a third one.

Arnd

2011-04-26 15:39:27

by Jesse Barnes

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Tue, 26 Apr 2011 16:26:19 +0200
Arnd Bergmann <[email protected]> wrote:
> I don't thing that this argument has anything to do with what the
> underlying API should be, right? I can see this built on top of either
> the dma-mapping headers with extensions to map potentially uncached
> pages, and with the iommu API. Neither way would however save us from
> implementing the three items listed above.

Or simply extending the DMA mapping API to allow for allocations
without mapping. I was just worried you had a more traditional driver
model in mind (e.g. coherent alloc on the ring buffer, single mappings
for data buffers, all mapped in the kernel driver at allocation time).

The DMA API does have some advantages, in that arches already support
it, there's some infrastructure for handling per-bus mapping, etc., so
building on top of it is probably a good idea.

> It's certainly a good point to note that we should have a way to
> allocate pages for a device without mapping them into any address
> space right away. My feeling is still that the dma mapping API is
> the right place for this, because it is the only part of the kernel
> that has knowledge about whether a device needs uncached memory for
> coherent access, under what constraints it can map noncontiguous
> memory into its own address space, and what its addressing capabilities
> are (dma mask).

Right. Sometimes a device or platform can handle either cached or
uncached though, and we need userspace to decide on the best type for
performance reasons.

--
Jesse Barnes, Intel Open Source Technology Center

2011-04-27 08:03:37

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> is broken on modern CPUs because it results in both cached and
> uncached mappings. Rebecca suggested different approaches how to
> get there.

I also suggested various approaches and produced patches, which I'm slowly
feeding in. However, I think whatever we do, we'll end up breaking
something along the line - especially as various places assume that
dma_alloc_coherent() is ultimately backed by memory with a struct page.

> 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> that this is needed, and it currently is not implemented, with
> an outdated comment explaining why it used to not be possible
> to do it.

dma_alloc_noncoherent is an entirely pointless API afaics.

> 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> both IOMMU and direct mapped DMA on some machines.
>
> 4. Implement an architecture independent version of dma_map_ops
> based on the iommu.h API. As Joerg mentioned, this has been
> missing for some time, and it would be better to do it once
> than for each IOMMU separately. This is probably a lot of work.

dma_map_ops design is broken - we can't have the entire DMA API indirected
through that structure. Whether you have an IOMMU or not is completely
independent of whether you have to do DMA cache handling. Moreover, with
dmabounce, having the DMA cache handling in place doesn't make sense.

So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops
for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
work like that.

I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable
for ARM.

2011-04-27 08:56:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:
> > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > that this is needed, and it currently is not implemented, with
> > an outdated comment explaining why it used to not be possible
> > to do it.
>
> dma_alloc_noncoherent is an entirely pointless API afaics.

The main use case that I can see for dma_alloc_noncoherent is being
able to allocate a large cacheable memory chunk that is mapped
contiguous into both kernel virtual and bus virtual space, but not
necessarily in contiguous in physical memory.

Without an IOMMU, I agree that it is pointless, because the only
sensible imlpementation would be alloc_pages_exact + dma_map_single.

> > 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> > both IOMMU and direct mapped DMA on some machines.
> >
> > 4. Implement an architecture independent version of dma_map_ops
> > based on the iommu.h API. As Joerg mentioned, this has been
> > missing for some time, and it would be better to do it once
> > than for each IOMMU separately. This is probably a lot of work.
>
> dma_map_ops design is broken - we can't have the entire DMA API indirected
> through that structure. Whether you have an IOMMU or not is completely
> independent of whether you have to do DMA cache handling. Moreover, with
> dmabounce, having the DMA cache handling in place doesn't make sense.
>
> So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops
> for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
> work like that.
>
> I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable
> for ARM.

We probably still need to handle both the coherent and noncoherent case
in each dma_map_ops implementation, at least for those combinations where
they matter (definitely the linear mapping). However, I think that using
dma_mapping_common.h would let us use an architecture-independent dma_map_ops
for the generic iommu code that Marek wants to introduce now.

I still don't understand how dmabounce works, but if it's similar to
swiotlb, we can have at least three different dma_map_ops: linear, dmabounce
and iommu.

Without the common iommu abstraction, there would be a bigger incentive
to go with dma_map_ops, because then we would need one operations structure
per IOMMU implementation, as some other architectures (x86, powerpc,
ia64, ...) have. If we only need to distinguish between the common linear
mapping code and the common iommu code, then you are right and we are likely
better off adding some more conditionals to the existing code to handle
the iommu case in addition to the ones we handle today.

Arnd

2011-04-27 09:09:34

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:
> We probably still need to handle both the coherent and noncoherent case
> in each dma_map_ops implementation, at least for those combinations where
> they matter (definitely the linear mapping). However, I think that using
> dma_mapping_common.h would let us use an architecture-independent dma_map_ops
> for the generic iommu code that Marek wants to introduce now.

The 'do we have an iommu or not' question and the 'do we need to do cache
coherency' question are two independent questions which are unrelated to
each other. There are four unique but equally valid combinations.

Pushing the cache coherency question down into the iommu stuff will mean
that we'll constantly be fighting against the 'but this iommu works on x86'
shite that we've fought with over block device crap for years. I have
no desire to go there.

What we need is a proper abstraction where the DMA ops can say whether
they can avoid DMA cache handling (eg, swiotlb or dmabounce stuff) but
default to DMA cache handling being the norm - and the DMA cache handling
performed in the level above the DMA ops indirection.

Anything else is asking for an endless stream of shite iommu stuff
getting DMA cache handling wrong.

2011-04-27 09:52:27

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

Arnd,

On 21 April 2011 20:29, Arnd Bergmann <[email protected]> wrote:
> I think the recent discussions on linaro-mm-sig and the BoF last week
> at ELC have been quite productive, and at least my understanding
> of the missing pieces has improved quite a bit. This is a list of
> things that I think need to be done in the kernel. Please complain
> if any of these still seem controversial:
>
> 1. Fix the arm version of dma_alloc_coherent. It's in use today and
>   is broken on modern CPUs because it results in both cached and
>   uncached mappings. Rebecca suggested different approaches how to
>   get there.

It's not broken since we moved to using Normal non-cacheable memory
for the coherent DMA buffers (as long as you flush the cacheable alias
before using the buffer, as we already do). The ARM ARM currently says
unpredictable for such situations but this is being clarified in
future updates and the Normal non-cacheable vs cacheable aliases can
be used (given correct cache maintenance before using the buffer).

> 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
>   that this is needed, and it currently is not implemented, with
>   an outdated comment explaining why it used to not be possible
>   to do it.

As Russell pointed out, there are 4 main combinations with iommu and
some coherency support (i.e. being able to snoop the CPU caches). But
in an SoC you can have different devices with different iommu and
coherency configurations. Some of them may even be able to see the L2
cache but not the L1 (in which case it would help if we can get an
inner non-cacheable outer cacheable mapping).

Anyway, we end up with different DMA ops per device via dev_archdata.

--
Catalin

2011-04-27 10:43:23

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011, Catalin Marinas wrote:
> Arnd,
>
> On 21 April 2011 20:29, Arnd Bergmann <[email protected]> wrote:
> > I think the recent discussions on linaro-mm-sig and the BoF last week
> > at ELC have been quite productive, and at least my understanding
> > of the missing pieces has improved quite a bit. This is a list of
> > things that I think need to be done in the kernel. Please complain
> > if any of these still seem controversial:
> >
> > 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> > is broken on modern CPUs because it results in both cached and
> > uncached mappings. Rebecca suggested different approaches how to
> > get there.
>
> It's not broken since we moved to using Normal non-cacheable memory
> for the coherent DMA buffers (as long as you flush the cacheable alias
> before using the buffer, as we already do). The ARM ARM currently says
> unpredictable for such situations but this is being clarified in
> future updates and the Normal non-cacheable vs cacheable aliases can
> be used (given correct cache maintenance before using the buffer).

Thanks for that information, I believe a number of people in the
previous discussions were relying on the information from the
documentation. Are you sure that this is not only correct for the
cores made by ARM ltd but also for the other implementations that
may have relied on documentation?

As I mentioned before, there are other architectures, where having
conflicting cache settings in TLB entries for the same pysical page
immediately checkstops the CPU, and I guess that this was also allowed
by the current version of the ARM ARM.

> > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > that this is needed, and it currently is not implemented, with
> > an outdated comment explaining why it used to not be possible
> > to do it.
>
> As Russell pointed out, there are 4 main combinations with iommu and
> some coherency support (i.e. being able to snoop the CPU caches). But
> in an SoC you can have different devices with different iommu and
> coherency configurations. Some of them may even be able to see the L2
> cache but not the L1 (in which case it would help if we can get an
> inner non-cacheable outer cacheable mapping).
>
> Anyway, we end up with different DMA ops per device via dev_archdata.

Having different DMA ops per device was the solution that I was suggesting
with dma_mapping_common.h, but Russell pointed out that it may not be
the best option.

The alternative would be to have just one set of dma_mapping functions
as we do today, but to extend the functions to also cover the iommu
case, for instance (example, don't take literally):

static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
size_t size, enum dma_data_direction dir)
{
dma_addr_t ret;

#ifdef CONFIG_DMABOUNCE
if (dev->archdata.dmabounce)
return dmabounce_map_single(dev, cpu_addr, size, dir);
#endif

#ifdef CONFIG_IOMMU
if (dev->archdata.iommu)
ret = iommu_map_single(dev, cpu_addr, size, dir);
else
#endif
dma_addr = virt_to_dma(dev, ptr);

dma_sync_single_for_device(dev, dma_addr, size, dir);
}

This would not even conflict with having a common implementation
for iommu based dma_map_ops -- we would just call the iommu functions
directly when needed rather than having an indirect function call.

Arnd

2011-04-27 10:52:06

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

Hello,

On Wednesday, April 27, 2011 10:57 AM Arnd Bergmann wrote:

> On Wednesday 27 April 2011, Russell King - ARM Linux wrote:
> > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > that this is needed, and it currently is not implemented, with
> > > an outdated comment explaining why it used to not be possible
> > > to do it.
> >
> > dma_alloc_noncoherent is an entirely pointless API afaics.
>
> The main use case that I can see for dma_alloc_noncoherent is being
> able to allocate a large cacheable memory chunk that is mapped
> contiguous into both kernel virtual and bus virtual space, but not
> necessarily in contiguous in physical memory.
>
> Without an IOMMU, I agree that it is pointless, because the only
> sensible imlpementation would be alloc_pages_exact + dma_map_single.

Still it might be reasonable to use it in the drivers that will work
on different platforms - one with iommu and one without.

> > > 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> > > both IOMMU and direct mapped DMA on some machines.
> > >
> > > 4. Implement an architecture independent version of dma_map_ops
> > > based on the iommu.h API. As Joerg mentioned, this has been
> > > missing for some time, and it would be better to do it once
> > > than for each IOMMU separately. This is probably a lot of work.
> >
> > dma_map_ops design is broken - we can't have the entire DMA API
> indirected
> > through that structure. Whether you have an IOMMU or not is completely
> > independent of whether you have to do DMA cache handling. Moreover, with
> > dmabounce, having the DMA cache handling in place doesn't make sense.
> >
> > So you can't have a dma_map_ops for the cache handling bits, a
> dma_map_ops
> > for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
> > work like that.
> >
> > I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable
> > for ARM.
>
> We probably still need to handle both the coherent and noncoherent case
> in each dma_map_ops implementation, at least for those combinations where
> they matter (definitely the linear mapping). However, I think that using
> dma_mapping_common.h would let us use an architecture-independent
> dma_map_ops
> for the generic iommu code that Marek wants to introduce now.
>
> I still don't understand how dmabounce works, but if it's similar to
> swiotlb, we can have at least three different dma_map_ops: linear,
> dmabounce and iommu.

That's exactly what I want to make in the initial version of my patches.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

2011-04-27 11:02:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011, Russell King - ARM Linux wrote:
> On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:
> > We probably still need to handle both the coherent and noncoherent case
> > in each dma_map_ops implementation, at least for those combinations where
> > they matter (definitely the linear mapping). However, I think that using
> > dma_mapping_common.h would let us use an architecture-independent dma_map_ops
> > for the generic iommu code that Marek wants to introduce now.
>
> The 'do we have an iommu or not' question and the 'do we need to do cache
> coherency' question are two independent questions which are unrelated to
> each other. There are four unique but equally valid combinations.
>
> Pushing the cache coherency question down into the iommu stuff will mean
> that we'll constantly be fighting against the 'but this iommu works on x86'
> shite that we've fought with over block device crap for years. I have
> no desire to go there.

Ok, I see. I believe we could avoid having to fight with the people that
only care about coherent architectures if we just have two separate
implementations of dma_map_ops in the iommu code, one for coherent
and one for noncoherent DMA. Any architecture that only needs one
of them would then only enable the Kconfig options for that implementation
and not care about the other one.

> What we need is a proper abstraction where the DMA ops can say whether
> they can avoid DMA cache handling (eg, swiotlb or dmabounce stuff) but
> default to DMA cache handling being the norm - and the DMA cache handling
> performed in the level above the DMA ops indirection.

Yes, that sounds definitely possible. I guess it could be as simple
as having a flag somewhere in struct device if we want to make it
architecture independent.

As for making the default being to do cache handling, I'm not completely
sure how that would work on architectures where most devices are coherent.
If I understood the DRM people correctly, some x86 machine have noncoherent
DMA in their GPUs while everything else is coherent.

Maybe we can default to arch_is_coherent() and allow a device to override
that when it knows better.

Arnd

2011-04-27 11:08:35

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 2011-04-27 at 11:43 +0100, Arnd Bergmann wrote:
> On Wednesday 27 April 2011, Catalin Marinas wrote:
> > On 21 April 2011 20:29, Arnd Bergmann <[email protected]> wrote:
> > > I think the recent discussions on linaro-mm-sig and the BoF last week
> > > at ELC have been quite productive, and at least my understanding
> > > of the missing pieces has improved quite a bit. This is a list of
> > > things that I think need to be done in the kernel. Please complain
> > > if any of these still seem controversial:
> > >
> > > 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> > > is broken on modern CPUs because it results in both cached and
> > > uncached mappings. Rebecca suggested different approaches how to
> > > get there.
> >
> > It's not broken since we moved to using Normal non-cacheable memory
> > for the coherent DMA buffers (as long as you flush the cacheable alias
> > before using the buffer, as we already do). The ARM ARM currently says
> > unpredictable for such situations but this is being clarified in
> > future updates and the Normal non-cacheable vs cacheable aliases can
> > be used (given correct cache maintenance before using the buffer).
>
> Thanks for that information, I believe a number of people in the
> previous discussions were relying on the information from the
> documentation. Are you sure that this is not only correct for the
> cores made by ARM ltd but also for the other implementations that
> may have relied on documentation?

It is a clarification in the ARM ARM so it covers all the cores made by
architecture licensees, not just ARM Ltd. It basically makes the
"unpredictable" part more predictable to allow certain types of aliases
(e.g. Strongly Ordered vs Normal memory would still be disallowed).

All the current implementations are safe with Normal memory aliases
(cacheable vs non-cacheable) but of course, there may be some
performance benefits in not having any alias.

> As I mentioned before, there are other architectures, where having
> conflicting cache settings in TLB entries for the same pysical page
> immediately checkstops the CPU, and I guess that this was also allowed
> by the current version of the ARM ARM.

The current version of the ARM ARM says "unpredictable". But this
general definition of "unpredictable" does not allow it to deadlock
(hardware) or have security implications. It is however allowed to
corrupt data.

> > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > that this is needed, and it currently is not implemented, with
> > > an outdated comment explaining why it used to not be possible
> > > to do it.
> >
> > As Russell pointed out, there are 4 main combinations with iommu and
> > some coherency support (i.e. being able to snoop the CPU caches). But
> > in an SoC you can have different devices with different iommu and
> > coherency configurations. Some of them may even be able to see the L2
> > cache but not the L1 (in which case it would help if we can get an
> > inner non-cacheable outer cacheable mapping).
> >
> > Anyway, we end up with different DMA ops per device via dev_archdata.
>
> Having different DMA ops per device was the solution that I was suggesting
> with dma_mapping_common.h, but Russell pointed out that it may not be
> the best option.

IMHO, that's the most flexible option. I can't say for sure whether
we'll need such flexibility in the future.

> The alternative would be to have just one set of dma_mapping functions
> as we do today, but to extend the functions to also cover the iommu
> case, for instance (example, don't take literally):
>
> static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
> size_t size, enum dma_data_direction dir)
> {
> dma_addr_t ret;
>
> #ifdef CONFIG_DMABOUNCE
> if (dev->archdata.dmabounce)
> return dmabounce_map_single(dev, cpu_addr, size, dir);
> #endif
>
> #ifdef CONFIG_IOMMU
> if (dev->archdata.iommu)
> ret = iommu_map_single(dev, cpu_addr, size, dir);
> else
> #endif
> dma_addr = virt_to_dma(dev, ptr);
>
> dma_sync_single_for_device(dev, dma_addr, size, dir);
> }
>
> This would not even conflict with having a common implementation
> for iommu based dma_map_ops -- we would just call the iommu functions
> directly when needed rather than having an indirect function call.

I don't particularly like having lots of #ifdef's (but we could probably
have some macros checking archdata.* to make this cleaner).

We also need a way to specify a coherency level as we are getting
platforms with devices connected to something like ACP (ARM Coherency
Port).

--
Catalin

2011-04-27 14:06:51

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 27 Apr 2011 10:52:25 +0100
Catalin Marinas <[email protected]> wrote:

> Anyway, we end up with different DMA ops per device via dev_archdata.

Several architectures already do. What's wrong with the approach for
arm?

2011-04-27 14:29:40

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 2011-04-27 at 15:06 +0100, FUJITA Tomonori wrote:
> On Wed, 27 Apr 2011 10:52:25 +0100
> Catalin Marinas <[email protected]> wrote:
>
> > Anyway, we end up with different DMA ops per device via dev_archdata.
>
> Several architectures already do. What's wrong with the approach for
> arm?

Nothing wrong IMHO but it depends on how you group the DMA ops as it may
not be feasible to have all the combinations dmabounce/iommu/coherency
combinations. I think the main combinations would be:

1. standard (no-iommu) + non-coherent
2. standard (no-iommu) + coherent
3. iommu + non-coherent
4. iommu + coherent
5. dmabounce + non-coherent
6. dmabounce + coherent

I think dmabounce and iommu can be exclusive (unless the iommu cannot
access the whole RAM). If that's the case, we can have three type of DMA
ops:

1. standard
2. iommu
3. dmabounce

with an additional flag via dev_archdata for cache coherency level (a
device may be able to snoop the L1 or L2 cache etc.)

--
Catalin

2011-04-27 14:34:40

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 27 Apr 2011 15:29:30 +0100
Catalin Marinas <[email protected]> wrote:

> On Wed, 2011-04-27 at 15:06 +0100, FUJITA Tomonori wrote:
> > On Wed, 27 Apr 2011 10:52:25 +0100
> > Catalin Marinas <[email protected]> wrote:
> >
> > > Anyway, we end up with different DMA ops per device via dev_archdata.
> >
> > Several architectures already do. What's wrong with the approach for
> > arm?
>
> Nothing wrong IMHO but it depends on how you group the DMA ops as it may
> not be feasible to have all the combinations dmabounce/iommu/coherency
> combinations. I think the main combinations would be:
>
> 1. standard (no-iommu) + non-coherent
> 2. standard (no-iommu) + coherent
> 3. iommu + non-coherent
> 4. iommu + coherent
> 5. dmabounce + non-coherent
> 6. dmabounce + coherent
>
> I think dmabounce and iommu can be exclusive (unless the iommu cannot
> access the whole RAM). If that's the case, we can have three type of DMA
> ops:
>
> 1. standard
> 2. iommu
> 3. dmabounce
>
> with an additional flag via dev_archdata for cache coherency level (a
> device may be able to snoop the L1 or L2 cache etc.)

Sounds nothing wrong to me too. I like to see arm people to switch
from dmabounce to swiotlb though.

2011-04-27 16:16:51

by Alex Deucher

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 7:02 AM, Arnd Bergmann <[email protected]> wrote:
> On Wednesday 27 April 2011, Russell King - ARM Linux wrote:
>> On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:
>> > We probably still need to handle both the coherent and noncoherent case
>> > in each dma_map_ops implementation, at least for those combinations where
>> > they matter (definitely the linear mapping). However, I think that using
>> > dma_mapping_common.h would let us use an architecture-independent dma_map_ops
>> > for the generic iommu code that Marek wants to introduce now.
>>
>> The 'do we have an iommu or not' question and the 'do we need to do cache
>> coherency' question are two independent questions which are unrelated to
>> each other. ?There are four unique but equally valid combinations.
>>
>> Pushing the cache coherency question down into the iommu stuff will mean
>> that we'll constantly be fighting against the 'but this iommu works on x86'
>> shite that we've fought with over block device crap for years. ?I have
>> no desire to go there.
>
> Ok, I see. I believe we could avoid having to fight with the people that
> only care about coherent architectures if we just have two separate
> implementations of dma_map_ops in the iommu code, one for coherent
> and one for noncoherent DMA. Any architecture that only needs one
> of them would then only enable the Kconfig options for that implementation
> and not care about the other one.
>
>> What we need is a proper abstraction where the DMA ops can say whether
>> they can avoid DMA cache handling (eg, swiotlb or dmabounce stuff) but
>> default to DMA cache handling being the norm - and the DMA cache handling
>> performed in the level above the DMA ops indirection.
>
> Yes, that sounds definitely possible. I guess it could be as simple
> as having a flag somewhere in struct device if we want to make it
> architecture independent.
>
> As for making the default being to do cache handling, I'm not completely
> sure how that would work on architectures where most devices are coherent.
> If I understood the DRM people correctly, some x86 machine have noncoherent
> DMA in their GPUs while everything else is coherent.

On radeon hardware at least the on chip gart mechanism supports both
snooped cache coherent pages and uncached, non-snooped pages.

Alex

>
> Maybe we can default to arch_is_coherent() and allow a device to override
> that when it knows better.
>
> ? ? ? ?Arnd
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

2011-04-27 17:44:40

by Anca Emanuel

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

@Russell: contact Linaro. They need you.
Make sure their effort is the right thing to do.

2011-04-27 20:16:23

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 01:02:43PM +0200, Arnd Bergmann wrote:
> On Wednesday 27 April 2011, Russell King - ARM Linux wrote:
> > On Wed, Apr 27, 2011 at 10:56:49AM +0200, Arnd Bergmann wrote:
> > > We probably still need to handle both the coherent and noncoherent case
> > > in each dma_map_ops implementation, at least for those combinations where
> > > they matter (definitely the linear mapping). However, I think that using
> > > dma_mapping_common.h would let us use an architecture-independent dma_map_ops
> > > for the generic iommu code that Marek wants to introduce now.
> >
> > The 'do we have an iommu or not' question and the 'do we need to do cache
> > coherency' question are two independent questions which are unrelated to
> > each other. There are four unique but equally valid combinations.
> >
> > Pushing the cache coherency question down into the iommu stuff will mean
> > that we'll constantly be fighting against the 'but this iommu works on x86'
> > shite that we've fought with over block device crap for years. I have
> > no desire to go there.
>
> Ok, I see. I believe we could avoid having to fight with the people that
> only care about coherent architectures if we just have two separate
> implementations of dma_map_ops in the iommu code, one for coherent
> and one for noncoherent DMA. Any architecture that only needs one
> of them would then only enable the Kconfig options for that implementation
> and not care about the other one.

But then we have to invent yet another whole new API to deal with the
cache coherency issues - which makes for more documentation, and eventually
more abuse because it won't quite do what architectures want it to do,
etc.

> Yes, that sounds definitely possible. I guess it could be as simple
> as having a flag somewhere in struct device if we want to make it
> architecture independent.

I was referring to a flag in the dma_ops to say whether the DMA ops
implementation requires DMA cache coherency. In the case of swiotlb,
performing full DMA cache coherency is a pure waste of CPU cycles -
and probably makes DMA much more expensive than merely switching back
to using PIO.

I'm really not interested in producing "generic" interfaces which end up
throwing the baby out with the bath water when we already have a better
implementation in place - even if the hardware sucks. That's not
forward progress as far as I'm concerned.

> As for making the default being to do cache handling, I'm not completely
> sure how that would work on architectures where most devices are coherent.
> If I understood the DRM people correctly, some x86 machine have noncoherent
> DMA in their GPUs while everything else is coherent.

Well, it sounds like struct device needs a flag to indicate whether it is
coherent or not - but exactly how this gets set seems to be architecture
dependent. I don't see bus or driver code being able to make the necessary
decisions - eg, tulip driver on x86 would be coherent, but tulip driver on
ARM would be non-coherent.

Nevertheless, doing it on a per-device basis is definitely the right
answer.

2011-04-27 20:22:04

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011 22:16:05 Russell King - ARM Linux wrote:
> > As for making the default being to do cache handling, I'm not completely
> > sure how that would work on architectures where most devices are coherent.
> > If I understood the DRM people correctly, some x86 machine have noncoherent
> > DMA in their GPUs while everything else is coherent.
>
> Well, it sounds like struct device needs a flag to indicate whether it is
> coherent or not - but exactly how this gets set seems to be architecture
> dependent. I don't see bus or driver code being able to make the necessary
> decisions - eg, tulip driver on x86 would be coherent, but tulip driver on
> ARM would be non-coherent.
>
> Nevertheless, doing it on a per-device basis is definitely the right
> answer.

The flag would not get set by the driver that uses the device but
the driver that found it, e.g. the PCI bus or the platform code,
which should know about these things and also install the appropriate
iommu or mapping operations.

Arnd

2011-04-27 20:26:15

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 10:21:48PM +0200, Arnd Bergmann wrote:
> On Wednesday 27 April 2011 22:16:05 Russell King - ARM Linux wrote:
> > > As for making the default being to do cache handling, I'm not completely
> > > sure how that would work on architectures where most devices are coherent.
> > > If I understood the DRM people correctly, some x86 machine have noncoherent
> > > DMA in their GPUs while everything else is coherent.
> >
> > Well, it sounds like struct device needs a flag to indicate whether it is
> > coherent or not - but exactly how this gets set seems to be architecture
> > dependent. I don't see bus or driver code being able to make the necessary
> > decisions - eg, tulip driver on x86 would be coherent, but tulip driver on
> > ARM would be non-coherent.
> >
> > Nevertheless, doing it on a per-device basis is definitely the right
> > answer.
>
> The flag would not get set by the driver that uses the device but
> the driver that found it, e.g. the PCI bus or the platform code,
> which should know about these things and also install the appropriate
> iommu or mapping operations.

As I said above, I don't think bus code can do it. Take my example
above of a tulip pci device on x86 and a tulip pci device on ARM. Both
use the same PCI code.

Maybe something in asm/pci.h - but that invites having lots of bus
specific header files in asm/.

A better solution imho would be to have an architecture callback for
struct device which gets registered, which can inspect the type of
the device, and set the flag depending on where it appears in the
tree.

2011-04-27 20:27:31

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 08:44:38PM +0300, Anca Emanuel wrote:
> @Russell: contact Linaro. They need you.
> Make sure their effort is the right thing to do.

Sorry, don't understand your message. I don't do brief twitter-like
impossible to comprehend messages.

2011-04-27 20:30:09

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 11:06:00PM +0900, FUJITA Tomonori wrote:
> On Wed, 27 Apr 2011 10:52:25 +0100
> Catalin Marinas <[email protected]> wrote:
>
> > Anyway, we end up with different DMA ops per device via dev_archdata.
>
> Several architectures already do. What's wrong with the approach for
> arm?

Please read the rest of the thread, where I've already explained the
issue.

2011-04-27 20:48:20

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011 22:26:03 Russell King - ARM Linux wrote:
> Maybe something in asm/pci.h - but that invites having lots of bus
> specific header files in asm/.
>
> A better solution imho would be to have an architecture callback for
> struct device which gets registered, which can inspect the type of
> the device, and set the flag depending on where it appears in the
> tree.

Ah, I was under the assumption that there was already a callback
for this. We have a dma_set_coherent_mask() implementation in
some pci hosts (ixp4xx and it8152), but that's not a proper
callback that can be override per host and it does not
actually do what we were talking about here. I guess the callback
should live in struct hw_pci in case of ARM, and set a new field
in struct device_dma_parameters.

Maybe we don't even need a new flag if we just set
device->coherent_dma_mask to zero.

Arnd

2011-04-27 21:36:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, 2011-04-21 at 21:29 +0200, Arnd Bergmann wrote:
>
> 7. Extend the dma_map_ops to have a way for mapping a buffer
> from dma_alloc_{non,}coherent into user space. We have not
> discussed that yet, but after thinking this for some time, I
> believe this would be the right approach to map buffers into
> user space from code that doesn't care about the underlying
> hardware.

Yes. There is a dma_mmap_coherent() call that's not part of the "Real"
API but is implemented by some archs and used by Alsa (I added support
for it on powerpc recently).

Maybe that should go into the dma ops.

The question remains, if we ever want to do more complex demand-paged
operations, should we also expose a lower level set of functions to get
struct page out of a dma_alloc_coherent() allocation and to get the
pgprot for the user dma mapping ?

> After all these are in place, building anything on top of
> dma_alloc_{non,}coherent should be much easier. The question
> of passing buffers between V4L and DRM is still completely
> unsolved as far as I can tell, but that discussion might become
> more focused if we can agree on the above points and assume
> that it will be done.

My gut feeling is that it should be done by having V4L use DRM buffers
in the first place...

> I expect that I will have to update the list above as people
> point out mistakes in my assumptions.

Cheers,
Ben.

2011-04-27 21:38:14

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> > 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> > is broken on modern CPUs because it results in both cached and
> > uncached mappings. Rebecca suggested different approaches how to
> > get there.
>
> I also suggested various approaches and produced patches, which I'm slowly
> feeding in. However, I think whatever we do, we'll end up breaking
> something along the line - especially as various places assume that
> dma_alloc_coherent() is ultimately backed by memory with a struct page.

Our implementation for embedded ppc has a similar problem. It currently
uses a pool of memory and does virtual mappings on it which means no
struct page easy to get to. How do you do on your side ? A fixed size
pool that you take out of the linear mapping ? Or you allocate pages in
the linear mapping and "unmap" them ? The problem I have with some
embedded ppc's is that the linear map is mapped in chunks of 256M or
so....

> > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > that this is needed, and it currently is not implemented, with
> > an outdated comment explaining why it used to not be possible
> > to do it.
>
> dma_alloc_noncoherent is an entirely pointless API afaics.

I was about to ask what the point is ... (what is the expected
semantic ? Memory that is reachable but not necessarily cache
coherent ?)

> > 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> > both IOMMU and direct mapped DMA on some machines.
> >
> > 4. Implement an architecture independent version of dma_map_ops
> > based on the iommu.h API. As Joerg mentioned, this has been
> > missing for some time, and it would be better to do it once
> > than for each IOMMU separately. This is probably a lot of work.
>
> dma_map_ops design is broken - we can't have the entire DMA API indirected
> through that structure.

Why not ? That's the only way we can deal in my experience with multiple
type of different iommu's etc... at runtime in a single kernel. We used
to more/less have global function pointers in a long past but we moved
to per device ops instead to cope with multiple DMA path within a given
system and it works fine.

> Whether you have an IOMMU or not is completely
> independent of whether you have to do DMA cache handling. Moreover, with
> dmabounce, having the DMA cache handling in place doesn't make sense.

Right. For now I don't have that problem on ppc as my iommu archs are
also fully coherent, so it's a bit more tricky that way but can be
handled I suppose by having the cache mgmnt be lib functions based on
flags added to the struct device.

> So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops
> for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
> work like that.

Well, the dmabounce and cache handling is one implementation that's just
on/off with parameters no ?. iommu is different implementations. So the
ops should be for the iommu backends. The dmabounce & cache handling is
then done by those backends based on flags you stick in struct device
for example.

> I believe the dma_map_ops stuff in asm-generic to be entirely unsuitable
> for ARM.

I don't think it is :-)

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-04-27 21:41:25

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1


> As I said above, I don't think bus code can do it. Take my example
> above of a tulip pci device on x86 and a tulip pci device on ARM. Both
> use the same PCI code.
>
> Maybe something in asm/pci.h - but that invites having lots of bus
> specific header files in asm/.
>
> A better solution imho would be to have an architecture callback for
> struct device which gets registered, which can inspect the type of
> the device, and set the flag depending on where it appears in the
> tree.

Now -that's gross :-)

For PCI you can have the flag propagate from the PHB down, for busses
without a bus type (platform) then whoever instanciate them (the
platform code) can set that appropriately.

Ben.

2011-04-27 21:45:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 2011-04-27 at 10:52 +0100, Catalin Marinas wrote:
>
> It's not broken since we moved to using Normal non-cacheable memory
> for the coherent DMA buffers (as long as you flush the cacheable alias
> before using the buffer, as we already do). The ARM ARM currently says
> unpredictable for such situations but this is being clarified in
> future updates and the Normal non-cacheable vs cacheable aliases can
> be used (given correct cache maintenance before using the buffer).

Don't you have a risk where speculative loads or prefetches might bring
back some stuff into the cache via the cachable mapping ? Is that an
issue ? As long as it's non-dirty and the cachable mapping isn't
otherwise used, I suppose it might be a non-issue, tho I've seen in
powerpc land cases of processors that can checkstop if a subsequent non
cachable access "hits" the stuff that was loaded in the cache.

Cheers,
Ben.

2011-04-28 00:16:17

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:

> The current version of the ARM ARM says "unpredictable". But this
> general definition of "unpredictable" does not allow it to deadlock
> (hardware) or have security implications. It is however allowed to
> corrupt data.

Not allowed to have security implications, but is allowed to corrupt data.

*boggle* :)

(The problem being, of course, that if the attacker is able to predict/control
what gets corrupted, it can easily end up leveraged into a security implication.)


Attachments:
(No filename) (227.00 B)

2011-04-28 06:40:16

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011 23:37:51 Benjamin Herrenschmidt wrote:
> On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:
> > On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > that this is needed, and it currently is not implemented, with
> > > an outdated comment explaining why it used to not be possible
> > > to do it.
> >
> > dma_alloc_noncoherent is an entirely pointless API afaics.
>
> I was about to ask what the point is ... (what is the expected
> semantic ? Memory that is reachable but not necessarily cache
> coherent ?)

Drivers use this when they explicitly want to manage the caching
themselves. I think this is most interesting on big NUMA systems,
where you really want to use fast (local cached) memory and
then flush it explicitly to do dma. Very few drivers use this:

arnd@wuerfel:~/linux-2.6$ git grep dma_alloc_noncoherent drivers/
drivers/base/dma-mapping.c: vaddr = dma_alloc_noncoherent(dev, size, dma_handle, gfp);
drivers/net/au1000_eth.c: aup->vaddr = (u32)dma_alloc_noncoherent(NULL, MAX_BUF_SIZE *
drivers/net/lasi_82596.c:#define DMA_ALLOC dma_alloc_noncoherent
drivers/net/sgiseeq.c: sr = dma_alloc_noncoherent(&pdev->dev, sizeof(*sp->srings),
drivers/scsi/53c700.c: memory = dma_alloc_noncoherent(hostdata->dev, TOTAL_MEM_SIZE,
drivers/scsi/sgiwd93.c: hdata->cpu = dma_alloc_noncoherent(&pdev->dev, HPC_DMA_SIZE,
drivers/tty/serial/mpsc.c: } else if ((pi->dma_region = dma_alloc_noncoherent(pi->port.dev,
drivers/video/au1200fb.c: fbdev->fb_mem = dma_alloc_noncoherent(&dev->dev,

> > So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops
> > for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
> > work like that.
>
> Well, the dmabounce and cache handling is one implementation that's just
> on/off with parameters no ?. iommu is different implementations. So the
> ops should be for the iommu backends. The dmabounce & cache handling is
> then done by those backends based on flags you stick in struct device
> for example.

Well, what we are currently discussing is to have a common implementation
for IOMMUs that provide the generic iommu_ops that the KVM people introduced.
Once we get there, we only need a single dma_map_ops structure for all
IOMMUs.

Arnd

2011-04-28 06:46:44

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, 28 Apr 2011 08:40:08 +0200
Arnd Bergmann <[email protected]> wrote:

> On Wednesday 27 April 2011 23:37:51 Benjamin Herrenschmidt wrote:
> > On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:
> > > On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> > > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > > that this is needed, and it currently is not implemented, with
> > > > an outdated comment explaining why it used to not be possible
> > > > to do it.
> > >
> > > dma_alloc_noncoherent is an entirely pointless API afaics.
> >
> > I was about to ask what the point is ... (what is the expected
> > semantic ? Memory that is reachable but not necessarily cache
> > coherent ?)
>
> Drivers use this when they explicitly want to manage the caching
> themselves.

Not "want to manage". The API is for drivers that "have to" manage the
cache because of architectures that can't allocate coherent memory.

> I think this is most interesting on big NUMA systems,
> where you really want to use fast (local cached) memory and
> then flush it explicitly to do dma. Very few drivers use this:

2011-04-28 07:24:32

by Cho KyongHo

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 6:45 AM, Benjamin Herrenschmidt
<[email protected]> wrote:
>
> Don't you have a risk where speculative loads or prefetches might bring
> back some stuff into the cache via the cachable mapping ? Is that an
> issue ? As long as it's non-dirty and the cachable mapping isn't
> otherwise used, I suppose it might be a non-issue, tho I've seen in
> powerpc land cases of processors that can checkstop if a subsequent non
> cachable access "hits" the stuff that was loaded in the cache.
>
> Cheers,
> Ben.
>
As far as I know, ARM processors does not have the capability
to detect non-cacheable access hits the stuff in the cache.

IMHO, speculative prefetch becomes a problem
when a coherent buffer (that is not-cacheable in ARM) is modified by a
DMA transaction
while old data is already loaded in the cache via another cacheable
mapping onto the buffer
even though it is never touched by CPU.
We can avoid this problem if the kernel removes 'executable' property
from the cacheable mapping.
But it is not able to modify page table entries in the direct mapping area.

Regards,
KyongHo

2011-04-28 08:27:50

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, 2011-04-28 at 01:15 +0100, [email protected] wrote:
> On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:
>
> > The current version of the ARM ARM says "unpredictable". But this
> > general definition of "unpredictable" does not allow it to deadlock
> > (hardware) or have security implications. It is however allowed to
> > corrupt data.
>
> Not allowed to have security implications, but is allowed to corrupt data.

By security I was referring to TrustZone extensions. IOW, unpredictable
in normal (non-secure) world should not cause data corruption in the
secure world.

--
Catalin

2011-04-28 08:31:21

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, 2011-04-27 at 22:45 +0100, Benjamin Herrenschmidt wrote:
> On Wed, 2011-04-27 at 10:52 +0100, Catalin Marinas wrote:
> > It's not broken since we moved to using Normal non-cacheable memory
> > for the coherent DMA buffers (as long as you flush the cacheable alias
> > before using the buffer, as we already do). The ARM ARM currently says
> > unpredictable for such situations but this is being clarified in
> > future updates and the Normal non-cacheable vs cacheable aliases can
> > be used (given correct cache maintenance before using the buffer).
>
> Don't you have a risk where speculative loads or prefetches might bring
> back some stuff into the cache via the cachable mapping ? Is that an
> issue ? As long as it's non-dirty and the cachable mapping isn't
> otherwise used, I suppose it might be a non-issue, tho I've seen in
> powerpc land cases of processors that can checkstop if a subsequent non
> cachable access "hits" the stuff that was loaded in the cache.

At the CPU cache level, unexpected cache hit is considered a miss, IOW
non-cacheable memory accesses ignore the cache lines that may have been
speculatively loaded.

--
Catalin

2011-04-28 09:30:55

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 07:41:07AM +1000, Benjamin Herrenschmidt wrote:
>
> > As I said above, I don't think bus code can do it. Take my example
> > above of a tulip pci device on x86 and a tulip pci device on ARM. Both
> > use the same PCI code.
> >
> > Maybe something in asm/pci.h - but that invites having lots of bus
> > specific header files in asm/.
> >
> > A better solution imho would be to have an architecture callback for
> > struct device which gets registered, which can inspect the type of
> > the device, and set the flag depending on where it appears in the
> > tree.
>
> Now -that's gross :-)
>
> For PCI you can have the flag propagate from the PHB down, for busses
> without a bus type (platform) then whoever instanciate them (the
> platform code) can set that appropriately.

How can you do that when it changes mid-bus heirarchy? I'm thinking
of the situation where the DRM stuff is on a child bus below the
root bus, and the root bus has DMA coherent devices on it but the DRM
stuff doesn't.

Your solution doesn't allow that - and I believe that's what Arnd is
talking about.

2011-04-28 09:37:55

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 07:37:51AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2011-04-27 at 08:35 +0100, Russell King - ARM Linux wrote:
> > On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> > > 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> > > is broken on modern CPUs because it results in both cached and
> > > uncached mappings. Rebecca suggested different approaches how to
> > > get there.
> >
> > I also suggested various approaches and produced patches, which I'm slowly
> > feeding in. However, I think whatever we do, we'll end up breaking
> > something along the line - especially as various places assume that
> > dma_alloc_coherent() is ultimately backed by memory with a struct page.
>
> Our implementation for embedded ppc has a similar problem. It currently
> uses a pool of memory and does virtual mappings on it which means no
> struct page easy to get to. How do you do on your side ? A fixed size
> pool that you take out of the linear mapping ? Or you allocate pages in
> the linear mapping and "unmap" them ? The problem I have with some
> embedded ppc's is that the linear map is mapped in chunks of 256M or
> so....

We don't - what I was referring to was people taking the DMA cookie and
treating it as a physical address, converting it to a PFN and then doing
pfn_to_page() on that. (Yes, it's been tried.)

There have been some subsystems (eg ALSA) which also tried to use
virt_to_page() on dma_alloc_coherent(), but I think those got fixed to
use our dma_mmap_coherent() stuff when building on ARM.

> > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > that this is needed, and it currently is not implemented, with
> > > an outdated comment explaining why it used to not be possible
> > > to do it.
> >
> > dma_alloc_noncoherent is an entirely pointless API afaics.
>
> I was about to ask what the point is ... (what is the expected
> semantic ? Memory that is reachable but not necessarily cache
> coherent ?)

As far as I can see, dma_alloc_noncoherent() should just be a wrapper
around the normal page allocation function. I don't see it ever needing
to do anything special - and the advantage of just being the normal
page allocation function is that its properties are well known and
architecture independent.

> > > 3. Convert ARM to use asm-generic/dma-mapping-common.h. We need
> > > both IOMMU and direct mapped DMA on some machines.
> > >
> > > 4. Implement an architecture independent version of dma_map_ops
> > > based on the iommu.h API. As Joerg mentioned, this has been
> > > missing for some time, and it would be better to do it once
> > > than for each IOMMU separately. This is probably a lot of work.
> >
> > dma_map_ops design is broken - we can't have the entire DMA API indirected
> > through that structure.
>
> Why not ? That's the only way we can deal in my experience with multiple
> type of different iommu's etc... at runtime in a single kernel. We used
> to more/less have global function pointers in a long past but we moved
> to per device ops instead to cope with multiple DMA path within a given
> system and it works fine.
>
> > Whether you have an IOMMU or not is completely
> > independent of whether you have to do DMA cache handling. Moreover, with
> > dmabounce, having the DMA cache handling in place doesn't make sense.

Here I've answered your question above.

> Right. For now I don't have that problem on ppc as my iommu archs are
> also fully coherent, so it's a bit more tricky that way but can be
> handled I suppose by having the cache mgmnt be lib functions based on
> flags added to the struct device.

Think about stuffing all the iommu drivers with DMA cache management for
ARM, and think about the maintainability for that when other folk come
along and change the iommu drivers. I've no desire to keep going to fix
them each time someone breaks the DMA cache management because everyone
elses cache is DMA coherent.

Keep that in the arch code, out of the dma_ops and it doesn't have to be
thought about by each and every iommu driver.

> > So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops
> > for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
> > work like that.
>
> Well, the dmabounce and cache handling is one implementation that's just
> on/off with parameters no ?. iommu is different implementations. So the
> ops should be for the iommu backends. The dmabounce & cache handling is
> then done by those backends based on flags you stick in struct device
> for example.

You've completely missed the point.

2011-04-28 09:42:25

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 07:31:06AM +1000, Benjamin Herrenschmidt wrote:
> The question remains, if we ever want to do more complex demand-paged
> operations, should we also expose a lower level set of functions to get
> struct page out of a dma_alloc_coherent() allocation and to get the
> pgprot for the user dma mapping ?

I don't think so - that places the requirement that dma_alloc_coherent()
must be backed by memory with a set of struct page, which may not always
be the case.

Think about dma_alloc_coherent() with dma_declare_coherent_memory() used
with memory which is not part of system RAM.

2011-04-28 10:27:24

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> 4. Implement an architecture independent version of dma_map_ops
> based on the iommu.h API. As Joerg mentioned, this has been
> missing for some time, and it would be better to do it once
> than for each IOMMU separately. This is probably a lot of work.

Yes, thats missing for a long time. It will also need some changes to
the IOMMU-API but that should be doable. The best would be to extend the
IOMMU-API so that it also supports GART-like IOMMUs. This way every
dma_ops implementation on all the architectures providing such an IOMMU
could be covered with the architecture independent dma_ops
implementation. This would only leave the low-level hardware access in
the IOMMU drivers.
I think this also requires to change the current semantics of the
existing IOMMU-API implementations. I will prepare a write-up of my
ideas for discussion.

Regards,

Joerg

2011-04-28 10:32:46

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

Hello,

On Thursday, April 28, 2011 11:38 AM Russell King - ARM Linux wrote:

> > > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > > that this is needed, and it currently is not implemented, with
> > > > an outdated comment explaining why it used to not be possible
> > > > to do it.
> > >
> > > dma_alloc_noncoherent is an entirely pointless API afaics.
> >
> > I was about to ask what the point is ... (what is the expected
> > semantic ? Memory that is reachable but not necessarily cache
> > coherent ?)
>
> As far as I can see, dma_alloc_noncoherent() should just be a wrapper
> around the normal page allocation function. I don't see it ever needing
> to do anything special - and the advantage of just being the normal
> page allocation function is that its properties are well known and
> architecture independent.

If there is IOMMU chip that supports pages larger than 4KiB then
dma_alloc_noncoherent() might try to allocate such larger pages what will
result in faster access to the buffer (lower iommu tlb miss ratio).
For large buffers even 64KiB 'pages' gives a significant performance
improvement.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

2011-04-28 10:41:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Wed, Apr 27, 2011 at 08:35:14AM +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:

> dma_map_ops design is broken - we can't have the entire DMA API indirected
> through that structure. Whether you have an IOMMU or not is completely
> independent of whether you have to do DMA cache handling. Moreover, with
> dmabounce, having the DMA cache handling in place doesn't make sense.
>
> So you can't have a dma_map_ops for the cache handling bits, a dma_map_ops
> for IOMMU, and a dma_map_ops for the dmabounce stuff. It just doesn't
> work like that.

Nobody says that the complete feature-set of the dma_ops needs to be
provided through the IOMMU-API. The different APIs are there to solve
different problems:

The IOMMU-API provides low-level access to IOMMU hardware and to map io
addresses to physical addresses (which can be chosen by the caller). The
IOMMU-API does not care about address space layout or cache management.

The DMA-API cares about address management. Every dma_ops implementation
using an IOMMU has an address allocator for io addresses implemented.
the DMA-API also cares about cache-management.

So if we can abstract the different IOMMUs on all architectures in the
IOMMU-API I see no reason why we can't have a common dma_ops
implementation. The dma-buffer ownership management (cpu<->device) can
be put into archtectural call-backs so that architectures that need it
just implement them and everything should work.

Or I am too naive to believe that (which is possible because of my
limited ARM knowledge). In this case please correct me :)

Regards,

Joerg

2011-04-28 10:51:48

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 12:32:32PM +0200, Marek Szyprowski wrote:
> On Thursday, April 28, 2011 11:38 AM Russell King - ARM Linux wrote:
> > > > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > > > > that this is needed, and it currently is not implemented, with
> > > > > an outdated comment explaining why it used to not be possible
> > > > > to do it.
> > > >
> > > > dma_alloc_noncoherent is an entirely pointless API afaics.
> > >
> > > I was about to ask what the point is ... (what is the expected
> > > semantic ? Memory that is reachable but not necessarily cache
> > > coherent ?)
> >
> > As far as I can see, dma_alloc_noncoherent() should just be a wrapper
> > around the normal page allocation function. I don't see it ever needing
> > to do anything special - and the advantage of just being the normal
> > page allocation function is that its properties are well known and
> > architecture independent.
>
> If there is IOMMU chip that supports pages larger than 4KiB then
> dma_alloc_noncoherent() might try to allocate such larger pages what will
> result in faster access to the buffer (lower iommu tlb miss ratio).
> For large buffers even 64KiB 'pages' gives a significant performance
> improvement.

The memory allocated by dma_alloc_noncoherent() (and dma_alloc_coherent())
has to be virtually contiguous, and DMA contiguous. It is assumed by all
drivers that:

virt = dma_alloc_foo(size, &dma);

cpuaddr = virt + offset;
dmaaddr = dma + offset;

results in the CPU and DMA seeing ultimately the same address for cpuaddr
and dmaaddr for 0 <= offset < size.

The standard alloc_pages() also ensures that if you ask for an order-N
page, you'll end up with that allocation being contiguous - so there's
no difference there.

What I'd suggest is that dma_alloc_noncoherent() should be architecture
independent, and should call into whatever iommu support the device has
to setup an approprite iommu mapping. IOW, I don't see any need for
every architecture to provide its own dma_alloc_noncoherent() allocation
function - or indeed every iommu implementation to deal with the
allocation issues either.

2011-04-28 11:01:38

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 12:41:44PM +0200, Joerg Roedel wrote:
> So if we can abstract the different IOMMUs on all architectures in the
> IOMMU-API I see no reason why we can't have a common dma_ops
> implementation. The dma-buffer ownership management (cpu<->device) can
> be put into archtectural call-backs so that architectures that need it
> just implement them and everything should work.

That is precisely what I'm arguing for. The DMA cache management is
architecture specific and should stay in the architecture specific code.
The IOMMU level stuff should bolt into that at the architecture specific
level.

So, eg, for ARM:

dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset,
size_t size, enum dma_data_direction dir)
{
struct dma_map_ops *ops = get_dma_ops(dev);
dma_addr_t addr;

BUG_ON(!valid_dma_direction(dir));
if (ops->flags & DMA_MANAGE_CACHE || !dev->dma_cache_coherent)
__dma_page_cpu_to_dev(page, offset, size, dir);
addr = ops->map_page(dev, page, offset, size, dir, NULL);
debug_dma_map_page(dev, page, offset, size, dir, addr, false);

return addr;
}

Things like swiotlb and dmabounce would not set DMA_MANAGE_CACHE in
ops->flags, but real iommus and the standard no-iommu implementations
would be required to set it to ensure that data is visible in memory
for CPUs which have DMA incoherent caches.

Maybe renaming DMA_MANAGE_CACHE to DMA_DATA_DEVICE_VISIBLE or something
like that would be more explicit as to its function.

dev->dma_cache_coherent serves to cover the case mentioned by the DRM
folk.

2011-04-28 12:12:53

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Catalin Marinas wrote:
> On Thu, 2011-04-28 at 01:15 +0100, [email protected] wrote:
> > On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:
> >
> > > The current version of the ARM ARM says "unpredictable". But this
> > > general definition of "unpredictable" does not allow it to deadlock
> > > (hardware) or have security implications. It is however allowed to
> > > corrupt data.
> >
> > Not allowed to have security implications, but is allowed to corrupt data.
>
> By security I was referring to TrustZone extensions. IOW, unpredictable
> in normal (non-secure) world should not cause data corruption in the
> secure world.

That definition is rather useless for operating systems that don't use
Trustzone then, right?

Arnd

2011-04-28 12:16:03

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Joerg Roedel wrote:
> On Thu, Apr 21, 2011 at 09:29:16PM +0200, Arnd Bergmann wrote:
> > 4. Implement an architecture independent version of dma_map_ops
> > based on the iommu.h API. As Joerg mentioned, this has been
> > missing for some time, and it would be better to do it once
> > than for each IOMMU separately. This is probably a lot of work.
>
> Yes, thats missing for a long time. It will also need some changes to
> the IOMMU-API but that should be doable. The best would be to extend the
> IOMMU-API so that it also supports GART-like IOMMUs. This way every
> dma_ops implementation on all the architectures providing such an IOMMU
> could be covered with the architecture independent dma_ops
> implementation. This would only leave the low-level hardware access in
> the IOMMU drivers.
> I think this also requires to change the current semantics of the
> existing IOMMU-API implementations. I will prepare a write-up of my
> ideas for discussion.

Ok, thanks!

Please include Marek in this, he said he has already started with an
implementation. Any insight from you will certainly help.

Arnd

2011-04-28 12:25:12

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 12:01:29PM +0100, Russell King - ARM Linux wrote:

> dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset,
> size_t size, enum dma_data_direction dir)
> {
> struct dma_map_ops *ops = get_dma_ops(dev);
> dma_addr_t addr;
>
> BUG_ON(!valid_dma_direction(dir));
> if (ops->flags & DMA_MANAGE_CACHE || !dev->dma_cache_coherent)
> __dma_page_cpu_to_dev(page, offset, size, dir);
> addr = ops->map_page(dev, page, offset, size, dir, NULL);
> debug_dma_map_page(dev, page, offset, size, dir, addr, false);
>
> return addr;
> }
>
> Things like swiotlb and dmabounce would not set DMA_MANAGE_CACHE in
> ops->flags, but real iommus and the standard no-iommu implementations
> would be required to set it to ensure that data is visible in memory
> for CPUs which have DMA incoherent caches.

Do we need flags for that? A flag is necessary if the cache-management
differs between IOMMU implementations on the same platform. If
cache-management is only specific to the platform (or architecture) then
it does make more sense to just call the function without flag checking
and every platform with coherent DMA just implements these as static
inline noops.

Regards,

Joerg

2011-04-28 12:29:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> What I'd suggest is that dma_alloc_noncoherent() should be architecture
> independent, and should call into whatever iommu support the device has
> to setup an approprite iommu mapping. IOW, I don't see any need for
> every architecture to provide its own dma_alloc_noncoherent() allocation
> function - or indeed every iommu implementation to deal with the
> allocation issues either.

Almost all architectures today define dma_alloc_noncoherent to
dma_alloc_coherent, which is totally fine on architectures
where cacheable coherent mappings are the default or where
we don't need to flush individual cache lines for dma_sync_*.

The problem with backing either of the two with alloc_pages or
alloc_pages_exact is that you cannot do large allocation when
physical memory is fragmented, even if you have an IOMMU.

IMHO the allocation for both dma_alloc_coherent and
dma_alloc_noncoherent should therefore depend on whether you
have an IOMMU. If you do, you can easily allocate megabytes,
e.g. for use as a frame buffer.

Arnd

2011-04-28 12:36:44

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 02:12:40PM +0200, Arnd Bergmann wrote:
> On Thursday 28 April 2011, Catalin Marinas wrote:
> > On Thu, 2011-04-28 at 01:15 +0100, [email protected] wrote:
> > > On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:
> > >
> > > > The current version of the ARM ARM says "unpredictable". But this
> > > > general definition of "unpredictable" does not allow it to deadlock
> > > > (hardware) or have security implications. It is however allowed to
> > > > corrupt data.
> > >
> > > Not allowed to have security implications, but is allowed to corrupt data.
> >
> > By security I was referring to TrustZone extensions. IOW, unpredictable
> > in normal (non-secure) world should not cause data corruption in the
> > secure world.
>
> That definition is rather useless for operating systems that don't use
> Trustzone then, right?

I'm not sure what you're implying. By running on a device with Trustzone
extensions, Linux is using them whether it knows it or not.

Linux on ARMs evaluation boards runs on the secure size of the Trustzone
dividing line. Linux on OMAP SoCs runs on the insecure size of that,
and has to make secure monitor calls to manipulate certain registers
(eg, to enable workarounds for errata etc). As SMC calls are highly
implementation specific, there is and can be no "trustzone" driver.

2011-04-28 12:42:51

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 02:25:09PM +0200, Joerg Roedel wrote:
> On Thu, Apr 28, 2011 at 12:01:29PM +0100, Russell King - ARM Linux wrote:
>
> > dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset,
> > size_t size, enum dma_data_direction dir)
> > {
> > struct dma_map_ops *ops = get_dma_ops(dev);
> > dma_addr_t addr;
> >
> > BUG_ON(!valid_dma_direction(dir));
> > if (ops->flags & DMA_MANAGE_CACHE || !dev->dma_cache_coherent)
> > __dma_page_cpu_to_dev(page, offset, size, dir);
> > addr = ops->map_page(dev, page, offset, size, dir, NULL);
> > debug_dma_map_page(dev, page, offset, size, dir, addr, false);
> >
> > return addr;
> > }
> >
> > Things like swiotlb and dmabounce would not set DMA_MANAGE_CACHE in
> > ops->flags, but real iommus and the standard no-iommu implementations
> > would be required to set it to ensure that data is visible in memory
> > for CPUs which have DMA incoherent caches.
>
> Do we need flags for that? A flag is necessary if the cache-management
> differs between IOMMU implementations on the same platform. If
> cache-management is only specific to the platform (or architecture) then
> it does make more sense to just call the function without flag checking
> and every platform with coherent DMA just implements these as static
> inline noops.

Sigh. You're not seeing the point.

There is _no_ point doing the cache management _if_ we're using something
like dmabounce or swiotlb, as we'll be using memcpy() at some point with
the buffer. Moreover, dmabounce or swiotlb may have to do its own cache
management _after_ that memcpy() to ensure that the page cache requirements
are met.

Doing DMA cache management for dmabounce or swiotlb will result in
unnecessary overhead - and as we can see from the MMC discussions,
it has a _significant_ performance impact.

Think about it. If you're using dmabounce, but still do the cache
management:

1. you flush the data out of the CPU cache back to memory.
2. you allocate new memory using dma_alloc_coherent() for the DMA buffer
which is accessible to the device.
3. you memcpy() the data out of the buffer you just flushed into the
DMA buffer - this re-fills the cache, evicting entries which may
otherwise be hot due to the cache fill policy.

Step 1 is entirely unnecessary and is just a complete and utter waste of
CPU resources.

2011-04-28 12:48:32

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> On Thu, Apr 28, 2011 at 02:12:40PM +0200, Arnd Bergmann wrote:
> > On Thursday 28 April 2011, Catalin Marinas wrote:
> > > On Thu, 2011-04-28 at 01:15 +0100, [email protected] wrote:
> > > > On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:
> > > >
> > > > > The current version of the ARM ARM says "unpredictable". But this
> > > > > general definition of "unpredictable" does not allow it to deadlock
> > > > > (hardware) or have security implications. It is however allowed to
> > > > > corrupt data.
> > > >
> > > > Not allowed to have security implications, but is allowed to corrupt data.
> > >
> > > By security I was referring to TrustZone extensions. IOW, unpredictable
> > > in normal (non-secure) world should not cause data corruption in the
> > > secure world.
> >
> > That definition is rather useless for operating systems that don't use
> > Trustzone then, right?
>
> I'm not sure what you're implying. By running on a device with Trustzone
> extensions, Linux is using them whether it knows it or not.
>
> Linux on ARMs evaluation boards runs on the secure size of the Trustzone
> dividing line. Linux on OMAP SoCs runs on the insecure size of that,
> and has to make secure monitor calls to manipulate certain registers
> (eg, to enable workarounds for errata etc). As SMC calls are highly
> implementation specific, there is and can be no "trustzone" driver.

My point was that when Linux runs in the secure partition (ok, I didn't
know we did that, but still), anything that corrupts Linux data has
security implications. If Linux runs outside of Trustzone, you can
also currupt Linux and the security is completely pointless because
after Linux is gone, you have nothing left that drives your devices
or runs user processes.

The only case where TrustZone would help is when you have an operating
system running in the secure partition as some sort of microkernel
(a.k.a. hypervisor) and have the "unpredictable" behavior isolated
in nonessential parts of the system.

Arnd

2011-04-28 12:59:26

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 01:42:42PM +0100, Russell King - ARM Linux wrote:

> Sigh. You're not seeing the point.
>
> There is _no_ point doing the cache management _if_ we're using something
> like dmabounce or swiotlb, as we'll be using memcpy() at some point with
> the buffer. Moreover, dmabounce or swiotlb may have to do its own cache
> management _after_ that memcpy() to ensure that the page cache requirements
> are met.

Well, I was talking about a generic dma_ops implementation based on the
iommu-api so that every system that has iommu hardware can use a common
code-set.
If you have to dma-bounce you don't have iommu hardware and thus you
don't use this common implementation of dma_ops (but probably the
swiotlb implementation which is already mostly generic).

> Doing DMA cache management for dmabounce or swiotlb will result in
> unnecessary overhead - and as we can see from the MMC discussions,
> it has a _significant_ performance impact.

Yeah, I see that from your explanation below. But as I said, swiotlb
backend is not a target use-case for a common iommu-api-bound dma_ops
implementation.

> Think about it. If you're using dmabounce, but still do the cache
> management:
>
> 1. you flush the data out of the CPU cache back to memory.
> 2. you allocate new memory using dma_alloc_coherent() for the DMA buffer
> which is accessible to the device.
> 3. you memcpy() the data out of the buffer you just flushed into the
> DMA buffer - this re-fills the cache, evicting entries which may
> otherwise be hot due to the cache fill policy.
>
> Step 1 is entirely unnecessary and is just a complete and utter waste of
> CPU resources.

Thanks for the explanation.

Regards,

Joerg

2011-04-28 13:02:23

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> >
> > Do we need flags for that? A flag is necessary if the cache-management
> > differs between IOMMU implementations on the same platform. If
> > cache-management is only specific to the platform (or architecture) then
> > it does make more sense to just call the function without flag checking
> > and every platform with coherent DMA just implements these as static
> > inline noops.
>
> Sigh. You're not seeing the point.
>
> There is no point doing the cache management if we're using something
> like dmabounce or swiotlb, as we'll be using memcpy() at some point with
> the buffer. Moreover, dmabounce or swiotlb may have to do its own cache
> management after that memcpy() to ensure that the page cache requirements
> are met.

I think the misunderstanding is that you are saying we need the flag
in dma_map_ops because you prefer to keep the cache management outside
of the individual dma_map_ops implementations.

What I guess J?rg is thinking of is to have the generic IOMMU version
of dma_map_ops call into the architecture specific code to manage the
caches on architectures that need it. That implementation would of
course not require the flag in dma_map_ops because the architecture
specific callback would use other ways (hardcoded for an architecture,
or looking at the individual device) to determine if this is ever needed.

That is also what I had in mind earlier, but you argued against it
on the base that putting the logic into the common code would lead
to a higher risk of people accidentally breaking it when they only
care about coherent architectures.

Arnd

2011-04-28 13:16:11

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 02:28:56PM +0200, Arnd Bergmann wrote:
> On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> > What I'd suggest is that dma_alloc_noncoherent() should be architecture
> > independent, and should call into whatever iommu support the device has
> > to setup an approprite iommu mapping. IOW, I don't see any need for
> > every architecture to provide its own dma_alloc_noncoherent() allocation
> > function - or indeed every iommu implementation to deal with the
> > allocation issues either.
>
> Almost all architectures today define dma_alloc_noncoherent to
> dma_alloc_coherent, which is totally fine on architectures
> where cacheable coherent mappings are the default or where
> we don't need to flush individual cache lines for dma_sync_*.

However, dma_alloc_coherent() memory can't be used with the dma_sync_*
API as its return address (unlike other architectures) is not in the
kernel direct mapped memory range.

The only thing valid for dma_sync_* are buffers which have been passed
to the dma_map_* APIs.

Instead, I think what you're referring to is dma_cache_sync(), which is
the API to be used with dma_alloc_noncoherent(), which we don't
implement.

As we have problems with some SMP implementations, and the noncoherent
API doesn't have the idea of buffer ownership, it's rather hard to deal
with the DMA cache implications with the existing API, especially with
the issues of speculative prefetching. The current usage (looking at
drivers/scsi/53c700.c) doesn't cater for speculative prefetch as the
dma_cache_sync(,,,DMA_FROM_DEVICE) is done well in advance of the DMA
actually happening.

So all in all, I think the noncoherent API is broken as currently
designed - and until we have devices on ARM which use it, I don't see
much point into trying to fix the current thing especially as we'd be
unable to test.

2011-04-28 13:19:38

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 03:02:16PM +0200, Arnd Bergmann wrote:
> I think the misunderstanding is that you are saying we need the flag
> in dma_map_ops because you prefer to keep the cache management outside
> of the individual dma_map_ops implementations.
>
> What I guess J?rg is thinking of is to have the generic IOMMU version
> of dma_map_ops call into the architecture specific code to manage the
> caches on architectures that need it. That implementation would of
> course not require the flag in dma_map_ops because the architecture
> specific callback would use other ways (hardcoded for an architecture,
> or looking at the individual device) to determine if this is ever needed.
>
> That is also what I had in mind earlier, but you argued against it
> on the base that putting the logic into the common code would lead
> to a higher risk of people accidentally breaking it when they only
> care about coherent architectures.

You still need this same cache handling code even when you don't have
an iommu. I don't see the point in having a dma_ops level of indirection
followed by a separate iommu_ops level of indirection - it seems to me to
be a waste of code and CPU time, and I don't see why its even necessary
when there's a much simpler way to deal with it (as I illustrated).

2011-04-28 13:56:24

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 02:19:28PM +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 28, 2011 at 03:02:16PM +0200, Arnd Bergmann wrote:

> You still need this same cache handling code even when you don't have
> an iommu.

You can reference the same code from different places.

> I don't see the point in having a dma_ops level of indirection
> followed by a separate iommu_ops level of indirection - it seems to me
> to be a waste of code and CPU time, and I don't see why its even
> necessary when there's a much simpler way to deal with it (as I
> illustrated).

There is no waste of code, just the opposite. Most of the dma_ops
implementations that use an IOMMU today have a lot of similiarities in
their code. All this code (on x86, alpha, sparc, ia64, ...) can
be unified to a generic solution that fits all (by abstracting the
differences between iommus into the iommu-api). So the current situation
is a much bigger code waste than having this unified. The ARM platforms
supporting iommu hardware will benefit from this as well. It simply
doesn't make sense to have one dma_ops implementation for each iommu
hardware around.

Regards,

Joerg

2011-04-28 14:30:11

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> On Thu, Apr 28, 2011 at 02:28:56PM +0200, Arnd Bergmann wrote:
> > On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> > > What I'd suggest is that dma_alloc_noncoherent() should be architecture
> > > independent, and should call into whatever iommu support the device has
> > > to setup an approprite iommu mapping. IOW, I don't see any need for
> > > every architecture to provide its own dma_alloc_noncoherent() allocation
> > > function - or indeed every iommu implementation to deal with the
> > > allocation issues either.
> >
> > Almost all architectures today define dma_alloc_noncoherent to
> > dma_alloc_coherent, which is totally fine on architectures
> > where cacheable coherent mappings are the default or where
> > we don't need to flush individual cache lines for dma_sync_*.
>
> However, dma_alloc_coherent() memory can't be used with the dma_sync_*
> API as its return address (unlike other architectures) is not in the
> kernel direct mapped memory range.

Right, because ARM does not fit in the two categories I listed
above: the regular DMA is not cache coherent and we need to flush
the cache lines for the data we want to access in dma_sync_*.

> The only thing valid for dma_sync_* are buffers which have been passed
> to the dma_map_* APIs.
>
> Instead, I think what you're referring to is dma_cache_sync(), which is
> the API to be used with dma_alloc_noncoherent(), which we don't
> implement.
>
> As we have problems with some SMP implementations, and the noncoherent
> API doesn't have the idea of buffer ownership, it's rather hard to deal
> with the DMA cache implications with the existing API, especially with
> the issues of speculative prefetching. The current usage (looking at
> drivers/scsi/53c700.c) doesn't cater for speculative prefetch as the
> dma_cache_sync(,,,DMA_FROM_DEVICE) is done well in advance of the DMA
> actually happening.
>
> So all in all, I think the noncoherent API is broken as currently
> designed - and until we have devices on ARM which use it, I don't see
> much point into trying to fix the current thing especially as we'd be
> unable to test.

I agree that dma_cache_sync() is totally unusable on ARM, I thought we
had killed that off and replaced it with dma_sync_*. Unfortunately,
I was mistaken there: all drivers that use dma_alloc_noncoherent
either use dma_cache_sync() or they do something that is more broken,
but they don't do dma_sync_*.

Given that people still want to have an interface that does what I
though this one did, I guess we have two options:

* Kill off dma_cache_sync and replace it with calls to dma_sync_*
so we can start using dma_alloc_noncoherent on ARM

* Introduce a new interface

Arnd

2011-04-28 14:30:30

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 03:56:21PM +0200, Joerg Roedel wrote:
> There is no waste of code, just the opposite. Most of the dma_ops
> implementations that use an IOMMU today have a lot of similiarities in
> their code. All this code (on x86, alpha, sparc, ia64, ...) can
> be unified to a generic solution that fits all (by abstracting the
> differences between iommus into the iommu-api). So the current situation
> is a much bigger code waste than having this unified. The ARM platforms
> supporting iommu hardware will benefit from this as well. It simply
> doesn't make sense to have one dma_ops implementation for each iommu
> hardware around.

I'll defer until there's patches available then - I don't think there's
much value continuing to discuss this until that time.

2011-04-28 14:35:13

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:
> Given that people still want to have an interface that does what I
> though this one did, I guess we have two options:
>
> * Kill off dma_cache_sync and replace it with calls to dma_sync_*
> so we can start using dma_alloc_noncoherent on ARM

I don't think this is an option as dma_sync_*() is part of the streaming
DMA mapping API (dma_map_*) which participates in the idea of buffer
ownership, which the noncoherent API doesn't appear to.

2011-04-28 14:38:44

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

I'm busy at a conference so I've not read the whole thread yet..

On Thu, 28 Apr 2011 16:29:52 +0200
Arnd Bergmann <[email protected]> wrote:

> I was mistaken there: all drivers that use dma_alloc_noncoherent
> either use dma_cache_sync() or they do something that is more broken,
> but they don't do dma_sync_*.

As the DMA-API.txt says, dma_alloc_noncoherent should be used with
dma_cache_sync(). You shouldn't use dma_sync* API with a memory
returned by dma_alloc_noncoherent().

2011-04-28 14:40:18

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:
> > Given that people still want to have an interface that does what I
> > though this one did, I guess we have two options:
> >
> > * Kill off dma_cache_sync and replace it with calls to dma_sync_*
> > so we can start using dma_alloc_noncoherent on ARM
>
> I don't think this is an option as dma_sync_*() is part of the streaming
> DMA mapping API (dma_map_*) which participates in the idea of buffer
> ownership, which the noncoherent API doesn't appear to.

I thought the problem was in fact that the noncoherent API cannot be
implemented on architectures like ARM specifically because there is
no concept of buffer ownership. The obvious way to fix that would
be to redefine the API. What am I missing?

Arnd

2011-04-28 14:58:41

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 04:39:59PM +0200, Arnd Bergmann wrote:
> On Thursday 28 April 2011, Russell King - ARM Linux wrote:
> > On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:
> > > Given that people still want to have an interface that does what I
> > > though this one did, I guess we have two options:
> > >
> > > * Kill off dma_cache_sync and replace it with calls to dma_sync_*
> > > so we can start using dma_alloc_noncoherent on ARM
> >
> > I don't think this is an option as dma_sync_*() is part of the streaming
> > DMA mapping API (dma_map_*) which participates in the idea of buffer
> > ownership, which the noncoherent API doesn't appear to.
>
> I thought the problem was in fact that the noncoherent API cannot be
> implemented on architectures like ARM specifically because there is
> no concept of buffer ownership. The obvious way to fix that would
> be to redefine the API. What am I missing?

You are partially correct. With the streaming interface, we're fairly
strict with the buffer ownership stuff, as the most effective way to
implement it across all our CPUs is to deal with the mapping, sync and
unmapping in terms of buffers being passed from CPU control to DMA device
control and back again.

With the noncoherent interface, there is less of a buffer ownership idea.
For instance, to read from a noncoherent buffer, the following is required
(in order, I'm not considering the effects of weakly ordered stuff):

/* dma happens, signalled complete */
dma_cache_invalidate(buffer, size);
/* cpu can now see up to date data */
message = *buffer;

Unlike the streaming API, we don't need to hand the buffer back to the
device before the CPU can repeat the above code sequence.

If we want to write to a noncoherent buffer, then we need:

*buffer = value;
dma_cache_writeback(buffer, size);
/* dma can only now see new value */

and again, the same thing applies.

There is an additional problem lurking in amongst this though - a buffer
which is both read and written by the CPU has to be extremely careful of
cache writebacks - this for instance would not be legal:

*buffer = value;
...
/* dma from device */
dma_cache_invalidate(buffer, size);
message = *buffer;

as it is not predictable whether we'll see 'value' or the DMA data - that
depends on the relative ordering of the DMA writing to RAM vs the cache
eviction of the CPU write.

So, there is a kind of buffer ownership here:

/* cpu owns */
dma_cache_writeback(buffer, size);
/* dma owns */
dma_cache_invalidate(buffer, size);
/* cpu owns */

but as shown above it doesn't need to be as strict as the streaming API.

Also note that there's a problem lurking here with DMA cache line size:

| int
| dma_get_cache_alignment(void)
|
| Returns the processor cache alignment. This is the absolute minimum
| alignment *and* width that you must observe when either mapping
| memory or doing partial flushes.
|
| Notes: This API may return a number *larger* than the actual cache
| line, but it will guarantee that one or more cache lines fit exactly
| into the width returned by this call. It will also always be a power
| of two for easy alignment.

$ grep -L dma_get_cache_alignment $(grep dma_alloc_noncoherent drivers/ -lr)
drivers/base/dma-mapping.c
drivers/scsi/sgiwd93.c
drivers/scsi/53c700.c
drivers/net/au1000_eth.c
drivers/net/sgiseeq.c
drivers/net/lasi_82596.c
drivers/video/au1200fb.c

so we have a bunch of drivers which presumably don't take any notice of
the DMA cache line size, which may be very important. 53c700 for instance
aligns its buffers using L1_CACHE_ALIGN(), which may be smaller than
what's actually required...

2011-04-28 19:37:04

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 10:34 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Thu, Apr 28, 2011 at 04:29:52PM +0200, Arnd Bergmann wrote:
>> Given that people still want to have an interface that does what I
>> though this one did, I guess we have two options:
>>
>> * Kill off dma_cache_sync and replace it with calls to dma_sync_*
>> ? so we can start using dma_alloc_noncoherent on ARM
>
> I don't think this is an option as dma_sync_*() is part of the streaming
> DMA mapping API (dma_map_*) which participates in the idea of buffer
> ownership, which the noncoherent API doesn't appear to.

Sorry to jump in like that, but to me it seems that this whole
discussion is going toward having the decision of cache attribute
inside dma_* function and that a driver asking for uncached memory
might get cached memory if IOMMU or others component allows to have
cache coherency.

As Jesse pointed out already, for performance reasons it's lot better
if you let the driver decide even if you have an iommu capable of
handling coherency for you. My understanding is that each time
coherency is asked for it trigger bus activities of some kind (i think
snoop is the term used for pci) this traffic can slow down both the
cpu and the device. For graphic driver we have a lot of write once and
use (once or more) buffer and it makes a lot of sense to have those
buffer allocated using uncached memory so we can tell the device (in
case of drm driver) that there is no need to trigger snoop activities
for coherency. So i believe the decision should ultimately be in the
driver side.

Jesse also pointed out space exhaustion inside the iommu and i believe
this should also be considered. This is why i believe the dma_* api is
not well suited. In DRM/TTM we use pci_dma_mapping* and we also play
with with page set_page*_uc|wc|wb.

So i believe a better API might look like :

- struct dma_alloc_unit {
bool contiguous;
uint dmamask;
}
struct dma_buffer {
dma_unit
}
CONTIGUOUS tell that this dma unit needs contiguous allocation or not,
if it needs contiguous allocation and there is an iommu then the
allocator might allocate non contiguous pages/memory and latter
properly program the iommu to make things look contiguous to the
device.
if contiguous==false then allocator might allocate one page at a time
but should rather to allocate a bunch of contiguous page to allow
optimization for minimizing tlb miss if the device allow such things
(maybe adding a flag here might make sense)
-dma_buffer dma_alloc_(uc|wc|wb)(dma_alloc_unit, size) : alloc memory
according to constraint defined by dma_alloc_unit
-dma_buffer_update(dma_buffer, offset, size) allow dmabounce&swiotlb
to know what needs to be updated
-dma_bus_map(dma_buffer) map the buffer on to the bus in case of
dmabounce that would mean copy to the bounce buffer, for iommu that
would mean bind it, and in case of no iommu well do nothings
-dma_bus_unmap(dma_buffer) implementation might not necessarily unmap
the buffer if there is plenty of room in the iommu

So usage would look like :
mydma_buffer = dma_alloc_uc(N);
cpuptr=dma_cpu_ptr(mydma_buffer)
//write to the buffer
// tell dma which data need to be updated depending on platform
iommu,dmabounce cache flushing ...
dma_buffer_update(mydma_buffer, offset, size)
dma_bus_map(mydma_buffer)
// let the device use the buffer
...
// the buffer isn't use anymore by the device
dma_bus_unmap(mydma_buffer)

It hides things like iommu or dmabounce from the device driver but
still allow the device driver to ask for the most optimal way. A
platform decide to not support dma_alloc_uc|wc (ie non coherent) if it
has an iommu that can handle coherency or some others way to handle it
like flushing. But if platform wants better performance it should try
to provide non coherent allocation (through highmem or changing kernel
mapping properties ...).

Maybe i am completely missing the point.

Cheers,
Jerome

2011-04-28 21:07:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, 2011-04-28 at 10:30 +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 28, 2011 at 07:41:07AM +1000, Benjamin Herrenschmidt wrote:
> >
> > > As I said above, I don't think bus code can do it. Take my example
> > > above of a tulip pci device on x86 and a tulip pci device on ARM. Both
> > > use the same PCI code.
> > >
> > > Maybe something in asm/pci.h - but that invites having lots of bus
> > > specific header files in asm/.
> > >
> > > A better solution imho would be to have an architecture callback for
> > > struct device which gets registered, which can inspect the type of
> > > the device, and set the flag depending on where it appears in the
> > > tree.
> >
> > Now -that's gross :-)
> >
> > For PCI you can have the flag propagate from the PHB down, for busses
> > without a bus type (platform) then whoever instanciate them (the
> > platform code) can set that appropriately.
>
> How can you do that when it changes mid-bus heirarchy? I'm thinking
> of the situation where the DRM stuff is on a child bus below the
> root bus, and the root bus has DMA coherent devices on it but the DRM
> stuff doesn't.

But that's not PCI right ? IE. with PCI, coherency is a property of the
PHB...

> Your solution doesn't allow that - and I believe that's what Arnd is
> talking about.

Well, for the rest I'm thinking just bolt it into the platform until you
can put the property in the DT :-)

Cheers,
Ben.

2011-04-29 00:26:51

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1


> However, dma_alloc_coherent() memory can't be used with the dma_sync_*
> API as its return address (unlike other architectures) is not in the
> kernel direct mapped memory range.

Well, on non-coherent architectures, dma_sync_* are cache flushes, I
don't see the point of doing those on a non-cachable mapping anyways.

> The only thing valid for dma_sync_* are buffers which have been passed
> to the dma_map_* APIs.

Right, at least that's our expectation on powerpc as well.

> Instead, I think what you're referring to is dma_cache_sync(), which is
> the API to be used with dma_alloc_noncoherent(), which we don't
> implement.

Too may confusing APIs....

Ben.

> As we have problems with some SMP implementations, and the noncoherent
> API doesn't have the idea of buffer ownership, it's rather hard to deal
> with the DMA cache implications with the existing API, especially with
> the issues of speculative prefetching. The current usage (looking at
> drivers/scsi/53c700.c) doesn't cater for speculative prefetch as the
> dma_cache_sync(,,,DMA_FROM_DEVICE) is done well in advance of the DMA
> actually happening.
>
> So all in all, I think the noncoherent API is broken as currently
> designed - and until we have devices on ARM which use it, I don't see
> much point into trying to fix the current thing especially as we'd be
> unable to test.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-04-29 00:29:56

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, 2011-04-28 at 15:37 -0400, Jerome Glisse wrote:
> Jesse also pointed out space exhaustion inside the iommu and i believe
> this should also be considered. This is why i believe the dma_* api is
> not well suited. In DRM/TTM we use pci_dma_mapping* and we also play
> with with page set_page*_uc|wc|wb.

Which are yet another set of completely x86-centric APIs that have not
been thought in the context of other architectures and are probably
mostly unimplementables on half of them :-)

Cheers,
Ben.

2011-04-29 05:51:15

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On 04/29/2011 02:29 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2011-04-28 at 15:37 -0400, Jerome Glisse wrote:
>
>> Jesse also pointed out space exhaustion inside the iommu and i believe
>> this should also be considered. This is why i believe the dma_* api is
>> not well suited. In DRM/TTM we use pci_dma_mapping* and we also play
>> with with page set_page*_uc|wc|wb.
>>
> Which are yet another set of completely x86-centric APIs that have not
> been thought in the context of other architectures and are probably
> mostly unimplementables on half of them :-)
>
> Cheers,
> Ben.
>
>
>

I've been doing some thinking over the years on how we could extend that
functionality to other architectures. The reason we need those is
because some x86 processors (early AMDs and, I think VIA c3) dislike
multiple mappings of the same pages with conflicting caching attributes.

What we really want to be able to do is to unmap pages from the linear
kernel map, to avoid having to transition the linear kernel map every
time we change other mappings.

The reason we need to do this in the first place is that AGP and modern
GPUs has a fast mode where snooping is turned off.

However, we should be able to construct a completely generic api around
these operations, and for architectures that don't support them we need
to determine

a) Whether we want to support them anyway (IIRC the problem with PPC is
that the linear kernel map has huge tlb entries that are very
inefficient to break up?)

b) Whether they are needed at all on the particular architecture. The
Intel x86 spec is, (according to AMD), supposed to forbid conflicting
caching attributes, but the Intel graphics guys use them for GEM. PPC
appears not to need it.

c) If neither of the above applies, we might be able to either use
explicit cache flushes (which will require a TTM cache sync API), or
require the device to use snooping mode. The architecture may also
perhaps have a pool of write-combined pages that we can use. This should
be indicated by defines in the api header.

/Thomas




> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

2011-04-29 07:35:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1


> I've been doing some thinking over the years on how we could extend that
> functionality to other architectures. The reason we need those is
> because some x86 processors (early AMDs and, I think VIA c3) dislike
> multiple mappings of the same pages with conflicting caching attributes.
>
> What we really want to be able to do is to unmap pages from the linear
> kernel map, to avoid having to transition the linear kernel map every
> time we change other mappings.
>
> The reason we need to do this in the first place is that AGP and modern
> GPUs has a fast mode where snooping is turned off.

Right. Unfortunately, unmapping pages from the linear mapping is
precisely what I cannot give you on powerpc :-(

This is due to our tendency to map it using the largest page size
available. That translates to things like:

- On hash based ppc64, I use 16M pages. I can't "break them up" due to
the limitation of the processor of having a single page size per segment
(and we use 1T segments nowadays). I could break the whole thing down to
4K but that would very seriously affect system performances.

- On embedded, I map it using 1G pages. I suppose I could break it up
since it's SW loaded but here too, system performance would suffer. In
addition, we rely on ppc32 embedded to have the first 768M of the linear
mapping and on ppc64 embedded, the first 1G, mapped using bolted TLB
entries, which we can really only do using very large entries
(respectively 256M and 1G) that can't be broken up.

So you need to make sure whatever APIs you come up with will work on
architectures where memory -has- to be cachable and coherent and you
cannot play with the linear mapping. But that won't help with our
non-coherent embedded systems :-(

Maybe with future chips we'll have more flexibility here but not at this
point.

> However, we should be able to construct a completely generic api around
> these operations, and for architectures that don't support them we need
> to determine
>
> a) Whether we want to support them anyway (IIRC the problem with PPC is
> that the linear kernel map has huge tlb entries that are very
> inefficient to break up?)

Depends on the PPC variant / type of MMU. Inefficiency is part of the
problem. The need to have things bolted is another part. 4xx/BookE for
example needs to have lowmem bolted in the TLB. If it's broken up,
you'll quickly use up the TLB with bolted entries.

We could relax that to a certain extent until only the kernel
text/data/bss needs to be bolted, tho that would be at the expense of
performance of the TLB miss handlers which would have issues walking the
page tables. We'd also need to make sure we don't hand out to your API
the memory that is within the bolted entries that cover the kernel.

IE. If the kernel is large (32M ?) then the smallest entry I can use on
some CPUs will be 256M. So I'll need to have a way to allocate outside
of the first 256M. The linux allocators today don't allow for that sort
of restrictions.

> b) Whether they are needed at all on the particular architecture. The
> Intel x86 spec is, (according to AMD), supposed to forbid conflicting
> caching attributes, but the Intel graphics guys use them for GEM. PPC
> appears not to need it.

We have problems with AGP and macs, we chose to mostly ignore them and
things have been working so-so ... with the old DRM. With DRI2 being
much more aggressive at mapping/unmapping things, things became a lot
less stable and it could be in part related to that. IE. Aliases are
similarily forbidden but we create them anyways.

> c) If neither of the above applies, we might be able to either use
> explicit cache flushes (which will require a TTM cache sync API), or
> require the device to use snooping mode. The architecture may also
> perhaps have a pool of write-combined pages that we can use. This should
> be indicated by defines in the api header.

Right. We should still shoot HW designers who give up coherency for the
sake of 3D benchmarks. It's insanely stupid.

Cheers,
Ben.

> /Thomas
>
>
>
>
> > _______________________________________________
> > Linaro-mm-sig mailing list
> > [email protected]
> > http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
> >

2011-04-29 08:00:26

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:
> However, we should be able to construct a completely generic api around
> these operations, and for architectures that don't support them we need
> to determine
>
> a) Whether we want to support them anyway (IIRC the problem with PPC is
> that the linear kernel map has huge tlb entries that are very
> inefficient to break up?)

That same issue applies to ARM too - you'd need to stop the entire
machine, rewrite all processes page tables, flush tlbs, and only
then restart. Otherwise there's the possibility of ending up with
conflicting types of TLB entries, and I'm not sure what the effect
of having two matching TLB entries for the same address would be.

> b) Whether they are needed at all on the particular architecture. The
> Intel x86 spec is, (according to AMD), supposed to forbid conflicting
> caching attributes, but the Intel graphics guys use them for GEM. PPC
> appears not to need it.

Some versions of the architecture manual say that having multiple
mappings with differing attributes is unpredictable.

2011-04-29 10:57:04

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On 04/29/2011 09:35 AM, Benjamin Herrenschmidt wrote:
>
> We have problems with AGP and macs, we chose to mostly ignore them and
> things have been working so-so ... with the old DRM. With DRI2 being
> much more aggressive at mapping/unmapping things, things became a lot
> less stable and it could be in part related to that. IE. Aliases are
> similarily forbidden but we create them anyways.
>
>

Do you have any idea how other OS's solve this AGP issue on Macs?
Using a fixed pool of write-combined pages?

>> c) If neither of the above applies, we might be able to either use
>> explicit cache flushes (which will require a TTM cache sync API), or
>> require the device to use snooping mode. The architecture may also
>> perhaps have a pool of write-combined pages that we can use. This should
>> be indicated by defines in the api header.
>>
> Right. We should still shoot HW designers who give up coherency for the
> sake of 3D benchmarks. It's insanely stupid.
>

I agree. From a driver writer's perspective having the GPU always
snooping the system pages would be a dream. On the GPUs that do support
snooping that I have looked at, its internal MMU usually support both
modes, but the snooping mode is way slower (we're talking 50-70% or so
slower texturing operations), and often buggy causing crashes or scanout
timing issues since system designers apparently don't really count on it
being used. I've found it usable for device-to-system memory blits.

In addition memcpy to device is usually way faster if the destination is
write-combined. Probably due to cache thrashing effects.

/Thomas

> Cheers,
> Ben.
>
>
>> /Thomas
>>
>>
>>
>>
>>
>>> _______________________________________________
>>> Linaro-mm-sig mailing list
>>> [email protected]
>>> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>>>
>>>
>
>

2011-04-29 11:21:34

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Friday 29 April 2011, Benjamin Herrenschmidt wrote:
>
> > However, dma_alloc_coherent() memory can't be used with the dma_sync_*
> > API as its return address (unlike other architectures) is not in the
> > kernel direct mapped memory range.
>
> Well, on non-coherent architectures, dma_sync_* are cache flushes, I
> don't see the point of doing those on a non-cachable mapping anyways.
>
The point was that you cannot do

#define dma_alloc_coherent dma_alloc_noncoherent

on ARM, as some other architectures do.

Arnd

2011-04-29 11:26:31

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:
> > >
> > > For PCI you can have the flag propagate from the PHB down, for busses
> > > without a bus type (platform) then whoever instanciate them (the
> > > platform code) can set that appropriately.
> >
> > How can you do that when it changes mid-bus heirarchy? I'm thinking
> > of the situation where the DRM stuff is on a child bus below the
> > root bus, and the root bus has DMA coherent devices on it but the DRM
> > stuff doesn't.
>
> But that's not PCI right ? IE. with PCI, coherency is a property of the
> PHB...

That is my understanding at least, but I'd like to have a confirmation
from the DRM folks.

I believe that the PC graphics cards that have noncoherent DMA mappings
are all of the unified memory (integrated into the northbridge) kind,
so they are not on the same host bridge as all regular PCI devices,
even if they appear as a PCI device.

Arnd

2011-04-29 11:47:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 13:26 +0200, Arnd Bergmann wrote:
> On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:
> > > >
> > > > For PCI you can have the flag propagate from the PHB down, for busses
> > > > without a bus type (platform) then whoever instanciate them (the
> > > > platform code) can set that appropriately.
> > >
> > > How can you do that when it changes mid-bus heirarchy? I'm thinking
> > > of the situation where the DRM stuff is on a child bus below the
> > > root bus, and the root bus has DMA coherent devices on it but the DRM
> > > stuff doesn't.
> >
> > But that's not PCI right ? IE. with PCI, coherency is a property of the
> > PHB...
>
> That is my understanding at least, but I'd like to have a confirmation
> from the DRM folks.
>
> I believe that the PC graphics cards that have noncoherent DMA mappings
> are all of the unified memory (integrated into the northbridge) kind,
> so they are not on the same host bridge as all regular PCI devices,
> even if they appear as a PCI device.

Hrm... beware with x86 , they love playing tricks :-) Since too many
BIOSes don't understand PCI domains, they make devices on separate
bridges look like sibling on the same segment and that sort of thing.

Cheers,
Ben.

2011-04-29 11:55:27

by Alan

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

> I believe that the PC graphics cards that have noncoherent DMA mappings
> are all of the unified memory (integrated into the northbridge) kind,
> so they are not on the same host bridge as all regular PCI devices,
> even if they appear as a PCI device.

The AGP GART is not coherent on a lot of systems - not necessarily
unified memory though, it can be a plug in AGP card too.
The GART is basically an IOMMU (and indeed in the later AMD case used
exactly as that)

2011-04-29 12:07:25

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On 04/29/2011 01:26 PM, Arnd Bergmann wrote:
> On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:
>
>>>> For PCI you can have the flag propagate from the PHB down, for busses
>>>> without a bus type (platform) then whoever instanciate them (the
>>>> platform code) can set that appropriately.
>>>>
>>> How can you do that when it changes mid-bus heirarchy? I'm thinking
>>> of the situation where the DRM stuff is on a child bus below the
>>> root bus, and the root bus has DMA coherent devices on it but the DRM
>>> stuff doesn't.
>>>
>> But that's not PCI right ? IE. with PCI, coherency is a property of the
>> PHB...
>>
> That is my understanding at least, but I'd like to have a confirmation
> from the DRM folks.
>
> I believe that the PC graphics cards that have noncoherent DMA mappings
> are all of the unified memory (integrated into the northbridge) kind,
> so they are not on the same host bridge as all regular PCI devices,
> even if they appear as a PCI device.
>

I think Jerome has mentioned at one point that the Radeon graphics cards
support
non-coherent mappings.

Fwiw, the PowerVR SGX MMU also supports this mode of operation, although
it being functional I guess depends on the system implementation.

/Thomas



> Arnd
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

2011-04-29 13:34:09

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, Apr 29, 2011 at 8:06 AM, Thomas Hellstrom <[email protected]> wrote:
> On 04/29/2011 01:26 PM, Arnd Bergmann wrote:
>>
>> On Thursday 28 April 2011, Benjamin Herrenschmidt wrote:
>>
>>>>>
>>>>> For PCI you can have the flag propagate from the PHB down, for busses
>>>>> without a bus type (platform) then whoever instanciate them (the
>>>>> platform code) can set that appropriately.
>>>>>
>>>>
>>>> How can you do that when it changes mid-bus heirarchy? ?I'm thinking
>>>> of the situation where the DRM stuff is on a child bus below the
>>>> root bus, and the root bus has DMA coherent devices on it but the DRM
>>>> stuff doesn't.
>>>>
>>>
>>> But that's not PCI right ? IE. with PCI, coherency is a property of the
>>> PHB...
>>>
>>
>> That is my understanding at least, but I'd like to have a confirmation
>> from the DRM folks.
>>
>> I believe that the PC graphics cards that have noncoherent DMA mappings
>> are all of the unified memory (integrated into the northbridge) kind,
>> so they are not on the same host bridge as all regular PCI devices,
>> even if they appear as a PCI device.
>>
>
> I think Jerome has mentioned at one point that the Radeon graphics cards
> support
> non-coherent mappings.
>
> Fwiw, the PowerVR SGX MMU also supports this mode of operation, although it
> being functional I guess depends on the system implementation.
>
> /Thomas
>

Radeon memory controller can do non snooped pci transaction, as far as
i have tested most of the x86 pci bridge don't try to be coherent then
ie they don't analyze pci dma and ask for cpu flush they just perform
the request (and i guess it's what all bridge will do), so it endup
being noncoherent. I haven't done any benchmark of how faster it's for
the GPU when it's not snooping but i guess it can give 50% boost as it
likely drastictly reduce pci transaction overhead.

I am talking here about device that you plug into any pci or pcie
slot, so it's not igp integrated into northbridge or into the cpu.

Cheers,
Jerome

2011-04-29 13:42:34

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 03:37:00PM -0400, Jerome Glisse wrote:
> As Jesse pointed out already, for performance reasons it's lot better
> if you let the driver decide even if you have an iommu capable of
> handling coherency for you. My understanding is that each time
> coherency is asked for it trigger bus activities of some kind (i think
> snoop is the term used for pci) this traffic can slow down both the
> cpu and the device. For graphic driver we have a lot of write once and
> use (once or more) buffer and it makes a lot of sense to have those
> buffer allocated using uncached memory so we can tell the device (in
> case of drm driver) that there is no need to trigger snoop activities
> for coherency. So i believe the decision should ultimately be in the
> driver side.

Stupid question: Couldn't these write-once-read-often buffers just stay
in the memory of the GPU instead of refetching them every time from main
memory? Or is that necessary because of the limited space on some GPUs?

Regards,

Joerg

2011-04-29 14:20:00

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, Apr 29, 2011 at 9:42 AM, Joerg Roedel <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 03:37:00PM -0400, Jerome Glisse wrote:
>> As Jesse pointed out already, for performance reasons it's lot better
>> if you let the driver decide even if you have an iommu capable of
>> handling coherency for you. My understanding is that each time
>> coherency is asked for it trigger bus activities of some kind (i think
>> snoop is the term used for pci) this traffic can slow down both the
>> cpu and the device. For graphic driver we have a lot of write once and
>> use (once or more) buffer and it makes a lot of sense to have those
>> buffer allocated using uncached memory so we can tell the device (in
>> case of drm driver) that there is no need to trigger snoop activities
>> for coherency. So i believe the decision should ultimately be in the
>> driver side.
>
> Stupid question: Couldn't these write-once-read-often buffers just stay
> in the memory of the GPU instead of refetching them every time from main
> memory? Or is that necessary because of the limited space on some GPUs?
>
> Regards,
>
> ? ? ? ?Joerg
>

We might be talking about several G of data, so using system is not
uncommon. Also when uploading data to GPU vram is better to let the
GPU do dma from system memory rather than having the CPU do memcpy.

Cheers,
Jerome

2011-04-29 15:37:47

by Jordan Crouse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On 04/29/2011 07:42 AM, Joerg Roedel wrote:
> On Thu, Apr 28, 2011 at 03:37:00PM -0400, Jerome Glisse wrote:
>> As Jesse pointed out already, for performance reasons it's lot better
>> if you let the driver decide even if you have an iommu capable of
>> handling coherency for you. My understanding is that each time
>> coherency is asked for it trigger bus activities of some kind (i think
>> snoop is the term used for pci) this traffic can slow down both the
>> cpu and the device. For graphic driver we have a lot of write once and
>> use (once or more) buffer and it makes a lot of sense to have those
>> buffer allocated using uncached memory so we can tell the device (in
>> case of drm driver) that there is no need to trigger snoop activities
>> for coherency. So i believe the decision should ultimately be in the
>> driver side.
>
> Stupid question: Couldn't these write-once-read-often buffers just stay
> in the memory of the GPU instead of refetching them every time from main
> memory? Or is that necessary because of the limited space on some GPUs?

Not all embedded GPUs have their own dedicated memory. On the MSM architecture
the devices and the CPU share the same physical pool.

Jordan

2011-04-29 15:41:31

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011, Catalin Marinas wrote:
> > >
> > > It's not broken since we moved to using Normal non-cacheable memory
> > > for the coherent DMA buffers (as long as you flush the cacheable alias
> > > before using the buffer, as we already do). The ARM ARM currently says
> > > unpredictable for such situations but this is being clarified in
> > > future updates and the Normal non-cacheable vs cacheable aliases can
> > > be used (given correct cache maintenance before using the buffer).
> >
> > Thanks for that information, I believe a number of people in the
> > previous discussions were relying on the information from the
> > documentation. Are you sure that this is not only correct for the
> > cores made by ARM ltd but also for the other implementations that
> > may have relied on documentation?
>
> It is a clarification in the ARM ARM so it covers all the cores made by
> architecture licensees, not just ARM Ltd. It basically makes the
> "unpredictable" part more predictable to allow certain types of aliases
> (e.g. Strongly Ordered vs Normal memory would still be disallowed).
>
> All the current implementations are safe with Normal memory aliases
> (cacheable vs non-cacheable) but of course, there may be some
> performance benefits in not having any alias.

A lot of the discussions we are about to have in Budapest will be
around solving the problem of having only valid combinations of
mappings, so we really need to have a clear statement in specification
form about what is actually valid.

Would it be possible to have an updated version of the relevant
section of the ARM ARM by next week so we can use that as the
base for our discussions?

Arnd

2011-04-29 16:27:21

by Jesse Barnes

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 29 Apr 2011 17:35:23 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

>
> > I've been doing some thinking over the years on how we could extend that
> > functionality to other architectures. The reason we need those is
> > because some x86 processors (early AMDs and, I think VIA c3) dislike
> > multiple mappings of the same pages with conflicting caching attributes.
> >
> > What we really want to be able to do is to unmap pages from the linear
> > kernel map, to avoid having to transition the linear kernel map every
> > time we change other mappings.
> >
> > The reason we need to do this in the first place is that AGP and modern
> > GPUs has a fast mode where snooping is turned off.
>
> Right. Unfortunately, unmapping pages from the linear mapping is
> precisely what I cannot give you on powerpc :-(
>
> This is due to our tendency to map it using the largest page size
> available. That translates to things like:
>
> - On hash based ppc64, I use 16M pages. I can't "break them up" due to
> the limitation of the processor of having a single page size per segment
> (and we use 1T segments nowadays). I could break the whole thing down to
> 4K but that would very seriously affect system performances.
>
> - On embedded, I map it using 1G pages. I suppose I could break it up
> since it's SW loaded but here too, system performance would suffer. In
> addition, we rely on ppc32 embedded to have the first 768M of the linear
> mapping and on ppc64 embedded, the first 1G, mapped using bolted TLB
> entries, which we can really only do using very large entries
> (respectively 256M and 1G) that can't be broken up.
>
> So you need to make sure whatever APIs you come up with will work on
> architectures where memory -has- to be cachable and coherent and you
> cannot play with the linear mapping. But that won't help with our
> non-coherent embedded systems :-(

You must be making it sound worse than it really is, otherwise how
would an embedded platform like the above deal with a display engine
that needed a large, contiguous chunk of uncached memory for the
display buffer? If the CPU is actively speculating into it and
overwriting blits etc it would never work... Or do you do such
reservations up front at 1G granularity??

> Right. We should still shoot HW designers who give up coherency for the
> sake of 3D benchmarks. It's insanely stupid.

Ah if it were that simple. :) There are big costs to implementing full
coherency for all your devices, as you well know, so it's just not a
question of benchmark optimization.

--
Jesse Barnes, Intel Open Source Technology Center

2011-04-29 16:32:15

by Jesse Barnes

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 29 Apr 2011 08:59:58 +0100
Russell King - ARM Linux <[email protected]> wrote:

> On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:
> > However, we should be able to construct a completely generic api around
> > these operations, and for architectures that don't support them we need
> > to determine
> >
> > a) Whether we want to support them anyway (IIRC the problem with PPC is
> > that the linear kernel map has huge tlb entries that are very
> > inefficient to break up?)
>
> That same issue applies to ARM too - you'd need to stop the entire
> machine, rewrite all processes page tables, flush tlbs, and only
> then restart. Otherwise there's the possibility of ending up with
> conflicting types of TLB entries, and I'm not sure what the effect
> of having two matching TLB entries for the same address would be.

Right, I don't think anyone wants to see this sort of thing happen with
any frequency. So either a large, uncached region can be set up a boot
time for allocations, or infrequent, large requests and conversions can
be made on demand, with memory being freed back to the main, coherent
pool under pressure.

> > b) Whether they are needed at all on the particular architecture. The
> > Intel x86 spec is, (according to AMD), supposed to forbid conflicting
> > caching attributes, but the Intel graphics guys use them for GEM. PPC
> > appears not to need it.
>
> Some versions of the architecture manual say that having multiple
> mappings with differing attributes is unpredictable.

Yes, there's a bit of abuse going on there. We've received a guarantee
that if the CPU speculates a line into the cache, as long as it's not
modified through the cacheable mapping the CPU won't write it back to
memory; it'll discard the line as needed instead (iirc AMD CPUs will
actually write back clean lines, so GEM wouldn't work the same way
there).

But even with GEM, there is a large performance penalty for having to
allocate a new buffer object the first time. Even though we don't have
to change mappings by stopping the machine etc, we still have to flush
out everything from the CPU relating to the object (since some lines
may be dirty), and then flush the memory controller buffers before
accessing it through the uncached mapping. So at least currently,
we're all in the same boat when it comes to new object allocations:
they will be expensive unless you already have some uncached mappings
you can re-use.

--
Jesse Barnes, Intel Open Source Technology Center

2011-04-29 16:43:02

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Friday, 29 April 2011, Arnd Bergmann <[email protected]> wrote:
> On Wednesday 27 April 2011, Catalin Marinas wrote:
>> > >
>> > > It's not broken since we moved to using Normal non-cacheable memory
>> > > for the coherent DMA buffers (as long as you flush the cacheable alias
>> > > before using the buffer, as we already do). The ARM ARM currently says
>> > > unpredictable for such situations but this is being clarified in
>> > > future updates and the Normal non-cacheable vs cacheable aliases can
>> > > be used (given correct cache maintenance before using the buffer).
>> >
>> > Thanks for that information, I believe a number of people in the
>> > previous discussions were relying on the information from the
>> > documentation. Are you sure that this is not only correct for the
>> > cores made by ARM ltd but also for the other implementations that
>> > may have relied on documentation?
>>
>> It is a clarification in the ARM ARM so it covers all the cores made by
>> architecture licensees, not just ARM Ltd. It basically makes the
>> "unpredictable" part more predictable to allow certain types of aliases
>> (e.g. Strongly Ordered vs Normal memory would still be disallowed).
>>
>> All the current implementations are safe with Normal memory aliases
>> (cacheable vs non-cacheable) but of course, there may be some
>> performance benefits in not having any alias.
>
> A lot of the discussions we are about to have in Budapest will be
> around solving the problem of having only valid combinations of
> mappings, so we really need to have a clear statement in specification
> form about what is actually valid.
>
> Would it be possible to have an updated version of the relevant
> section of the ARM ARM by next week so we can use that as the
> base for our discussions?

I'll ask the architecture people here in ARM and get back to you
(there is holiday until Tuesday next week in the UK).

--
Catalin

2011-04-29 18:30:27

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Friday 29 April 2011 18:32:09 Jesse Barnes wrote:
> On Fri, 29 Apr 2011 08:59:58 +0100
> Russell King - ARM Linux <[email protected]> wrote:
>
> > On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:
> > > However, we should be able to construct a completely generic api around
> > > these operations, and for architectures that don't support them we need
> > > to determine
> > >
> > > a) Whether we want to support them anyway (IIRC the problem with PPC is
> > > that the linear kernel map has huge tlb entries that are very
> > > inefficient to break up?)
> >
> > That same issue applies to ARM too - you'd need to stop the entire
> > machine, rewrite all processes page tables, flush tlbs, and only
> > then restart. Otherwise there's the possibility of ending up with
> > conflicting types of TLB entries, and I'm not sure what the effect
> > of having two matching TLB entries for the same address would be.
>
> Right, I don't think anyone wants to see this sort of thing happen with
> any frequency. So either a large, uncached region can be set up a boot
> time for allocations, or infrequent, large requests and conversions can
> be made on demand, with memory being freed back to the main, coherent
> pool under pressure.

I'd like to first have an official confirmation from the CPU designers
if there is actually a problem with mapping a single page both cacheable
and noncacheable. Based on what Catalin said, it's probably allowed and
the current spec is just being more paranoid than it needs to be. Also,
KyongHo Cho said that it might only be relevant for pages that are mapped
executable.

If that is the case, we can probably work around this by turning the entire
linear mapping (except for the kernel binary) into nonexecutable mode,
if we don't do that already.
This is desirable for security purposes anyway.

Arnd

2011-04-29 22:16:34

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, Apr 29, 2011 at 08:29:50PM +0200, Arnd Bergmann wrote:
> On Friday 29 April 2011 18:32:09 Jesse Barnes wrote:
> > On Fri, 29 Apr 2011 08:59:58 +0100
> > Russell King - ARM Linux <[email protected]> wrote:
> >
> > > On Fri, Apr 29, 2011 at 07:50:12AM +0200, Thomas Hellstrom wrote:
> > > > However, we should be able to construct a completely generic api around
> > > > these operations, and for architectures that don't support them we need
> > > > to determine
> > > >
> > > > a) Whether we want to support them anyway (IIRC the problem with PPC is
> > > > that the linear kernel map has huge tlb entries that are very
> > > > inefficient to break up?)
> > >
> > > That same issue applies to ARM too - you'd need to stop the entire
> > > machine, rewrite all processes page tables, flush tlbs, and only
> > > then restart. Otherwise there's the possibility of ending up with
> > > conflicting types of TLB entries, and I'm not sure what the effect
> > > of having two matching TLB entries for the same address would be.
> >
> > Right, I don't think anyone wants to see this sort of thing happen with
> > any frequency. So either a large, uncached region can be set up a boot
> > time for allocations, or infrequent, large requests and conversions can
> > be made on demand, with memory being freed back to the main, coherent
> > pool under pressure.
>
> I'd like to first have an official confirmation from the CPU designers
> if there is actually a problem with mapping a single page both cacheable
> and noncacheable.

Everytime this gets discussed, someone says that because they don't
believe what I say. OMAP folk confirmed it last time around.

I'm getting tired of this. I'm going to give up with answering any
further Linux questions until next week and I'll delete my entire
mailbox this weekend as I really can't be bothered to catch up with all
the crap that's happened over easter. I'm really getting pissed off at
all the shite crap that's flying around at the moment that I'm really
starting to not care one ounce about Linux, either on ARM or on this
utterly shite and broken x86 hardware.

Let ARM rot in mainline. I really don't care anymore.

2011-04-29 22:38:01

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 20:29 +0200, Arnd Bergmann wrote:
>
> If that is the case, we can probably work around this by turning the
> entire
> linear mapping (except for the kernel binary) into nonexecutable mode,
> if we don't do that already.
> This is desirable for security purposes anyway

You'd still have an "edge" problem if you use large pages for the linear
mapping, you can't obviously make part of the kernel text NX and you'd
have to make sure you 'exclude' from those GPU allocations whatever
overlaps with your last executable large page.

In a way, it's a similar problem I have with bolted memory on BookE
where I can't restrict GPU allocations to memory that isn't bolted :-)

Cheers,
Ben.

2011-04-29 22:47:15

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 09:27 -0700, Jesse Barnes wrote:

> You must be making it sound worse than it really is, otherwise how
> would an embedded platform like the above deal with a display engine
> that needed a large, contiguous chunk of uncached memory for the
> display buffer? If the CPU is actively speculating into it and
> overwriting blits etc it would never work... Or do you do such
> reservations up front at 1G granularity??

Such embedded platforms have not been used with GPUs so far and our only
implementation of 64-bit BookE is fortunately also completely cache
coherent :-)

The good thing on ppc is that so far there is no new design coming from
us or FSL that isn't cache coherent. The bad thing is that people seem
to still try to pump out things using old 44x which isn't and somewhat
seem to also want to use GPUs on them :-)

The 44x is a case where I have a small (64 entries) SW loaded TLB and I
bolt the first 768M of the linear mapping (lowmem) using 3x256M entries.
What "saves" it is that it's also an ancient design with essentially a
busted prefetch engine that will thus cope with aliases as long as we
don't explicitely access the cached and non-cached aliases
simultaneously.

The nasty cases I have never really dealt with properly are the Apple
machines and their non coherent AGP. Those processors were really not
designed with the idea that one would do non-coherent DMA, especially
the 970 (G5) and our Linux code really don't like it.

Things tend to "work" with DRI 1 because we allocate the AGP memory once
in one big chunk (it's pages but they are allocated together and thus
tend to be contiguous) so the possible issues with prefetch are so rare,
I think we end up being lucky. With DRI 2 dynamically mapping things
in/out, we have a bigger problem and I don't know how to solve it other
than forcing the DRM to allocate graphic objects in reserved areas of
memory made of 16M pools that I unmap from the linear mapping.... (since
I use 16M pages to map the linear mapping).

For ppc32 laptops it's even worse as I use 256MB BATs (block address
translation, kind of special registers to create large static mappings)
to map the linear mapping, which brings me back to the 44x case to some
extent. I can't really do without at the moment, at the very least I
require the kernel text / data / bss to be covered by BATs.

> > Right. We should still shoot HW designers who give up coherency for the
> > sake of 3D benchmarks. It's insanely stupid.
>
> Ah if it were that simple. :) There are big costs to implementing full
> coherency for all your devices, as you well know, so it's just not a
> question of benchmark optimization.

But it -is- that simple.

You do have to deal with coherency anyways for your PHB unless you start
advocating that we should make everything else non coherent as well. So
you have the logic. Just make your GPU operate on the same protocol.

It's really only a perf tradeoff I believe. And a bad one.

Cheers,
Ben.

2011-04-29 22:51:23

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 12:55 +0200, Thomas Hellstrom wrote:
> On 04/29/2011 09:35 AM, Benjamin Herrenschmidt wrote:
> >
> > We have problems with AGP and macs, we chose to mostly ignore them and
> > things have been working so-so ... with the old DRM. With DRI2 being
> > much more aggressive at mapping/unmapping things, things became a lot
> > less stable and it could be in part related to that. IE. Aliases are
> > similarily forbidden but we create them anyways.
> >
> Do you have any idea how other OS's solve this AGP issue on Macs?
> Using a fixed pool of write-combined pages?

Write-combine is a different business, it's a matter of not mapping with
the G bit, but no, the way MacOS works I think is that they don't
actually use large pages at all, and I don't even think they have a
linear mapping of all memory. On the other hand they are slow :-)

> >> c) If neither of the above applies, we might be able to either use
> >> explicit cache flushes (which will require a TTM cache sync API), or
> >> require the device to use snooping mode. The architecture may also
> >> perhaps have a pool of write-combined pages that we can use. This should
> >> be indicated by defines in the api header.
> >>
> > Right. We should still shoot HW designers who give up coherency for the
> > sake of 3D benchmarks. It's insanely stupid.
> >
>
> I agree. From a driver writer's perspective having the GPU always
> snooping the system pages would be a dream. On the GPUs that do support
> snooping that I have looked at, its internal MMU usually support both
> modes, but the snooping mode is way slower (we're talking 50-70% or so
> slower texturing operations), and often buggy causing crashes or scanout
> timing issues since system designers apparently don't really count on it
> being used. I've found it usable for device-to-system memory blits.
>
> In addition memcpy to device is usually way faster if the destination is
> write-combined. Probably due to cache thrashing effects.

Possibly. It's a matter of the HW folks actually spending some time to
make it work properly. It can be done :-) It's just that they don't
bother. Look at the perfs one can get out of fully coherent PCIe
nowadays, largely enough for a simple scanout :-)

Cheers,
Ben.

> /Thomas
>
> > Cheers,
> > Ben.
> >
> >
> >> /Thomas
> >>
> >>
> >>
> >>
> >>
> >>> _______________________________________________
> >>> Linaro-mm-sig mailing list
> >>> [email protected]
> >>> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
> >>>
> >>>
> >
> >
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2011-04-29 22:52:16

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 12:56 +0100, Alan Cox wrote:
> > I believe that the PC graphics cards that have noncoherent DMA mappings
> > are all of the unified memory (integrated into the northbridge) kind,
> > so they are not on the same host bridge as all regular PCI devices,
> > even if they appear as a PCI device.
>
> The AGP GART is not coherent on a lot of systems - not necessarily
> unified memory though, it can be a plug in AGP card too.
> The GART is basically an IOMMU (and indeed in the later AMD case used
> exactly as that)

Right. Actually there's also the ability for PCIe devices to set a "no
snoop" bit on transactions and thus behave in a non-coherent manner.
Hopefully most sane PHBs ignore that bit ...

Cheers,
Ben.

2011-04-29 22:53:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 14:06 +0200, Thomas Hellstrom wrote:
> I think Jerome has mentioned at one point that the Radeon graphics
> cards support non-coherent mappings.

If the card is PCI/PCI-X/PCIe then coherency is not its business, it's
the business of the host bridge. However, on PCIe at least, the card can
indeed set a "no snoop" attribute on DMA transactions to request "no
coherency". At least the systems have the latitude to just ignore that
bit (like we do on all ppc afaik) :-)

> Fwiw, the PowerVR SGX MMU also supports this mode of operation,
> although it being functional I guess depends on the system
> implementation.

Right, it's not a GPU thing, it's really a system design thing.

Cheers,
Ben.

2011-04-29 22:55:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, 2011-04-29 at 09:34 -0400, Jerome Glisse wrote:
>
> Radeon memory controller can do non snooped pci transaction, as far as
> i have tested most of the x86 pci bridge don't try to be coherent then
> ie they don't analyze pci dma and ask for cpu flush they just perform
> the request (and i guess it's what all bridge will do), so it endup
> being noncoherent. I haven't done any benchmark of how faster it's for
> the GPU when it's not snooping but i guess it can give 50% boost as it
> likely drastictly reduce pci transaction overhead.
>
> I am talking here about device that you plug into any pci or pcie
> slot, so it's not igp integrated into northbridge or into the cpu.

Right, the card has nothing to do with the snooping process, it's purely
a feature of the bridge, based on a flag optionally set by the card. As
I said earlier, bridges have the freedom to ignore it, which we do on
ppc, so that's a non issue.

Cheers,
Ben.

2011-04-30 02:45:27

by Jesse Barnes

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Sat, 30 Apr 2011 08:46:54 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> > Ah if it were that simple. :) There are big costs to implementing full
> > coherency for all your devices, as you well know, so it's just not a
> > question of benchmark optimization.
>
> But it -is- that simple.
>
> You do have to deal with coherency anyways for your PHB unless you start
> advocating that we should make everything else non coherent as well. So
> you have the logic. Just make your GPU operate on the same protocol.
>
> It's really only a perf tradeoff I believe. And a bad one.

Ok so I was the one oversimplifying. :) Yes, it's definitely doable to
make a cache coherent PHB, and is awfully nice from a perf and
programming perspective.

But as you say, to make a high performance one for things like gfx, or
even to handle things like atomic ops, adds a lot of expense (in the
case of graphics, a whole lot unless you can integrate with the CPU,
and even then display can be tough to deal with).

I don't see even good coherent implementations being good enough for
high perf graphics in the near term (though at least on relatively high
power designs like Sandy Bridge we're getting close) so we'll have to
solve the uncached and simultaneous mapping issue both for today's
hardware and the near future.

--
Jesse Barnes, Intel Open Source Technology Center

2011-05-02 04:42:12

by David Brown

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Fri, Apr 29 2011, Russell King - ARM Linux wrote:

> On Fri, Apr 29, 2011 at 08:29:50PM +0200, Arnd Bergmann wrote:

>> I'd like to first have an official confirmation from the CPU designers
>> if there is actually a problem with mapping a single page both cacheable
>> and noncacheable.
>
> Everytime this gets discussed, someone says that because they don't
> believe what I say. OMAP folk confirmed it last time around.

I'll confirm this from the Qualcomm side as well. You cannot have
multiple inconsistent mappings of the same page without having difficult
to find problems.

The spec clarifications appear to give ways of dealing with it if it
happens, and bounds on what can go wrong, but I wouldn't call it
something we want to do normally. Corrupt data is arguably less of a
problem than nasal demons, but still a problem.

David

> I'm getting tired of this. I'm going to give up with answering any
> further Linux questions until next week and I'll delete my entire
> mailbox this weekend as I really can't be bothered to catch up with all
> the crap that's happened over easter. I'm really getting pissed off at
> all the shite crap that's flying around at the moment that I'm really
> starting to not care one ounce about Linux, either on ARM or on this
> utterly shite and broken x86 hardware.
>
> Let ARM rot in mainline. I really don't care anymore.
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2011-05-02 11:26:43

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Monday 02 May 2011, David Brown wrote:
> I'll confirm this from the Qualcomm side as well. You cannot have
> multiple inconsistent mappings of the same page without having difficult
> to find problems.

I believe Catalin was referring to the case where you have only
one nonconsistent (cacheable) mapping plus multiple consistent
(cacheable) mappings. I don't think anyone has suggested doing
DMA to a page that has multiple nonconsistent mappings with
virtually indexed caches.

> The spec clarifications appear to give ways of dealing with it if it
> happens, and bounds on what can go wrong, but I wouldn't call it
> something we want to do normally. Corrupt data is arguably less of a
> problem than nasal demons, but still a problem.

Anything that has a theoretical chance of corrupting data is not an
option, but I'd really like to see what the clarified spec says
about this. Even if there is a way to legally leave a page for
dma_alloc_coherent in the linear mapping, it might turn out to
be harder to do than using highmem pages or unmapping supersections
at run time as was suggested.

Arnd

2011-05-03 14:45:16

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC] ARM DMA mapping TODO, v1

On Thu, Apr 28, 2011 at 02:12:40PM +0200, Arnd Bergmann wrote:
> On Thursday 28 April 2011, Catalin Marinas wrote:
> > On Thu, 2011-04-28 at 01:15 +0100, [email protected] wrote:
> > > On Wed, 27 Apr 2011 12:08:28 BST, Catalin Marinas said:
> > >
> > > > The current version of the ARM ARM says "unpredictable". But this
> > > > general definition of "unpredictable" does not allow it to deadlock
> > > > (hardware) or have security implications. It is however allowed to
> > > > corrupt data.
> > >
> > > Not allowed to have security implications, but is allowed to corrupt data.
> >
> > By security I was referring to TrustZone extensions. IOW, unpredictable
> > in normal (non-secure) world should not cause data corruption in the
> > secure world.
>
> That definition is rather useless for operating systems that don't use
> Trustzone then, right?

IIUC, the restriction on unpredictable behaviour is basically that the processor
can't do anything which would result in or otherwise imply an escalation of
privilege.

TrustZone is one kind of privilege, but there are plenty of other operations
implying privilege (entering privileged mode from user mode, masking or
intercepting interrupts or exceptions, bypassing or reconfiguring MMU permissions
etc.) "Unpredictable" behaviours are not allowed to have any such consequences
IIRC. Without that restriction you wouldn't really have any OS security at all.

In the kernel, we do have to be careful about avoiding unpredictable behaviours,
since we're already running at maximum privilege (not including TZ) -- so the
damage which unpredictable behaviours can wreak is much greater, by running
invalid code, misconfiguring the MMU, allowing caches to get out of sync etc.
But that's not fundamentally different from the general need to avoid kernel bugs
-- the scope of _any_ kernel code to do damage is greater than for userspace code,
whether it involves architecturally unpredictable behaviour, or just plain
ordinary bugs or security holes in the C code.

---Dave

>
> Arnd
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2011-05-03 15:05:12

by Laurent Pinchart

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Wednesday 27 April 2011 12:43:16 Arnd Bergmann wrote:
> On Wednesday 27 April 2011, Catalin Marinas wrote:
> > On 21 April 2011 20:29, Arnd Bergmann <[email protected]> wrote:
> > > I think the recent discussions on linaro-mm-sig and the BoF last week
> > > at ELC have been quite productive, and at least my understanding
> > > of the missing pieces has improved quite a bit. This is a list of
> > > things that I think need to be done in the kernel. Please complain
> > > if any of these still seem controversial:
> > >
> > > 1. Fix the arm version of dma_alloc_coherent. It's in use today and
> > >
> > > is broken on modern CPUs because it results in both cached and
> > > uncached mappings. Rebecca suggested different approaches how to
> > > get there.
> >
> > It's not broken since we moved to using Normal non-cacheable memory
> > for the coherent DMA buffers (as long as you flush the cacheable alias
> > before using the buffer, as we already do). The ARM ARM currently says
> > unpredictable for such situations but this is being clarified in
> > future updates and the Normal non-cacheable vs cacheable aliases can
> > be used (given correct cache maintenance before using the buffer).
>
> Thanks for that information, I believe a number of people in the
> previous discussions were relying on the information from the
> documentation. Are you sure that this is not only correct for the
> cores made by ARM ltd but also for the other implementations that
> may have relied on documentation?
>
> As I mentioned before, there are other architectures, where having
> conflicting cache settings in TLB entries for the same pysical page
> immediately checkstops the CPU, and I guess that this was also allowed
> by the current version of the ARM ARM.
>
> > > 2. Implement dma_alloc_noncoherent on ARM. Marek pointed out
> > >
> > > that this is needed, and it currently is not implemented, with
> > > an outdated comment explaining why it used to not be possible
> > > to do it.
> >
> > As Russell pointed out, there are 4 main combinations with iommu and
> > some coherency support (i.e. being able to snoop the CPU caches). But
> > in an SoC you can have different devices with different iommu and
> > coherency configurations. Some of them may even be able to see the L2
> > cache but not the L1 (in which case it would help if we can get an
> > inner non-cacheable outer cacheable mapping).
> >
> > Anyway, we end up with different DMA ops per device via dev_archdata.
>
> Having different DMA ops per device was the solution that I was suggesting
> with dma_mapping_common.h, but Russell pointed out that it may not be
> the best option.
>
> The alternative would be to have just one set of dma_mapping functions
> as we do today, but to extend the functions to also cover the iommu
> case, for instance (example, don't take literally):
>
> static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
> size_t size, enum dma_data_direction dir)
> {
> dma_addr_t ret;
>
> #ifdef CONFIG_DMABOUNCE
> if (dev->archdata.dmabounce)
> return dmabounce_map_single(dev, cpu_addr, size, dir);
> #endif
>
> #ifdef CONFIG_IOMMU
> if (dev->archdata.iommu)
> ret = iommu_map_single(dev, cpu_addr, size, dir);
> else
> #endif

I wish it was that simple.

The OMAP4 ISS (Imaging Subsystem) has no IOMMU, but it can use the OMAP4 DMM
(Dynamic Memory Manager) which acts as a memory remapper. Basically (if my
understanding is correct), the ISS is configured to read/write from/to
physical addresses. If those physical addresses are in the DMM address range,
the DMM translates the accesses to physical accesses, acting as an IOMMU.

The ISS can thus write to physically contiguous memory directly, or to
scattered physical pages through the DMM. Whether an IOMMU (or, to be correct
in this case, the IOMMU-like DMM) needs to handle the DMA is a per-buffer
decision, not a per-device decision.

> dma_addr = virt_to_dma(dev, ptr);
>
> dma_sync_single_for_device(dev, dma_addr, size, dir);
> }
>
> This would not even conflict with having a common implementation
> for iommu based dma_map_ops -- we would just call the iommu functions
> directly when needed rather than having an indirect function call.

--
Regards,

Laurent Pinchart

2011-05-03 15:31:36

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

On Tuesday 03 May 2011, Laurent Pinchart wrote:
> I wish it was that simple.
>
> The OMAP4 ISS (Imaging Subsystem) has no IOMMU, but it can use the OMAP4 DMM
> (Dynamic Memory Manager) which acts as a memory remapper. Basically (if my
> understanding is correct), the ISS is configured to read/write from/to
> physical addresses. If those physical addresses are in the DMM address range,
> the DMM translates the accesses to physical accesses, acting as an IOMMU.
>
> The ISS can thus write to physically contiguous memory directly, or to
> scattered physical pages through the DMM. Whether an IOMMU (or, to be correct
> in this case, the IOMMU-like DMM) needs to handle the DMA is a per-buffer
> decision, not a per-device decision.

This doesn't sound too unusual for IOMMU implementations. A lot of time
you can access e.g. low memory using a direct mapping but you need the
IOMMU code for highmem. I've also seen a machine where a linear mapping
exists for all the memory in strict ordering, while you can use relaxed
DMA ordering when you go through the IOMMU address range. If we manage
to come up with a common dma-mapping API implementation for all IOMMUs,
it certainly needs to handle that case as well.

Arnd