2015-02-02 16:52:49

by Daniel Vetter

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Thu, Jan 29, 2015 at 05:18:33PM -0500, Rob Clark wrote:
> On Thu, Jan 29, 2015 at 2:26 PM, Russell King - ARM Linux
> <[email protected]> wrote:
> > On Thu, Jan 29, 2015 at 01:52:09PM -0500, Rob Clark wrote:
> >> Quite possibly for some of these edge some of cases, some of the
> >> dma-buf exporters are going to need to get more clever (ie. hand off
> >> different scatterlists to different clients). Although I think by far
> >> the two common cases will be "I can support anything via an iommu/mmu"
> >> and "I need phys contig".
> >>
> >> But that isn't an issue w/ dma-buf itself, so much as it is an issue
> >> w/ drivers. I guess there would be more interest in fixing up drivers
> >> when actual hw comes along that needs it..
> >
> > However, validating the attachments is the business of dma-buf. This
> > is actual infrastructure, which should ensure some kind of sanity such
> > as the issues I've raised.
> >
>
> My initial thought is for dma-buf to not try to prevent something than
> an exporter can actually do.. I think the scenario you describe could
> be handled by two sg-lists, if the exporter was clever enough.

That's already needed, each attachment has it's own sg-list. After all
there's no array of dma_addr_t in the sg tables, so you can't use one sg
for more than one mapping. And due to different iommu different devices
can easily end up with different addresses.

> That all said, I think probably all the existing exporters cache the
> sg-list. And I can't think of any actual hw which would hit this
> problem that can be solved by multiple sg-lists for the same physical
> memory. (And the constraint calculation kind of assumes the end
> result will be a single sg-list.) So it seems reasonable to me to
> check that max_segment_count * max_segment_size is not smaller than
> the buffer.
>
> If it was a less theoretical problem, I think I'd more inclined for a
> way that the exporter could override the checks, or something along
> those lines.
>
> otoh, if the attachment is just not possible because the buffer has
> been already allocated and mapped by someone with more relaxed
> constraints.. then I think the driver should be the one returning the
> error since dma-buf doesn't know this.

Importers currently cache the mapped sg list aggressively (i915) or
outright pin it for as long as possible (everyone else). So any kind of
moving stuff around is pretty much impossible with current drivers.

The even worse violation of the dma-buf spec is that all the ttm drivers
don't use the sg table correctly at all. They assume that each physical
page has exactly one sg table entry, and then fish out the struct page *
pointer from that to build up their own bo management stuff and ignore
everything else.

> > The whole "we can push it onto our users" is really on - what that
> > results in is the users ignoring most of the requirements and just doing
> > their own thing, which ultimately ends up with the whole thing turning
> > into a disgusting mess - one which becomes very difficult to fix later.
>
> Ideally at some point, dma-mapping or some helpers would support
> allocations matching constraints.. I think only actual gpu drivers
> want to do crazy enough things that they'd want to bypass dma-mapping.
> If everyone else can use dma-mapping and/or helpers then we make it
> harder for drivers to do the wrong thing than to do the right thing.
>
> > Now, if we're going to do the "more clever" thing you mention above,
> > that rather negates the point of this two-part patch set, which is to
> > provide the union of the DMA capabilities of all users. A union in
> > that case is no longer sane as we'd be tailoring the SG lists to each
> > user.
>
> It doesn't really negate.. a different sg list representing the same
> physical memory cannot suddenly make the buffer physically contiguous
> (from the perspective of memory)..
>
> (unless we are not on the same page here, so to speak)

Or someone was not chip and put a decent iommu in front of the same IP
block. E.g. the raspi gpu needs contiguous memory for rendering, but the
same block is used elsewhere but then with an iommu.

But thinking about all this I wonder whether we really should start with
some kind of constraint solving. It feels fairly leaky compared to the
encapsulation the dma api provides, and so isn't really better for
upstream than just using ion (which completely gives up on this problem
and relies on userspace allocating correct buffers).

And if we step away for a bit there's already a bunch of things that the
current dma api fails at, and which is just a bit a worse problem with
dma-buf sharing: There's not really a generic way to map an sg table
zero-copy, i.e. there's no generic way to avoid bounce buffers. And that's
already hurting all the existing gpu drivers: ttm essentially does
page-sized allocs for everything and then has it's own dma allocator on
top of that page-cache. i915 has some other hacks to at least not fail the
bounce buffer allocator too badly. Grepping for SWIOTLB in drm is fairly
interesting.

So imo if our goal is to keep the abstraction provided by the dma api
somewhat intact we should first figure out to map an sg table without
copying any data. If we have that any exporter can then easily check
whether an attachment works out by test-mapping stuff. A bit inefficient,
but at least possible (and exporters could cache the mapped sg table if so
desired). And we could rip out a pile of hacks from drm drivers.

The other problem is is coherency management. Even in the single-device
usecase current dma-buf isn't up to things since on many socs the same
device can use both coherent and non-coherent transactions. And if you map
the same memory multiple times we don't really want to flush cpu caches
multiple times (ppl alreay have massive caches and stuff to avoid the
cpu cache flush when there's just one device using the buffer object).
Again this probably needs some core dma api extensions. And again we could
probably throw away a bunch of driver code (at least in i915,
unfortunately there's no intel v4l driver to test non-coherent sharing on
x86).

With all that resolved somehow (and all these issues already affect
single-device dma api usage by gpu drivers) the bit left would be figuring
out where to allocate things. And I'm not even sure whether we should
really bother to implement this in the kernel (no matter how I slice it it
always looks like we're leaking the dma api abstractions) but just with a
few tries in userspace:

- Allocate shared buffers from the scanout engine (or v4l, though that
doesn't yet support exporting).

- If you can't import, try it the other way round. This isn't perfect for
cross-process + cross-device bo sharing, but then pretty much all
compositors have a gpu rendering fallback anyway because this problem is
a bit too complex to solve perfectly. Of course for the next frame
compositor can provide a new buffer which hopefully works out better.

- Userspace just knows where to allocate. Imo that's not actually
unreasonable since if you really have that tricky requirements you
probably also have insane buffer layouts and then all bets for generic
code are off anyway.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


2015-02-02 20:30:24

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
>> My initial thought is for dma-buf to not try to prevent something than
>> an exporter can actually do.. I think the scenario you describe could
>> be handled by two sg-lists, if the exporter was clever enough.
>
> That's already needed, each attachment has it's own sg-list. After all
> there's no array of dma_addr_t in the sg tables, so you can't use one sg
> for more than one mapping. And due to different iommu different devices
> can easily end up with different addresses.


Well, to be fair it may not be explicitly stated, but currently one
should assume the dma_addr_t's in the dmabuf sglist are bogus. With
gpu's that implement per-process/context page tables, I'm not really
sure that there is a sane way to actually do anything else..

BR,
-R

2015-02-02 21:46:32

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
> On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
> >> My initial thought is for dma-buf to not try to prevent something than
> >> an exporter can actually do.. I think the scenario you describe could
> >> be handled by two sg-lists, if the exporter was clever enough.
> >
> > That's already needed, each attachment has it's own sg-list. After all
> > there's no array of dma_addr_t in the sg tables, so you can't use one sg
> > for more than one mapping. And due to different iommu different devices
> > can easily end up with different addresses.
>
>
> Well, to be fair it may not be explicitly stated, but currently one
> should assume the dma_addr_t's in the dmabuf sglist are bogus. With
> gpu's that implement per-process/context page tables, I'm not really
> sure that there is a sane way to actually do anything else..

That's incorrect - and goes dead against the design of scatterlists.

Not only that, but it is entirely possible that you may get handed
memory via dmabufs for which there are no struct page's associated
with that memory - think about display systems which have their own
video memory which is accessible to the GPU, but it isn't system
memory.

In those circumstances, you have to use the dma_addr_t's and not the
pages.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-02 22:36:14

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Mon, Feb 2, 2015 at 4:46 PM, Russell King - ARM Linux
<[email protected]> wrote:
> On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
>> On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
>> >> My initial thought is for dma-buf to not try to prevent something than
>> >> an exporter can actually do.. I think the scenario you describe could
>> >> be handled by two sg-lists, if the exporter was clever enough.
>> >
>> > That's already needed, each attachment has it's own sg-list. After all
>> > there's no array of dma_addr_t in the sg tables, so you can't use one sg
>> > for more than one mapping. And due to different iommu different devices
>> > can easily end up with different addresses.
>>
>>
>> Well, to be fair it may not be explicitly stated, but currently one
>> should assume the dma_addr_t's in the dmabuf sglist are bogus. With
>> gpu's that implement per-process/context page tables, I'm not really
>> sure that there is a sane way to actually do anything else..
>
> That's incorrect - and goes dead against the design of scatterlists.

yeah, a bit of an abuse, although I'm not sure I see a much better way
when a device vaddr depends on user context..

> Not only that, but it is entirely possible that you may get handed
> memory via dmabufs for which there are no struct page's associated
> with that memory - think about display systems which have their own
> video memory which is accessible to the GPU, but it isn't system
> memory.

well, I guess anyways when it comes to sharing buffers, it won't be
the vram placement of the bo that gets shared ;-)

BR,
-R

> In those circumstances, you have to use the dma_addr_t's and not the
> pages.
>
> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

2015-02-03 07:45:32

by Daniel Vetter

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Mon, Feb 02, 2015 at 09:46:16PM +0000, Russell King - ARM Linux wrote:
> On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
> > On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
> > >> My initial thought is for dma-buf to not try to prevent something than
> > >> an exporter can actually do.. I think the scenario you describe could
> > >> be handled by two sg-lists, if the exporter was clever enough.
> > >
> > > That's already needed, each attachment has it's own sg-list. After all
> > > there's no array of dma_addr_t in the sg tables, so you can't use one sg
> > > for more than one mapping. And due to different iommu different devices
> > > can easily end up with different addresses.
> >
> >
> > Well, to be fair it may not be explicitly stated, but currently one
> > should assume the dma_addr_t's in the dmabuf sglist are bogus. With
> > gpu's that implement per-process/context page tables, I'm not really
> > sure that there is a sane way to actually do anything else..
>
> That's incorrect - and goes dead against the design of scatterlists.
>
> Not only that, but it is entirely possible that you may get handed
> memory via dmabufs for which there are no struct page's associated
> with that memory - think about display systems which have their own
> video memory which is accessible to the GPU, but it isn't system
> memory.
>
> In those circumstances, you have to use the dma_addr_t's and not the
> pages.

Yeah exactly. At least with i915 we'd really want to be able to share
stolen memory in some cases, and since that's stolen there's no struct
pages for them. On top of that any cpu access is also blocked to that
range in the memory controller, so the dma_addr_t is really the _only_
thing you can use to get at those memory ranges. And afaik the camera pipe
on intel soc can get there - unfortunately that one doesn't have an
upstream driver :(

And just to clarify: All current dma-buf exporter that I've seen implement
the sg mapping correctly and _do_ map the sg table into device address
space with dma_map_sg. In other words: The dma_addr_t are all valid, it's
just that e.g. with ttm no one has bothered to teach ttm a dma-buf
correctly. The internal abstraction is all there, ttm-internal buffer
object interface match what dma-buf exposes fairly closes (hey I didn't do
shit when designing those interfaces ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 07:47:37

by Daniel Vetter

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
> On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
> >> My initial thought is for dma-buf to not try to prevent something than
> >> an exporter can actually do.. I think the scenario you describe could
> >> be handled by two sg-lists, if the exporter was clever enough.
> >
> > That's already needed, each attachment has it's own sg-list. After all
> > there's no array of dma_addr_t in the sg tables, so you can't use one sg
> > for more than one mapping. And due to different iommu different devices
> > can easily end up with different addresses.
>
>
> Well, to be fair it may not be explicitly stated, but currently one
> should assume the dma_addr_t's in the dmabuf sglist are bogus. With
> gpu's that implement per-process/context page tables, I'm not really
> sure that there is a sane way to actually do anything else..

Hm, what does per-process/context page tables have to do here? At least on
i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled
internally.

Since atm the dma api doesn't have any context of contexts or different
pagetables, I don't see who you could use that at all.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 07:48:54

by Daniel Vetter

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Mon, Feb 02, 2015 at 05:36:10PM -0500, Rob Clark wrote:
> well, I guess anyways when it comes to sharing buffers, it won't be
> the vram placement of the bo that gets shared ;-)

Actually that's pretty much what I'd like to be able to do for i915.
Except that vram is just a specially protected chunk of main memory.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 12:28:37

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
> On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
> > On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
> > >> My initial thought is for dma-buf to not try to prevent something than
> > >> an exporter can actually do.. I think the scenario you describe could
> > >> be handled by two sg-lists, if the exporter was clever enough.
> > >
> > > That's already needed, each attachment has it's own sg-list. After all
> > > there's no array of dma_addr_t in the sg tables, so you can't use one sg
> > > for more than one mapping. And due to different iommu different devices
> > > can easily end up with different addresses.
> >
> >
> > Well, to be fair it may not be explicitly stated, but currently one
> > should assume the dma_addr_t's in the dmabuf sglist are bogus. With
> > gpu's that implement per-process/context page tables, I'm not really
> > sure that there is a sane way to actually do anything else..
>
> Hm, what does per-process/context page tables have to do here? At least on
> i915 we have a two levels of page tables:
> - first level for vm/device isolation, used through dma api
> - 2nd level for per-gpu-context isolation and context switching, handled
> internally.
>
> Since atm the dma api doesn't have any context of contexts or different
> pagetables, I don't see who you could use that at all.

What I've found with *my* etnaviv drm implementation (not Christian's - I
found it impossible to work with Christian, especially with the endless
"msm doesn't do it that way, so we shouldn't" responses and his attitude
towards cherry-picking my development work [*]) is that it's much easier to
keep the GPU MMU local to the GPU and under the control of the DRM MM code,
rather than attaching the IOMMU to the DMA API and handling it that way.

There are several reasons for that:

1. DRM has a better idea about when the memory needs to be mapped to the
GPU, and it can more effectively manage the GPU MMU.

2. The GPU MMU may have TLBs which can only be flushed via a command in
the GPU command stream, so it's fundamentally necessary for the MMU to
be managed by the GPU driver so that it knows when (and how) to insert
the flushes.


* - as a direct result of that, I've stopped all further development of
etnaviv drm, and I'm intending to strip it out from my Xorg DDX driver
as the etnaviv drm API which Christian wants is completely incompatible
with the non-etnaviv drm, and that just creates far too much pain in the
DDX driver.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 12:59:32

by Daniel Vetter

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 12:28:14PM +0000, Russell King - ARM Linux wrote:
> On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
> > On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
> > > On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
> > > >> My initial thought is for dma-buf to not try to prevent something than
> > > >> an exporter can actually do.. I think the scenario you describe could
> > > >> be handled by two sg-lists, if the exporter was clever enough.
> > > >
> > > > That's already needed, each attachment has it's own sg-list. After all
> > > > there's no array of dma_addr_t in the sg tables, so you can't use one sg
> > > > for more than one mapping. And due to different iommu different devices
> > > > can easily end up with different addresses.
> > >
> > >
> > > Well, to be fair it may not be explicitly stated, but currently one
> > > should assume the dma_addr_t's in the dmabuf sglist are bogus. With
> > > gpu's that implement per-process/context page tables, I'm not really
> > > sure that there is a sane way to actually do anything else..
> >
> > Hm, what does per-process/context page tables have to do here? At least on
> > i915 we have a two levels of page tables:
> > - first level for vm/device isolation, used through dma api
> > - 2nd level for per-gpu-context isolation and context switching, handled
> > internally.
> >
> > Since atm the dma api doesn't have any context of contexts or different
> > pagetables, I don't see who you could use that at all.
>
> What I've found with *my* etnaviv drm implementation (not Christian's - I
> found it impossible to work with Christian, especially with the endless
> "msm doesn't do it that way, so we shouldn't" responses and his attitude
> towards cherry-picking my development work [*]) is that it's much easier to
> keep the GPU MMU local to the GPU and under the control of the DRM MM code,
> rather than attaching the IOMMU to the DMA API and handling it that way.
>
> There are several reasons for that:
>
> 1. DRM has a better idea about when the memory needs to be mapped to the
> GPU, and it can more effectively manage the GPU MMU.
>
> 2. The GPU MMU may have TLBs which can only be flushed via a command in
> the GPU command stream, so it's fundamentally necessary for the MMU to
> be managed by the GPU driver so that it knows when (and how) to insert
> the flushes.

3. Switching between different address spaces (for per gpu context
isolation) often requires intricate knowledge about the gpu and close
coordination. Well maybe just a part of 2 really, but an important one.

Fully agree and tbh I'm not sure whether the current push in arm to expose
all gpu mmus as iommus is solid. Even for pasid (per-context iommu tables)
which is a big official standard there's still a lot of open questions
about how to do it properly. And it requires strict hw support so that the
hw always knows which pasid it should use for a given dma access.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 13:28:51

by Christian Gmeiner

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

2015-02-03 13:28 GMT+01:00 Russell King - ARM Linux <[email protected]>:
> On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
>> On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
>> > On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
>> > >> My initial thought is for dma-buf to not try to prevent something than
>> > >> an exporter can actually do.. I think the scenario you describe could
>> > >> be handled by two sg-lists, if the exporter was clever enough.
>> > >
>> > > That's already needed, each attachment has it's own sg-list. After all
>> > > there's no array of dma_addr_t in the sg tables, so you can't use one sg
>> > > for more than one mapping. And due to different iommu different devices
>> > > can easily end up with different addresses.
>> >
>> >
>> > Well, to be fair it may not be explicitly stated, but currently one
>> > should assume the dma_addr_t's in the dmabuf sglist are bogus. With
>> > gpu's that implement per-process/context page tables, I'm not really
>> > sure that there is a sane way to actually do anything else..
>>
>> Hm, what does per-process/context page tables have to do here? At least on
>> i915 we have a two levels of page tables:
>> - first level for vm/device isolation, used through dma api
>> - 2nd level for per-gpu-context isolation and context switching, handled
>> internally.
>>
>> Since atm the dma api doesn't have any context of contexts or different
>> pagetables, I don't see who you could use that at all.
>
> What I've found with *my* etnaviv drm implementation (not Christian's - I
> found it impossible to work with Christian, especially with the endless
> "msm doesn't do it that way, so we shouldn't" responses and his attitude
> towards cherry-picking my development work [*]) is that it's much easier to
> keep the GPU MMU local to the GPU and under the control of the DRM MM code,
> rather than attaching the IOMMU to the DMA API and handling it that way.
>

Keep in mind that I tried to reach you several times via mail and irc
and you simply
ignored me. Did you know that took almost all of your patches (with
small changes)?
And I needed to cherry pick you patches as they were a) wrong, b) solved in a
different way or c) had "hack" in the subject. I am quite sorry that I
ended that
way, but it is not only my fault!

> There are several reasons for that:
>
> 1. DRM has a better idea about when the memory needs to be mapped to the
> GPU, and it can more effectively manage the GPU MMU.
>
> 2. The GPU MMU may have TLBs which can only be flushed via a command in
> the GPU command stream, so it's fundamentally necessary for the MMU to
> be managed by the GPU driver so that it knows when (and how) to insert
> the flushes.
>
>
> * - as a direct result of that, I've stopped all further development of
> etnaviv drm, and I'm intending to strip it out from my Xorg DDX driver
> as the etnaviv drm API which Christian wants is completely incompatible
> with the non-etnaviv drm, and that just creates far too much pain in the
> DDX driver.
>

That is bad, but life moves on.

greets
--
Christian Gmeiner, MSc

https://soundcloud.com/christian-gmeiner

2015-02-03 14:04:07

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 2:48 AM, Daniel Vetter <[email protected]> wrote:
> On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
>> On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
>> >> My initial thought is for dma-buf to not try to prevent something than
>> >> an exporter can actually do.. I think the scenario you describe could
>> >> be handled by two sg-lists, if the exporter was clever enough.
>> >
>> > That's already needed, each attachment has it's own sg-list. After all
>> > there's no array of dma_addr_t in the sg tables, so you can't use one sg
>> > for more than one mapping. And due to different iommu different devices
>> > can easily end up with different addresses.
>>
>>
>> Well, to be fair it may not be explicitly stated, but currently one
>> should assume the dma_addr_t's in the dmabuf sglist are bogus. With
>> gpu's that implement per-process/context page tables, I'm not really
>> sure that there is a sane way to actually do anything else..
>
> Hm, what does per-process/context page tables have to do here? At least on
> i915 we have a two levels of page tables:
> - first level for vm/device isolation, used through dma api
> - 2nd level for per-gpu-context isolation and context switching, handled
> internally.
>
> Since atm the dma api doesn't have any context of contexts or different
> pagetables, I don't see who you could use that at all.

Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
drop use of dma-mapping entirely (incl the current call to dma_map_sg,
which I just need until we can use drm_cflush on arm), and
attach/detach iommu domains directly to implement context switches.
At that point, dma_addr_t really has no sensible meaning for me.

BR,
-R

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 14:20:29

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
> On Tue, Feb 3, 2015 at 2:48 AM, Daniel Vetter <[email protected]> wrote:
> > On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
> >> On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
> >> >> My initial thought is for dma-buf to not try to prevent something than
> >> >> an exporter can actually do.. I think the scenario you describe could
> >> >> be handled by two sg-lists, if the exporter was clever enough.
> >> >
> >> > That's already needed, each attachment has it's own sg-list. After all
> >> > there's no array of dma_addr_t in the sg tables, so you can't use one sg
> >> > for more than one mapping. And due to different iommu different devices
> >> > can easily end up with different addresses.
> >>
> >>
> >> Well, to be fair it may not be explicitly stated, but currently one
> >> should assume the dma_addr_t's in the dmabuf sglist are bogus. With
> >> gpu's that implement per-process/context page tables, I'm not really
> >> sure that there is a sane way to actually do anything else..
> >
> > Hm, what does per-process/context page tables have to do here? At least on
> > i915 we have a two levels of page tables:
> > - first level for vm/device isolation, used through dma api
> > - 2nd level for per-gpu-context isolation and context switching, handled
> > internally.
> >
> > Since atm the dma api doesn't have any context of contexts or different
> > pagetables, I don't see who you could use that at all.
>
> Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
> drop use of dma-mapping entirely (incl the current call to dma_map_sg,
> which I just need until we can use drm_cflush on arm), and
> attach/detach iommu domains directly to implement context switches.
> At that point, dma_addr_t really has no sensible meaning for me.

I think what you see here is a quite common hardware setup and we really
lack the right abstraction for it at the moment. Everybody seems to
work around it with a mix of the dma-mapping API and the iommu API.
These are doing different things, and even though the dma-mapping API
can be implemented on top of the iommu API, they are not really compatible.

The drm_clflush helpers don't seem like the right solution to me,
because all other devices outside of drm will face the same issue,
and I suspect we should fill the missing gaps in the API in a
more generic way.

Arnd

2015-02-03 14:25:36

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 7:28 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
>> On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
>> > On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter <[email protected]> wrote:
>> > >> My initial thought is for dma-buf to not try to prevent something than
>> > >> an exporter can actually do.. I think the scenario you describe could
>> > >> be handled by two sg-lists, if the exporter was clever enough.
>> > >
>> > > That's already needed, each attachment has it's own sg-list. After all
>> > > there's no array of dma_addr_t in the sg tables, so you can't use one sg
>> > > for more than one mapping. And due to different iommu different devices
>> > > can easily end up with different addresses.
>> >
>> >
>> > Well, to be fair it may not be explicitly stated, but currently one
>> > should assume the dma_addr_t's in the dmabuf sglist are bogus. With
>> > gpu's that implement per-process/context page tables, I'm not really
>> > sure that there is a sane way to actually do anything else..
>>
>> Hm, what does per-process/context page tables have to do here? At least on
>> i915 we have a two levels of page tables:
>> - first level for vm/device isolation, used through dma api
>> - 2nd level for per-gpu-context isolation and context switching, handled
>> internally.
>>
>> Since atm the dma api doesn't have any context of contexts or different
>> pagetables, I don't see who you could use that at all.
>
> What I've found with *my* etnaviv drm implementation (not Christian's - I
> found it impossible to work with Christian, especially with the endless
> "msm doesn't do it that way, so we shouldn't" responses and his attitude
> towards cherry-picking my development work [*]) is that it's much easier to
> keep the GPU MMU local to the GPU and under the control of the DRM MM code,
> rather than attaching the IOMMU to the DMA API and handling it that way.
>
> There are several reasons for that:
>
> 1. DRM has a better idea about when the memory needs to be mapped to the
> GPU, and it can more effectively manage the GPU MMU.
>
> 2. The GPU MMU may have TLBs which can only be flushed via a command in
> the GPU command stream, so it's fundamentally necessary for the MMU to
> be managed by the GPU driver so that it knows when (and how) to insert
> the flushes.
>

If gpu mmu needs some/all updates to happen from command-stream then
probably better to handle it internally..

That is a slightly different scenario from msm, where we have many
instances of the same iommu[*] scattered through the SoC in front of
various different devices.

BR,
-R

[*] at least from iommu register layout, same driver is used for all
instances.. but maybe the tlb+walker are maybe more tightly integrated
to the gpu, but that is just speculation on implementation details
based on some paper I found along the way

>
> * - as a direct result of that, I've stopped all further development of
> etnaviv drm, and I'm intending to strip it out from my Xorg DDX driver
> as the etnaviv drm API which Christian wants is completely incompatible
> with the non-etnaviv drm, and that just creates far too much pain in the
> DDX driver.
>
> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

2015-02-03 14:33:20

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 02:28:26PM +0100, Christian Gmeiner wrote:
> 2015-02-03 13:28 GMT+01:00 Russell King - ARM Linux <[email protected]>:
> > What I've found with *my* etnaviv drm implementation (not Christian's - I
> > found it impossible to work with Christian, especially with the endless
> > "msm doesn't do it that way, so we shouldn't" responses and his attitude
> > towards cherry-picking my development work [*]) is that it's much easier to
> > keep the GPU MMU local to the GPU and under the control of the DRM MM code,
> > rather than attaching the IOMMU to the DMA API and handling it that way.
> >
>
> Keep in mind that I tried to reach you several times via mail and irc
> and you simply
> ignored me. Did you know that took almost all of your patches (with
> small changes)?
> And I needed to cherry pick you patches as they were a) wrong, b) solved in a
> different way or c) had "hack" in the subject. I am quite sorry that I
> ended that
> way, but it is not only my fault!

Exactly - you *took* every patch that I published whether I was ready
for you to take it or not.

That's not how kernel development works. Kernel development works by
people working on the code, and *pushing* patches or git trees upstream.
It doesn't work by having people running around *taking* patches from
people just because they feel like it.

I asked you several times not to do that which means the only way I can
control you is by *not* publishing my changes, thereby denying other
people the ability to test my changes.

Another result of you *taking* patches from me is that you totally
*screwed* my ability to work with you. If you make this stuff
unnecessary hard, you can expect people to walk away, and that's
precisely what I've done.

There's also the issue of you *taking* my patches and then applying
them to your tree with your own modifications, again *screwing* my tree,
and screwing my ability - again - to work with you.

Many of my patches in your repository are also marked as you being the
author of them... which _really_ is not nice.

Your "review" comments of "based 1:1 on the MSM driver" were really crazy.
Just because one DRM driver does something one way does not make it the
only way, nor does it make it suitable for use everywhere, even if you
modelled your driver on MSM. It certainly doesn't mean that the way the
MSM driver does it is correct either.

And frankly, you calling my patches "wrong" is laughable. I have a stable
fully functional Xorg DDX driver here which works with my version of your
etnaviv DRM across several Vivante GPUs - GC660 on Dove, and two revisions
of GC320 on iMX6 (which are notoriously buggy). No kernel oopses, no GPU
lockups. I've had machines with uptimes of a month here with it, with the
driver being exercised frequently during that period.

You refused to take things such as the DMA address monitoring for GPU
hangups to stop it mis-firing. This _is_ a bug fix. Without that, your
driver spits out random GPU hangups when there isn't actually any hangup
at all. *Reliably* so. Your excuse for not taking it was "The current
vivante driver doesn't do that." While that's true, the V2 Vivante
drivers _do_ do it in their "guard thread" - and they do it because -
as I already explained to you - it's entirely possible for the GPU to
take a long time, longer than your hangcheck timeout, to render a
series of 1080p operations. And again, not everything that the Vivante
drivers do is correct either. Jon and myself know that very well having
spent a long time debugging their GPL'd offerings.

Even the "hack" patch was mostly correct - the reason that it is labelled
as a "hack" is because - as the commit log said - it should really be
done by the MMUv1 code, but that requires your entire IOMMU _bodge_ to be
rewritten properly. Even the Vivante drivers use that region when they
can.

Then there's also the compatibility with the etnaviv library - which is
an important thing if you want people to make use of your driver. You
applied the patches for and then reverted which completely screws the
Xorg DDX driver, making it impossible to support both etnaviv and
etnadrm without having two almost identical copies of the same code. I
don't want to maintain almost identical copies of that same code, and
no one in their right mind would want that.

Having some level of sane user compatibility between etnaviv and
etnaviv drm will _gain_ you uses as it will allow people to write code
which works on both platforms - and it's really not difficult to do.
(In fact, I've proven it by writing a shim layer between the etnaviv
API and the DRM interfaces in the DDX driver.)

Then there's the round-robin IOMMU address space allocation, which is
required to avoid corrupted pixmaps (which is something that _your_
driver does very very well - in fact, so well that it corrupts system
memory), and the reaping of the IOMMU space when we run out of IOMMU
space to map.

Now, on to things that you do wrong.

There's your bodge with the component helper too which I'm disgusted with,
and I'll tell you now - your use of the component helper is wrong.

In your latest repository, there's this reserved-memory thing which you've
invented - to work around iMX6 with 2GB of RAM allocating the command
ring buffer at physical addresses >= 0x80000000. That's absolutely not
necessary, the GPU has physical offset registers which can be used to
program the lower 2G of MMUv1 space at the appropriate offset - and
there's good reasons to do that - it prevents the GPU being able to
access 0x00000000-0x10000000 which are where the peripheral registers
are on the iMX6 - it prevents the GPU being able to scribble into (eg)
the SDRAM controller registers etc potentially taking the system down.
Yes, it won't work if we see 3G on iMX6, but that's something which can
be addressed in other ways (such as passing appropriate GFP_* flags.)

Here's a reminder of the kind of "high quality" review comments you
provided:

* staging: etnaviv: ensure cleanup of reservation object
commit 729b31a8dd07d5db744cc5cb32c28ebf2e8cadb5

This is based 1:1 on msm drm driver. I will talk with Rob
about it and will postpone this patch.

* staging: etnaviv: implement MMU reaping
commit 0c7f0736cc02ba83dea15d73e9fa98277839ca67

I will _NOT_ take this one.

* staging: etnaviv: stop the hangcheck timer mis-firing
commit f0c3d8f2bf81774aaf19284ec3287e343a772e63

Too hacky, current vivante drivers do not have this hack.
I will _NOT_ take this one.

* staging: etnaviv: increase iommu page table size to 512KiB
commit 610da4c465732f1c20495f8d7d9504ec25665bb0

I will _NOT_ take this one. I will postpone it.

* staging: etnaviv: fix fence wrapping for gem objects
commit 218e8ffce1d57a797e7f3234cd8ffd62fc4dab71

This is based 1:1 on msm drm driver. I will talk with Rob
about it and will postpone this patch.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 14:37:31

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 09:04:03AM -0500, Rob Clark wrote:
> Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
> drop use of dma-mapping entirely (incl the current call to dma_map_sg,
> which I just need until we can use drm_cflush on arm), and
> attach/detach iommu domains directly to implement context switches.
> At that point, dma_addr_t really has no sensible meaning for me.

So how do you intend to import from a subsystem which only gives you
the dma_addr_t?

If you aren't passing system memory, you have no struct page. You can't
fake up a struct page. What this means is that struct scatterlist can't
represent it any other way.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 14:41:28

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
> On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
> > Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
> > drop use of dma-mapping entirely (incl the current call to dma_map_sg,
> > which I just need until we can use drm_cflush on arm), and
> > attach/detach iommu domains directly to implement context switches.
> > At that point, dma_addr_t really has no sensible meaning for me.
>
> I think what you see here is a quite common hardware setup and we really
> lack the right abstraction for it at the moment. Everybody seems to
> work around it with a mix of the dma-mapping API and the iommu API.
> These are doing different things, and even though the dma-mapping API
> can be implemented on top of the iommu API, they are not really compatible.

I'd go as far as saying that the "DMA API on top of IOMMU" is more
intended to be for a system IOMMU for the bus in question, rather
than a device-level IOMMU.

If an IOMMU is part of a device, then the device should handle it
(maybe via an abstraction) and not via the DMA API. The DMA API should
be handing the bus addresses to the device driver which the device's
IOMMU would need to generate. (In other words, in this circumstance,
the DMA API shouldn't give you the device internal address.)

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 14:45:00

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 9:37 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Tue, Feb 03, 2015 at 09:04:03AM -0500, Rob Clark wrote:
>> Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
>> drop use of dma-mapping entirely (incl the current call to dma_map_sg,
>> which I just need until we can use drm_cflush on arm), and
>> attach/detach iommu domains directly to implement context switches.
>> At that point, dma_addr_t really has no sensible meaning for me.
>
> So how do you intend to import from a subsystem which only gives you
> the dma_addr_t?
>
> If you aren't passing system memory, you have no struct page. You can't
> fake up a struct page. What this means is that struct scatterlist can't
> represent it any other way.

Tell the exporter to stop using carveouts, and give me proper memory
instead.. ;-)

Well, at least on these SoC's, I think the only valid use for carveout
memory is the bootloader splashscreen. And I was planning on just
hanging on to that for myself for fbdev scanout buffer or other
internal (non shared) usage..

BR,
-R

> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

2015-02-03 14:53:29

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
> On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
> > On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
> > > Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
> > > drop use of dma-mapping entirely (incl the current call to dma_map_sg,
> > > which I just need until we can use drm_cflush on arm), and
> > > attach/detach iommu domains directly to implement context switches.
> > > At that point, dma_addr_t really has no sensible meaning for me.
> >
> > I think what you see here is a quite common hardware setup and we really
> > lack the right abstraction for it at the moment. Everybody seems to
> > work around it with a mix of the dma-mapping API and the iommu API.
> > These are doing different things, and even though the dma-mapping API
> > can be implemented on top of the iommu API, they are not really compatible.
>
> I'd go as far as saying that the "DMA API on top of IOMMU" is more
> intended to be for a system IOMMU for the bus in question, rather
> than a device-level IOMMU.
>
> If an IOMMU is part of a device, then the device should handle it
> (maybe via an abstraction) and not via the DMA API. The DMA API should
> be handing the bus addresses to the device driver which the device's
> IOMMU would need to generate. (In other words, in this circumstance,
> the DMA API shouldn't give you the device internal address.)

Exactly. And the abstraction that people choose at the moment is the
iommu API, for better or worse. It makes a lot of sense to use this
API if the same iommu is used for other devices as well (which is
the case on Tegra and probably a lot of others). Unfortunately the
iommu API lacks support for cache management, and probably other things
as well, because this was not an issue for the original use case
(device assignment on KVM/x86).

This could be done by adding explicit or implied cache management
to the IOMMU mapping interfaces, or by extending the dma-mapping
interfaces in a way that covers the use case of the device managing
its own address space, in addition to the existing coherent and
streaming interfaces.

Arnd

2015-02-03 14:58:33

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 09:44:57AM -0500, Rob Clark wrote:
> On Tue, Feb 3, 2015 at 9:37 AM, Russell King - ARM Linux
> <[email protected]> wrote:
> > On Tue, Feb 03, 2015 at 09:04:03AM -0500, Rob Clark wrote:
> >> Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
> >> drop use of dma-mapping entirely (incl the current call to dma_map_sg,
> >> which I just need until we can use drm_cflush on arm), and
> >> attach/detach iommu domains directly to implement context switches.
> >> At that point, dma_addr_t really has no sensible meaning for me.
> >
> > So how do you intend to import from a subsystem which only gives you
> > the dma_addr_t?
> >
> > If you aren't passing system memory, you have no struct page. You can't
> > fake up a struct page. What this means is that struct scatterlist can't
> > represent it any other way.
>
> Tell the exporter to stop using carveouts, and give me proper memory
> instead.. ;-)
>
> Well, at least on these SoC's, I think the only valid use for carveout
> memory is the bootloader splashscreen. And I was planning on just
> hanging on to that for myself for fbdev scanout buffer or other
> internal (non shared) usage..

I wasn't thinking about carveouts - as I already mentioned earlier in this
thread, it may be memory which couldn't possibly ever be system memory -
for example, a separate chunk of memory which is tightly coupled to the
graphics system but not so to the CPU.

In such a case, we wouldn't want to use that as normal system memory, but
we would want to allocate framebuffers and the like from it, and maybe
pass them around.

While it may not be appropriate for MSM, it's still something that needs
to be considered, because there may be (and I know there are) dmabuf
users which do pass memory this way.

So, what I'm saying is that for the purposes of the dmabuf API, we can't
mandate that the scatterlists will contain a valid struct page pointer.
It'd probably be a good idea for the importer to validate the scatterlist
at import time if it has this requirement.

However, thinking about this more, I think that from a generic design
point of view, we really should limit the "struct page" usage to a
special MSM-ism - something which should definitely not be copied by
other drivers. As has been mentioned previously, if there is a system
MMU which needs to be programmed to map system memory onto the bus, the
struct page becomes absolutely useless, and the only thing that gives
you the correct "handle" to that memory is the dma_addr_t.

Finally, note that n_mapped = dma_map_sg(dev, sg, n_ent, dir) - n_mapped
can be less than n_ent when there's the presence of an IOMMU, since an
IOMMU is permitted to coalesce individual scatterlist entries if it
so chooses, and when walking the scatterlist for DMA purposes, the
scatterlist sg_dma_*() accessors should be used, and it should be
iterated over from 0 to n_mapped, not 0 to n_ent. It's important to
realise that in driver code, sg->length may not be the same as
sg_dma_len(sg) for exactly this reason:

#ifdef CONFIG_NEED_SG_DMA_LENGTH
#define sg_dma_len(sg) ((sg)->dma_length)
#else
#define sg_dma_len(sg) ((sg)->length)
#endif

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 15:19:40

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 9:41 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
>> On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
>> > Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
>> > drop use of dma-mapping entirely (incl the current call to dma_map_sg,
>> > which I just need until we can use drm_cflush on arm), and
>> > attach/detach iommu domains directly to implement context switches.
>> > At that point, dma_addr_t really has no sensible meaning for me.
>>
>> I think what you see here is a quite common hardware setup and we really
>> lack the right abstraction for it at the moment. Everybody seems to
>> work around it with a mix of the dma-mapping API and the iommu API.
>> These are doing different things, and even though the dma-mapping API
>> can be implemented on top of the iommu API, they are not really compatible.
>
> I'd go as far as saying that the "DMA API on top of IOMMU" is more
> intended to be for a system IOMMU for the bus in question, rather
> than a device-level IOMMU.
>
> If an IOMMU is part of a device, then the device should handle it
> (maybe via an abstraction) and not via the DMA API. The DMA API should
> be handing the bus addresses to the device driver which the device's
> IOMMU would need to generate. (In other words, in this circumstance,
> the DMA API shouldn't give you the device internal address.)

if the dma_addr_t becomes the address upstream of the iommu (in
practice, the phys addr), that would, I think, address my concerns
about dma_addr_t

BR,
-R

> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

2015-02-03 15:22:30

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 03:52:48PM +0100, Arnd Bergmann wrote:
> On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
> > I'd go as far as saying that the "DMA API on top of IOMMU" is more
> > intended to be for a system IOMMU for the bus in question, rather
> > than a device-level IOMMU.
> >
> > If an IOMMU is part of a device, then the device should handle it
> > (maybe via an abstraction) and not via the DMA API. The DMA API should
> > be handing the bus addresses to the device driver which the device's
> > IOMMU would need to generate. (In other words, in this circumstance,
> > the DMA API shouldn't give you the device internal address.)
>
> Exactly. And the abstraction that people choose at the moment is the
> iommu API, for better or worse. It makes a lot of sense to use this
> API if the same iommu is used for other devices as well (which is
> the case on Tegra and probably a lot of others). Unfortunately the
> iommu API lacks support for cache management, and probably other things
> as well, because this was not an issue for the original use case
> (device assignment on KVM/x86).
>
> This could be done by adding explicit or implied cache management
> to the IOMMU mapping interfaces, or by extending the dma-mapping
> interfaces in a way that covers the use case of the device managing
> its own address space, in addition to the existing coherent and
> streaming interfaces.

Don't we already have those in the DMA API? dma_sync_*() ?

dma_map_sg() - sets up the system MMU and deals with initial cache
coherency handling. Device IOMMU being the responsibility of the
GPU driver.

The GPU can then do dma_sync_*() on the scatterlist as is necessary
to synchronise the cache coherency (while respecting the ownership
rules - which are very important on ARM to follow as some sync()s are
destructive to any dirty data in the CPU cache.)

dma_unmap_sg() tears down the system MMU and deals with the final cache
handling.

Why do we need more DMA API interfaces?

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 15:25:55

by Rob Clark

[permalink] [raw]
Subject: Re: [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 9:52 AM, Arnd Bergmann <[email protected]> wrote:
> On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
>> On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
>> > On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
>> > > Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to
>> > > drop use of dma-mapping entirely (incl the current call to dma_map_sg,
>> > > which I just need until we can use drm_cflush on arm), and
>> > > attach/detach iommu domains directly to implement context switches.
>> > > At that point, dma_addr_t really has no sensible meaning for me.
>> >
>> > I think what you see here is a quite common hardware setup and we really
>> > lack the right abstraction for it at the moment. Everybody seems to
>> > work around it with a mix of the dma-mapping API and the iommu API.
>> > These are doing different things, and even though the dma-mapping API
>> > can be implemented on top of the iommu API, they are not really compatible.
>>
>> I'd go as far as saying that the "DMA API on top of IOMMU" is more
>> intended to be for a system IOMMU for the bus in question, rather
>> than a device-level IOMMU.
>>
>> If an IOMMU is part of a device, then the device should handle it
>> (maybe via an abstraction) and not via the DMA API. The DMA API should
>> be handing the bus addresses to the device driver which the device's
>> IOMMU would need to generate. (In other words, in this circumstance,
>> the DMA API shouldn't give you the device internal address.)
>
> Exactly. And the abstraction that people choose at the moment is the
> iommu API, for better or worse. It makes a lot of sense to use this
> API if the same iommu is used for other devices as well (which is
> the case on Tegra and probably a lot of others). Unfortunately the
> iommu API lacks support for cache management, and probably other things
> as well, because this was not an issue for the original use case
> (device assignment on KVM/x86).
>
> This could be done by adding explicit or implied cache management
> to the IOMMU mapping interfaces, or by extending the dma-mapping
> interfaces in a way that covers the use case of the device managing
> its own address space, in addition to the existing coherent and
> streaming interfaces.

I think for gpu's, we'd prefer explicit and less abstraction.. which
is probably opposite of what every other driver would want

In the end, my eventual goal is explicit control of tlb flush, and
control of my address space. And in fact in some cases we are going
to want to use the gpu to bang on iommu registers to do context
switches and tlb flushes. (Which is obviously not the first step..
and something that is fairly difficult to get right/secure.. but the
performance win seems significant so I'm not sure we can avoid it.)

BR,
-R

>
> Arnd

2015-02-03 15:31:26

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 15:22:05 Russell King - ARM Linux wrote:
> On Tue, Feb 03, 2015 at 03:52:48PM +0100, Arnd Bergmann wrote:
> > On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
> > > I'd go as far as saying that the "DMA API on top of IOMMU" is more
> > > intended to be for a system IOMMU for the bus in question, rather
> > > than a device-level IOMMU.
> > >
> > > If an IOMMU is part of a device, then the device should handle it
> > > (maybe via an abstraction) and not via the DMA API. The DMA API should
> > > be handing the bus addresses to the device driver which the device's
> > > IOMMU would need to generate. (In other words, in this circumstance,
> > > the DMA API shouldn't give you the device internal address.)
> >
> > Exactly. And the abstraction that people choose at the moment is the
> > iommu API, for better or worse. It makes a lot of sense to use this
> > API if the same iommu is used for other devices as well (which is
> > the case on Tegra and probably a lot of others). Unfortunately the
> > iommu API lacks support for cache management, and probably other things
> > as well, because this was not an issue for the original use case
> > (device assignment on KVM/x86).
> >
> > This could be done by adding explicit or implied cache management
> > to the IOMMU mapping interfaces, or by extending the dma-mapping
> > interfaces in a way that covers the use case of the device managing
> > its own address space, in addition to the existing coherent and
> > streaming interfaces.
>
> Don't we already have those in the DMA API? dma_sync_*() ?
>
> dma_map_sg() - sets up the system MMU and deals with initial cache
> coherency handling. Device IOMMU being the responsibility of the
> GPU driver.

dma_sync_*() works with whatever comes out of dma_map_*(), true,
but this is not what they want to do here.

> The GPU can then do dma_sync_*() on the scatterlist as is necessary
> to synchronise the cache coherency (while respecting the ownership
> rules - which are very important on ARM to follow as some sync()s are
> destructive to any dirty data in the CPU cache.)
>
> dma_unmap_sg() tears down the system MMU and deals with the final cache
> handling.
>
> Why do we need more DMA API interfaces?

The dma_map_* interfaces assign the virtual addresses internally,
using typically either a global address space for all devices, or one
address space per device.

There are multiple things that this cannot do, and that is why the
drivers use the iommu API directly:

- use one address space per 'struct mm'
- map user memory with bus_address == user_address
- map memory into the GPU without having a permanent kernel mapping
- map memory first, and do the initial cache flushes later

Arnd

2015-02-03 15:54:31

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
> On Tuesday 03 February 2015 15:22:05 Russell King - ARM Linux wrote:
> > Don't we already have those in the DMA API? dma_sync_*() ?
> >
> > dma_map_sg() - sets up the system MMU and deals with initial cache
> > coherency handling. Device IOMMU being the responsibility of the
> > GPU driver.
>
> dma_sync_*() works with whatever comes out of dma_map_*(), true,
> but this is not what they want to do here.
>
> > The GPU can then do dma_sync_*() on the scatterlist as is necessary
> > to synchronise the cache coherency (while respecting the ownership
> > rules - which are very important on ARM to follow as some sync()s are
> > destructive to any dirty data in the CPU cache.)
> >
> > dma_unmap_sg() tears down the system MMU and deals with the final cache
> > handling.
> >
> > Why do we need more DMA API interfaces?
>
> The dma_map_* interfaces assign the virtual addresses internally,
> using typically either a global address space for all devices, or one
> address space per device.

We shouldn't be doing one address space per device for precisely this
reason. We should be doing one address space per *bus*. I did have
a nice diagram to illustrate the point in my previous email, but I
deleted it, I wish I hadn't... briefly:

Fig. 1.
+------------------+
|+-----+ device |
CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU--> |
|+-----+ |
+------------------+

Fig.1 represents what I'd call the "GPU" issue that we're talking about
in this thread.

Fig. 2.
CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device

The DMA API should be responsible (at the very least) for everything on
the left of "<iobus>" in and should be providing a dma_addr_t which is
representative of what the device (in Fig.1) as a whole sees. That's
the "system" part.

I believe this is the approach which is taken by x86 and similar platforms,
simply because they tend not to have an IOMMU on individual devices (and
if they did, eg, on a PCI card, it's clearly the responsibility of the
device driver.)

Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable.
For fig.2, it is entirely possible that the same device could appear
without an IOMMU, and in that scenario, you would want the IOMMU to be
handled transparently.

However, by doing so for everything, you run into exactly the problem
which is being discussed here - the need to separate out the cache
coherency from the IOMMU aspects. You probably also have a setup very
similar to fig.1 (which is certainly true of Vivante GPUs.)

If you have the need to separately control both, then using the DMA API
to encapsulate both does not make sense - at which point, the DMA API
should be responsible for the minimum only - in other words, everything
to the left of <iobus> (so including the system MMU.) The control of
the device IOMMU should be the responsibility of device driver in this
case.

So, dma_map_sg() would be responsible for dealing with the CPU cache
coherency issues, and setting up the system MMU. dma_sync_*() would
be responsible for the CPU cache coherency issues, and dma_unmap_sg()
would (again) deal with the CPU cache and tear down the system MMU
mappings.

Meanwhile, the device driver has ultimate control over its IOMMU, the
creation and destruction of mappings and context switches at the
appropriate times.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 16:13:08

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 15:54:04 Russell King - ARM Linux wrote:
> On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
> > The dma_map_* interfaces assign the virtual addresses internally,
> > using typically either a global address space for all devices, or one
> > address space per device.
>
> We shouldn't be doing one address space per device for precisely this
> reason. We should be doing one address space per *bus*. I did have
> a nice diagram to illustrate the point in my previous email, but I
> deleted it, I wish I hadn't... briefly:
>
> Fig. 1.
> +------------------+
> |+-----+ device |
> CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU--> |
> |+-----+ |
> +------------------+
>
> Fig.1 represents what I'd call the "GPU" issue that we're talking about
> in this thread.
>
> Fig. 2.
> CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device
>
> The DMA API should be responsible (at the very least) for everything on
> the left of "<iobus>" in and should be providing a dma_addr_t which is
> representative of what the device (in Fig.1) as a whole sees. That's
> the "system" part.
>
> I believe this is the approach which is taken by x86 and similar platforms,
> simply because they tend not to have an IOMMU on individual devices (and
> if they did, eg, on a PCI card, it's clearly the responsibility of the
> device driver.)
>
> Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable.
> For fig.2, it is entirely possible that the same device could appear
> without an IOMMU, and in that scenario, you would want the IOMMU to be
> handled transparently.
>
> However, by doing so for everything, you run into exactly the problem
> which is being discussed here - the need to separate out the cache
> coherency from the IOMMU aspects. You probably also have a setup very
> similar to fig.1 (which is certainly true of Vivante GPUs.)
>
> If you have the need to separately control both, then using the DMA API
> to encapsulate both does not make sense - at which point, the DMA API
> should be responsible for the minimum only - in other words, everything
> to the left of <iobus> (so including the system MMU.) The control of
> the device IOMMU should be the responsibility of device driver in this
> case.
>
> So, dma_map_sg() would be responsible for dealing with the CPU cache
> coherency issues, and setting up the system MMU. dma_sync_*() would
> be responsible for the CPU cache coherency issues, and dma_unmap_sg()
> would (again) deal with the CPU cache and tear down the system MMU
> mappings.
>
> Meanwhile, the device driver has ultimate control over its IOMMU, the
> creation and destruction of mappings and context switches at the
> appropriate times.

I agree for the case you are describing here. From what I understood
from Rob was that he is looking at something more like:

Fig 3
CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device

where the IOMMU controls one or more contexts per device, and is
shared across GPU and non-GPU devices. Here, we need to use the
dmap-mapping interface to set up the IO page table for any device
that is unable to address all of system RAM, and we can use it
for purposes like isolation of the devices. There are also cases
where using the IOMMU is not optional.

So unlike the scenario you describe, the driver cannot at the
same time control the cache (using the dma-mapping API) and
the I/O page tables (using the iommu API or some internal
functions).

Arnd

2015-02-03 16:22:06

by Rob Clark

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann <[email protected]> wrote:
> I agree for the case you are describing here. From what I understood
> from Rob was that he is looking at something more like:
>
> Fig 3
> CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
>
> where the IOMMU controls one or more contexts per device, and is
> shared across GPU and non-GPU devices. Here, we need to use the
> dmap-mapping interface to set up the IO page table for any device
> that is unable to address all of system RAM, and we can use it
> for purposes like isolation of the devices. There are also cases
> where using the IOMMU is not optional.


Actually, just to clarify, the IOMMU instance is specific to the GPU..
not shared with other devices. Otherwise managing multiple contexts
would go quite badly..

But other devices have their own instance of the same IOMMU.. so same
driver could be used.

BR,
-R

2015-02-03 16:37:59

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 11:22:01 Rob Clark wrote:
> On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann <[email protected]> wrote:
> > I agree for the case you are describing here. From what I understood
> > from Rob was that he is looking at something more like:
> >
> > Fig 3
> > CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
> >
> > where the IOMMU controls one or more contexts per device, and is
> > shared across GPU and non-GPU devices. Here, we need to use the
> > dmap-mapping interface to set up the IO page table for any device
> > that is unable to address all of system RAM, and we can use it
> > for purposes like isolation of the devices. There are also cases
> > where using the IOMMU is not optional.
>
>
> Actually, just to clarify, the IOMMU instance is specific to the GPU..
> not shared with other devices. Otherwise managing multiple contexts
> would go quite badly..
>
> But other devices have their own instance of the same IOMMU.. so same
> driver could be used.

I think from the driver perspective, I'd view those two cases as
identical. Not sure if Russell agrees with that.

Arnd

2015-02-03 16:58:46

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 05:12:40PM +0100, Arnd Bergmann wrote:
> On Tuesday 03 February 2015 15:54:04 Russell King - ARM Linux wrote:
> > On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
> > > The dma_map_* interfaces assign the virtual addresses internally,
> > > using typically either a global address space for all devices, or one
> > > address space per device.
> >
> > We shouldn't be doing one address space per device for precisely this
> > reason. We should be doing one address space per *bus*. I did have
> > a nice diagram to illustrate the point in my previous email, but I
> > deleted it, I wish I hadn't... briefly:
> >
> > Fig. 1.
> > +------------------+
> > |+-----+ device |
> > CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU--> |
> > |+-----+ |
> > +------------------+
> >
> > Fig.1 represents what I'd call the "GPU" issue that we're talking about
> > in this thread.
> >
> > Fig. 2.
> > CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device
> >
> > The DMA API should be responsible (at the very least) for everything on
> > the left of "<iobus>" in and should be providing a dma_addr_t which is
> > representative of what the device (in Fig.1) as a whole sees. That's
> > the "system" part.
> >
> > I believe this is the approach which is taken by x86 and similar platforms,
> > simply because they tend not to have an IOMMU on individual devices (and
> > if they did, eg, on a PCI card, it's clearly the responsibility of the
> > device driver.)
> >
> > Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable.
> > For fig.2, it is entirely possible that the same device could appear
> > without an IOMMU, and in that scenario, you would want the IOMMU to be
> > handled transparently.
> >
> > However, by doing so for everything, you run into exactly the problem
> > which is being discussed here - the need to separate out the cache
> > coherency from the IOMMU aspects. You probably also have a setup very
> > similar to fig.1 (which is certainly true of Vivante GPUs.)
> >
> > If you have the need to separately control both, then using the DMA API
> > to encapsulate both does not make sense - at which point, the DMA API
> > should be responsible for the minimum only - in other words, everything
> > to the left of <iobus> (so including the system MMU.) The control of
> > the device IOMMU should be the responsibility of device driver in this
> > case.
> >
> > So, dma_map_sg() would be responsible for dealing with the CPU cache
> > coherency issues, and setting up the system MMU. dma_sync_*() would
> > be responsible for the CPU cache coherency issues, and dma_unmap_sg()
> > would (again) deal with the CPU cache and tear down the system MMU
> > mappings.
> >
> > Meanwhile, the device driver has ultimate control over its IOMMU, the
> > creation and destruction of mappings and context switches at the
> > appropriate times.
>
> I agree for the case you are describing here. From what I understood
> from Rob was that he is looking at something more like:
>
> Fig 3
> CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
>
> where the IOMMU controls one or more contexts per device, and is
> shared across GPU and non-GPU devices. Here, we need to use the
> dmap-mapping interface to set up the IO page table for any device
> that is unable to address all of system RAM, and we can use it
> for purposes like isolation of the devices. There are also cases
> where using the IOMMU is not optional.

Okay, but switching contexts is not something which the DMA API has
any knowledge of (so it can't know which context to associate with
which mapping.) While it knows which device, it has no knowledge
(nor is there any way for it to gain knowledge) about contexts.

My personal view is that extending the DMA API in this way feels quite
dirty - it's a violation of the DMA API design, which is to (a) demark
the buffer ownership between CPU and DMA agent, and (b) to translate
buffer locations into a cookie which device drivers can use to instruct
their device to access that memory. To see why, consider... that you
map a buffer to a device in context A, and then you switch to context B,
which means the dma_addr_t given previously is no longer valid. You
then try to unmap it... which is normally done using the (now no longer
valid) dma_addr_t.

It seems to me that to support this at DMA API level, we would need to
completely revamp the DMA API, which IMHO isn't going to be nice. (It
would mean that we end up with three APIs - the original PCI DMA API,
the existing DMA API, and some new DMA API.)

Do we have any views on how common this feature is?

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 17:02:03

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 11:22:01AM -0500, Rob Clark wrote:
> On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann <[email protected]> wrote:
> > I agree for the case you are describing here. From what I understood
> > from Rob was that he is looking at something more like:
> >
> > Fig 3
> > CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
> >
> > where the IOMMU controls one or more contexts per device, and is
> > shared across GPU and non-GPU devices. Here, we need to use the
> > dmap-mapping interface to set up the IO page table for any device
> > that is unable to address all of system RAM, and we can use it
> > for purposes like isolation of the devices. There are also cases
> > where using the IOMMU is not optional.
>
>
> Actually, just to clarify, the IOMMU instance is specific to the GPU..
> not shared with other devices. Otherwise managing multiple contexts
> would go quite badly..
>
> But other devices have their own instance of the same IOMMU.. so same
> driver could be used.

Okay, so that is my Fig.2 case, and we don't have to worry about Fig.3.

One thing I forgot in Fig.1/2 which my original did have were to mark
the system MMU as optional. (Think an ARM64 with SMMU into a 32-bit
peripheral bus.) Do we support stacked MMUs in the DMA API? We may
need to if we keep IOMMUs in the DMA API.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-02-03 17:35:36

by Rob Clark

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 11:58 AM, Russell King - ARM Linux
<[email protected]> wrote:
>
> Okay, but switching contexts is not something which the DMA API has
> any knowledge of (so it can't know which context to associate with
> which mapping.) While it knows which device, it has no knowledge
> (nor is there any way for it to gain knowledge) about contexts.
>
> My personal view is that extending the DMA API in this way feels quite
> dirty - it's a violation of the DMA API design, which is to (a) demark
> the buffer ownership between CPU and DMA agent, and (b) to translate
> buffer locations into a cookie which device drivers can use to instruct
> their device to access that memory. To see why, consider... that you
> map a buffer to a device in context A, and then you switch to context B,
> which means the dma_addr_t given previously is no longer valid. You
> then try to unmap it... which is normally done using the (now no longer
> valid) dma_addr_t.
>
> It seems to me that to support this at DMA API level, we would need to
> completely revamp the DMA API, which IMHO isn't going to be nice. (It
> would mean that we end up with three APIs - the original PCI DMA API,
> the existing DMA API, and some new DMA API.)
>
> Do we have any views on how common this feature is?
>

I can't think of cases outside of GPU's.. if it were more common I'd
be in favor of teaching dma api about multiple contexts, but right now
I think that would just amount to forcing a lot of churn on everyone
else for the benefit of GPU's.

IMHO it makes more sense for GPU drivers to bypass the dma api if they
need to. Plus, sooner or later, someone will discover that with some
trick or optimization they can get moar fps, but the extra layer of
abstraction will just be getting in the way.

BR,
-R

2015-02-03 20:03:31

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 05:36:59PM +0100, Arnd Bergmann wrote:
> On Tuesday 03 February 2015 11:22:01 Rob Clark wrote:
> > On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann <[email protected]> wrote:
> > > I agree for the case you are describing here. From what I understood
> > > from Rob was that he is looking at something more like:
> > >
> > > Fig 3
> > > CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
> > >
> > > where the IOMMU controls one or more contexts per device, and is
> > > shared across GPU and non-GPU devices. Here, we need to use the
> > > dmap-mapping interface to set up the IO page table for any device
> > > that is unable to address all of system RAM, and we can use it
> > > for purposes like isolation of the devices. There are also cases
> > > where using the IOMMU is not optional.
> >
> >
> > Actually, just to clarify, the IOMMU instance is specific to the GPU..
> > not shared with other devices. Otherwise managing multiple contexts
> > would go quite badly..
> >
> > But other devices have their own instance of the same IOMMU.. so same
> > driver could be used.
>
> I think from the driver perspective, I'd view those two cases as
> identical. Not sure if Russell agrees with that.

Imo whether the iommu is private to the device and required for gpu
functionality like context switching or shared across a bunch of devices
is fairly important. Assuming I understand this discussion correctly we
have two different things pulling in opposite directions:

- From a gpu functionality perspective we want to give the gpu driver full
control over the device-private iommu, pushing it out of the control of
the dma api. dma_map_sg would just map to whatever bus addresses that
iommu would need to use for generating access cycles.

This is the design used by every gpu driver we have in upstream thus far
(where you always have some on-gpu iommu/pagetable walker thing), on top
of whatever system iommu that might be there or not (which is then
managed by the dma apis).

- On many soc people love to reuse iommus with the same or similar
interface all over the place. The solution thus far adopted on arm
platforms is to write an iommu driver for those and then implement the
dma-api on top of this iommu.

But if we unconditionally do this then we rob the gpu driver's ability
to control its private iommu like it wants to, because a lot of the
functionality is lost behind the dma api abstraction.

Again assuming I'm not confused can't we just solve this by pushing the
dma api abstraction down one layer for just the gpu, and let it use its
private iommmu directly? Steps for binding a buffer would be:
1. dma_map_sg
2. Noodle the dma_addr_t out of the sg table and feed those into a 2nd
level mapping set up through the iommu api for the gpu-private mmu.

Again, this is what i915 and all the ttm based drivers already do, except
that we don't use the generic iommu interfaces but have our own (i915 has
its interface in i915_gem_gtt.c, ttm just calls them tt for translation
tables ...).

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 20:07:31

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 12:35:34PM -0500, Rob Clark wrote:
> On Tue, Feb 3, 2015 at 11:58 AM, Russell King - ARM Linux
> <[email protected]> wrote:
> >
> > Okay, but switching contexts is not something which the DMA API has
> > any knowledge of (so it can't know which context to associate with
> > which mapping.) While it knows which device, it has no knowledge
> > (nor is there any way for it to gain knowledge) about contexts.
> >
> > My personal view is that extending the DMA API in this way feels quite
> > dirty - it's a violation of the DMA API design, which is to (a) demark
> > the buffer ownership between CPU and DMA agent, and (b) to translate
> > buffer locations into a cookie which device drivers can use to instruct
> > their device to access that memory. To see why, consider... that you
> > map a buffer to a device in context A, and then you switch to context B,
> > which means the dma_addr_t given previously is no longer valid. You
> > then try to unmap it... which is normally done using the (now no longer
> > valid) dma_addr_t.
> >
> > It seems to me that to support this at DMA API level, we would need to
> > completely revamp the DMA API, which IMHO isn't going to be nice. (It
> > would mean that we end up with three APIs - the original PCI DMA API,
> > the existing DMA API, and some new DMA API.)
> >
> > Do we have any views on how common this feature is?
> >
>
> I can't think of cases outside of GPU's.. if it were more common I'd
> be in favor of teaching dma api about multiple contexts, but right now
> I think that would just amount to forcing a lot of churn on everyone
> else for the benefit of GPU's.
>
> IMHO it makes more sense for GPU drivers to bypass the dma api if they
> need to. Plus, sooner or later, someone will discover that with some
> trick or optimization they can get moar fps, but the extra layer of
> abstraction will just be getting in the way.

See my other reply, but all existing full-blown drivers don't bypass the
dma api. Instead it's just a two-level scheme:
1. First level is dma api. Might or might not contain a system iommu.
2. 2nd level is the gpu-private iommu which is also used for per context
address spaces. Thus far all drivers just rolled their own drivers for
this (it's kinda fused to the chips on x86 hw anyway), but it looks like
using the iommu api gives us a somewhat suitable abstraction for code
sharing.

Imo you need both, otherwise we start leaking stuff like cpu cache
flushing all over the place. Looking at i915 (where the dma api assumes
that everything is coherent, which is kinda not the case) that won't be
pretty. And there's still the issue that you might nest a system iommu
and a 2nd level iommu for per-context pagetables (this is real and what's
going on right now on intel hw).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-03 21:43:32

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 21:04:35 Daniel Vetter wrote:

> - On many soc people love to reuse iommus with the same or similar
> interface all over the place. The solution thus far adopted on arm
> platforms is to write an iommu driver for those and then implement the
> dma-api on top of this iommu.
>
> But if we unconditionally do this then we rob the gpu driver's ability
> to control its private iommu like it wants to, because a lot of the
> functionality is lost behind the dma api abstraction.

I believe in case of rockchips, msm, exynos and tegra, the same
iommu is used directly by the DRM driver while it is used
implicitly by the dma-mapping API. We have run into some problems
with this in 3.19, but they should be solvable.

> Again assuming I'm not confused can't we just solve this by pushing the
> dma api abstraction down one layer for just the gpu, and let it use its
> private iommmu directly? Steps for binding a buffer would be:
> 1. dma_map_sg
> 2. Noodle the dma_addr_t out of the sg table and feed those into a 2nd
> level mapping set up through the iommu api for the gpu-private mmu.

If you want to do that, you run into the problem of telling the driver
core about it. We associate the device with an iommu in the device
tree, describing there how it is wired up.

The driver core creates a platform_device for this and checks if it
an iommu mapping is required or wanted for the device, which is then
set up. When the device driver wants to create its own iommu mapping,
this conflicts with the one that is already there. We can't just
skip the iommu setup for all devices because it may be needed sometimes,
and I don't really want to see hacks where the driver core knows which
devices are GPUs and skips the mapping for them, which would be a
layering violation.

> Again, this is what i915 and all the ttm based drivers already do, except
> that we don't use the generic iommu interfaces but have our own (i915 has
> its interface in i915_gem_gtt.c, ttm just calls them tt for translation
> tables ...).

Right, if you have a private iommu, there is no problem. The tricky part
is using a single driver for the system-level translation and the gpu
private mappings when there is only one type of iommu in the system.

Arnd

2015-02-03 21:45:44

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tuesday 03 February 2015 12:35:34 Rob Clark wrote:
>
> I can't think of cases outside of GPU's.. if it were more common I'd
> be in favor of teaching dma api about multiple contexts, but right now
> I think that would just amount to forcing a lot of churn on everyone
> else for the benefit of GPU's.

We have a couple of users of the iommu API at the moment outside of
GPUs:

* drivers/media/platform/omap3isp/isp.c
* drivers/remoteproc/remoteproc_core.c
* drivers/infiniband/hw/usnic/usnic_uiom.c
* drivers/vfio/

I assume we will see a few more over time. The vfio case is the most
important one here, since that is what the iommu API was designed for.

Arnd

2015-02-03 22:07:08

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 3, 2015 at 10:42 PM, Arnd Bergmann <[email protected]> wrote:
>> Again assuming I'm not confused can't we just solve this by pushing the
>> dma api abstraction down one layer for just the gpu, and let it use its
>> private iommmu directly? Steps for binding a buffer would be:
>> 1. dma_map_sg
>> 2. Noodle the dma_addr_t out of the sg table and feed those into a 2nd
>> level mapping set up through the iommu api for the gpu-private mmu.
>
> If you want to do that, you run into the problem of telling the driver
> core about it. We associate the device with an iommu in the device
> tree, describing there how it is wired up.
>
> The driver core creates a platform_device for this and checks if it
> an iommu mapping is required or wanted for the device, which is then
> set up. When the device driver wants to create its own iommu mapping,
> this conflicts with the one that is already there. We can't just
> skip the iommu setup for all devices because it may be needed sometimes,
> and I don't really want to see hacks where the driver core knows which
> devices are GPUs and skips the mapping for them, which would be a
> layering violation.

I don't think you get a choice but to make gpus a special case.
There's a bunch of cases why the iommu private to the gpu is special:
- If there's gpu-private iommu at all you have a nice security
problem, and you must scan your cmd stream to make sure no gpu access
goes to arbitrary system memory. We kinda consider isolation between
clients optional, but isolation to everything else is mandatory. And
scanning the cmd stream in software has such big implications on the
design of your driver that you essentially need 2 different drivers.
Even if the IP block otherwise matches.
- If your iommu supports multiple address space then the gpu must
know. We've already covered this case.

So trying to wedge the dma api between the gpu and its private iommu
is imo the layering violation here. Imo the dma api only should
control an iommu for the gpu if:
- the iommu is shared (so can't be used for isolation and you need the
full blwon cmd scanner)
- it's a 2nd level iommu (e.g. what we have on i915) and there is
another private iommu.

Note that with private I only mean no other device can use it, I don't
mean whether it's on the same IP block or not (we even have an iommu
abstraction in i915 because the pagetable walkers are pretty much
separate from everything else and evolve mostly independently).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2015-02-04 00:14:57

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

On Tue, Feb 03, 2015 at 10:42:26PM +0100, Arnd Bergmann wrote:
> Right, if you have a private iommu, there is no problem. The tricky part
> is using a single driver for the system-level translation and the gpu
> private mappings when there is only one type of iommu in the system.

You've got a problem anyway with this approach. If you look at my
figure 2 and apply it to this scenario, you have two MMUs stacked
on top of each other. That's something that (afaik) we don't support,
but it's entirely possible that will come along with ARM64.

It may not be nice to have to treat GPUs specially, but I think we
really do need to, and forget the idea that the GPU's IOMMU (as
opposed to a system MMU) should appear in a generic form in DT.

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.