LinuxLists.cc - Re: [RFC] [PATCH] DRM TTM Memory Manager patch

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

Eric Anholt wrote:

>On Thu, 2007-04-26 at 16:55 +1000, Dave Airlie wrote:
>
>
>>Hi,
>>
>>The patch is too big to fit on the list and I've no idea how we could
>>break it down further, it just happens to be a lot of new code..
>>
>>http://people.freedesktop.org/~airlied/ttm/0001-drm-Implement-TTM-Memory-manager-core-functionality.txt
>>
>>The patch header and diffstat are below,
>>
>>This isn't for integration yet but we'd like an initial review by
>>anyone with the spare time and inclination, there is a lot stuff
>>relying on getting this code bet into shape and into the kernel but
>>any cleanups people can suggest now especially to the user interfaces
>>would be appreciated as once we set that stuff in stone it'll be a
>>pain to change... also it doesn't have any driver side code, this is
>>just the generic pieces. I'll post the intel 915 side code later but
>>there isn't that much to it..
>>
>>It applies on top of my drm-2.6 git tree drm-mm branch....
>>
>>-----------------------------------------------------------------------------------------------------
>>
>>This patch brings in the TTM (Translation Table Maps) memory
>>management
>>system from Thomas Hellstrom at Tungsten Graphics.
>>
>>This patch only covers the core functionality and changes to the drm
>>core.
>>
>>The TTM memory manager enables dynamic mapping of memory objects in
>>and
>>out of the graphic card accessible memory (e.g. AGP), this implements
>>the AGP backend for TTM to be used by the i915 driver.
>>
>>
>
>
>I've been slow responding, but we've been talking a lot on IRC and at
>Intel about the TTM interface recently, and trying to come up with a
>concensus between us as to what we'd like to see.
>
>1) Multiplexed ioctls hurt
>The first issue I have with this version is the userland interface.
>You've got two ioctls for buffer management and once for fence
>management, yet these 3 ioctls are actually just attempting to be
>generic interfaces for around 25 actual functions you want to call
>(except for the unimplemented ones, drm_bo_fence and drm_bo_ref_fence).
>So there are quasi-generic arguments to these ioctls, where most of the
>members are ignored by any given function, but it's not obvious to the
>caller which ones. There's no comments or anything as to what the
>arguments to these functions are or what exactly they do. We've got 100
>generic ioctl numbers allocated and unused still, so I don't think we
>should be shy about having separate ioctls for separate functions, if
>this is the interface we expect to use going forward.
>
>
>
Right. This interface was in its infancy when there were only (without
looking to deeply) three generic IOCTLS left.

this is definitely a good point and I agree completely.

>2) Early microoptimizations
>There's also apparently an unused way to chain these function calls in a
>single ioctl call for the buffer object ioctl. This is one of a couple
>of microoptimizations at the expense of code clarity which have bothered
>me while reviewing the TTM code, when I'm guessing no testing was done
>to see if it was actually a bottleneck.
>
>
Yes. The function chaining is currently only used to validate buffer
lists. I still believe it is
needed for that functionality, a bit depending on what we want to be
able to change when a buffer is validated. But I can currently not see
any other use for it in the future.

>3) Fencing and flushing troubles
>I'm definitely concerned by the fencing interface. Right now, the
>driver is flushing caches and fencing every buffer with EXE and its
>driver-specific FLUSHED flag in dispatching command buffers. We almost
>surely don't want to be flushing for every batch buffer just in case
>someone wants to do CPU reads from something. However, with the current
>mechanism, if I fence my operation with just EXE and no flush, then
>somebody goes to map that buffer, they'll wait for the fence, but no
>flush will be emitted. The interface we've been imagining wouldn't have
>driver-specific fence flags, but instead be only a marker of when
>command execution has passed a certain point (which is what fencing is
>about, anyway). In validating buffers, you would pass whether they're
>for READ or WRITE as we currently do, and you'd put it on the unfenced
>read/write lists as appropriate. Add one buffer object function for
>emitting the flush, which would then determine whether the next
>fence-all-unfenced call would cover just the list of unfenced reads or
>the list of both unfenced reads and unfenced writes. Then, in mapping,
>check if it's on the unfenced-writes list and emit the flush and fence,
>and then wait for a fence on the buffer before continuing with the
>mapping.
>
>
>
Right. This functionality is actually available in the current code,
except that we have only one unfenced list and the fence flags indicate
what type of flushes are needed. There's even an implementation of the
intel sync flush mechanism in the fencing code. If the batch-buffer
flush is omitted the driver specific flush flag is not needed and the
fence mechanism will do a sync flush whenever needed and signal all
previous r/w fences.

Although very nice in theory this did not work well. Particularly fbo
applications with render-to-texture followed by a subimage update lost a
lot of performance, and the Intel sync flush mechanism carries quite
some latency making apps that use it very cpu-intensive, since it is
polling-only.

Emitting a flush to the ring when needed would carry with it a lot of
latency, although the cpu-intensity would not be that bad.

The current flush-after-all buffers default variant is simply there
because it performs best and with the least amount of CPU usage of all
variants I've tested. "gears" is a good example. You can take away the
after-batchbuffer-flush and remove the DRM_I915_FENCE_FLAG_FLUSHED flag
from intel_batchbuffer.c There's very little difference in rendering speed.

That said, gpus and different engines are very different in their need
for flushing. I believe the current fence type interface can do what you
are after on intel without limiting us to Intel-only 3D functionality.

Leaving Intel for a moment and looking, for example, at the Unichrome
video- and mpeg engines, they require the fence and then a wait for
video idle or mpeg idle (using polling) respectively.

That would work very well with the current interface and two
driver-specific fence types. But with your model we would need two
driver-specific unfenced list, and I'd love to get rid of the unfenced
list for good.

So if your main concern with fencing is with rendering performance and
not with the interface, I think the best thing is to implement and test
your strategy with the current interface. It's definitively feasible,
and if you see a big performance boost, lets go for it but lets also
think of a way that suits other hardware.

Personally I'd like to focus on how we can do fencing yet still support
nouveau-type drivers that like to do everything from user-space, and how
to report engine errors using fence objects. These are areas where the
current implementation is not really up to what's needed.

>4) Locking and deadlocking
>Right now, the main rendering path I'm seeing is something like this:
>
>lock_hardware() # I'm not really sure why this lock is here
>map(texture)
>write texture data
>unmap(texture)
>unlock_hardware()
>
>map(some_buffer) # GL lets you have buffers mapped for a long time
>write to some_buffer
>map(some_buffer_2)
>write to some_buffer_2
>
>map(batchbuffer) # start some actual rendering
>map(vertices)
>write vertices
>write commands
>unmap(vertices)
>unmap(batchbuffer)
>
>unmap(some_buffer) # the GL client called this
>unmap(some_buffer_2)
>
>lock_hardware()
>validate(batchbuffer)
>validate(vertices)
>validate(texture)
>validate(some_buffer)
>validate(some_buffer_2)
>validate(backbuffer)
>map_while_validated(batchbuffer)
>write relocations in batchbuffer
>unmap(batchbuffer)
>emit(batchbuffer)
>fence_unfenced()
>unlock_hardware()
>
>There's also a fallback path:
>lock_hardware()
>map(backbuffer)
>map(frontbuffer)
>map(vertices)
>map(some_buffer)
>rendering with cpu
>unmaps(...)
>unlock_hardware()
>
>Note: map blocks on fencing, and if you don't pass the magic
>"while-validated" flag, then validate blocks on unmap.
>
>If we've got GL buffer objects shared between clients, we've got easy
>deadlocks available:
>
>client A map buffer 1
> client B locks
> client B validates buffer 2
>client A map buffer 2 (block)
> client B validates buffer 1 (block)
>
>I'm not sure to what extent GL allows buffer object sharing, but with
>EXT_tfp we're going to need to be dealing with this for the
>X Server versus GL interactions.
>
>5) Fixing locking by not doing it
>So, it looks like the existing hardware lock isn't cutting it for
>avoiding deadlocks. Additionally, we'd like to get to a point where I
>can have some random app running things on my graphics card with my GUI
>server clueless that it's happening (more access control would be
>needed, but allowing lock pushdown is the main design issue here, I
>think). That pretty much means having a hardware lock that an app can
>hold for an arbitrary amount of time has to go away.
>
>Since we've got the lock_hardware() around any series of validates for
>rendering, we've been thinking about pushing that whole series of
>operations into one device-specific (probably?) ioctl:
>
>submit_buffers(buffer_list, relocation_list, bool full_state)
>
>There's a comment suggesting that this is a good idea on
>intel_batchbuffers already. This submit_buffers would do:
>
>choose offsets to validate into
>perform relocations
>validate the buffers with the given flags
>emit flush if indicated
>emit the fence
>
>
>
>For multi-context hardware, the kernel then gets to choose which
>rendering context this batchbuffer will be run on. If it can't give you
>a context with the same state, it returns EBUSY or whatever and you
>resubmit with full state upload and the flag saying so. Additionally,
>the kernel can then do reasonable simulation of multi-context hardware
>on hardware with context save/restore, which we haven't been doing
>before.
>
>Also, since the kernel has all the buffers needed for validate at once,
>it can be smarter about placing the allocations, as right now you could
>possibly get aperture fragmentation from early validates which prevent
>later buffer validations from succeeding.
>
>With this and removing what appears to be an unnecessary hardware lock
>from the texture write, we've got userland locking out of the hardware
>rendering path, we've got a nice system for multi-context hardware and
>the ability for the kernel to simulate multiple contexts if not, and we
>can remove most of the userland interface, getting us down to:
>
>bo_create
>bo_map
>bo_unmap
>bo_ref
>bo_unref
>submit_buffers
>fence_reference
>fence_unreference
>fence_wait
>
>
>
Yes. This or something similar is definitely the right way to go.
The submit super-ioctl is something that has been on the todo list for
quite some time. If it then also means that we can remove much of the
interface code, even better!

>That's a lot less than 25 entrypoints, and I like that. Plus, with this
>and the DRM modesetting work, adding access control and batchbuffer
>validation is "all" that's needed to get the kernel scheduling arbitrary
>programs doing random things on the card with your gui unaware of it.
>
>But this still leaves the map versus validate deadlocks. The
>closest we've come up with to a solution is: "Maps never block on maps,
>validate blocks on unmap, and map blocks on (write-flushed if necessary)
>fence completion. It is up to threads performing submit_buffers and
>maps on shared objects to arrange external synchronization to avoid
>deadlock." This has the advantage that those threads likely have some
>sort of protocol requirements for rendering consistency between those
>objects, and are going to be doing some sort of synchronization anyway.
>I'm just not sure yet on how feasible it is to bend GL into this model.
>
>
>
Hmm, yes, that's an intricate question. I suspect that the texture lock
is actually a workaround for the deadlock you are illustrating in the
shared texture case.

It might be possible to find schemes that work around this. One way
could possibly be to have a buffer mapping -and validate order for
shared buffers.

If a client tries to map shared buffers out-of-order, drm could
automatically do an unmap - remap of previously mapped buffers that
would violate that order.

/Thomas

2007-05-04 04:32:57

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellström wrote:

> It might be possible to find schemes that work around this. One way
> could possibly be to have a buffer mapping -and validate order for
> shared buffers.

If mapping never blocks on anything other than the fence, then there
isn't any dead lock possibility. What this says is that ordering of
rendering between clients is *not DRMs problem*. I think that's a good
solution though; I want to let multiple apps work on DRM-able memory
with their own CPU without contention.

I don't recall if Eric layed out the proposed rules, but:

1) Map never blocks on map. Clients interested in dealing with this
are on their own.

2) Submit blocks on map. You must unmap all buffers before submitting
them. Doing the relocations in the kernel makes this all possible.

3) Map blocks on the fence from submit. We can play with pending the
flush until the app asks for the buffer back, or we can play with
figuring out when flushes are useful automatically. Doesn't matter
if the policy is in the kernel.

I'm interested in making deadlock avoidence trivial and eliminating any
map-map contention.

--
[email protected]

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-05-04 08:24:16

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

Keith Packard wrote:
> On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellström wrote:
>
>
>> It might be possible to find schemes that work around this. One way
>> could possibly be to have a buffer mapping -and validate order for
>> shared buffers.
>>
>
> If mapping never blocks on anything other than the fence, then there
> isn't any dead lock possibility. What this says is that ordering of
> rendering between clients is *not DRMs problem*. I think that's a good
> solution though; I want to let multiple apps work on DRM-able memory
> with their own CPU without contention.
>
> I don't recall if Eric layed out the proposed rules, but:
>
> 1) Map never blocks on map. Clients interested in dealing with this
> are on their own.
>
> 2) Submit blocks on map. You must unmap all buffers before submitting
> them. Doing the relocations in the kernel makes this all possible.
>
> 3) Map blocks on the fence from submit. We can play with pending the
> flush until the app asks for the buffer back, or we can play with
> figuring out when flushes are useful automatically. Doesn't matter
> if the policy is in the kernel.
>
> I'm interested in making deadlock avoidence trivial and eliminating any
> map-map contention.
>
>
It's rare to have two clients access the same buffer at the same time.
In what situation will this occur?

If we think of map / unmap and validation / fence as taking a buffer
mutex either for the CPU or for the GPU, that's the way implementation
is done today. The CPU side of the mutex should IIRC be per-client
recursive. OTOH, the TTM implementation won't stop the CPU from
accessing the buffer when it is unmapped, but then you're on your own.
"Mutexes" need to be taken in the correct order, otherwise a deadlock
will occur, and GL will, as outlined in Eric's illustration, more or
less encourage us to take buffers in the "incorrect" order.

In essence what you propose is to eliminate the deadlock problem by just
avoiding taking the buffer mutex unless we know the GPU has it. I see
two problems with this:

* It will encourage different DRI clients to simultaneously access
the same buffer.
* Inter-client and GPU data coherence can be guaranteed if we issue
a mb() / write-combining flush with the unmap operation (which,
BTW, I'm not sure is done today). Otherwise it is up to the
clients, and very easy to forget.

I'm a bit afraid we might in the future regret taking the easy way out?

OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
buffers in the correct order might not be the best one either. It will
certainly mean some CPU overhead and what if we have to do the same with
buffer validation? (Yes for some operations with thousands and thousands
of relocations, the user space validation might need to stay).

Personally, I'm slightly biased towards having DRM resolve the deadlock,
but I think any solution will do as long as the implications and why we
choose a certain solution are totally clear.

For item 3) above the kernel must have a way to issue a flush when
needed for buffer eviction.
The current implementation also requires the buffer to be completely
flushed before mapping.
Other than that the flushing policy is currently completely up to the
DRM drivers.

/Thomas

2007-05-04 08:49:08

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
> Keith Packard wrote:
> > On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellstr?m wrote:
> >
> >
> >> It might be possible to find schemes that work around this. One way
> >> could possibly be to have a buffer mapping -and validate order for
> >> shared buffers.
> >>
> >
> > If mapping never blocks on anything other than the fence, then there
> > isn't any dead lock possibility. What this says is that ordering of
> > rendering between clients is *not DRMs problem*. I think that's a good
> > solution though; I want to let multiple apps work on DRM-able memory
> > with their own CPU without contention.
> >
> > I don't recall if Eric layed out the proposed rules, but:
> >
> > 1) Map never blocks on map. Clients interested in dealing with this
> > are on their own.
> >
> > 2) Submit blocks on map. You must unmap all buffers before submitting
> > them. Doing the relocations in the kernel makes this all possible.
> >
> > 3) Map blocks on the fence from submit. We can play with pending the
> > flush until the app asks for the buffer back, or we can play with
> > figuring out when flushes are useful automatically. Doesn't matter
> > if the policy is in the kernel.
> >
> > I'm interested in making deadlock avoidence trivial and eliminating any
> > map-map contention.
> >
> >
> It's rare to have two clients access the same buffer at the same time.
> In what situation will this occur?
>
> If we think of map / unmap and validation / fence as taking a buffer
> mutex either for the CPU or for the GPU, that's the way implementation
> is done today. The CPU side of the mutex should IIRC be per-client
> recursive. OTOH, the TTM implementation won't stop the CPU from
> accessing the buffer when it is unmapped, but then you're on your own.
> "Mutexes" need to be taken in the correct order, otherwise a deadlock
> will occur, and GL will, as outlined in Eric's illustration, more or
> less encourage us to take buffers in the "incorrect" order.
>
> In essence what you propose is to eliminate the deadlock problem by just
> avoiding taking the buffer mutex unless we know the GPU has it. I see
> two problems with this:
>
> * It will encourage different DRI clients to simultaneously access
> the same buffer.
> * Inter-client and GPU data coherence can be guaranteed if we issue
> a mb() / write-combining flush with the unmap operation (which,
> BTW, I'm not sure is done today). Otherwise it is up to the
> clients, and very easy to forget.
>
> I'm a bit afraid we might in the future regret taking the easy way out?
>
> OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
> buffers in the correct order might not be the best one either. It will
> certainly mean some CPU overhead and what if we have to do the same with
> buffer validation? (Yes for some operations with thousands and thousands
> of relocations, the user space validation might need to stay).
>
> Personally, I'm slightly biased towards having DRM resolve the deadlock,
> but I think any solution will do as long as the implications and why we
> choose a certain solution are totally clear.
>
> For item 3) above the kernel must have a way to issue a flush when
> needed for buffer eviction.
> The current implementation also requires the buffer to be completely
> flushed before mapping.
> Other than that the flushing policy is currently completely up to the
> DRM drivers.
>
> /Thomas

I might say stupid things as i don't think i fully understand all
the input to this problem. Anyway here is my thought on all this:

1) First client map never block (as in Keith layout) except on
fence from drm side (point 3 in Keith layout)
2) Client should always unmap buffer before submitting (as in Keith layout)
3) In drm side you always acquire buffer lock in a give order for
instance each buffer got and id and you lock from smaller id to bigger
one (with a clever implementation the cost for that will be small)
4) We got 2 gpu queue:
- one with pending apps ask in which we do all stuff necessary
to be done before submitting (locking buffer, validation, ...)
for instance we might wait here for each buffer that are still
mapped by some other apps in user space
- one run queue in which we add each apps ask that are now
ready to be submited to the gpu

Of course in this scheme we keep the fencing stuff so user space can
know when it safe to use previously submited buffer again. The outcome
of having two seperate queue in drm is that if two apps lockup each other
other apps can still use the gpu so only the apps fighting for a buffer will
suffer.

And for user space synchronization i believe it a user space problem i.e.
it's up to user space to add proper synch. For instance as map doesn't
block for any client in user space two apps can mess with same buffer
it's up to user to have a policy to exclude each other (i believe this will
be a dri or xorg problem to synch btw consumer).

I believe in this scheme you can only have deadlock btw two buggy
apps (like one forgeting to unmap a buffer) or any apps that forgot
to unmap buffer before submitting won't cause damage to other
apps. I hope i properly explain my thought (hey my brain is in a sleepy
state right now).

best,
Jerome Glisse

2007-05-04 09:40:10

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On 5/4/07, Jerome Glisse <[email protected]> wrote:
> On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
> > Keith Packard wrote:
> > > On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellstr?m wrote:
> > >
> > >
> > >> It might be possible to find schemes that work around this. One way
> > >> could possibly be to have a buffer mapping -and validate order for
> > >> shared buffers.
> > >>
> > >
> > > If mapping never blocks on anything other than the fence, then there
> > > isn't any dead lock possibility. What this says is that ordering of
> > > rendering between clients is *not DRMs problem*. I think that's a good
> > > solution though; I want to let multiple apps work on DRM-able memory
> > > with their own CPU without contention.
> > >
> > > I don't recall if Eric layed out the proposed rules, but:
> > >
> > > 1) Map never blocks on map. Clients interested in dealing with this
> > > are on their own.
> > >
> > > 2) Submit blocks on map. You must unmap all buffers before submitting
> > > them. Doing the relocations in the kernel makes this all possible.
> > >
> > > 3) Map blocks on the fence from submit. We can play with pending the
> > > flush until the app asks for the buffer back, or we can play with
> > > figuring out when flushes are useful automatically. Doesn't matter
> > > if the policy is in the kernel.
> > >
> > > I'm interested in making deadlock avoidence trivial and eliminating any
> > > map-map contention.
> > >
> > >
> > It's rare to have two clients access the same buffer at the same time.
> > In what situation will this occur?
> >
> > If we think of map / unmap and validation / fence as taking a buffer
> > mutex either for the CPU or for the GPU, that's the way implementation
> > is done today. The CPU side of the mutex should IIRC be per-client
> > recursive. OTOH, the TTM implementation won't stop the CPU from
> > accessing the buffer when it is unmapped, but then you're on your own.
> > "Mutexes" need to be taken in the correct order, otherwise a deadlock
> > will occur, and GL will, as outlined in Eric's illustration, more or
> > less encourage us to take buffers in the "incorrect" order.
> >
> > In essence what you propose is to eliminate the deadlock problem by just
> > avoiding taking the buffer mutex unless we know the GPU has it. I see
> > two problems with this:
> >
> > * It will encourage different DRI clients to simultaneously access
> > the same buffer.
> > * Inter-client and GPU data coherence can be guaranteed if we issue
> > a mb() / write-combining flush with the unmap operation (which,
> > BTW, I'm not sure is done today). Otherwise it is up to the
> > clients, and very easy to forget.
> >
> > I'm a bit afraid we might in the future regret taking the easy way out?
> >
> > OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
> > buffers in the correct order might not be the best one either. It will
> > certainly mean some CPU overhead and what if we have to do the same with
> > buffer validation? (Yes for some operations with thousands and thousands
> > of relocations, the user space validation might need to stay).
> >
> > Personally, I'm slightly biased towards having DRM resolve the deadlock,
> > but I think any solution will do as long as the implications and why we
> > choose a certain solution are totally clear.
> >
> > For item 3) above the kernel must have a way to issue a flush when
> > needed for buffer eviction.
> > The current implementation also requires the buffer to be completely
> > flushed before mapping.
> > Other than that the flushing policy is currently completely up to the
> > DRM drivers.
> >
> > /Thomas
>
> I might say stupid things as i don't think i fully understand all
> the input to this problem. Anyway here is my thought on all this:
>
> 1) First client map never block (as in Keith layout) except on
> fence from drm side (point 3 in Keith layout)
> 2) Client should always unmap buffer before submitting (as in Keith layout)
> 3) In drm side you always acquire buffer lock in a give order for
> instance each buffer got and id and you lock from smaller id to bigger
> one (with a clever implementation the cost for that will be small)
> 4) We got 2 gpu queue:
> - one with pending apps ask in which we do all stuff necessary
> to be done before submitting (locking buffer, validation, ...)
> for instance we might wait here for each buffer that are still
> mapped by some other apps in user space
> - one run queue in which we add each apps ask that are now
> ready to be submited to the gpu
>
> Of course in this scheme we keep the fencing stuff so user space can
> know when it safe to use previously submited buffer again. The outcome
> of having two seperate queue in drm is that if two apps lockup each other
> other apps can still use the gpu so only the apps fighting for a buffer will
> suffer.
>
> And for user space synchronization i believe it a user space problem i.e.
> it's up to user space to add proper synch. For instance as map doesn't
> block for any client in user space two apps can mess with same buffer
> it's up to user to have a policy to exclude each other (i believe this will
> be a dri or xorg problem to synch btw consumer).
>
> I believe in this scheme you can only have deadlock btw two buggy
> apps (like one forgeting to unmap a buffer) or any apps that forgot
> to unmap buffer before submitting won't cause damage to other
> apps. I hope i properly explain my thought (hey my brain is in a sleepy
> state right now).
>
> best,
> Jerome Glisse
>

Did i say i was sleepy :) ? I forgot to tell why we should always
acquire lock in same order in pending queue, we should do this
because several kernel thread will work on the pending queue.
While the running queue should only be processed by one kernel
thread; thus we do not need anymore a big lock for the GPU
only this single thread will really talk to the gpu (in fact we might
still need the lock if xorg ddx do mmio gpu regs access).

On a side note i think this scheme also fit well with gpu having
several context and which doesn't need big validation (read
nv gpu).

best,
Jerome Glisse

2007-05-04 11:04:05

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

Jerome Glisse wrote:
> On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
>> Keith Packard wrote:
>> > On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellstr?m wrote:
>> >
>> >
>> >> It might be possible to find schemes that work around this. One way
>> >> could possibly be to have a buffer mapping -and validate order for
>> >> shared buffers.
>> >>
>> >
>> > If mapping never blocks on anything other than the fence, then there
>> > isn't any dead lock possibility. What this says is that ordering of
>> > rendering between clients is *not DRMs problem*. I think that's a good
>> > solution though; I want to let multiple apps work on DRM-able memory
>> > with their own CPU without contention.
>> >
>> > I don't recall if Eric layed out the proposed rules, but:
>> >
>> > 1) Map never blocks on map. Clients interested in dealing with this
>> > are on their own.
>> >
>> > 2) Submit blocks on map. You must unmap all buffers before submitting
>> > them. Doing the relocations in the kernel makes this all possible.
>> >
>> > 3) Map blocks on the fence from submit. We can play with pending the
>> > flush until the app asks for the buffer back, or we can play with
>> > figuring out when flushes are useful automatically. Doesn't matter
>> > if the policy is in the kernel.
>> >
>> > I'm interested in making deadlock avoidence trivial and eliminating
>> any
>> > map-map contention.
>> >
>> >
>> It's rare to have two clients access the same buffer at the same time.
>> In what situation will this occur?
>>
>> If we think of map / unmap and validation / fence as taking a buffer
>> mutex either for the CPU or for the GPU, that's the way implementation
>> is done today. The CPU side of the mutex should IIRC be per-client
>> recursive. OTOH, the TTM implementation won't stop the CPU from
>> accessing the buffer when it is unmapped, but then you're on your own.
>> "Mutexes" need to be taken in the correct order, otherwise a deadlock
>> will occur, and GL will, as outlined in Eric's illustration, more or
>> less encourage us to take buffers in the "incorrect" order.
>>
>> In essence what you propose is to eliminate the deadlock problem by just
>> avoiding taking the buffer mutex unless we know the GPU has it. I see
>> two problems with this:
>>
>> * It will encourage different DRI clients to simultaneously access
>> the same buffer.
>> * Inter-client and GPU data coherence can be guaranteed if we issue
>> a mb() / write-combining flush with the unmap operation (which,
>> BTW, I'm not sure is done today). Otherwise it is up to the
>> clients, and very easy to forget.
>>
>> I'm a bit afraid we might in the future regret taking the easy way out?
>>
>> OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
>> buffers in the correct order might not be the best one either. It will
>> certainly mean some CPU overhead and what if we have to do the same with
>> buffer validation? (Yes for some operations with thousands and thousands
>> of relocations, the user space validation might need to stay).
>>
>> Personally, I'm slightly biased towards having DRM resolve the deadlock,
>> but I think any solution will do as long as the implications and why we
>> choose a certain solution are totally clear.
>>
>> For item 3) above the kernel must have a way to issue a flush when
>> needed for buffer eviction.
>> The current implementation also requires the buffer to be completely
>> flushed before mapping.
>> Other than that the flushing policy is currently completely up to the
>> DRM drivers.
>>
>> /Thomas
>
> I might say stupid things as i don't think i fully understand all
> the input to this problem. Anyway here is my thought on all this:
>
> 1) First client map never block (as in Keith layout) except on
> fence from drm side (point 3 in Keith layout)
>
But is there really a need for this except to avoid the above-mentioned
deadlock?
As I'm not too up to date with all the possibilities the servers and GL
clients may be using shared buffers,
I need some enlightenment :). Could we have an example, please?

> 4) We got 2 gpu queue:
> - one with pending apps ask in which we do all stuff
> necessary
> to be done before submitting (locking buffer,
> validation, ...)
> for instance we might wait here for each buffer that are
> still
> mapped by some other apps in user space
> - one run queue in which we add each apps ask that are now
> ready to be submited to the gpu

This is getting closer and closer to a GPU scheduler, an interesting
topic indeed.
Perhaps we should have a separate discussion on the needs and
requirements for such a thing?

Regards,
/Thomas

2007-05-04 11:57:21

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
> Jerome Glisse wrote:
> > On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
> >> Keith Packard wrote:
> >> > On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellstr?m wrote:
> >> >
> >> >
> >> >> It might be possible to find schemes that work around this. One way
> >> >> could possibly be to have a buffer mapping -and validate order for
> >> >> shared buffers.
> >> >>
> >> >
> >> > If mapping never blocks on anything other than the fence, then there
> >> > isn't any dead lock possibility. What this says is that ordering of
> >> > rendering between clients is *not DRMs problem*. I think that's a good
> >> > solution though; I want to let multiple apps work on DRM-able memory
> >> > with their own CPU without contention.
> >> >
> >> > I don't recall if Eric layed out the proposed rules, but:
> >> >
> >> > 1) Map never blocks on map. Clients interested in dealing with this
> >> > are on their own.
> >> >
> >> > 2) Submit blocks on map. You must unmap all buffers before submitting
> >> > them. Doing the relocations in the kernel makes this all possible.
> >> >
> >> > 3) Map blocks on the fence from submit. We can play with pending the
> >> > flush until the app asks for the buffer back, or we can play with
> >> > figuring out when flushes are useful automatically. Doesn't matter
> >> > if the policy is in the kernel.
> >> >
> >> > I'm interested in making deadlock avoidence trivial and eliminating
> >> any
> >> > map-map contention.
> >> >
> >> >
> >> It's rare to have two clients access the same buffer at the same time.
> >> In what situation will this occur?
> >>
> >> If we think of map / unmap and validation / fence as taking a buffer
> >> mutex either for the CPU or for the GPU, that's the way implementation
> >> is done today. The CPU side of the mutex should IIRC be per-client
> >> recursive. OTOH, the TTM implementation won't stop the CPU from
> >> accessing the buffer when it is unmapped, but then you're on your own.
> >> "Mutexes" need to be taken in the correct order, otherwise a deadlock
> >> will occur, and GL will, as outlined in Eric's illustration, more or
> >> less encourage us to take buffers in the "incorrect" order.
> >>
> >> In essence what you propose is to eliminate the deadlock problem by just
> >> avoiding taking the buffer mutex unless we know the GPU has it. I see
> >> two problems with this:
> >>
> >> * It will encourage different DRI clients to simultaneously access
> >> the same buffer.
> >> * Inter-client and GPU data coherence can be guaranteed if we issue
> >> a mb() / write-combining flush with the unmap operation (which,
> >> BTW, I'm not sure is done today). Otherwise it is up to the
> >> clients, and very easy to forget.
> >>
> >> I'm a bit afraid we might in the future regret taking the easy way out?
> >>
> >> OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
> >> buffers in the correct order might not be the best one either. It will
> >> certainly mean some CPU overhead and what if we have to do the same with
> >> buffer validation? (Yes for some operations with thousands and thousands
> >> of relocations, the user space validation might need to stay).
> >>
> >> Personally, I'm slightly biased towards having DRM resolve the deadlock,
> >> but I think any solution will do as long as the implications and why we
> >> choose a certain solution are totally clear.
> >>
> >> For item 3) above the kernel must have a way to issue a flush when
> >> needed for buffer eviction.
> >> The current implementation also requires the buffer to be completely
> >> flushed before mapping.
> >> Other than that the flushing policy is currently completely up to the
> >> DRM drivers.
> >>
> >> /Thomas
> >
> > I might say stupid things as i don't think i fully understand all
> > the input to this problem. Anyway here is my thought on all this:
> >
> > 1) First client map never block (as in Keith layout) except on
> > fence from drm side (point 3 in Keith layout)
> >
> But is there really a need for this except to avoid the above-mentioned
> deadlock?
> As I'm not too up to date with all the possibilities the servers and GL
> clients may be using shared buffers,
> I need some enlightenment :). Could we have an example, please?

I think the current main consumer would be compiz or any other
compositor which use TextureFromPixmap, i really think the we
might see further use of sharing graphical data among applications,
i got example here at my work of such use case even thought this
doesn't use GL at all but another indoor protocol. Another possible
case where such buffer sharing might occur is inside same application
with two or more GL context (i am ready to bet that we already have
some where example of such application).

> > 4) We got 2 gpu queue:
> > - one with pending apps ask in which we do all stuff
> > necessary
> > to be done before submitting (locking buffer,
> > validation, ...)
> > for instance we might wait here for each buffer that are
> > still
> > mapped by some other apps in user space
> > - one run queue in which we add each apps ask that are now
> > ready to be submited to the gpu
>
> This is getting closer and closer to a GPU scheduler, an interesting
> topic indeed.
> Perhaps we should have a separate discussion on the needs and
> requirements for such a thing?
>
> Regards,
> /Thomas

Hey ! I tried to hide that this was a dumb scheduler ;), i believe that
we need such things along memory management. A functionality that
is worth having is the possibility for the kernel to preempt the pending
queue to get exclusive access to the GPU (for instance when printing
kernel failure).

Maybe we should first discuss this in dri-devel and come back to
our lkml friends :) once we are happy with design. Even thought i
am sure there is people on lkml that definitely can help with design
considering their experiences on scheduler and memory management
and the whole synchronization problem.

best,
Jerome Glisse

2007-05-04 12:32:33

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

Jerome Glisse wrote:
> On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
>> Jerome Glisse wrote:
>> > On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
>> >> Keith Packard wrote:
>> >> > On Thu, 2007-05-03 at 01:01 +0200, Thomas Hellstr?m wrote:
>> >> >
>> >> >
>> >> >> It might be possible to find schemes that work around this. One
>> way
>> >> >> could possibly be to have a buffer mapping -and validate order for
>> >> >> shared buffers.
>> >> >>
>> >> >
>> >> > If mapping never blocks on anything other than the fence, then
>> there
>> >> > isn't any dead lock possibility. What this says is that ordering of
>> >> > rendering between clients is *not DRMs problem*. I think that's
>> a good
>> >> > solution though; I want to let multiple apps work on DRM-able
>> memory
>> >> > with their own CPU without contention.
>> >> >
>> >> > I don't recall if Eric layed out the proposed rules, but:
>> >> >
>> >> > 1) Map never blocks on map. Clients interested in dealing with
>> this
>> >> > are on their own.
>> >> >
>> >> > 2) Submit blocks on map. You must unmap all buffers before
>> submitting
>> >> > them. Doing the relocations in the kernel makes this all
>> possible.
>> >> >
>> >> > 3) Map blocks on the fence from submit. We can play with
>> pending the
>> >> > flush until the app asks for the buffer back, or we can play
>> with
>> >> > figuring out when flushes are useful automatically. Doesn't
>> matter
>> >> > if the policy is in the kernel.
>> >> >
>> >> > I'm interested in making deadlock avoidence trivial and eliminating
>> >> any
>> >> > map-map contention.
>> >> >
>> >> >
>> >> It's rare to have two clients access the same buffer at the same
>> time.
>> >> In what situation will this occur?
>> >>
>> >> If we think of map / unmap and validation / fence as taking a buffer
>> >> mutex either for the CPU or for the GPU, that's the way
>> implementation
>> >> is done today. The CPU side of the mutex should IIRC be per-client
>> >> recursive. OTOH, the TTM implementation won't stop the CPU from
>> >> accessing the buffer when it is unmapped, but then you're on your
>> own.
>> >> "Mutexes" need to be taken in the correct order, otherwise a deadlock
>> >> will occur, and GL will, as outlined in Eric's illustration, more or
>> >> less encourage us to take buffers in the "incorrect" order.
>> >>
>> >> In essence what you propose is to eliminate the deadlock problem
>> by just
>> >> avoiding taking the buffer mutex unless we know the GPU has it. I see
>> >> two problems with this:
>> >>
>> >> * It will encourage different DRI clients to simultaneously
>> access
>> >> the same buffer.
>> >> * Inter-client and GPU data coherence can be guaranteed if we
>> issue
>> >> a mb() / write-combining flush with the unmap operation (which,
>> >> BTW, I'm not sure is done today). Otherwise it is up to the
>> >> clients, and very easy to forget.
>> >>
>> >> I'm a bit afraid we might in the future regret taking the easy way
>> out?
>> >>
>> >> OTOH, letting DRM resolve the deadlock by unmapping and remapping
>> shared
>> >> buffers in the correct order might not be the best one either. It
>> will
>> >> certainly mean some CPU overhead and what if we have to do the
>> same with
>> >> buffer validation? (Yes for some operations with thousands and
>> thousands
>> >> of relocations, the user space validation might need to stay).
>> >>
>> >> Personally, I'm slightly biased towards having DRM resolve the
>> deadlock,
>> >> but I think any solution will do as long as the implications and
>> why we
>> >> choose a certain solution are totally clear.
>> >>
>> >> For item 3) above the kernel must have a way to issue a flush when
>> >> needed for buffer eviction.
>> >> The current implementation also requires the buffer to be completely
>> >> flushed before mapping.
>> >> Other than that the flushing policy is currently completely up to the
>> >> DRM drivers.
>> >>
>> >> /Thomas
>> >
>> > I might say stupid things as i don't think i fully understand all
>> > the input to this problem. Anyway here is my thought on all this:
>> >
>> > 1) First client map never block (as in Keith layout) except on
>> > fence from drm side (point 3 in Keith layout)
>> >
>> But is there really a need for this except to avoid the above-mentioned
>> deadlock?
>> As I'm not too up to date with all the possibilities the servers and GL
>> clients may be using shared buffers,
>> I need some enlightenment :). Could we have an example, please?
>
> I think the current main consumer would be compiz or any other
> compositor which use TextureFromPixmap, i really think the we
> might see further use of sharing graphical data among applications,
> i got example here at my work of such use case even thought this
> doesn't use GL at all but another indoor protocol. Another possible
> case where such buffer sharing might occur is inside same application
> with two or more GL context (i am ready to bet that we already have
> some where example of such application).
>
I was actually referring to an example where two clients need to have a
buffer mapped and access it at exactly the same time.
If there is such a situation, we have no other choice than to drop the
buffer locking on map. If there isn't we can at least consider other
alternatives that resolve the deadlock issue but that also will help
clients synchronize and keep data coherent.

/Thomas

2007-05-04 12:52:30

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On 5/4/07, Thomas Hellstr?m <[email protected]> wrote:
> I was actually referring to an example where two clients need to have a
> buffer mapped and access it at exactly the same time.
> If there is such a situation, we have no other choice than to drop the
> buffer locking on map. If there isn't we can at least consider other
> alternatives that resolve the deadlock issue but that also will help
> clients synchronize and keep data coherent.
>
> /Thomas

One might be a texture where a portion is updated by one thread
and another portion update by another one, i believe the application
will know better than us if such concurrent access will conflict or not.
If this both thread access different pixel it make sense to let them
work together at the same time on the texture. If they are writing
to same pixel then they will have to sync between them so they
don't do somethings stupid.

My point is that the user space will know better if sync is needed
or not and how to sync access to the same buffer. Moreover we
can still add a locking mechanism in user space (in libdrm for
instance).

There is very likely others use case for such concurrent access which
i can't think off right now.

best,
Jerome Glisse

2007-05-04 15:15:26

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On Fri, 2007-05-04 at 10:07 +0200, Thomas Hellström wrote:
>
> It's rare to have two clients access the same buffer at the same time.
> In what situation will this occur?

Right, what I'm trying to avoid is having any contention for
applications *not* sharing the same objects.

If there is any locking for mapping, we can either attempt to define a
locking order, or we can have a single global lock. The former leaves us
prone to deadlocks, the latter eliminates the ability for un-contended
parallel access.

> * It will encourage different DRI clients to simultaneously access
> the same buffer.

Sure. Separate 'DRI' from 'GL' and this may be a sensible plan. If you
want to prevent this *that's not DRI's problem*.

> * Inter-client and GPU data coherence can be guaranteed if we issue
> a mb() / write-combining flush with the unmap operation (which,
> BTW, I'm not sure is done today). Otherwise it is up to the
> clients, and very easy to forget.

CPU-GPU coherence is ensured by the mutual exclusion between mapping and
submitting. You may either have data available to the CPU or to the GPU.
I think that's a basic requirement for any solution in this space.
Keying the submit and map as to whether writing will occur means that
appropriate flushing and fencing can be automatically applied within the
kernel.

> OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
> buffers in the correct order might not be the best one either. It will
> certainly mean some CPU overhead and what if we have to do the same with
> buffer validation? (Yes for some operations with thousands and thousands
> of relocations, the user space validation might need to stay).

I do not want to do relocations in user space. I don't see why doing
thousands of these requires moving this operation out of the kernel.

--
[email protected]

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-05-04 15:28:40

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On Fri, 2007-05-04 at 11:40 +0200, Jerome Glisse wrote:

> On a side note i think this scheme also fit well with gpu having
> several context and which doesn't need big validation (read
> nv gpu).

Yeah, I want to make sure we have a simple model that supports
multi-context hardware while also avoiding failing over badly when we
have more users than hardware contexts.

--
[email protected]

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-05-04 15:32:28

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

On Fri, 2007-05-04 at 14:32 +0200, Thomas Hellström wrote:
> If there isn't we can at least consider other
> alternatives that resolve the deadlock issue but that also will help
> clients synchronize and keep data coherent.

If clients want coherence, they're welcome to implement their own
locking. Let's make sure we separate the semantics required for GPU
operation from semantics required by DRM users.

--
[email protected]

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-05-04 16:17:03

by Keith Whitwell

[permalink] [raw]

Subject: Re: [RFC] [PATCH] DRM TTM Memory Manager patch

Keith Packard wrote:

>> OTOH, letting DRM resolve the deadlock by unmapping and remapping shared
>> buffers in the correct order might not be the best one either. It will
>> certainly mean some CPU overhead and what if we have to do the same with
>> buffer validation? (Yes for some operations with thousands and thousands
>> of relocations, the user space validation might need to stay).
>
> I do not want to do relocations in user space. I don't see why doing
> thousands of these requires moving this operation out of the kernel.

Agreed. The original conception for this was to have valdiation plus
relocations be a single operation, and by implication in the kernel.
Although the code as it stands doesn't do this, I think that should
still be the approach.

The issue with thousands of relocations from my point of view isn't a
problem - that's just a matter of getting appropriate data structures in
place.

Where things get a bit more interesting is with hardware where you are
required to submit a whole scene's worth of rendering before the
hardware will kick off, and with the expectation that the texture
placement will remain unchanged throughout the scene. This is a very
easy way to hit any upper limit on texture memory - the agp aperture
size in the case of integrated chipsets.

That's a special case of a the general problem of what do you do when a
client submits any validation list that can't be satisfied. Failing to
render isn't really an option, either the client or the memory manager
has to either prevent it happening in the first place or have some
mechanism for chopping up the dma buffer into segments which are
satisfiable... Neither of which I can see an absolutely reliable way to
do.

I think that any memory manager we can propose will have flaws of some
sort - either it is prone to failures that aren't really allowed by the
API, is excessively complex or somewhat pessimistic. We've chosen a
design that is simple, optimistic, but can potentially say "no"
unexpectedly. It would then be up to the client to somehow pick up the
pieces & potentially submit a smaller list. So far we just haven't
touched on how that might work.

The way to get around this is to mandate that hardware supports paged
virtual memory... But that seems to be a difficult trick.

Keith

2007-05-04 16:26:34