Hi,
Thomas i attach a reworked page pool allocator based on Dave works,
this one should be ok with ttm cache status tracking. It definitely
helps on AGP system, now the bottleneck is in mesa vertex's dma
allocation.
Cheers,
Jerome
Jerome Glisse skrev:
> Hi,
>
> Thomas i attach a reworked page pool allocator based on Dave works,
> this one should be ok with ttm cache status tracking. It definitely
> helps on AGP system, now the bottleneck is in mesa vertex's dma
> allocation.
>
> Cheers,
> Jerome
>
Hi, Jerome!
In general it looks very good. Some things that need fixing:
1) We must have a way to hand back pages. I still not quite understand
how the shrink callbacks work and whether they are applicable. Another
scheme would be to free, say 1MB when we have at least 2MB available.
2) We should avoid including AGP headers if AGP is not configured.
Either reimplement unmap_page_from_agp or map_page_into_agp or move them
out from the AGP headers. We've hade complaints before from people with
AGP free systems that the code doesn't compile.
3) Since we're allocating (and freeing) in batches we should use the
set_pages_array() interface to avoid a global tlb flush per page.
4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
it will always allocate cached pages and then transition them.
5) Use TTM_PFX in printouts.
/Thomas
On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]> wrote:
> Hi,
>
> Thomas i attach a reworked page pool allocator based on Dave works,
> this one should be ok with ttm cache status tracking. It definitely
> helps on AGP system, now the bottleneck is in mesa vertex's dma
> allocation.
>
My original version kept a list of wb pages as well, this proved to be
quite a useful
optimisation on my test systems when I implemented it, without it I
was spending ~20%
of my CPU in getting free pages, granted I always used WB pages on
PCIE/IGP systems.
Another optimisation I made at the time was around the populate call,
(not sure if this
is what still happens):
Allocate a 64K local BO for DMA object.
Write into the first 5 pages from userspace - get WB pages.
Bind to GART, swap those 5 pages to WC + flush.
Then populate the rest with WC pages from the list.
Granted I think allocating WC in the first place from the pool might
work just as well since most of the DMA buffers are write only.
Dave.
Dave Airlie skrev:
> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]> wrote:
>
>> Hi,
>>
>> Thomas i attach a reworked page pool allocator based on Dave works,
>> this one should be ok with ttm cache status tracking. It definitely
>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>> allocation.
>>
>>
>
> My original version kept a list of wb pages as well, this proved to be
> quite a useful
> optimisation on my test systems when I implemented it, without it I
> was spending ~20%
> of my CPU in getting free pages, granted I always used WB pages on
> PCIE/IGP systems.
>
> Another optimisation I made at the time was around the populate call,
> (not sure if this
> is what still happens):
>
> Allocate a 64K local BO for DMA object.
> Write into the first 5 pages from userspace - get WB pages.
> Bind to GART, swap those 5 pages to WC + flush.
> Then populate the rest with WC pages from the list.
> Granted I think allocating WC in the first place from the pool might
> work just as well since most of the DMA buffers are write only.
>
Yes, I think in the latter case the user-space driver should take care
to specify WC from the beginning, when the BO is allocated.
BTW is there any DMA buffer verification taking place on WC buffers on
Radeon?
> Dave.
>
/Thomas
On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]> wrote:
> > Hi,
> >
> > Thomas i attach a reworked page pool allocator based on Dave works,
> > this one should be ok with ttm cache status tracking. It definitely
> > helps on AGP system, now the bottleneck is in mesa vertex's dma
> > allocation.
> >
>
> My original version kept a list of wb pages as well, this proved to be
> quite a useful
> optimisation on my test systems when I implemented it, without it I
> was spending ~20%
> of my CPU in getting free pages, granted I always used WB pages on
> PCIE/IGP systems.
>
> Another optimisation I made at the time was around the populate call,
> (not sure if this
> is what still happens):
>
> Allocate a 64K local BO for DMA object.
> Write into the first 5 pages from userspace - get WB pages.
> Bind to GART, swap those 5 pages to WC + flush.
> Then populate the rest with WC pages from the list.
>
> Granted I think allocating WC in the first place from the pool might
> work just as well since most of the DMA buffers are write only.
>
> Dave.
>
I think it's better to fix userspace to not allocate as much buffer per
frame as it does now rather than having a pool of wb pages, i removed
it because on my 64M box memory is getting tight, we need to compute
the number of page we still based on memory. Also i think it's ok
to assume that page allocation is fast enough.
I am reworking the patch with lastes Thomas comment, will post new one
after a bit of testing.
Cheers,
Jerome
On Fri, 2009-06-26 at 08:31 +0200, Thomas Hellström wrote:
> Dave Airlie skrev:
> > On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]> wrote:
> >
> >> Hi,
> >>
> >> Thomas i attach a reworked page pool allocator based on Dave works,
> >> this one should be ok with ttm cache status tracking. It definitely
> >> helps on AGP system, now the bottleneck is in mesa vertex's dma
> >> allocation.
> >>
> >>
> >
> > My original version kept a list of wb pages as well, this proved to be
> > quite a useful
> > optimisation on my test systems when I implemented it, without it I
> > was spending ~20%
> > of my CPU in getting free pages, granted I always used WB pages on
> > PCIE/IGP systems.
> >
> > Another optimisation I made at the time was around the populate call,
> > (not sure if this
> > is what still happens):
> >
> > Allocate a 64K local BO for DMA object.
> > Write into the first 5 pages from userspace - get WB pages.
> > Bind to GART, swap those 5 pages to WC + flush.
> > Then populate the rest with WC pages from the list.
> > Granted I think allocating WC in the first place from the pool might
> > work just as well since most of the DMA buffers are write only.
> >
>
> Yes, I think in the latter case the user-space driver should take care
> to specify WC from the beginning, when the BO is allocated.
>
> BTW is there any DMA buffer verification taking place on WC buffers on
> Radeon?
Command buffer submitted from userspace are never bo object, they are
normal memory so they are cached, kernel copy the content into bo
buffer which have proper cache status.
Cheers,
Jerome
>
> I think it's better to fix userspace to not allocate as much buffer per
> frame as it does now rather than having a pool of wb pages, i removed
> it because on my 64M box memory is getting tight, we need to compute
> the number of page we still based on memory. Also i think it's ok
> to assume that page allocation is fast enough.
Yup we could try decreasing the DMA buffer size, I think the DMA buffer
would be a lot better suited to suballocation, doing individual bo allocs
like the code used to was a major hit, assuming page alloc is fast enough
isn't valid since I've already proved it was a major CPU user when I fixed
it the first time :)
The problem with making the dma buffer smaller is the having a buffer
mapped and sent to the kernel for relocs.
Dave.
On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]> wrote:
> > Hi,
> >
> > Thomas i attach a reworked page pool allocator based on Dave works,
> > this one should be ok with ttm cache status tracking. It definitely
> > helps on AGP system, now the bottleneck is in mesa vertex's dma
> > allocation.
> >
>
> My original version kept a list of wb pages as well, this proved to be
> quite a useful
> optimisation on my test systems when I implemented it, without it I
> was spending ~20%
> of my CPU in getting free pages, granted I always used WB pages on
> PCIE/IGP systems.
>
> Another optimisation I made at the time was around the populate call,
> (not sure if this
> is what still happens):
>
> Allocate a 64K local BO for DMA object.
> Write into the first 5 pages from userspace - get WB pages.
> Bind to GART, swap those 5 pages to WC + flush.
> Then populate the rest with WC pages from the list.
>
> Granted I think allocating WC in the first place from the pool might
> work just as well since most of the DMA buffers are write only.
>
> Dave.
> --
Attached a new version of the patch, which integrate changes discussed.
Cheers,
Jerome
Jerome Glisse skrev:
> On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
>
>> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>> this one should be ok with ttm cache status tracking. It definitely
>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>> allocation.
>>>
>>>
>> My original version kept a list of wb pages as well, this proved to be
>> quite a useful
>> optimisation on my test systems when I implemented it, without it I
>> was spending ~20%
>> of my CPU in getting free pages, granted I always used WB pages on
>> PCIE/IGP systems.
>>
>> Another optimisation I made at the time was around the populate call,
>> (not sure if this
>> is what still happens):
>>
>> Allocate a 64K local BO for DMA object.
>> Write into the first 5 pages from userspace - get WB pages.
>> Bind to GART, swap those 5 pages to WC + flush.
>> Then populate the rest with WC pages from the list.
>>
>> Granted I think allocating WC in the first place from the pool might
>> work just as well since most of the DMA buffers are write only.
>>
>> Dave.
>> --
>>
>
> Attached a new version of the patch, which integrate changes discussed.
>
> Cheers,
> Jerome
>
Hi, Jerome!
Still some outstanding things:
1) The AGP protection fixes compilation errors when AGP is not enabled,
but what about architectures that need the map_page_into_agp() semantics
for TTM even when AGP is not enabled? At the very least TTM should be
disabled on those architectures. The best option would be to make those
calls non-agp specific.
2) Why is the page refcount upped with get_page() after an alloc_page()?
3) It seems like pages are cache-transitioned one-by-one when freed.
Again, this is a global TLB flush per page. Can't we free a large chunk
of pages at once?
/Thanks,
Thomas
2009/6/30 Thomas Hellstr?m <[email protected]>:
> Jerome Glisse skrev:
>>
>> On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
>>
>>>
>>> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>>> this one should be ok with ttm cache status tracking. It definitely
>>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>>> allocation.
>>>>
>>>>
>>>
>>> My original version kept a list of wb pages as well, this proved to be
>>> quite a useful
>>> optimisation on my test systems when I implemented it, without it I
>>> was spending ~20%
>>> of my CPU in getting free pages, granted I always used WB pages on
>>> PCIE/IGP systems.
>>>
>>> Another optimisation I made at the time was around the populate call,
>>> (not sure if this
>>> is what still happens):
>>>
>>> Allocate a 64K local BO for DMA object.
>>> Write into the first 5 pages from userspace - get WB pages.
>>> Bind to GART, swap those 5 pages to WC + flush.
>>> Then populate the rest with WC pages from the list.
>>>
>>> Granted I think allocating WC in the first place from the pool might
>>> work just as well since most of the DMA buffers are write only.
>>>
>>> Dave.
>>> --
>>>
>>
>> Attached a new version of the patch, which integrate changes discussed.
>>
>> Cheers,
>> Jerome
>>
>
> Hi, Jerome!
> Still some outstanding things:
>
> 1) The AGP protection fixes compilation errors when AGP is not enabled, but
> what about architectures that need the map_page_into_agp() semantics for TTM
> even when AGP is not enabled? At the very least TTM should be disabled on
> those architectures. The best option would be to make those calls non-agp
> specific.
>
> 2) Why is the page refcount upped with get_page() after an alloc_page()?
>
> 3) It seems like pages are cache-transitioned one-by-one when freed. Again,
> this is a global TLB flush per page. Can't we free a large chunk of pages at
> once?
>
Jerome,
have we addressed these?
I'd really like to push this soon, as I'd like to fix up the 32 vs 36
bit dma masks if possible
which relies on us being able to tell the allocator to use GFP_DMA32 on some hw
(32-bit PAE mainly with a PCI card).
Dave.
On Thu, 2009-07-09 at 16:06 +1000, Dave Airlie wrote:
> 2009/6/30 Thomas Hellström <[email protected]>:
> > Jerome Glisse skrev:
> >>
> >> On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
> >>
> >>>
> >>> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<[email protected]>
> >>> wrote:
> >>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> Thomas i attach a reworked page pool allocator based on Dave works,
> >>>> this one should be ok with ttm cache status tracking. It definitely
> >>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
> >>>> allocation.
> >>>>
> >>>>
> >>>
> >>> My original version kept a list of wb pages as well, this proved to be
> >>> quite a useful
> >>> optimisation on my test systems when I implemented it, without it I
> >>> was spending ~20%
> >>> of my CPU in getting free pages, granted I always used WB pages on
> >>> PCIE/IGP systems.
> >>>
> >>> Another optimisation I made at the time was around the populate call,
> >>> (not sure if this
> >>> is what still happens):
> >>>
> >>> Allocate a 64K local BO for DMA object.
> >>> Write into the first 5 pages from userspace - get WB pages.
> >>> Bind to GART, swap those 5 pages to WC + flush.
> >>> Then populate the rest with WC pages from the list.
> >>>
> >>> Granted I think allocating WC in the first place from the pool might
> >>> work just as well since most of the DMA buffers are write only.
> >>>
> >>> Dave.
> >>> --
> >>>
> >>
> >> Attached a new version of the patch, which integrate changes discussed.
> >>
> >> Cheers,
> >> Jerome
> >>
> >
> > Hi, Jerome!
> > Still some outstanding things:
> >
> > 1) The AGP protection fixes compilation errors when AGP is not enabled, but
> > what about architectures that need the map_page_into_agp() semantics for TTM
> > even when AGP is not enabled? At the very least TTM should be disabled on
> > those architectures. The best option would be to make those calls non-agp
> > specific.
> >
> > 2) Why is the page refcount upped with get_page() after an alloc_page()?
> >
> > 3) It seems like pages are cache-transitioned one-by-one when freed. Again,
> > this is a global TLB flush per page. Can't we free a large chunk of pages at
> > once?
> >
>
> Jerome,
>
> have we addressed these?
>
> I'd really like to push this soon, as I'd like to fix up the 32 vs 36
> bit dma masks if possible
> which relies on us being able to tell the allocator to use GFP_DMA32 on some hw
> (32-bit PAE mainly with a PCI card).
FWIW, I tried this patch on my PowerBook, and it didn't go too well:
With AGP enabled, the kernel panics before there's any KMS display.
With AGP disabled, I get a KMS display, but it shows a failure to
allocate the ring buffer, and then it stops updating.
--
Earthling Michel Dänzer | http://www.vmware.com
Libre software enthusiast | Debian, X and DRI developer
On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> Jerome Glisse skrev:
> > Hi,
> >
> > Thomas i attach a reworked page pool allocator based on Dave works,
> > this one should be ok with ttm cache status tracking. It definitely
> > helps on AGP system, now the bottleneck is in mesa vertex's dma
> > allocation.
> >
> > Cheers,
> > Jerome
> >
> Hi, Jerome!
> In general it looks very good. Some things that need fixing:
>
> 1) We must have a way to hand back pages. I still not quite understand
> how the shrink callbacks work and whether they are applicable. Another
> scheme would be to free, say 1MB when we have at least 2MB available.
>
> 2) We should avoid including AGP headers if AGP is not configured.
> Either reimplement unmap_page_from_agp or map_page_into_agp or move them
> out from the AGP headers. We've hade complaints before from people with
> AGP free systems that the code doesn't compile.
>
> 3) Since we're allocating (and freeing) in batches we should use the
> set_pages_array() interface to avoid a global tlb flush per page.
>
> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> it will always allocate cached pages and then transition them.
>
Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
wrong) :
1 - bo get allocated tt->state = unpopulated
2 - bo is mapped few page are faulted tt->state = unpopulated
3 - bo is cache transitioned but tt->state == unpopulated but
they are page which have been touch by the cpu so we need
to clflush them and transition them, this never happen if
we don't call ttm_tt_populate and proceed with the remaining
of the cache transitioning functions
As a workaround i will try to go through the pages tables and
transition existing pages. Do you have any idea for a better
plan ?
Cheers,
Jerome
On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> >
> > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> > it will always allocate cached pages and then transition them.
> >
>
> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> wrong) :
> 1 - bo get allocated tt->state = unpopulated
> 2 - bo is mapped few page are faulted tt->state = unpopulated
> 3 - bo is cache transitioned but tt->state == unpopulated but
> they are page which have been touch by the cpu so we need
> to clflush them and transition them, this never happen if
> we don't call ttm_tt_populate and proceed with the remaining
> of the cache transitioning functions
>
> As a workaround i will try to go through the pages tables and
> transition existing pages. Do you have any idea for a better
> plan ?
>
> Cheers,
> Jerome
My workaround ruin the whole idea of pool allocation what happens
is that most bo get cache transition page per page. My thinking
is that we should do the following:
- is there is a least one page allocated then fully populate
the object and do cache transition on all the pages.
- otherwise update caching_state and leaves object unpopulated
This needs that we some how reflect the fact that there is at least
one page allocated, i am thinking to adding a new state for that :
ttm_partialy_populated
Thomas what do you think about that ?
Cheers,
Jerome
On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > >
> > > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> > > it will always allocate cached pages and then transition them.
> > >
> >
> > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > wrong) :
> > 1 - bo get allocated tt->state = unpopulated
> > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > 3 - bo is cache transitioned but tt->state == unpopulated but
> > they are page which have been touch by the cpu so we need
> > to clflush them and transition them, this never happen if
> > we don't call ttm_tt_populate and proceed with the remaining
> > of the cache transitioning functions
> >
> > As a workaround i will try to go through the pages tables and
> > transition existing pages. Do you have any idea for a better
> > plan ?
> >
> > Cheers,
> > Jerome
>
> My workaround ruin the whole idea of pool allocation what happens
> is that most bo get cache transition page per page. My thinking
> is that we should do the following:
> - is there is a least one page allocated then fully populate
> the object and do cache transition on all the pages.
> - otherwise update caching_state and leaves object unpopulated
>
> This needs that we some how reflect the fact that there is at least
> one page allocated, i am thinking to adding a new state for that :
> ttm_partialy_populated
>
> Thomas what do you think about that ?
>
> Cheers,
> Jerome
Attached updated patch it doesn't introduce ttm_partialy_populated
but keep the populate call in cache transition. So far it seems to
work properly on AGP platform and helps quite a lot with performances.
I wonder if i should rather allocate some memory to store the pool
structure in ttm_page_pool_init rather than having quite a lot of
static variables ? Anyone has thought on that ?
Cheers,
Jerome
Jerome Glisse wrote:
> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>
>> Jerome Glisse skrev:
>>
>>> Hi,
>>>
>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>> this one should be ok with ttm cache status tracking. It definitely
>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>> allocation.
>>>
>>> Cheers,
>>> Jerome
>>>
>>>
>> Hi, Jerome!
>> In general it looks very good. Some things that need fixing:
>>
>> 1) We must have a way to hand back pages. I still not quite understand
>> how the shrink callbacks work and whether they are applicable. Another
>> scheme would be to free, say 1MB when we have at least 2MB available.
>>
>> 2) We should avoid including AGP headers if AGP is not configured.
>> Either reimplement unmap_page_from_agp or map_page_into_agp or move them
>> out from the AGP headers. We've hade complaints before from people with
>> AGP free systems that the code doesn't compile.
>>
>> 3) Since we're allocating (and freeing) in batches we should use the
>> set_pages_array() interface to avoid a global tlb flush per page.
>>
>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
>> it will always allocate cached pages and then transition them.
>>
>>
>
> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> wrong) :
> 1 - bo get allocated tt->state = unpopulated
> 2 - bo is mapped few page are faulted tt->state = unpopulated
> 3 - bo is cache transitioned but tt->state == unpopulated but
> they are page which have been touch by the cpu so we need
> to clflush them and transition them,
Right.
> this never happen if
> we don't call ttm_tt_populate and proceed with the remaining
> of the cache transitioning functions
>
Why can't we just skip ttm_tt_populate and proceed with the remaining of
the cache transitioning functions? Then all pages currently allocated to
the TTM will be transitioned.
/Thomas
Jerome Glisse wrote:
> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
>
>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
>>
>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>>>
>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
>>>> it will always allocate cached pages and then transition them.
>>>>
>>>>
>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>>> wrong) :
>>> 1 - bo get allocated tt->state = unpopulated
>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>>> they are page which have been touch by the cpu so we need
>>> to clflush them and transition them, this never happen if
>>> we don't call ttm_tt_populate and proceed with the remaining
>>> of the cache transitioning functions
>>>
>>> As a workaround i will try to go through the pages tables and
>>> transition existing pages. Do you have any idea for a better
>>> plan ?
>>>
>>> Cheers,
>>> Jerome
>>>
>> My workaround ruin the whole idea of pool allocation what happens
>> is that most bo get cache transition page per page. My thinking
>> is that we should do the following:
>> - is there is a least one page allocated then fully populate
>> the object and do cache transition on all the pages.
>> - otherwise update caching_state and leaves object unpopulated
>>
>> This needs that we some how reflect the fact that there is at least
>> one page allocated, i am thinking to adding a new state for that :
>> ttm_partialy_populated
>>
>> Thomas what do you think about that ?
>>
>> Cheers,
>> Jerome
>>
>
> Attached updated patch it doesn't introduce ttm_partialy_populated
> but keep the populate call in cache transition. So far it seems to
> work properly on AGP platform and helps quite a lot with performances.
> I wonder if i should rather allocate some memory to store the pool
> structure in ttm_page_pool_init rather than having quite a lot of
> static variables ? Anyone has thought on that ?
>
>
Jerome,
TTM has a device struct per device and an optional global struct that is
common for all devices and intended to be per subsystem.
The only subsystem currently having a global structure is the memory
accounting subsystem:
struct ttm_mem_global
You can either put the global stuff there or create and register a
separate global struct for the page pool.
I'll probably also later add a global structure for the bo subsystem,
since we need a common buffer object memory footprint shrinker for
multiple cards.
/Thomas
> Cheers,
> Jerome
>
On Wed, 2009-07-22 at 10:27 +0200, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> >
> >> Jerome Glisse skrev:
> >>
> >>> Hi,
> >>>
> >>> Thomas i attach a reworked page pool allocator based on Dave works,
> >>> this one should be ok with ttm cache status tracking. It definitely
> >>> helps on AGP system, now the bottleneck is in mesa vertex's dma
> >>> allocation.
> >>>
> >>> Cheers,
> >>> Jerome
> >>>
> >>>
> >> Hi, Jerome!
> >> In general it looks very good. Some things that need fixing:
> >>
> >> 1) We must have a way to hand back pages. I still not quite understand
> >> how the shrink callbacks work and whether they are applicable. Another
> >> scheme would be to free, say 1MB when we have at least 2MB available.
> >>
> >> 2) We should avoid including AGP headers if AGP is not configured.
> >> Either reimplement unmap_page_from_agp or map_page_into_agp or move them
> >> out from the AGP headers. We've hade complaints before from people with
> >> AGP free systems that the code doesn't compile.
> >>
> >> 3) Since we're allocating (and freeing) in batches we should use the
> >> set_pages_array() interface to avoid a global tlb flush per page.
> >>
> >> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> >> it will always allocate cached pages and then transition them.
> >>
> >>
> >
> > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > wrong) :
> > 1 - bo get allocated tt->state = unpopulated
> > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > 3 - bo is cache transitioned but tt->state == unpopulated but
> > they are page which have been touch by the cpu so we need
> > to clflush them and transition them,
>
> Right.
>
> > this never happen if
> > we don't call ttm_tt_populate and proceed with the remaining
> > of the cache transitioning functions
> >
>
> Why can't we just skip ttm_tt_populate and proceed with the remaining of
> the cache transitioning functions? Then all pages currently allocated to
> the TTM will be transitioned.
>
Because not all tt->pages entry have valid page some are null and thus
we have to go through the whole table and transition page per page. I
did that and quick benchmark showed that it's faster to fully populate
and fully transition.
Cheers,
Jerome
On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> > On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > > >
> > > > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> > > > it will always allocate cached pages and then transition them.
> > > >
> > >
> > > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > > wrong) :
> > > 1 - bo get allocated tt->state = unpopulated
> > > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > > 3 - bo is cache transitioned but tt->state == unpopulated but
> > > they are page which have been touch by the cpu so we need
> > > to clflush them and transition them, this never happen if
> > > we don't call ttm_tt_populate and proceed with the remaining
> > > of the cache transitioning functions
> > >
> > > As a workaround i will try to go through the pages tables and
> > > transition existing pages. Do you have any idea for a better
> > > plan ?
> > >
> > > Cheers,
> > > Jerome
> >
> > My workaround ruin the whole idea of pool allocation what happens
> > is that most bo get cache transition page per page. My thinking
> > is that we should do the following:
> > - is there is a least one page allocated then fully populate
> > the object and do cache transition on all the pages.
> > - otherwise update caching_state and leaves object unpopulated
> >
> > This needs that we some how reflect the fact that there is at least
> > one page allocated, i am thinking to adding a new state for that :
> > ttm_partialy_populated
> >
> > Thomas what do you think about that ?
> >
> > Cheers,
> > Jerome
>
> Attached updated patch it doesn't introduce ttm_partialy_populated
> but keep the populate call in cache transition. So far it seems to
> work properly on AGP platform
Yeah, this one works for me as well.
> and helps quite a lot with performances.
Can't say I've noticed that however. How did you measure?
--
Earthling Michel Dänzer | http://www.vmware.com
Libre software enthusiast | Debian, X and DRI developer
On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> > On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> > > On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > > > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > > > >
> > > > > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> > > > > it will always allocate cached pages and then transition them.
> > > > >
> > > >
> > > > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > > > wrong) :
> > > > 1 - bo get allocated tt->state = unpopulated
> > > > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > > > 3 - bo is cache transitioned but tt->state == unpopulated but
> > > > they are page which have been touch by the cpu so we need
> > > > to clflush them and transition them, this never happen if
> > > > we don't call ttm_tt_populate and proceed with the remaining
> > > > of the cache transitioning functions
> > > >
> > > > As a workaround i will try to go through the pages tables and
> > > > transition existing pages. Do you have any idea for a better
> > > > plan ?
> > > >
> > > > Cheers,
> > > > Jerome
> > >
> > > My workaround ruin the whole idea of pool allocation what happens
> > > is that most bo get cache transition page per page. My thinking
> > > is that we should do the following:
> > > - is there is a least one page allocated then fully populate
> > > the object and do cache transition on all the pages.
> > > - otherwise update caching_state and leaves object unpopulated
> > >
> > > This needs that we some how reflect the fact that there is at least
> > > one page allocated, i am thinking to adding a new state for that :
> > > ttm_partialy_populated
> > >
> > > Thomas what do you think about that ?
> > >
> > > Cheers,
> > > Jerome
> >
> > Attached updated patch it doesn't introduce ttm_partialy_populated
> > but keep the populate call in cache transition. So far it seems to
> > work properly on AGP platform
>
> Yeah, this one works for me as well.
>
> > and helps quite a lot with performances.
>
> Can't say I've noticed that however. How did you measure?
gears & quake3 but i haven't yet done any X 2d benchmark,
2d is dead slow with kms at leat this is the feeling i have.
Cheers,
Jerome
Jerome Glisse wrote:
> On Wed, 2009-07-22 at 10:27 +0200, Thomas Hellström wrote:
>
>> Jerome Glisse wrote:
>>
>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>>>
>>>
>>>> Jerome Glisse skrev:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>>>> this one should be ok with ttm cache status tracking. It definitely
>>>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>>>> allocation.
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>>>
>>>>>
>>>>>
>>>> Hi, Jerome!
>>>> In general it looks very good. Some things that need fixing:
>>>>
>>>> 1) We must have a way to hand back pages. I still not quite understand
>>>> how the shrink callbacks work and whether they are applicable. Another
>>>> scheme would be to free, say 1MB when we have at least 2MB available.
>>>>
>>>> 2) We should avoid including AGP headers if AGP is not configured.
>>>> Either reimplement unmap_page_from_agp or map_page_into_agp or move them
>>>> out from the AGP headers. We've hade complaints before from people with
>>>> AGP free systems that the code doesn't compile.
>>>>
>>>> 3) Since we're allocating (and freeing) in batches we should use the
>>>> set_pages_array() interface to avoid a global tlb flush per page.
>>>>
>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
>>>> it will always allocate cached pages and then transition them.
>>>>
>>>>
>>>>
>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>>> wrong) :
>>> 1 - bo get allocated tt->state = unpopulated
>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>>> they are page which have been touch by the cpu so we need
>>> to clflush them and transition them,
>>>
>> Right.
>>
>>
>>> this never happen if
>>> we don't call ttm_tt_populate and proceed with the remaining
>>> of the cache transitioning functions
>>>
>>>
>> Why can't we just skip ttm_tt_populate and proceed with the remaining of
>> the cache transitioning functions? Then all pages currently allocated to
>> the TTM will be transitioned.
>>
>>
>
> Because not all tt->pages entry have valid page some are null and thus
> we have to go through the whole table and transition page per page. I
> did that and quick benchmark showed that it's faster to fully populate
> and fully transition.
>
>
OK.
That's probably sufficient for now.
/Thomas
> Cheers,
> Jerome
>
>
Jerome Glisse wrote:
> On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
>
>> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
>>
>>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
>>>
>>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
>>>>
>>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>>>>>
>>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
>>>>>> it will always allocate cached pages and then transition them.
>>>>>>
>>>>>>
>>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>>>>> wrong) :
>>>>> 1 - bo get allocated tt->state = unpopulated
>>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>>>>> they are page which have been touch by the cpu so we need
>>>>> to clflush them and transition them, this never happen if
>>>>> we don't call ttm_tt_populate and proceed with the remaining
>>>>> of the cache transitioning functions
>>>>>
>>>>> As a workaround i will try to go through the pages tables and
>>>>> transition existing pages. Do you have any idea for a better
>>>>> plan ?
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>>>
>>>> My workaround ruin the whole idea of pool allocation what happens
>>>> is that most bo get cache transition page per page. My thinking
>>>> is that we should do the following:
>>>> - is there is a least one page allocated then fully populate
>>>> the object and do cache transition on all the pages.
>>>> - otherwise update caching_state and leaves object unpopulated
>>>>
>>>> This needs that we some how reflect the fact that there is at least
>>>> one page allocated, i am thinking to adding a new state for that :
>>>> ttm_partialy_populated
>>>>
>>>> Thomas what do you think about that ?
>>>>
>>>> Cheers,
>>>> Jerome
>>>>
>>> Attached updated patch it doesn't introduce ttm_partialy_populated
>>> but keep the populate call in cache transition. So far it seems to
>>> work properly on AGP platform
>>>
>> Yeah, this one works for me as well.
>>
>>
>>> and helps quite a lot with performances.
>>>
>> Can't say I've noticed that however. How did you measure?
>>
>
> gears
Hmm,
In gears there shouldn't really be any buffer allocation / freeing going
on at all once the display lists are set up, and gears should really be
gpu bound in most cases.
what's the source of the buffer allocations / frees when gears is run?
/Thomas
On Wed, 2009-07-22 at 21:13 +0200, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
> >
> >> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> >>
> >>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> >>>
> >>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> >>>>
> >>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> >>>>>
> >>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> >>>>>> it will always allocate cached pages and then transition them.
> >>>>>>
> >>>>>>
> >>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> >>>>> wrong) :
> >>>>> 1 - bo get allocated tt->state = unpopulated
> >>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
> >>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
> >>>>> they are page which have been touch by the cpu so we need
> >>>>> to clflush them and transition them, this never happen if
> >>>>> we don't call ttm_tt_populate and proceed with the remaining
> >>>>> of the cache transitioning functions
> >>>>>
> >>>>> As a workaround i will try to go through the pages tables and
> >>>>> transition existing pages. Do you have any idea for a better
> >>>>> plan ?
> >>>>>
> >>>>> Cheers,
> >>>>> Jerome
> >>>>>
> >>>> My workaround ruin the whole idea of pool allocation what happens
> >>>> is that most bo get cache transition page per page. My thinking
> >>>> is that we should do the following:
> >>>> - is there is a least one page allocated then fully populate
> >>>> the object and do cache transition on all the pages.
> >>>> - otherwise update caching_state and leaves object unpopulated
> >>>>
> >>>> This needs that we some how reflect the fact that there is at least
> >>>> one page allocated, i am thinking to adding a new state for that :
> >>>> ttm_partialy_populated
> >>>>
> >>>> Thomas what do you think about that ?
> >>>>
> >>>> Cheers,
> >>>> Jerome
> >>>>
> >>> Attached updated patch it doesn't introduce ttm_partialy_populated
> >>> but keep the populate call in cache transition. So far it seems to
> >>> work properly on AGP platform
> >>>
> >> Yeah, this one works for me as well.
> >>
> >>
> >>> and helps quite a lot with performances.
> >>>
> >> Can't say I've noticed that however. How did you measure?
> >>
> >
> > gears
> Hmm,
> In gears there shouldn't really be any buffer allocation / freeing going
> on at all once the display lists are set up, and gears should really be
> gpu bound in most cases.
>
> what's the source of the buffer allocations / frees when gears is run?
>
> /Thomas
We free reallocate vertex buffer each frame iirc.
Cheers,
Jerome
On Wed, 2009-07-22 at 15:35 -0700, Jerome Glisse wrote:
> On Wed, 2009-07-22 at 21:13 +0200, Thomas Hellström wrote:
> > Jerome Glisse wrote:
> > > On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
> > >
> > >> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> > >>
> > >>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> > >>>
> > >>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > >>>>
> > >>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > >>>>>
> > >>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
> > >>>>>> it will always allocate cached pages and then transition them.
> > >>>>>>
> > >>>>>>
> > >>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > >>>>> wrong) :
> > >>>>> 1 - bo get allocated tt->state = unpopulated
> > >>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
> > >>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
> > >>>>> they are page which have been touch by the cpu so we need
> > >>>>> to clflush them and transition them, this never happen if
> > >>>>> we don't call ttm_tt_populate and proceed with the remaining
> > >>>>> of the cache transitioning functions
> > >>>>>
> > >>>>> As a workaround i will try to go through the pages tables and
> > >>>>> transition existing pages. Do you have any idea for a better
> > >>>>> plan ?
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Jerome
> > >>>>>
> > >>>> My workaround ruin the whole idea of pool allocation what happens
> > >>>> is that most bo get cache transition page per page. My thinking
> > >>>> is that we should do the following:
> > >>>> - is there is a least one page allocated then fully populate
> > >>>> the object and do cache transition on all the pages.
> > >>>> - otherwise update caching_state and leaves object unpopulated
> > >>>>
> > >>>> This needs that we some how reflect the fact that there is at least
> > >>>> one page allocated, i am thinking to adding a new state for that :
> > >>>> ttm_partialy_populated
> > >>>>
> > >>>> Thomas what do you think about that ?
> > >>>>
> > >>>> Cheers,
> > >>>> Jerome
> > >>>>
> > >>> Attached updated patch it doesn't introduce ttm_partialy_populated
> > >>> but keep the populate call in cache transition. So far it seems to
> > >>> work properly on AGP platform
> > >>>
> > >> Yeah, this one works for me as well.
> > >>
> > >>
> > >>> and helps quite a lot with performances.
> > >>>
> > >> Can't say I've noticed that however. How did you measure?
> > >>
> > >
> > > gears
> > Hmm,
> > In gears there shouldn't really be any buffer allocation / freeing going
> > on at all once the display lists are set up, and gears should really be
> > gpu bound in most cases.
> >
> > what's the source of the buffer allocations / frees when gears is run?
> >
> > /Thomas
>
> We free reallocate vertex buffer each frame iirc.
Gears does everything in display lists which means geometry is held in
VBOs retained for the life of the application. Once the first frame is
rendered there shouldn't be any more uploads.
Keith
On Thu, Jul 23, 2009 at 9:24 AM, Keith Whitwell<[email protected]> wrote:
> On Wed, 2009-07-22 at 15:35 -0700, Jerome Glisse wrote:
>> On Wed, 2009-07-22 at 21:13 +0200, Thomas Hellstr?m wrote:
>> > Jerome Glisse wrote:
>> > > On Wed, 2009-07-22 at 15:16 +0200, Michel D?nzer wrote:
>> > >
>> > >> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
>> > >>
>> > >>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
>> > >>>
>> > >>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
>> > >>>>
>> > >>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellstr?m wrote:
>> > >>>>>
>> > >>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
>> > >>>>>> it will always allocate cached pages and then transition them.
>> > >>>>>>
>> > >>>>>>
>> > >>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>> > >>>>> wrong) :
>> > >>>>> 1 - bo get allocated tt->state = unpopulated
>> > >>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>> > >>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>> > >>>>> ? ? they are page which have been touch by the cpu so we need
>> > >>>>> ? ? to clflush them and transition them, this never happen if
>> > >>>>> ? ? we don't call ttm_tt_populate and proceed with the remaining
>> > >>>>> ? ? of the cache transitioning functions
>> > >>>>>
>> > >>>>> As a workaround i will try to go through the pages tables and
>> > >>>>> transition existing pages. Do you have any idea for a better
>> > >>>>> plan ?
>> > >>>>>
>> > >>>>> Cheers,
>> > >>>>> Jerome
>> > >>>>>
>> > >>>> My workaround ruin the whole idea of pool allocation what happens
>> > >>>> is that most bo get cache transition page per page. My thinking
>> > >>>> is that we should do the following:
>> > >>>> ? ? ? ?- is there is a least one page allocated then fully populate
>> > >>>> ? ? ? ?the object and do cache transition on all the pages.
>> > >>>> ? ? ? ?- otherwise update caching_state and leaves object unpopulated
>> > >>>>
>> > >>>> This needs that we some how reflect the fact that there is at least
>> > >>>> one page allocated, i am thinking to adding a new state for that :
>> > >>>> ttm_partialy_populated
>> > >>>>
>> > >>>> Thomas what do you think about that ?
>> > >>>>
>> > >>>> Cheers,
>> > >>>> Jerome
>> > >>>>
>> > >>> Attached updated patch it doesn't introduce ttm_partialy_populated
>> > >>> but keep the populate call in cache transition. So far it seems to
>> > >>> work properly on AGP platform
>> > >>>
>> > >> Yeah, this one works for me as well.
>> > >>
>> > >>
>> > >>> and helps quite a lot with performances.
>> > >>>
>> > >> Can't say I've noticed that however. How did you measure?
>> > >>
>> > >
>> > > gears
>> > Hmm,
>> > In gears there shouldn't really be any buffer allocation / freeing going
>> > on at all once the display lists are set up, and gears should really be
>> > gpu bound in most cases.
>> >
>> > what's the source of the buffer allocations / frees when gears is run?
>> >
>> > /Thomas
>>
>> We free reallocate vertex buffer each frame iirc.
>
> Gears does everything in display lists which means geometry is held in
> VBOs retained for the life of the application. ?Once the first frame is
> rendered there shouldn't be any more uploads.
>
I don't think the r300 driver does proper vbo's yet.
Dave.
On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
> TTM has a device struct per device and an optional global struct that is
> common for all devices and intended to be per subsystem.
>
> The only subsystem currently having a global structure is the memory
> accounting subsystem:
> struct ttm_mem_global
Thomas i don't think the way we init ttm_mem_global today make
it follow the 1 struct ttm_mem_global for everyone. I think it
should be initialized and refcounted by device struct.
So on first device creation a ttm_mem_global is created and
then anytime a new device is created the refcount of ttm_mem_global
is increased. This would mean some static global refcount inside
ttm_memory.c, maybe there is something similar to singleton in
the linux toolbox.
Thought ? Idea ?
Cheers,
Jerome
Jerome Glisse skrev:
> On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
>
>> TTM has a device struct per device and an optional global struct that is
>> common for all devices and intended to be per subsystem.
>>
>> The only subsystem currently having a global structure is the memory
>> accounting subsystem:
>> struct ttm_mem_global
>>
>
> Thomas i don't think the way we init ttm_mem_global today make
> it follow the 1 struct ttm_mem_global for everyone. I think it
> should be initialized and refcounted by device struct.
>
> So on first device creation a ttm_mem_global is created and
> then anytime a new device is created the refcount of ttm_mem_global
> is increased.
Jerome,
This is exactly what the current code intends to do.
Are you seeing something different?
/Thomas
On Tue, 2009-07-28 at 20:55 +0200, Thomas Hellström wrote:
> Jerome Glisse skrev:
> > On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
> >
> >> TTM has a device struct per device and an optional global struct that is
> >> common for all devices and intended to be per subsystem.
> >>
> >> The only subsystem currently having a global structure is the memory
> >> accounting subsystem:
> >> struct ttm_mem_global
> >>
> >
> > Thomas i don't think the way we init ttm_mem_global today make
> > it follow the 1 struct ttm_mem_global for everyone. I think it
> > should be initialized and refcounted by device struct.
> >
> > So on first device creation a ttm_mem_global is created and
> > then anytime a new device is created the refcount of ttm_mem_global
> > is increased.
> Jerome,
> This is exactly what the current code intends to do.
>
> Are you seeing something different?
I definitly don't see that :) In radeon we do create a structure
which hold the ttm_mem_global struct so it's not shared at all
it got inited & destroyed along the driver. This is why i think
it's better to remove the driver initialization and let bo_device
init path take care of initializing one and only one object which
can be shared by multiple driverttm_mem_global_inits.
So what i propose is remove mem_glob parameter from :
ttm_bo_device_init, add a call to ttm_mem_global_init in
ttm_bo_device_init and add some static refcount in ttm_memory.c
if refcount = 0 then ttm_mem_global_init create a ttm_mem_global
struct and initialize things, if refcount > 0 then it gives
back the already initialized ttm_mem_global.
Of course we unref with ttm_mem_global_release and destroy
once refcount reach 0.
Cheers,
Jerome
Jerome Glisse wrote:
> On Tue, 2009-07-28 at 20:55 +0200, Thomas Hellström wrote:
>
>> Jerome Glisse skrev:
>>
>>> On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
>>>
>>>
>>>> TTM has a device struct per device and an optional global struct that is
>>>> common for all devices and intended to be per subsystem.
>>>>
>>>> The only subsystem currently having a global structure is the memory
>>>> accounting subsystem:
>>>> struct ttm_mem_global
>>>>
>>>>
>>> Thomas i don't think the way we init ttm_mem_global today make
>>> it follow the 1 struct ttm_mem_global for everyone. I think it
>>> should be initialized and refcounted by device struct.
>>>
>>> So on first device creation a ttm_mem_global is created and
>>> then anytime a new device is created the refcount of ttm_mem_global
>>> is increased.
>>>
>> Jerome,
>> This is exactly what the current code intends to do.
>>
>> Are you seeing something different?
>>
>
> I definitly don't see that :) In radeon we do create a structure
> which hold the ttm_mem_global struct so it's not shared at all
> it got inited & destroyed along the driver. This is why i think
> it's better to remove the driver initialization and let bo_device
> init path take care of initializing one and only one object which
> can be shared by multiple driverttm_mem_global_inits.
>
>
Which radeon struct is holding the ttm_mem_global struct?
The radeon code looks very similar to the openchrome code in which the
struct ttm_mem_global is allocated at ttm_global.c, line 74 and freed at
ttm_global.c, line 108 when its refcount has reached zero.
So the device holds a struct ttm_global_reference that *only points* to
the global item, and which is destroyed on device takedown. If there are
more than one device pointing to the mem_global object, it won't get
destroyed.
So the code should be working perfectly fine unless there is a bug.
> So what i propose is remove mem_glob parameter from :
> ttm_bo_device_init, add a call to ttm_mem_global_init in
> ttm_bo_device_init
Nope, The ttm_mem_global object is used by other ttm subsystems
(fencing, user-space objects),
so that can't be done.
> and add some static refcount in ttm_memory.c
> if refcount = 0 then ttm_mem_global_init create a ttm_mem_global
> struct and initialize things, if refcount > 0 then it gives
> back the already initialized ttm_mem_global.
>
>
This is exactly what ttm_global was created to do, and what it hopefully
does. If you create two radeon devices the ttm_mem_global object should
be the same, even though the global references pointing to it are of
course different. Have you actually tried this?
/Thomas
> Of course we unref with ttm_mem_global_release and destroy
> once refcount reach 0.
>
> Cheers,
> Jerome
>
>
On Wed, 2009-07-29 at 11:39 +0200, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Tue, 2009-07-28 at 20:55 +0200, Thomas Hellström wrote:
> >
> >> Jerome Glisse skrev:
> >>
> >>> On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
> >>>
> >>>
> >>>> TTM has a device struct per device and an optional global struct that is
> >>>> common for all devices and intended to be per subsystem.
> >>>>
> >>>> The only subsystem currently having a global structure is the memory
> >>>> accounting subsystem:
> >>>> struct ttm_mem_global
> >>>>
> >>>>
> >>> Thomas i don't think the way we init ttm_mem_global today make
> >>> it follow the 1 struct ttm_mem_global for everyone. I think it
> >>> should be initialized and refcounted by device struct.
> >>>
> >>> So on first device creation a ttm_mem_global is created and
> >>> then anytime a new device is created the refcount of ttm_mem_global
> >>> is increased.
> >>>
> >> Jerome,
> >> This is exactly what the current code intends to do.
> >>
> >> Are you seeing something different?
> >>
> >
> > I definitly don't see that :) In radeon we do create a structure
> > which hold the ttm_mem_global struct so it's not shared at all
> > it got inited & destroyed along the driver. This is why i think
> > it's better to remove the driver initialization and let bo_device
> > init path take care of initializing one and only one object which
> > can be shared by multiple driverttm_mem_global_inits.
> >
> >
> Which radeon struct is holding the ttm_mem_global struct?
>
> The radeon code looks very similar to the openchrome code in which the
> struct ttm_mem_global is allocated at ttm_global.c, line 74 and freed at
> ttm_global.c, line 108 when its refcount has reached zero.
>
> So the device holds a struct ttm_global_reference that *only points* to
> the global item, and which is destroyed on device takedown. If there are
> more than one device pointing to the mem_global object, it won't get
> destroyed.
>
> So the code should be working perfectly fine unless there is a bug.
>
> > So what i propose is remove mem_glob parameter from :
> > ttm_bo_device_init, add a call to ttm_mem_global_init in
> > ttm_bo_device_init
>
> Nope, The ttm_mem_global object is used by other ttm subsystems
> (fencing, user-space objects),
> so that can't be done.
>
> > and add some static refcount in ttm_memory.c
> > if refcount = 0 then ttm_mem_global_init create a ttm_mem_global
> > struct and initialize things, if refcount > 0 then it gives
> > back the already initialized ttm_mem_global.
> >
> >
>
> This is exactly what ttm_global was created to do, and what it hopefully
> does. If you create two radeon devices the ttm_mem_global object should
> be the same, even though the global references pointing to it are of
> course different. Have you actually tried this?
>
> /Thomas
>
Ok code wasn't clear for me until i read ttm_global.c
Cheers,
Jerome