Hi all,
I've started doing some work with using the new DRM memory manager
from TG for pixmaps in the X server using Intel 9xx series hardware.
The intel hardware pretty much requires pages to be uncached for the
GPU to access them. It can use cached memory for some operations but
it isn't very useful and my attempts to use it ended in a lot of
crashiness..
Now one of the major usage patterns for pixmaps is
allocate pixmap
copy data into pixmap
use pixmap from hardware
free pixmap
with the current memory manager + updated change_page_attr (to use
clflush when we have it) fixes from Andi Kleen, it operates something
like this
allocate pixmap gets cached memory
copy data into the pixmap
pre-use from hardware we flush the cache lines and tlb
use the pixmap in hardware
pre-free we need to set the page back to cached so we flush the tlb
free the memory.
The other path is if we don't want to use the memory cached ever is just
allocate pixmap
flush cache lines/tlb
use uncached from CPU
use uncached from GPU
pre-free set the page back to cached, flush the TLB
free the page
Now the big issue here on SMP is that the cache and/or tlb flushes
require IPIs and they are very noticeable on the profiles,
So after all that I'd like to have some sort of uncached page list I
can allocate pages from, so with frequent pixmap creation/destruction
I don't spend a lot of time in the cache flushing routines and avoid
the IPI in particular.
The options I can sorta see roughly are:
1. the DRM just allocates a bunch of uncached pages and manages a
cache of them for interacting with the hardware, this sounds wrong and
we run into how do we correctly size the pool issues.
2. (Is this idea crazy??) We modify the VM somehow so we have an
uncached list, when we first allocate pages with the GFP_UNCACHED they
get migrated to the uncached zone and the pages use a page flag to say
they are uncached. Then the DRM just re-uses things from that list. If
later we end up with memory pressure, the free pages on the uncached
list could be migrated back to the normal page lists by modifying the
page attributes and flushing the tlb....
Any other ideas and suggestions?
Dave.
> allocate pixmap gets cached memory
> copy data into the pixmap
> pre-use from hardware we flush the cache lines and tlb
> use the pixmap in hardware
> pre-free we need to set the page back to cached so we flush the tlb
> free the memory.
> Now the big issue here on SMP is that the cache and/or tlb flushes
> require IPIs and they are very noticeable on the profiles,
Blame intel ;)
> Any other ideas and suggestions?
Without knowing exactly what you are doing:
- Copies to uncached memory are very expensive on an x86 processor
(so it might be faster not to write and flush)
- Its not clear from your description how intelligent your transfer
system is.
I'd expect for example that the process was something like
Parse pending commands until either
1. Queue empties
2. A time target passes
For each command we need to shove a pixmap over add it
to the buffer to transfer
Do a single CLFLUSH and maybe IPI
Fire up the command queue
Keep the buffers hanging around until there is memory pressure
if we may reuse that pixmap
Can you clarify that ?
If the hugepage anti-frag stuff ever gets merged this would also help as
you could possibly grab a huge page from the allocator for this purpose
and have to flip only one TLB entry.
Alan
On 20 Aug, 01:50, "Dave Airlie" <[email protected]> wrote:
> Hi all,
>
> I've started doing some work with using the new DRM memory manager
> from TG for pixmaps in the X server using Intel 9xx series hardware.
>
> The intel hardware pretty much requires pages to be uncached for the
> GPU to access them. It can use cached memory for some operations but
> it isn't very useful and my attempts to use it ended in a lot of
> crashiness..
Write-combining access seems the correct thing here, followed by a
wmb(). Uncached writing would be horrendously slow.
[snip]
> So after all that I'd like to have some sort of uncached page list I
> can allocate pages from
This is exactly what Intel's PAT mechanism exists for - just mark the
desired access type (index) on the pages you've been allocated.
It's documented in the Intel Architecture Software Design manuals, but
Linux's support is lacking in certain areas [discussions on LKML],
which a number of developers have been trying to move forward.
Quite a few significant graphics/HPC etc vendors are forced to use it
without this complete support, so it would be good to get this
additional impetus involved...
Daniel
--
Daniel J Blueman
> Blame intel ;)
>
> > Any other ideas and suggestions?
>
> Without knowing exactly what you are doing:
>
> - Copies to uncached memory are very expensive on an x86 processor
> (so it might be faster not to write and flush)
> - Its not clear from your description how intelligent your transfer
> system is.
It is still possible to change the transfer system, but it should be
intelligent enough or possible to make it more intelligent..
I also realise I need PAT + write combining but I believe this problem
is othogonal...
>
> I'd expect for example that the process was something like
>
> Parse pending commands until either
> 1. Queue empties
> 2. A time target passes
>
> For each command we need to shove a pixmap over add it
> to the buffer to transfer
>
> Do a single CLFLUSH and maybe IPI
>
> Fire up the command queue
>
> Keep the buffers hanging around until there is memory pressure
> if we may reuse that pixmap
>
> Can you clarify that ?
So at the moment a pixmap maps directly to a kernel buffer object
which is a bunch of pages that get faulted in on the CPU or allocated
when the buffer is to be used by the GPU. So when a pixmap is created
a buffer object is created, when a pixmap is destroyed a buffer object
is destroyed. Perhaps I can cache a bunch of buffer objects in
userspace for re-use as pixmaps but I'm not really sure that will
scale too well.
When X wishes the GPU to access a buffer (pixmap), it calls into the
kernel with a single ioctl with a list of all buffers the GPU is going
to access along with a buffer containing the command to do the access,
now at the moment, when each of those buffers is bound into the GART
for the first time the system does a change_page_attr for each page
and calls the global flush[1].
Now if a buffer is bound into the GART and gets accessed from the CPU
later again (software fallback) we have the choice of taking it back
out of the GART and letting the nopfn call fault back in the pages
uncached or we can flush the tlb and bring them back in cached. We are
hoping to avoid software fallbacks on the hardware platforms we want
to work on as much as possible.
Finally when a buffer is destroyed, the pages are released back to the
system, so of course the pages are set back to cached and we need
another tlb/cache flush per pixmap buffer destructor.
So you can see why some sort of uncached+writecombined page cache
would be useful, I could just allocate a bunch of pages at startup as
uncached+writecombined, and allocate pixmaps from them and when I
bind/free the pixmap I don't need the flush at all, now I'd really
like this to be part of the VM so that under memory pressure it can
just take the pages I've got in my cache back and after flushing turn
them back into cached pages, the other option is for the DRM to do
this on its own and penalise the whole system.
[1]. (this is one inefficiency in that if multiple buffers are being
bound in for the first time it'll flush for each of them, I'm trying
to get rid of this inefficiency but I may need to tweak the order of
things as at the moment, it crashes hard if I tried to leave the
cache/tlb flush until later.)
Dave.
>
> Write-combining access seems the correct thing here, followed by a
> wmb(). Uncached writing would be horrendously slow.
>
> [snip]
> > So after all that I'd like to have some sort of uncached page list I
> > can allocate pages from
>
> This is exactly what Intel's PAT mechanism exists for - just mark the
> desired access type (index) on the pages you've been allocated.
>
> It's documented in the Intel Architecture Software Design manuals, but
> Linux's support is lacking in certain areas [discussions on LKML],
> which a number of developers have been trying to move forward.
>
> Quite a few significant graphics/HPC etc vendors are forced to use it
> without this complete support, so it would be good to get this
> additional impetus involved...
I'm hoping to pick up the PAT cause at some point soon this stuff is
definitely required to get any use out of modern graphics hardware.
It is slightly orthogonal to the issue I mentioned in that I still
have the problem of allocating uncached memory without the flushing
overheads associated with making pages cached/uncached constantly..
Dave.
>
> Daniel
> --
> Daniel J Blueman
>
On Tue, 2007-08-21 at 16:05 +1000, Dave Airlie wrote:
> So you can see why some sort of uncached+writecombined page cache
> would be useful, I could just allocate a bunch of pages at startup as
> uncached+writecombined, and allocate pixmaps from them and when I
> bind/free the pixmap I don't need the flush at all, now I'd really
> like this to be part of the VM so that under memory pressure it can
> just take the pages I've got in my cache back and after flushing turn
> them back into cached pages, the other option is for the DRM to do
> this on its own and penalise the whole system.
Can't you make these pages part of the regular VM by sticking them all
into an address_space.
And for this reclaim behaviour you'd only need to set PG_private and
have a_ops->releasepage() dtrt.
There is an uncached allocator in IA64 arch code
(linux/arch/ia64/kernel/uncached.c). Maybe having a look at
that will help? Jes wrote it.
Peter Zijlstra wrote:
> On Tue, 2007-08-21 at 16:05 +1000, Dave Airlie wrote:
>
>
>>So you can see why some sort of uncached+writecombined page cache
>>would be useful, I could just allocate a bunch of pages at startup as
>>uncached+writecombined, and allocate pixmaps from them and when I
>>bind/free the pixmap I don't need the flush at all, now I'd really
>>like this to be part of the VM so that under memory pressure it can
>>just take the pages I've got in my cache back and after flushing turn
>>them back into cached pages, the other option is for the DRM to do
>>this on its own and penalise the whole system.
>
>
> Can't you make these pages part of the regular VM by sticking them all
> into an address_space.
>
> And for this reclaim behaviour you'd only need to set PG_private and
> have a_ops->releasepage() dtrt.
I'd just suggest Dave just registers a shrinker to start with.
You really want to be able to batch TLB flushes as well, which
->releasepage may not be so good at (you could add more machinery
behind the releasepage to build batches and so on, but anyway, a
shrinker might be the quickest way to get something working).
--
SUSE Labs, Novell Inc.