2011-03-28 11:16:47

by r6144

[permalink] [raw]
Subject: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

Hello,

I am reporting a problem of significant desktop sluggishness caused by
mmap-related kernel algorithms. In particular, after a few days of use,
I encounter multiple-second delays switching between a workspace
containing Evolution and another containing e.g. firefox, which is very
annoying since I switch workspaces very frequently. Oprofile indicates
that, during workspace switching, over 30% of CPU time is spent in
find_vma(), likely called from arch_get_unmapped_area_topdown().

This is essentially a repost of https://lkml.org/lkml/2010/11/14/236 ,
with a bit more investigation and workarounds. The same issue has also
been reported in https://bugzilla.kernel.org/show_bug.cgi?id=17531 , but
that bug report has not received any attention either.

My kernel is Fedora 14's kernel-2.6.35.11-83.fc14.x86_64, and the open
source radeon (r600) driver is used.

Basically, the GEM/TTM-based r600 driver (and presumably many other
drivers as well) seems to allocate a buffer object for each XRender
picture or glyph, and most such objects are mapped into the X server's
address space with libdrm. After the system runs for a few days, the
number of mappings according to "wc -l /proc/$(pgrep Xorg)/maps" can
reach 5k-10k, with the vast majority being 4kB-sized mappings
to /dev/dri/card0 almost contiguously laid out in the address space.
Such a large number of mappings should be expected, given the numerous
distinct glyphs arising from different CJK characters, fonts, and sizes.
Note that libdrm_radeon's bo_unmap() keeps a buffer object mapped even
if it is no longer accessed by the CPU (and only calls munmap() when the
object is destroyed), which has certainly inflated the mapping count
significantly, but otherwise the mmap() overhead would be prohibitive.

Currently the kernel's arch_get_unmapped_area_topdown() is linear-time,
so further mmap() calls becomes very slow with so many mappings existing
in the X server's address space. Since redrawing a window usually
involves the creation of a significant number of temporary pixmaps or
XRender pictures, often requiring mapping by the X server, it is thus
slowed down greatly. Although arch_get_unmapped_area_topdown() attempts
to use mm->free_area_cache to speed up the search, the cache is usually
invalidated due to the mm->cached_hole_size test whenever the block size
being searched for is smaller than that in the last time; this ensures
that the function always finds the earliest unmapped area in search
order that is sufficiently large, thus reducing address space
fragmentation (commit 1363c3cd). Consequently, if different mapping
sizes are used in successive mmap() calls, as is often the case when
dealing with pixmaps larger than a page in size, the cache would be
invalidated almost half of the time, and the amortized cost of each
mmap() remains linear.

A quantitative measurement is made with the attached pbobench.cpp,
compiled with Makefile.pbobench. This program uses OpenGL pixel-buffer
objects (which corresponds one-to-one to GEM buffer objects on my
system) to simulate the effect of having a large number of GEM-related
mappings in the X server. It first creates and maps N page-sized PBOs
to mimic the large number of XRender glyphs, then measures the time
needed to create/map/unmap/destroy more PBOs with size varying between
1-16384 bytes. The time spent per iteration (which does either a
create/map or an unmap/destroy) is clearly O(N):

N=100: 17.3us
N=1000: 19.9us
N=10000: 88.5us
N=20000: 205us
N=40000: 406us

and in oprofile results, the amount of CPU time spent in find_vma() can
reach 60-70%, while no other single function takes more than 3%.

I think this problem is not difficult to solve. While it isn't obvious
to me how to find the earliest sufficiently-large unmapped area quickly,
IMHO it is just as good, fragmentation-wise, if we simply allocate from
the smallest sufficiently-large unmapped area regardless of its address;
for this purpose, the final "open-ended" unmapped area in the original
search order (i.e. the one with the lowest address in
arch_get_unmapped_area_topdown()) can be regarded as being infinitely
large, so that it is only used (from the correct "end") when absolutely
necessary. In this way, a simple size-indexed rb-tree of the unmapped
areas will allow the search to be performed in logarithmic time.

As I'm not good at kernel hacking, for now I have written a userspace
workaround in libdrm, available from
https://github.com/r6144/libdrm/tree/my , which reserves some address
space and then allocates from it using MMAP_FIXED. Due to laziness, it
is written in C++ and does not currently combine adjacent free blocks.
This gives the expected improvements in pbobench results:

N=100: 18.3us
N=1000: 18.0us
N=10000: 18.2us
N=20000: 18.9us
N=40000: 20.8us
N=80000: 23.5us
NOTE: N=80000 requires increasing /proc/sys/vm/max_map_count

I am also running Xorg with this modified version of libdrm. So far it
runs okay, and seem to be somewhat snappier than before, although as "wc
-l /proc/$(pgrep Xorg)/maps" has only reached 4369 by now, the
improvement in responsiveness is not yet that great. I have not tested
the algorithm in 32-bit programs though, but intuitively it should work.

(By the way, after this modification, SELinux's sidtab_search_context()
appears near the top of the profiling results due to the use of
linear-time search. It should eventually be optimized as well.)

Do you find it worthwhile to implement something similar in the kernel?
After all, the responsiveness improvement can be quite significant, and
it seems difficult to make the graphics subsystem do fewer mmap()'s
(e.g. by storing multiple XRender glyphs in a single buffer object).
Not to mention that other applications doing lots of mmap()'s for
whatever reason will benefit as well.

Please CC me as I'm not subscribed.

r6144


Attachments:
Makefile.pbobench (268.00 B)
pbobench.cpp (2.91 kB)
Download all attachments

2011-03-28 18:24:19

by Lucas Stach

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

Hi,

I have seen this too in some traces I have done with nouveau nvfx some
time ago. (The report in kernel bugzilla is a outcome of this.) I'm
strongly in favour of fixing the kernel side, as I think doing a
workaround in userspace is a bad hack. In fact doing so is on my long
"list of things to fix when I ever get a 48h day".

One thing that pulled me away from this is, that doing something new in
mmap is a bit regression-prone. If we miss some corner case it is very
easy to break someone's application.

--Lucas

Am Montag, den 28.03.2011, 19:16 +0800 schrieb r6144:
> Hello,
>
> I am reporting a problem of significant desktop sluggishness caused by
> mmap-related kernel algorithms. In particular, after a few days of use,
> I encounter multiple-second delays switching between a workspace
> containing Evolution and another containing e.g. firefox, which is very
> annoying since I switch workspaces very frequently. Oprofile indicates
> that, during workspace switching, over 30% of CPU time is spent in
> find_vma(), likely called from arch_get_unmapped_area_topdown().
>
> This is essentially a repost of https://lkml.org/lkml/2010/11/14/236 ,
> with a bit more investigation and workarounds. The same issue has also
> been reported in https://bugzilla.kernel.org/show_bug.cgi?id=17531 , but
> that bug report has not received any attention either.
>
> My kernel is Fedora 14's kernel-2.6.35.11-83.fc14.x86_64, and the open
> source radeon (r600) driver is used.
>
> Basically, the GEM/TTM-based r600 driver (and presumably many other
> drivers as well) seems to allocate a buffer object for each XRender
> picture or glyph, and most such objects are mapped into the X server's
> address space with libdrm. After the system runs for a few days, the
> number of mappings according to "wc -l /proc/$(pgrep Xorg)/maps" can
> reach 5k-10k, with the vast majority being 4kB-sized mappings
> to /dev/dri/card0 almost contiguously laid out in the address space.
> Such a large number of mappings should be expected, given the numerous
> distinct glyphs arising from different CJK characters, fonts, and sizes.
> Note that libdrm_radeon's bo_unmap() keeps a buffer object mapped even
> if it is no longer accessed by the CPU (and only calls munmap() when the
> object is destroyed), which has certainly inflated the mapping count
> significantly, but otherwise the mmap() overhead would be prohibitive.
>
> Currently the kernel's arch_get_unmapped_area_topdown() is linear-time,
> so further mmap() calls becomes very slow with so many mappings existing
> in the X server's address space. Since redrawing a window usually
> involves the creation of a significant number of temporary pixmaps or
> XRender pictures, often requiring mapping by the X server, it is thus
> slowed down greatly. Although arch_get_unmapped_area_topdown() attempts
> to use mm->free_area_cache to speed up the search, the cache is usually
> invalidated due to the mm->cached_hole_size test whenever the block size
> being searched for is smaller than that in the last time; this ensures
> that the function always finds the earliest unmapped area in search
> order that is sufficiently large, thus reducing address space
> fragmentation (commit 1363c3cd). Consequently, if different mapping
> sizes are used in successive mmap() calls, as is often the case when
> dealing with pixmaps larger than a page in size, the cache would be
> invalidated almost half of the time, and the amortized cost of each
> mmap() remains linear.
>
> A quantitative measurement is made with the attached pbobench.cpp,
> compiled with Makefile.pbobench. This program uses OpenGL pixel-buffer
> objects (which corresponds one-to-one to GEM buffer objects on my
> system) to simulate the effect of having a large number of GEM-related
> mappings in the X server. It first creates and maps N page-sized PBOs
> to mimic the large number of XRender glyphs, then measures the time
> needed to create/map/unmap/destroy more PBOs with size varying between
> 1-16384 bytes. The time spent per iteration (which does either a
> create/map or an unmap/destroy) is clearly O(N):
>
> N=100: 17.3us
> N=1000: 19.9us
> N=10000: 88.5us
> N=20000: 205us
> N=40000: 406us
>
> and in oprofile results, the amount of CPU time spent in find_vma() can
> reach 60-70%, while no other single function takes more than 3%.
>
> I think this problem is not difficult to solve. While it isn't obvious
> to me how to find the earliest sufficiently-large unmapped area quickly,
> IMHO it is just as good, fragmentation-wise, if we simply allocate from
> the smallest sufficiently-large unmapped area regardless of its address;
> for this purpose, the final "open-ended" unmapped area in the original
> search order (i.e. the one with the lowest address in
> arch_get_unmapped_area_topdown()) can be regarded as being infinitely
> large, so that it is only used (from the correct "end") when absolutely
> necessary. In this way, a simple size-indexed rb-tree of the unmapped
> areas will allow the search to be performed in logarithmic time.
>
> As I'm not good at kernel hacking, for now I have written a userspace
> workaround in libdrm, available from
> https://github.com/r6144/libdrm/tree/my , which reserves some address
> space and then allocates from it using MMAP_FIXED. Due to laziness, it
> is written in C++ and does not currently combine adjacent free blocks.
> This gives the expected improvements in pbobench results:
>
> N=100: 18.3us
> N=1000: 18.0us
> N=10000: 18.2us
> N=20000: 18.9us
> N=40000: 20.8us
> N=80000: 23.5us
> NOTE: N=80000 requires increasing /proc/sys/vm/max_map_count
>
> I am also running Xorg with this modified version of libdrm. So far it
> runs okay, and seem to be somewhat snappier than before, although as "wc
> -l /proc/$(pgrep Xorg)/maps" has only reached 4369 by now, the
> improvement in responsiveness is not yet that great. I have not tested
> the algorithm in 32-bit programs though, but intuitively it should work.
>
> (By the way, after this modification, SELinux's sidtab_search_context()
> appears near the top of the profiling results due to the use of
> linear-time search. It should eventually be optimized as well.)
>
> Do you find it worthwhile to implement something similar in the kernel?
> After all, the responsiveness improvement can be quite significant, and
> it seems difficult to make the graphics subsystem do fewer mmap()'s
> (e.g. by storing multiple XRender glyphs in a single buffer object).
> Not to mention that other applications doing lots of mmap()'s for
> whatever reason will benefit as well.
>
> Please CC me as I'm not subscribed.
>
> r6144
> _______________________________________________
> dri-devel mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

2011-03-29 14:22:57

by Jerome Glisse

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Mon, Mar 28, 2011 at 2:13 PM, Lucas Stach <[email protected]> wrote:
> Hi,
>
> I have seen this too in some traces I have done with nouveau nvfx some
> time ago. (The report in kernel bugzilla is a outcome of this.) I'm
> strongly in favour of fixing the kernel side, as I think doing a
> workaround in userspace is a bad hack. In fact doing so is on my long
> "list of things to fix when I ever get a 48h day".
>
> One thing that pulled me away from this is, that doing something new in
> mmap is a bit regression-prone. If we miss some corner case it is very
> easy to break someone's application.
>
> --Lucas
>
> Am Montag, den 28.03.2011, 19:16 +0800 schrieb r6144:
>> Hello,
>>
>> I am reporting a problem of significant desktop sluggishness caused by
>> mmap-related kernel algorithms. ?In particular, after a few days of use,
>> I encounter multiple-second delays switching between a workspace
>> containing Evolution and another containing e.g. firefox, which is very
>> annoying since I switch workspaces very frequently. ?Oprofile indicates
>> that, during workspace switching, over 30% of CPU time is spent in
>> find_vma(), likely called from arch_get_unmapped_area_topdown().
>>
>> This is essentially a repost of https://lkml.org/lkml/2010/11/14/236 ,
>> with a bit more investigation and workarounds. ?The same issue has also
>> been reported in https://bugzilla.kernel.org/show_bug.cgi?id=17531 , but
>> that bug report has not received any attention either.
>>
>> My kernel is Fedora 14's kernel-2.6.35.11-83.fc14.x86_64, and the open
>> source radeon (r600) driver is used.
>>
>> Basically, the GEM/TTM-based r600 driver (and presumably many other
>> drivers as well) seems to allocate a buffer object for each XRender
>> picture or glyph, and most such objects are mapped into the X server's
>> address space with libdrm. ?After the system runs for a few days, the
>> number of mappings according to "wc -l /proc/$(pgrep Xorg)/maps" can
>> reach 5k-10k, with the vast majority being 4kB-sized mappings
>> to /dev/dri/card0 almost contiguously laid out in the address space.
>> Such a large number of mappings should be expected, given the numerous
>> distinct glyphs arising from different CJK characters, fonts, and sizes.
>> Note that libdrm_radeon's bo_unmap() keeps a buffer object mapped even
>> if it is no longer accessed by the CPU (and only calls munmap() when the
>> object is destroyed), which has certainly inflated the mapping count
>> significantly, but otherwise the mmap() overhead would be prohibitive.
>>
>> Currently the kernel's arch_get_unmapped_area_topdown() is linear-time,
>> so further mmap() calls becomes very slow with so many mappings existing
>> in the X server's address space. ?Since redrawing a window usually
>> involves the creation of a significant number of temporary pixmaps or
>> XRender pictures, often requiring mapping by the X server, it is thus
>> slowed down greatly. ?Although arch_get_unmapped_area_topdown() attempts
>> to use mm->free_area_cache to speed up the search, the cache is usually
>> invalidated due to the mm->cached_hole_size test whenever the block size
>> being searched for is smaller than that in the last time; this ensures
>> that the function always finds the earliest unmapped area in search
>> order that is sufficiently large, thus reducing address space
>> fragmentation (commit 1363c3cd). ?Consequently, if different mapping
>> sizes are used in successive mmap() calls, as is often the case when
>> dealing with pixmaps larger than a page in size, the cache would be
>> invalidated almost half of the time, and the amortized cost of each
>> mmap() remains linear.
>>
>> A quantitative measurement is made with the attached pbobench.cpp,
>> compiled with Makefile.pbobench. ?This program uses OpenGL pixel-buffer
>> objects (which corresponds one-to-one to GEM buffer objects on my
>> system) to simulate the effect of having a large number of GEM-related
>> mappings in the X server. ?It first creates and maps N page-sized PBOs
>> to mimic the large number of XRender glyphs, then measures the time
>> needed to create/map/unmap/destroy more PBOs with size varying between
>> 1-16384 bytes. ?The time spent per iteration (which does either a
>> create/map or an unmap/destroy) is clearly O(N):
>>
>> N=100: 17.3us
>> N=1000: 19.9us
>> N=10000: 88.5us
>> N=20000: 205us
>> N=40000: 406us
>>
>> and in oprofile results, the amount of CPU time spent in find_vma() can
>> reach 60-70%, while no other single function takes more than 3%.
>>
>> I think this problem is not difficult to solve. ?While it isn't obvious
>> to me how to find the earliest sufficiently-large unmapped area quickly,
>> IMHO it is just as good, fragmentation-wise, if we simply allocate from
>> the smallest sufficiently-large unmapped area regardless of its address;
>> for this purpose, the final "open-ended" unmapped area in the original
>> search order (i.e. the one with the lowest address in
>> arch_get_unmapped_area_topdown()) can be regarded as being infinitely
>> large, so that it is only used (from the correct "end") when absolutely
>> necessary. ?In this way, a simple size-indexed rb-tree of the unmapped
>> areas will allow the search to be performed in logarithmic time.
>>
>> As I'm not good at kernel hacking, for now I have written a userspace
>> workaround in libdrm, available from
>> https://github.com/r6144/libdrm/tree/my , which reserves some address
>> space and then allocates from it using MMAP_FIXED. ?Due to laziness, it
>> is written in C++ and does not currently combine adjacent free blocks.
>> This gives the expected improvements in pbobench results:
>>
>> N=100: 18.3us
>> N=1000: 18.0us
>> N=10000: 18.2us
>> N=20000: 18.9us
>> N=40000: 20.8us
>> N=80000: 23.5us
>> NOTE: N=80000 requires increasing /proc/sys/vm/max_map_count
>>
>> I am also running Xorg with this modified version of libdrm. ?So far it
>> runs okay, and seem to be somewhat snappier than before, although as "wc
>> -l /proc/$(pgrep Xorg)/maps" has only reached 4369 by now, the
>> improvement in responsiveness is not yet that great. ?I have not tested
>> the algorithm in 32-bit programs though, but intuitively it should work.
>>
>> (By the way, after this modification, SELinux's sidtab_search_context()
>> appears near the top of the profiling results due to the use of
>> linear-time search. ?It should eventually be optimized as well.)
>>
>> Do you find it worthwhile to implement something similar in the kernel?
>> After all, the responsiveness improvement can be quite significant, and
>> it seems difficult to make the graphics subsystem do fewer mmap()'s
>> (e.g. by storing multiple XRender glyphs in a single buffer object).
>> Not to mention that other applications doing lots of mmap()'s for
>> whatever reason will benefit as well.
>>
>> Please CC me as I'm not subscribed.
>>
>> r6144

Killer solution would be to have no mapping and a decent
upload/download ioctl that can take userpage.

Cheers,
Jerome

2011-03-29 14:44:39

by r6144

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

?? 2011-03-29???? 10:22 -0400??Jerome Glisseд????

> Killer solution would be to have no mapping and a decent
> upload/download ioctl that can take userpage.

Doesn't this sound like GEM's read/write interface implemented by e.g.
the i915 driver? But if I understand correctly, a mmap-like interface
should still be necessary if we want to implement e.g. glMapBuffer()
without extra copying.

r6144

2011-03-29 15:23:26

by Jerome Glisse

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

2011/3/29 r6144 <[email protected]>:
> ?? 2011-03-29???? 10:22 -0400??Jerome Glisseд????
>
>> Killer solution would be to have no mapping and a decent
>> upload/download ioctl that can take userpage.
>
> Doesn't this sound like GEM's read/write interface implemented by e.g.
> the i915 driver? But if I understand correctly, a mmap-like interface
> should still be necessary if we want to implement e.g. glMapBuffer()
> without extra copying.
>
> r6144
>
>
glMapBuffer should not be use, it's really not a good way to do stuff.
Anyway the extra copy might be unavoidable given that sometime the
front/back might either be in unmappable vram or either have memory
layout that is not the one specify at buffer creation (this is very
common when using tiling for instance). So even considering MapBuffer
or a like function i believe it's a lot better to not allow buffer
mapping in userspace but provide upload/download hooks that can use
userpage to avoid as much as possible extra copy.

Cheers,
Jerome

2011-03-29 18:01:08

by Lucas Stach

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

Am Dienstag, den 29.03.2011, 11:23 -0400 schrieb Jerome Glisse:
> 2011/3/29 r6144 <[email protected]>:
> > 在 2011-03-29二的 10:22 -0400,Jerome Glisse写道:
> >
> >> Killer solution would be to have no mapping and a decent
> >> upload/download ioctl that can take userpage.
> >
> > Doesn't this sound like GEM's read/write interface implemented by e.g.
> > the i915 driver? But if I understand correctly, a mmap-like interface
> > should still be necessary if we want to implement e.g. glMapBuffer()
> > without extra copying.
> >
> > r6144
> >
> >
> glMapBuffer should not be use, it's really not a good way to do stuff.
> Anyway the extra copy might be unavoidable given that sometime the
> front/back might either be in unmappable vram or either have memory
> layout that is not the one specify at buffer creation (this is very
> common when using tiling for instance). So even considering MapBuffer
> or a like function i believe it's a lot better to not allow buffer
> mapping in userspace but provide upload/download hooks that can use
> userpage to avoid as much as possible extra copy.
>
> Cheers,
> Jerome
>

Wouldn't this give us a performance penalty for short lived resources
like vbo's which are located in GART memory? Mmap allows us to write
directly to this drm controlled portion of sysram. With a copy based
implementation we would have to allocate the buffer in sysram just to
copy it over to another portion of sysram which seems a little insane to
me, but I'm not an expert here.

-- Lucas

2011-03-29 19:45:37

by Jerome Glisse

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Tue, Mar 29, 2011 at 2:01 PM, Lucas Stach <[email protected]> wrote:
> Am Dienstag, den 29.03.2011, 11:23 -0400 schrieb Jerome Glisse:
>> 2011/3/29 r6144 <[email protected]>:
>> > 在 2011-03-29二的 10:22 -0400,Jerome Glisse写道:
>> >
>> >> Killer solution would be to have no mapping and a decent
>> >> upload/download ioctl that can take userpage.
>> >
>> > Doesn't this sound like GEM's read/write interface implemented by e.g.
>> > the i915 driver?  But if I understand correctly, a mmap-like interface
>> > should still be necessary if we want to implement e.g. glMapBuffer()
>> > without extra copying.
>> >
>> > r6144
>> >
>> >
>> glMapBuffer should not be use, it's really not a good way to do stuff.
>> Anyway the extra copy might be unavoidable given that sometime the
>> front/back might either be in unmappable vram or either have memory
>> layout that is not the one specify at buffer creation (this is very
>> common when using tiling for instance). So even considering MapBuffer
>> or a like function i believe it's a lot better to not allow buffer
>> mapping in userspace but provide upload/download hooks that can use
>> userpage to avoid as much as possible extra copy.
>>
>> Cheers,
>> Jerome
>>
>
> Wouldn't this give us a performance penalty for short lived resources
> like vbo's which are located in GART memory? Mmap allows us to write
> directly to this drm controlled portion of sysram. With a copy based
> implementation we would have to allocate the buffer in sysram just to
> copy it over to another portion of sysram which seems a little insane to
> me, but I'm not an expert here.
>
> -- Lucas

Short lived & small bo would definitly doesn't work well for this kind
of API, it would all be a function of the ioctl cost. But i am not
sure the drawback would be that big, intel tested with pread/pwrite
and gived up don't remember why. For the vbo case you describe the
scheme i was thinking would be : allocate bo and on buffer data call
upload to the allocated bo using the bind user page feature that would
mean zero extra copy operation. For the fire forget case of vbo,
likely somekind of transient buffer would be more appropriate.

Cheers,
Jerome

2011-03-29 20:26:31

by Daniel Vetter

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Tue, Mar 29, 2011 at 03:45:34PM -0400, Jerome Glisse wrote:
> Short lived & small bo would definitly doesn't work well for this kind
> of API, it would all be a function of the ioctl cost. But i am not
> sure the drawback would be that big, intel tested with pread/pwrite
> and gived up don't remember why. For the vbo case you describe the
> scheme i was thinking would be : allocate bo and on buffer data call
> upload to the allocated bo using the bind user page feature that would
> mean zero extra copy operation. For the fire forget case of vbo,
> likely somekind of transient buffer would be more appropriate.

Just to clarify: Uploads to linear buffers are all done with pwrite (due
to an api foobar, it's not so great for 2d/tiled stuff). It's amazing how
much faster that is: Switching vbo's from mmpa to pwrite has given a few
percent more fps in openarena in i915g! As long as the chunk you're gonna
write fits into L1 cache, it's probably a net win.
-Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48

2011-03-29 21:05:03

by Jerome Glisse

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Tue, Mar 29, 2011 at 4:26 PM, Daniel Vetter <[email protected]> wrote:
> On Tue, Mar 29, 2011 at 03:45:34PM -0400, Jerome Glisse wrote:
>> Short lived & small bo would definitly doesn't work well for this kind
>> of API, it would all be a function of the ioctl cost. But i am not
>> sure the drawback would be that big, intel tested with pread/pwrite
>> and gived up don't remember why. For the vbo case you describe the
>> scheme i was thinking would be : allocate bo and on buffer data call
>> upload to the allocated bo using the bind user page feature that would
>> mean zero extra copy operation. For the fire forget case of vbo,
>> likely somekind of transient buffer would be more appropriate.
>
> Just to clarify: Uploads to linear buffers are all done with pwrite (due
> to an api foobar, it's not so great for 2d/tiled stuff). It's amazing how
> much faster that is: Switching vbo's from mmpa to pwrite has given a few
> percent more fps in openarena in i915g! As long as the chunk you're gonna
> write fits into L1 cache, it's probably a net win.
> -Daniel

What i had in mind was something little bit more advance that pwrite,
somethings that would take width,height,pitch of userpage and would be
able to perform proper blit. But yes pwrite in intel is kind of
limited.

Cheers,
Jerome

2011-03-29 21:57:51

by Dave Airlie

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Wed, Mar 30, 2011 at 7:04 AM, Jerome Glisse <[email protected]> wrote:
> On Tue, Mar 29, 2011 at 4:26 PM, Daniel Vetter <[email protected]> wrote:
>> On Tue, Mar 29, 2011 at 03:45:34PM -0400, Jerome Glisse wrote:
>>> Short lived & small bo would definitly doesn't work well for this kind
>>> of API, it would all be a function of the ioctl cost. But i am not
>>> sure the drawback would be that big, intel tested with pread/pwrite
>>> and gived up don't remember why. For the vbo case you describe the
>>> scheme i was thinking would be : allocate bo and on buffer data call
>>> upload to the allocated bo using the bind user page feature that would
>>> mean zero extra copy operation. For the fire forget case of vbo,
>>> likely somekind of transient buffer would be more appropriate.
>>
>> Just to clarify: Uploads to linear buffers are all done with pwrite (due
>> to an api foobar, it's not so great for 2d/tiled stuff). It's amazing how
>> much faster that is: Switching vbo's from mmpa to pwrite has given a few
>> percent more fps in openarena in i915g! As long as the chunk you're gonna
>> write fits into L1 cache, it's probably a net win.
>> -Daniel
>
> What i had in mind was something little bit more advance that pwrite,
> somethings that would take width,height,pitch of userpage and would be
> able to perform proper blit. But yes pwrite in intel is kind of
> limited.

TTM has support for userpage binding we just don't use it.

Dave.

2011-03-30 07:32:39

by Chris Wilson

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Wed, 30 Mar 2011 07:57:49 +1000, Dave Airlie <[email protected]> wrote:
> On Wed, Mar 30, 2011 at 7:04 AM, Jerome Glisse <[email protected]> wrote:
> > What i had in mind was something little bit more advance that pwrite,
> > somethings that would take width,height,pitch of userpage and would be
> > able to perform proper blit. But yes pwrite in intel is kind of
> > limited.
>
> TTM has support for userpage binding we just don't use it.

Yes, and I've been experimenting with the same in GEM to great effect in
the DDX. The complication remains in managing the CPU synchronisation,
which suggests that it would only be useful for STREAM_DRAW objects (and
perhaps the sub-region updates to STATIC_DRAW). (And for readback, if
retrieving the data were the actual bottleneck.)

And I did play with a new pread/pwrite interface that did as you suggest,
binding the user pages and performing a blit. But by the time you make the
interface asynchronous, it becomes much easier to let the client code
create the mapping and be fully aware of the barriers.

And yes I do concur that vma bookkeeping does impose significant overheads
and I have been removing as many mappings from our drivers as I can; within
the limitations of the pwrite interface.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2011-03-30 13:28:12

by Jerome Glisse

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Wed, Mar 30, 2011 at 3:32 AM, Chris Wilson <[email protected]> wrote:
> On Wed, 30 Mar 2011 07:57:49 +1000, Dave Airlie <[email protected]> wrote:
>> On Wed, Mar 30, 2011 at 7:04 AM, Jerome Glisse <[email protected]> wrote:
>> > What i had in mind was something little bit more advance that pwrite,
>> > somethings that would take width,height,pitch of userpage and would be
>> > able to perform proper blit. But yes pwrite in intel is kind of
>> > limited.
>>
>> TTM has support for userpage binding we just don't use it.
>
> Yes, and I've been experimenting with the same in GEM to great effect in
> the DDX. The complication remains in managing the CPU synchronisation,
> which suggests that it would only be useful for STREAM_DRAW objects (and
> perhaps the sub-region updates to STATIC_DRAW). (And for readback, if
> retrieving the data were the actual bottleneck.)

What do you mean by CPU synchronisation ? In what i had in mind the
upload/download would block userspace until operation is, this would
make upload/dowload barrier of course it doesn't play well with
usecase where you keep uploading/downloading (idea to aleviate that is
to allow several download/upload in one ioctl call).

> And I did play with a new pread/pwrite interface that did as you suggest,
> binding the user pages and performing a blit. But by the time you make the
> interface asynchronous, it becomes much easier to let the client code
> create the mapping and be fully aware of the barriers.
>
> And yes I do concur that vma bookkeeping does impose significant overheads
> and I have been removing as many mappings from our drivers as I can; within
> the limitations of the pwrite interface.
> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

Cheers,
Jerome

2011-03-30 14:07:32

by Chris Wilson

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Wed, 30 Mar 2011 09:28:07 -0400, Jerome Glisse <[email protected]> wrote:
> On Wed, Mar 30, 2011 at 3:32 AM, Chris Wilson <[email protected]> wrote:
> > On Wed, 30 Mar 2011 07:57:49 +1000, Dave Airlie <[email protected]> wrote:
> >> On Wed, Mar 30, 2011 at 7:04 AM, Jerome Glisse <[email protected]> wrote:
> >> > What i had in mind was something little bit more advance that pwrite,
> >> > somethings that would take width,height,pitch of userpage and would be
> >> > able to perform proper blit. But yes pwrite in intel is kind of
> >> > limited.
> >>
> >> TTM has support for userpage binding we just don't use it.
> >
> > Yes, and I've been experimenting with the same in GEM to great effect in
> > the DDX. The complication remains in managing the CPU synchronisation,
> > which suggests that it would only be useful for STREAM_DRAW objects (and
> > perhaps the sub-region updates to STATIC_DRAW). (And for readback, if
> > retrieving the data were the actual bottleneck.)
>
> What do you mean by CPU synchronisation ? In what i had in mind the
> upload/download would block userspace until operation is, this would
> make upload/dowload barrier of course it doesn't play well with
> usecase where you keep uploading/downloading (idea to aleviate that is
> to allow several download/upload in one ioctl call).

Yes, that is the issue: having to control access to the user pages whilst
they are in use by the GPU. A completely synchronous API for performing
a single pwrite with the blitter is too slow, much slower than doing an
uncached write with the CPU and queueing up multiple blits (as we
currently do).

The API I ended up with for the pwrite using the BLT was to specify a 2D
region (addr, width, height, stride, flags etc) and list of clip rects. At
which point I grew disenchanted, and realised that simply creating a bo
for mapping user pages was the far better solution.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2011-03-30 15:07:11

by Jerome Glisse

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Wed, Mar 30, 2011 at 10:07 AM, Chris Wilson <[email protected]> wrote:
> On Wed, 30 Mar 2011 09:28:07 -0400, Jerome Glisse <[email protected]> wrote:
>> On Wed, Mar 30, 2011 at 3:32 AM, Chris Wilson <[email protected]> wrote:
>> > On Wed, 30 Mar 2011 07:57:49 +1000, Dave Airlie <[email protected]> wrote:
>> >> On Wed, Mar 30, 2011 at 7:04 AM, Jerome Glisse <[email protected]> wrote:
>> >> > What i had in mind was something little bit more advance that pwrite,
>> >> > somethings that would take width,height,pitch of userpage and would be
>> >> > able to perform proper blit. But yes pwrite in intel is kind of
>> >> > limited.
>> >>
>> >> TTM has support for userpage binding we just don't use it.
>> >
>> > Yes, and I've been experimenting with the same in GEM to great effect in
>> > the DDX. The complication remains in managing the CPU synchronisation,
>> > which suggests that it would only be useful for STREAM_DRAW objects (and
>> > perhaps the sub-region updates to STATIC_DRAW). (And for readback, if
>> > retrieving the data were the actual bottleneck.)
>>
>> What do you mean by CPU synchronisation ? In what i had in mind the
>> upload/download would block userspace until operation is, this would
>> make upload/dowload barrier of course it doesn't play well with
>> usecase where you keep uploading/downloading (idea to aleviate that is
>> to allow several download/upload in one ioctl call).
>
> Yes, that is the issue: having to control access to the user pages whilst
> they are in use by the GPU. A completely synchronous API for performing
> a single pwrite with the blitter is too slow, much slower than doing an
> uncached write with the CPU and queueing up multiple blits (as we
> currently do).
>
> The API I ended up with for the pwrite using the BLT was to specify a 2D
> region (addr, width, height, stride, flags etc) and list of clip rects. At
> which point I grew disenchanted, and realised that simply creating a bo
> for mapping user pages was the far better solution.
> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

What kind of usage didn't played well with synchronous upload/download ? X, GL ?

Cheers,
Jerome

2011-03-30 15:23:20

by Chris Wilson

[permalink] [raw]
Subject: Re: GEM-related desktop sluggishness due to linear-time arch_get_unmapped_area_topdown()

On Wed, 30 Mar 2011 11:07:08 -0400, Jerome Glisse <[email protected]> wrote:
> What kind of usage didn't played well with synchronous upload/download ? X, GL ?

Performing fallback rendering for X using a shadow buffer.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre