2005-12-21 17:23:36

by Andrey Volkov

[permalink] [raw]
Subject: [RFC] genalloc != generic DEVICE memory allocator

Hello Jes and all

I try to use your allocator (gen_pool_xxx), idea of which
is a cute nice thing. But current implementation of it is
inappropriate for a _device_ (aka onchip, like framebuffer) memory
allocation, by next reasons:

1) Device memory is expensive resource by access time and/or size cost.
So we couldn't use (usually) this memory for the free blocks lists.
2) Device memory usually have special requirement of access to it
(alignment/special insn). So we couldn't use part of allocated
blocks for some control structures (this problem solved in your
implementation, it's common remark)
3) Obvious (IMHO) workflow of mem. allocator look like:
- at startup time, driver allocate some big
(almost) static mem. chunk(s) for a control/data structures.
- during work of the device, driver allocate many small
mem. blocks with almost identical size.
such behavior lead to degeneration of buddy method and
transform it to the first/best fit method (with long seek
by the free node list).
4) The simple binary buddy method is far away from perfect for a device
due to a big internal fragmentation. Especially for a
network/mfd devices, for which, size of allocated data very
often is not a power of 2.

I start to modify your code to satisfy above demands,
but firstly I wish to know your, or somebody else, opinion.

Especially I will very happy if somebody have and could
provide to all, some device specific memory usage statistics.

--
Regards
Andrey Volkov


2005-12-22 08:43:56

by Pantelis Antoniou

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

Andrey Volkov wrote:
> Hello Jes and all
>
> I try to use your allocator (gen_pool_xxx), idea of which
> is a cute nice thing. But current implementation of it is
> inappropriate for a _device_ (aka onchip, like framebuffer) memory
> allocation, by next reasons:
>
> 1) Device memory is expensive resource by access time and/or size cost.
> So we couldn't use (usually) this memory for the free blocks lists.
> 2) Device memory usually have special requirement of access to it
> (alignment/special insn). So we couldn't use part of allocated
> blocks for some control structures (this problem solved in your
> implementation, it's common remark)
> 3) Obvious (IMHO) workflow of mem. allocator look like:
> - at startup time, driver allocate some big
> (almost) static mem. chunk(s) for a control/data structures.
> - during work of the device, driver allocate many small
> mem. blocks with almost identical size.
> such behavior lead to degeneration of buddy method and
> transform it to the first/best fit method (with long seek
> by the free node list).
> 4) The simple binary buddy method is far away from perfect for a device
> due to a big internal fragmentation. Especially for a
> network/mfd devices, for which, size of allocated data very
> often is not a power of 2.
>
> I start to modify your code to satisfy above demands,
> but firstly I wish to know your, or somebody else, opinion.
>
> Especially I will very happy if somebody have and could
> provide to all, some device specific memory usage statistics.
>

Hi Andrey,

FYI, on arch/ppc/lib/rheap.c theres an implementation of a remote heap.

It is currently used for the management of freescale's CPM1 & CPM2 internal
dual port RAM.

Take a look, it might be what you have in mind.

Regards

Pantelis

2005-12-22 13:41:33

by Andrey Volkov

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

Hi Sylvain,

Sylvain Munaut wrote:
> Hi Andrey,
>
>
> Didn't I sent you the memory allocator I wrote a few month back for 5200
> SRAM ?
Yes, I receive it and currently I use it for Bestcomm, but,
as I wrote before, I also writing another driver for which I need
allocator too, and sram_xxx/gen_pool_xxx completely inappropriate for it
(since device is PCI based). Also, trust me, it will be 6th or 7th
allocator implementation what I did, its more than enough to make me
sick from allocators.

As well, IMHO, yet another allocator in kernel (currently almost each
driver for dev. with onboard dynamically allocated mem. implement
somehow or other buddy/first fit alloc) will cause yet another bugs in
kernel ALREADY FIXED in driver in the neighbourhood dir.

>
> It uses the sram itself for the free block list but without using any
> (iow, you could allocate the whole SRAM, no memory is wasted). The SRAM
> is on-chip so pretty fast access. That kind of allocator is no good for
> memory on a PCI board or such though (bad access time ! using main
> memory would be better)
>
> Sylvain

Completely agree, but - for BESTCOMM case. This is what I have in mind
when I wrote 'usually' at 1) ;). Also don't forget about storage
for size of allocated blocks (which later passed to free) - in sram_xxx
case main memory used for it indirectly (when you push constant as
param) or directly, when you are store it in data struct. So, IMO,
better use it directly and control it in one place, then try to catch
bugs with invalid size pushed to free.


>
>
> Andrey Volkov wrote:
>
>>Hello Jes and all
>>
>>I try to use your allocator (gen_pool_xxx), idea of which
>>is a cute nice thing. But current implementation of it is
>>inappropriate for a _device_ (aka onchip, like framebuffer) memory
>>allocation, by next reasons:
>>
>> 1) Device memory is expensive resource by access time and/or size cost.
>> So we couldn't use (usually) this memory for the free blocks lists.
>> 2) Device memory usually have special requirement of access to it
>> (alignment/special insn). So we couldn't use part of allocated
>> blocks for some control structures (this problem solved in your
>> implementation, it's common remark)
>> 3) Obvious (IMHO) workflow of mem. allocator look like:
>> - at startup time, driver allocate some big
>> (almost) static mem. chunk(s) for a control/data structures.
>> - during work of the device, driver allocate many small
>> mem. blocks with almost identical size.
>> such behavior lead to degeneration of buddy method and
>> transform it to the first/best fit method (with long seek
>> by the free node list).
>> 4) The simple binary buddy method is far away from perfect for a device
>> due to a big internal fragmentation. Especially for a
>> network/mfd devices, for which, size of allocated data very
>> often is not a power of 2.
>>
>>I start to modify your code to satisfy above demands,
>>but firstly I wish to know your, or somebody else, opinion.
>>
>>Especially I will very happy if somebody have and could
>>provide to all, some device specific memory usage statistics.
>>

--
Regards
Andrey Volkov

P.S. Oops, sorry for duplication, I forget insert CC in prev replay.

2005-12-22 13:48:21

by Andrey Volkov

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

Hi Pantelis,

Pantelis Antoniou wrote:
> Andrey Volkov wrote:
>
>> Hello Jes and all
>>
>> I try to use your allocator (gen_pool_xxx), idea of which
>> is a cute nice thing. But current implementation of it is
>> inappropriate for a _device_ (aka onchip, like framebuffer) memory
>> allocation, by next reasons:
>>
>> 1) Device memory is expensive resource by access time and/or size cost.
>> So we couldn't use (usually) this memory for the free blocks lists.
>> 2) Device memory usually have special requirement of access to it
>> (alignment/special insn). So we couldn't use part of allocated
>> blocks for some control structures (this problem solved in your
>> implementation, it's common remark)
>> 3) Obvious (IMHO) workflow of mem. allocator look like:
>> - at startup time, driver allocate some big
>> (almost) static mem. chunk(s) for a control/data structures.
>> - during work of the device, driver allocate many small
>> mem. blocks with almost identical size.
>> such behavior lead to degeneration of buddy method and
>> transform it to the first/best fit method (with long seek
>> by the free node list).
>> 4) The simple binary buddy method is far away from perfect for a device
>> due to a big internal fragmentation. Especially for a
>> network/mfd devices, for which, size of allocated data very
>> often is not a power of 2.
>>
>> I start to modify your code to satisfy above demands,
>> but firstly I wish to know your, or somebody else, opinion.
>>
>> Especially I will very happy if somebody have and could
>> provide to all, some device specific memory usage statistics.
>>
>
> Hi Andrey,
>
> FYI, on arch/ppc/lib/rheap.c theres an implementation of a remote heap.
>
> It is currently used for the management of freescale's CPM1 & CPM2 internal
> dual port RAM.
>
> Take a look, it might be what you have in mind.
>
> Regards
>
> Pantelis

Thanks I missed it (and small wonder! :( ).

Andrew, Is somebody count HOW MANY dev specific implementation
of buddy/first-fit allocators now in kernel?

--
Regards
Andrey Volkov

2005-12-22 14:21:15

by Pantelis Antoniou

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

Andrey Volkov wrote:
> Hi Pantelis,
>
> Pantelis Antoniou wrote:
>
>>Andrey Volkov wrote:
>>

[snip]

>>
>>Hi Andrey,
>>
>>FYI, on arch/ppc/lib/rheap.c theres an implementation of a remote heap.
>>
>>It is currently used for the management of freescale's CPM1 & CPM2 internal
>>dual port RAM.
>>
>>Take a look, it might be what you have in mind.
>>
>>Regards
>>
>>Pantelis
>
>
> Thanks I missed it (and small wonder! :( ).
>
> Andrew, Is somebody count HOW MANY dev specific implementation
> of buddy/first-fit allocators now in kernel?
>

Yes, it is indeed messy.

The rheap implementation is generic enough and I believe can fit most of the
special memory allocators needs. If you'd like I could move it somewhere
generic and test it.

Regards

Pantelis

2005-12-22 15:37:50

by Jes Sorensen

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

>>>>> "Andrey" == Andrey Volkov <[email protected]> writes:

Andrey> Hello Jes and all I try to use your allocator (gen_pool_xxx),
Andrey> idea of which is a cute nice thing. But current implementation
Andrey> of it is inappropriate for a _device_ (aka onchip, like
Andrey> framebuffer) memory allocation, by next reasons:

Andrey,

Keep in mind that genalloc was meant to be simple for basic memory
allocations. It was never meant to be an over complex super high
performance allocation mechanism.

Andrey> 1) Device memory is expensive resource by access time and/or
Andrey> size cost. So we couldn't use (usually) this memory for the
Andrey> free blocks lists.

This really is irrelevant, the space is only used within the object
when it's on the free list. Ie. if all memory is handed out there's
no space used for this purpose.

Andrey> 3) Obvious (IMHO) workflow of mem. allocator
Andrey> look like: - at startup time, driver allocate some big
Andrey> (almost) static mem. chunk(s) for a control/data structures.
Andrey> - during work of the device, driver allocate many small
Andrey> mem. blocks with almost identical size. such behavior lead to
Andrey> degeneration of buddy method and transform it to the
Andrey> first/best fit method (with long seek by the free node list).

This is only really valid for network devices, and even then it's not
quite so. For things like uncached allocations your observation is
completely off.

For the case of more traditional devices, the control structures will
be allocated from one end of the block, the rest will be used for
packet descriptors which will be going in and out of the memory pool
on a regular basis. In most normal cases these will all be of the same
size and it doesn't matter where in the memory space they were
allocated.

Andrey> 4) The simple binary buddy method is far away from perfect for
Andrey> a device due to a big internal fragmentation. Especially for a
Andrey> network/mfd devices, for which, size of allocated data very
Andrey> often is not a power of 2.

For network devices it's perfectly adequate as it will almost always
satisfy what I described above. Incoming packets will always be
allocated for a full MTU sized packet hence all allocated blocks will
be of the same size. For outgoing packets, the allcation is short
lived and while it may be that a good chunk of packets aren't all full
MTU sized, it is rarely worth the hassle of trying to make the
allocator allow to-the-byte sized allocations as the number of
outstanding outgoing packets will be very limited.

Andrey> I start to modify your code to satisfy above demands, but
Andrey> firstly I wish to know your, or somebody else, opinion.

I honestly don't think the majority of your demands are valid.
genalloc was meant to be simple, not an ultra fast at any random
block size allocator. So far I don't see any reason for changing to
the allocation algorithm into anything much more complex - doesn't
mean there couldn't be a reason for doing so, but I don't think you
have described any so far.

You mentioned frame buffers, but what is the kernel supposed to do
with those allocation wise? If you have a frame buffer console, the
memory is allocated once and handed to the frame buffer driver.
Ie. you don't need a ton of on demand allocations for that and for
X, the memory management is handled in the X server, not by the
kernel.

The only thing I think would make sense to implement is to allow it to
use indirect descriptor blocks for the memory it manages. This is not
because it's wrong to use the memory for the free list, as it will
only be used for this when the chunk is not in use, but because access
to certain types of memory isn't always valid through normal direct
access. Ie. if one used descriptor blocks residing in normal
GFP_KERNEL memory, it would be possible to use the allocator to manage
memory sitting on the other side of a PCI bus.

Regards,
Jes

2005-12-22 15:44:46

by Andrey Volkov

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

Pantelis Antoniou wrote:
> Andrey Volkov wrote:
>
>> Hi Pantelis,
>>
>> Pantelis Antoniou wrote:
>>
>>> Andrey Volkov wrote:
>>>
>
> [snip]
>
>>>
>>> Hi Andrey,
>>>
>>> FYI, on arch/ppc/lib/rheap.c theres an implementation of a remote heap.
>>>
>>> It is currently used for the management of freescale's CPM1 & CPM2
>>> internal
>>> dual port RAM.
>>>
>>> Take a look, it might be what you have in mind.
>>>
>>> Regards
>>>
>>> Pantelis
>>
>>
>>
>> Thanks I missed it (and small wonder! :( ).
>>
>> Andrew, Is somebody count HOW MANY dev specific implementation
>> of buddy/first-fit allocators now in kernel?
>>
>
> Yes, it is indeed messy.
>
> The rheap implementation is generic enough and I believe can fit most of
> the
> special memory allocators needs. If you'd like I could move it somewhere
> generic and test it.
>
I'm sure lib/ will be appropriate place. and something like
"DON'T TRY REINVENT WHEEL, TRY FIX EXISTS" in documentation/ :).

Now couple word about rheap: I understand why you are use static
alignment in allocator, but its very specialized for CPM. IMO, align
must be a param of xx_alloc. For ex: device may demand alignment by
8 bytes, which ok until... you are try map this memory to the user
space (don't shoot at me, remember about framebuffer & co).

--
Regards
Andrey Volkov

2005-12-22 15:59:22

by Pantelis Antoniou

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

>

[snip]

> I'm sure lib/ will be appropriate place. and something like
> "DON'T TRY REINVENT WHEEL, TRY FIX EXISTS" in documentation/ :).
>
> Now couple word about rheap: I understand why you are use static
> alignment in allocator, but its very specialized for CPM. IMO, align
> must be a param of xx_alloc. For ex: device may demand alignment by
> 8 bytes, which ok until... you are try map this memory to the user
> space (don't shoot at me, remember about framebuffer & co).
>

It is trivial to align to a given alignment in a call. Please search
the archives since this was needed for CPM2 and I've committed a patch.

As for mapping user space, since rheap only deals with addresses and never
touches the memory it's supposed to control, you can do pretty much everything.

I still don't understand what are you trying to do however.

Mind explaining?

> --
> Regards
> Andrey Volkov
>

Regards

Pantelis

2005-12-22 18:19:05

by Andrey Volkov

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

Hi Jes,

Jes Sorensen wrote:
>>>>>>"Andrey" == Andrey Volkov <[email protected]> writes:
>
>
> Andrey> Hello Jes and all I try to use your allocator (gen_pool_xxx),
> Andrey> idea of which is a cute nice thing. But current implementation
> Andrey> of it is inappropriate for a _device_ (aka onchip, like
> Andrey> framebuffer) memory allocation, by next reasons:
>
> Andrey,
>
> Keep in mind that genalloc was meant to be simple for basic memory
> allocations. It was never meant to be an over complex super high
> performance allocation mechanism.
>
> Andrey> 1) Device memory is expensive resource by access time and/or
> Andrey> size cost. So we couldn't use (usually) this memory for the
> Andrey> free blocks lists.
>
> This really is irrelevant, the space is only used within the object
> when it's on the free list. Ie. if all memory is handed out there's
> no space used for this purpose.

I point out 2 reasons: ACCESS TIME was first :), let take very
widespread case: PCI device with some onboard memory and any
N GHz proc. - result may be terrible: each access to device mem (which
usually uncached) will slowed down this super fast proc to 33 MHZ, i.e
same as we made busy-wait with disabled interrupts after each read/write...

I possible awry when use 'control structures' in 2), I've in view
allocator's control structures (size/next etc), not device specific
control structs.

>
> Andrey> 3) Obvious (IMHO) workflow of mem. allocator
> Andrey> look like: - at startup time, driver allocate some big
> Andrey> (almost) static mem. chunk(s) for a control/data structures.
> Andrey> - during work of the device, driver allocate many small
> Andrey> mem. blocks with almost identical size. such behavior lead to
> Andrey> degeneration of buddy method and transform it to the
> Andrey> first/best fit method (with long seek by the free node list).
>
> This is only really valid for network devices, and even then it's not
> quite so. For things like uncached allocations your observation is
> completely off.

Could you give me some examples? Possible I overlooked something
significant.

>
> For the case of more traditional devices, the control structures will
> be allocated from one end of the block, the rest will be used for
> packet descriptors which will be going in and out of the memory pool
> on a regular basis.

This was main reason why I try to modify genalloc: I needed in
generic allocator for both short-live strictly aligned blocks and
long-live blocks with restriction by size.

> In most normal cases these will all be of the same
> size and it doesn't matter where in the memory space they were
> allocated.

And thats also why I consider that 'buddy' is not appropriate to be
'generic' (most cases == generic, isn't is :)?): when you're allocate
mainly same sized blocks, 'buddy' degraded to the first-fit.

Possible solution I see in mixed first-fit with lazy coalescent for
short lived blocks and first-fit with immediately coalescent for
long-lived blocks. But, again, I may overlook something significant.
And, certainly, I could overlooked someone else allocator implementation
in some driver.

>
> Andrey> 4) The simple binary buddy method is far away from perfect for
> Andrey> a device due to a big internal fragmentation. Especially for a
> Andrey> network/mfd devices, for which, size of allocated data very
> Andrey> often is not a power of 2.
>
snip
>
> Andrey> I start to modify your code to satisfy above demands, but
> Andrey> firstly I wish to know your, or somebody else, opinion.
>
> I honestly don't think the majority of your demands are valid.
> genalloc was meant to be simple, not an ultra fast at any random
> block size allocator. So far I don't see any reason for changing to
> the allocation algorithm into anything much more complex - doesn't
> mean there couldn't be a reason for doing so, but I don't think you
> have described any so far.
I disagree here, generic couldn't be very simple and slow, because in
this case simply no one will be use it, and hence we'll get today's
picture: reimplemented allocators in many drivers.

>
> You mentioned frame buffers, but what is the kernel supposed to do
> with those allocation wise? If you have a frame buffer console, the
> memory is allocated once and handed to the frame buffer driver.
> Ie. you don't need a ton of on demand allocations for that and for
> X, the memory management is handled in the X server, not by the
> kernel.

For video-only device this is true, but if device is a multifunctional,
which is frequent case in embedded systems, then kernel must control of
device memory allocation. Currently, however, even video cards for
desktops become more and more multifunctional (VIVO/audio etc.).

>
> The only thing I think would make sense to implement is to allow it to
> use indirect descriptor blocks for the memory it manages. This is not
> because it's wrong to use the memory for the free list, as it will
> only be used for this when the chunk is not in use, but because access
> to certain types of memory isn't always valid through normal direct
> access. Ie. if one used descriptor blocks residing in normal
> GFP_KERNEL memory, it would be possible to use the allocator to manage
> memory sitting on the other side of a PCI bus.
I describe above, why we couldn't/wouldn't use onboard memory for
allocator specific data.

Pantelis, Am I answered to your question (...what are you trying to
do...) too?

--
Regards
Andrey Volkov

2005-12-22 18:23:07

by Pantelis Antoniou

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

On Thursday 22 December 2005 20:18, Andrey Volkov wrote:
> Hi Jes,
>
> Jes Sorensen wrote:
> >>>>>>"Andrey" == Andrey Volkov <[email protected]> writes:
> >
> >
> > Andrey> Hello Jes and all I try to use your allocator (gen_pool_xxx),
> > Andrey> idea of which is a cute nice thing. But current implementation
> > Andrey> of it is inappropriate for a _device_ (aka onchip, like
> > Andrey> framebuffer) memory allocation, by next reasons:
> >
> > Andrey,
> >
> > Keep in mind that genalloc was meant to be simple for basic memory
> > allocations. It was never meant to be an over complex super high
> > performance allocation mechanism.
> >
> > Andrey> 1) Device memory is expensive resource by access time and/or
> > Andrey> size cost. So we couldn't use (usually) this memory for the
> > Andrey> free blocks lists.
> >
> > This really is irrelevant, the space is only used within the object
> > when it's on the free list. Ie. if all memory is handed out there's
> > no space used for this purpose.
>
> I point out 2 reasons: ACCESS TIME was first :), let take very
> widespread case: PCI device with some onboard memory and any
> N GHz proc. - result may be terrible: each access to device mem (which
> usually uncached) will slowed down this super fast proc to 33 MHZ, i.e
> same as we made busy-wait with disabled interrupts after each read/write...
>
> I possible awry when use 'control structures' in 2), I've in view
> allocator's control structures (size/next etc), not device specific
> control structs.
>
> >
> > Andrey> 3) Obvious (IMHO) workflow of mem. allocator
> > Andrey> look like: - at startup time, driver allocate some big
> > Andrey> (almost) static mem. chunk(s) for a control/data structures.
> > Andrey> - during work of the device, driver allocate many small
> > Andrey> mem. blocks with almost identical size. such behavior lead to
> > Andrey> degeneration of buddy method and transform it to the
> > Andrey> first/best fit method (with long seek by the free node list).
> >
> > This is only really valid for network devices, and even then it's not
> > quite so. For things like uncached allocations your observation is
> > completely off.
>
> Could you give me some examples? Possible I overlooked something
> significant.
>
> >
> > For the case of more traditional devices, the control structures will
> > be allocated from one end of the block, the rest will be used for
> > packet descriptors which will be going in and out of the memory pool
> > on a regular basis.
>
> This was main reason why I try to modify genalloc: I needed in
> generic allocator for both short-live strictly aligned blocks and
> long-live blocks with restriction by size.
>
> > In most normal cases these will all be of the same
> > size and it doesn't matter where in the memory space they were
> > allocated.
>
> And thats also why I consider that 'buddy' is not appropriate to be
> 'generic' (most cases == generic, isn't is :)?): when you're allocate
> mainly same sized blocks, 'buddy' degraded to the first-fit.
>
> Possible solution I see in mixed first-fit with lazy coalescent for
> short lived blocks and first-fit with immediately coalescent for
> long-lived blocks. But, again, I may overlook something significant.
> And, certainly, I could overlooked someone else allocator implementation
> in some driver.
>
> >
> > Andrey> 4) The simple binary buddy method is far away from perfect for
> > Andrey> a device due to a big internal fragmentation. Especially for a
> > Andrey> network/mfd devices, for which, size of allocated data very
> > Andrey> often is not a power of 2.
> >
> snip
> >
> > Andrey> I start to modify your code to satisfy above demands, but
> > Andrey> firstly I wish to know your, or somebody else, opinion.
> >
> > I honestly don't think the majority of your demands are valid.
> > genalloc was meant to be simple, not an ultra fast at any random
> > block size allocator. So far I don't see any reason for changing to
> > the allocation algorithm into anything much more complex - doesn't
> > mean there couldn't be a reason for doing so, but I don't think you
> > have described any so far.
> I disagree here, generic couldn't be very simple and slow, because in
> this case simply no one will be use it, and hence we'll get today's
> picture: reimplemented allocators in many drivers.
>
> >
> > You mentioned frame buffers, but what is the kernel supposed to do
> > with those allocation wise? If you have a frame buffer console, the
> > memory is allocated once and handed to the frame buffer driver.
> > Ie. you don't need a ton of on demand allocations for that and for
> > X, the memory management is handled in the X server, not by the
> > kernel.
>
> For video-only device this is true, but if device is a multifunctional,
> which is frequent case in embedded systems, then kernel must control of
> device memory allocation. Currently, however, even video cards for
> desktops become more and more multifunctional (VIVO/audio etc.).
>
> >
> > The only thing I think would make sense to implement is to allow it to
> > use indirect descriptor blocks for the memory it manages. This is not
> > because it's wrong to use the memory for the free list, as it will
> > only be used for this when the chunk is not in use, but because access
> > to certain types of memory isn't always valid through normal direct
> > access. Ie. if one used descriptor blocks residing in normal
> > GFP_KERNEL memory, it would be possible to use the allocator to manage
> > memory sitting on the other side of a PCI bus.
> I describe above, why we couldn't/wouldn't use onboard memory for
> allocator specific data.
>
> Pantelis, Am I answered to your question (...what are you trying to
> do...) too?
>

Yes. rheap seems to cover your cases...

> --
> Regards
> Andrey Volkov
> _______________________________________________
> Linuxppc-embedded mailing list
> [email protected]
> https://ozlabs.org/mailman/listinfo/linuxppc-embedded
>

Regards

Pantelis

2005-12-23 07:38:32

by Andrey Volkov

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator



Pantelis Antoniou wrote:
> On Thursday 22 December 2005 20:18, Andrey Volkov wrote:
>
>>Hi Jes,
>>
>>Jes Sorensen wrote:
>>
>>>>>>>>"Andrey" == Andrey Volkov <[email protected]> writes:
>>>
>>>
>>>Andrey> Hello Jes and all I try to use your allocator (gen_pool_xxx),
>>>Andrey> idea of which is a cute nice thing. But current implementation
>>>Andrey> of it is inappropriate for a _device_ (aka onchip, like
>>>Andrey> framebuffer) memory allocation, by next reasons:
>>>
>>>Andrey,
>>>
>>>Keep in mind that genalloc was meant to be simple for basic memory
>>>allocations. It was never meant to be an over complex super high
>>>performance allocation mechanism.
>>>
>>>Andrey> 1) Device memory is expensive resource by access time and/or
>>>Andrey> size cost. So we couldn't use (usually) this memory for the
>>>Andrey> free blocks lists.
>>>
>>>This really is irrelevant, the space is only used within the object
>>>when it's on the free list. Ie. if all memory is handed out there's
>>>no space used for this purpose.
>>
>>I point out 2 reasons: ACCESS TIME was first :), let take very
>>widespread case: PCI device with some onboard memory and any
>>N GHz proc. - result may be terrible: each access to device mem (which
>>usually uncached) will slowed down this super fast proc to 33 MHZ, i.e
>>same as we made busy-wait with disabled interrupts after each read/write...
>>
>>I possible awry when use 'control structures' in 2), I've in view
>>allocator's control structures (size/next etc), not device specific
>>control structs.
>>
>>
>>>Andrey> 3) Obvious (IMHO) workflow of mem. allocator
>>>Andrey> look like: - at startup time, driver allocate some big
>>>Andrey> (almost) static mem. chunk(s) for a control/data structures.
>>>Andrey> - during work of the device, driver allocate many small
>>>Andrey> mem. blocks with almost identical size. such behavior lead to
>>>Andrey> degeneration of buddy method and transform it to the
>>>Andrey> first/best fit method (with long seek by the free node list).
>>>
>>>This is only really valid for network devices, and even then it's not
>>>quite so. For things like uncached allocations your observation is
>>>completely off.
>>
>>Could you give me some examples? Possible I overlooked something
>>significant.
>>
>>
>>>For the case of more traditional devices, the control structures will
>>>be allocated from one end of the block, the rest will be used for
>>>packet descriptors which will be going in and out of the memory pool
>>>on a regular basis.
>>
>>This was main reason why I try to modify genalloc: I needed in
>>generic allocator for both short-live strictly aligned blocks and
>>long-live blocks with restriction by size.
>>
>>
>>>In most normal cases these will all be of the same
>>>size and it doesn't matter where in the memory space they were
>>>allocated.
>>
>>And thats also why I consider that 'buddy' is not appropriate to be
>>'generic' (most cases == generic, isn't is :)?): when you're allocate
>>mainly same sized blocks, 'buddy' degraded to the first-fit.
>>
>>Possible solution I see in mixed first-fit with lazy coalescent for
>>short lived blocks and first-fit with immediately coalescent for
>>long-lived blocks. But, again, I may overlook something significant.
>>And, certainly, I could overlooked someone else allocator implementation
>>in some driver.
>>
>>
>>>Andrey> 4) The simple binary buddy method is far away from perfect for
>>>Andrey> a device due to a big internal fragmentation. Especially for a
>>>Andrey> network/mfd devices, for which, size of allocated data very
>>>Andrey> often is not a power of 2.
>>>
>>
>>snip
>>
>>>Andrey> I start to modify your code to satisfy above demands, but
>>>Andrey> firstly I wish to know your, or somebody else, opinion.
>>>
>>>I honestly don't think the majority of your demands are valid.
>>>genalloc was meant to be simple, not an ultra fast at any random
>>>block size allocator. So far I don't see any reason for changing to
>>>the allocation algorithm into anything much more complex - doesn't
>>>mean there couldn't be a reason for doing so, but I don't think you
>>>have described any so far.
>>
>>I disagree here, generic couldn't be very simple and slow, because in
>>this case simply no one will be use it, and hence we'll get today's
>>picture: reimplemented allocators in many drivers.
>>
>>
>>>You mentioned frame buffers, but what is the kernel supposed to do
>>>with those allocation wise? If you have a frame buffer console, the
>>>memory is allocated once and handed to the frame buffer driver.
>>>Ie. you don't need a ton of on demand allocations for that and for
>>>X, the memory management is handled in the X server, not by the
>>>kernel.
>>
>>For video-only device this is true, but if device is a multifunctional,
>>which is frequent case in embedded systems, then kernel must control of
>>device memory allocation. Currently, however, even video cards for
>>desktops become more and more multifunctional (VIVO/audio etc.).
>>
>>
>>>The only thing I think would make sense to implement is to allow it to
>>>use indirect descriptor blocks for the memory it manages. This is not
>>>because it's wrong to use the memory for the free list, as it will
>>>only be used for this when the chunk is not in use, but because access
>>>to certain types of memory isn't always valid through normal direct
>>>access. Ie. if one used descriptor blocks residing in normal
>>>GFP_KERNEL memory, it would be possible to use the allocator to manage
>>>memory sitting on the other side of a PCI bus.
>>
>>I describe above, why we couldn't/wouldn't use onboard memory for
>>allocator specific data.
>>
>>Pantelis, Am I answered to your question (...what are you trying to
>>do...) too?
>>
>
>
> Yes. rheap seems to cover your cases...
>
Agree, I couldn't see nothing better for a basement of generic dev. alloc.

So, it will be much better if it will be moved to lib/.

Anyone have some more comments about subj. ?

--
Regards
Andrey Volkov

2005-12-23 07:52:29

by Pantelis Antoniou

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

>

[snip]

>
> Agree, I couldn't see nothing better for a basement of generic dev. alloc.
>
> So, it will be much better if it will be moved to lib/.
>
> Anyone have some more comments about subj. ?
>

Sure, but the call has to be made be a core developer.

Andrew?

Regards

Pantelis

2005-12-23 10:18:06

by Andrey Volkov

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator



Pantelis Antoniou wrote:
>>
>
> [snip]
>
>>
>> Agree, I couldn't see nothing better for a basement of generic dev.
>> alloc.
>>
>> So, it will be much better if it will be moved to lib/.
>>
>> Anyone have some more comments about subj. ?
>>
>
> Sure, but the call has to be made be a core developer.
>
> Andrew?

Pantelis, what did you think about renaming rheap.c and rh_xxx, to
something like dev_xxx, since, for example, rh_alloc overlapped with
__rh_alloc (__RegionHach__alloc) in the drivers/md/dm-raid1.c.


--
Regards
Andrey Volkov

2005-12-23 10:59:25

by Jes Sorensen

[permalink] [raw]
Subject: Re: [RFC] genalloc != generic DEVICE memory allocator

>>>>> "Andrey" == Andrey Volkov <[email protected]> writes:

Andrey> Hi Jes,
Andrey> Jes Sorensen wrote:
>> This really is irrelevant, the space is only used within the
>> object when it's on the free list. Ie. if all memory is handed out
>> there's no space used for this purpose.

Andrey> I point out 2 reasons: ACCESS TIME was first :), let take very
Andrey> widespread case: PCI device with some onboard memory and any N
Andrey> GHz proc. - result may be terrible: each access to device mem
Andrey> (which usually uncached) will slowed down this super fast proc
Andrey> to 33 MHZ, i.e same as we made busy-wait with disabled
Andrey> interrupts after each read/write...

Andrey,

As I said in my response, you need the control blocks because you are
not allowed to directly access things on the other side of the PCI bus
without using the readl/writel equivalent macros. It's got nothing to
do with access speed.

>> For the case of more traditional devices, the control structures
>> will be allocated from one end of the block, the rest will be used
>> for packet descriptors which will be going in and out of the memory
>> pool on a regular basis.

Andrey> This was main reason why I try to modify genalloc: I needed in
Andrey> generic allocator for both short-live strictly aligned blocks
Andrey> and long-live blocks with restriction by size.

genalloc is perfectly adequate for that purpose. The long lived
allocations will just be taken out first, the rest will be used for
the short lived.

>> In most normal cases these will all be of the same size and it
>> doesn't matter where in the memory space they were allocated.

Andrey> And thats also why I consider that 'buddy' is not appropriate
Andrey> to be 'generic' (most cases == generic, isn't is :)?): when
Andrey> you're allocate mainly same sized blocks, 'buddy' degraded to
Andrey> the first-fit.

huh?

>> I honestly don't think the majority of your demands are valid.
>> genalloc was meant to be simple, not an ultra fast at any random
>> block size allocator. So far I don't see any reason for changing to
>> the allocation algorithm into anything much more complex - doesn't
>> mean there couldn't be a reason for doing so, but I don't think you
>> have described any so far.
Andrey> I disagree here, generic couldn't be very simple and slow,
Andrey> because in this case simply no one will be use it, and hence
Andrey> we'll get today's picture: reimplemented allocators in many
Andrey> drivers.

Of course it can. I will continue to claim that you are trying to turn
it into something it doesn't need to be. The allocator I used was
based on the allocator from the old sym2 driver, which is a perfect
example of it being used by a device driver.

>> You mentioned frame buffers, but what is the kernel supposed to do
>> with those allocation wise? If you have a frame buffer console, the
>> memory is allocated once and handed to the frame buffer driver.
>> Ie. you don't need a ton of on demand allocations for that and for
>> X, the memory management is handled in the X server, not by the
>> kernel.

Andrey> For video-only device this is true, but if device is a
Andrey> multifunctional, which is frequent case in embedded systems,
Andrey> then kernel must control of device memory
Andrey> allocation. Currently, however, even video cards for desktops
Andrey> become more and more multifunctional (VIVO/audio etc.).

For multi functional devices you still often split the memory up at
init time. Some memory is never going to be given back (like the frame
buffer itself), other blocks are like the network packet descriptors
in a network device.

>> The only thing I think would make sense to implement is to allow
>> it to use indirect descriptor blocks for the memory it
>> manages. This is not because it's wrong to use the memory for the
>> free list, as it will only be used for this when the chunk is not
>> in use, but because access to certain types of memory isn't always
>> valid through normal direct access. Ie. if one used descriptor
>> blocks residing in normal GFP_KERNEL memory, it would be possible
>> to use the allocator to manage memory sitting on the other side of
>> a PCI bus.
Andrey> I describe above, why we couldn't/wouldn't use onboard memory
Andrey> for allocator specific data.

As I pointed out, your description wasn't valid. You are not allowed
to directly dereference memory on the other side of a PCI bus.

Regards,
Jes