2011-06-10 09:57:28

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCHv10 0/10] Contiguous Memory Allocator

Hello everyone,

Like I've promised during the Memory Management summit at Linaro Meeting
in Budapest I continued the development of the CMA. The goal is to
integrate it as tight as possible with other kernel subsystems (like
memory management and dma-mapping) and finally merge to mainline.

This version introduces integration with DMA-mapping subsystem for ARM
architecture, but I believe that similar integration can be done for
other archs too. I've also rebased all the code onto latest v3.0-rc2
kernel.

A few words for these who see CMA for the first time:

The Contiguous Memory Allocator (CMA) makes it possible for device
drivers to allocate big contiguous chunks of memory after the system
has booted.

The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework will migrate system pages to create a required big chunk of
physically contiguous memory.

For more information see the changelog and links to previous versions
of CMA framework.

The current version of CMA is just an allocator that handles allocation
of contiguous memory blocks. The difference between this patchset and
Kamezawa's alloc_contig_pages() are:

1. alloc_contig_pages() requires MAX_ORDER alignment of allocations
which may be unsuitable for embeded systems where a few MiBs are
required.

Lack of the requirement on the alignment means that several threads
might try to access the same pageblock/page. To prevent this from
happening CMA uses a mutex so that only one cm_alloc()/cm_free()
function may run at one point.

2. CMA may use its own migratetype (MIGRATE_CMA) which behaves
similarly to ZONE_MOVABLE but can be put in arbitrary places.

This is required for us since we need to define two disjoint memory
ranges inside system RAM. (ie. in two memory banks (do not confuse
with nodes)).

3. alloc_contig_pages() scans memory in search for range that could be
migrated. CMA on the other hand maintains its own allocator to
decide where to allocate memory for device drivers and then tries
to migrate pages from that part if needed. This is not strictly
required but I somehow feel it might be faster.

The integration with ARM DMA-mapping subsystem is quite straightforward.
Once cma context is available alloc_pages() can be replaced by
cm_alloc() call.

Current version have been tested on Samsung S5PC110 based Aquila machine
and s5p-fimc V4L2 driver. The driver itself uses videobuf2 dma-contig
memory allocator, which in turn relies on dma_alloc_coherent() from
DMA-mapping subsystem. By integrating CMA with DMA-mapping we manage to
get this driver working with CMA without any single change required in
the driver or videobuf2-dma-contig allocator.

TODO:
1. use struct page * or pfn internally instead of physicall address
2. use some simple bitmap based allocator instead of genaloc
3. provide a function similar to dma_declare_coherent_memory(), which
will created and register cma area for particular device
4. code cleanup and simplification
5. discussion
6. double-mapping issues with ARMv6+ and coherent memory

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


Links to previous versions of the patchset:
v9: <http://article.gmane.org/gmane.linux.kernel.mm/60787>
v8: <http://article.gmane.org/gmane.linux.kernel.mm/56855>
v7: <http://article.gmane.org/gmane.linux.kernel.mm/55626>
v6: <http://article.gmane.org/gmane.linux.kernel.mm/55626>
v5: (intentionally left out as CMA v5 was identical to CMA v4)
lv4: <http://article.gmane.org/gmane.linux.kernel.mm/52010>
v3: <http://article.gmane.org/gmane.linux.kernel.mm/51573>
v2: <http://article.gmane.org/gmane.linux.kernel.mm/50986>
v1: <http://article.gmane.org/gmane.linux.kernel.mm/50669>


Changelog:

v10:
1. Rebased onto 3.0-rc2 and resolved all conflicts

2. Simplified CMA to be just a pure memory allocator, for use
with platfrom/bus specific subsystems, like dma-mapping.
Removed all device specific functions are calls.

3. Integrated with ARM DMA-mapping subsystem.

4. Code cleanup here and there.

5. Removed private context support.

v9: 1. Rebased onto 2.6.39-rc1 and resolved all conflicts

2. Fixed a bunch of nasty bugs that happened when the allocation
failed (mainly kernel oops due to NULL ptr dereference).

3. Introduced testing code: cma-regions compatibility layer and
videobuf2-cma memory allocator module.

v8: 1. The alloc_contig_range() function has now been separated from
CMA and put in page_allocator.c. This function tries to
migrate all LRU pages in specified range and then allocate the
range using alloc_contig_freed_pages().

2. Support for MIGRATE_CMA has been separated from the CMA code.
I have not tested if CMA works with ZONE_MOVABLE but I see no
reasons why it shouldn't.

3. I have added a @private argument when creating CMA contexts so
that one can reserve memory and not share it with the rest of
the system. This way, CMA acts only as allocation algorithm.

v7: 1. A lot of functionality that handled driver->allocator_context
mapping has been removed from the patchset. This is not to say
that this code is not needed, it's just not worth posting
everything in one patchset.

Currently, CMA is "just" an allocator. It uses it's own
migratetype (MIGRATE_CMA) for defining ranges of pageblokcs
which behave just like ZONE_MOVABLE but dispite the latter can
be put in arbitrary places.

2. The migration code that was introduced in the previous version
actually started working.


v6: 1. Most importantly, v6 introduces support for memory migration.
The implementation is not yet complete though.

Migration support means that when CMA is not using memory
reserved for it, page allocator can allocate pages from it.
When CMA wants to use the memory, the pages have to be moved
and/or evicted as to make room for CMA.

To make it possible it must be guaranteed that only movable and
reclaimable pages are allocated in CMA controlled regions.
This is done by introducing a MIGRATE_CMA migrate type that
guarantees exactly that.

Some of the migration code is "borrowed" from Kamezawa
Hiroyuki's alloc_contig_pages() implementation. The main
difference is that thanks to MIGRATE_CMA migrate type CMA
assumes that memory controlled by CMA are is always movable or
reclaimable so that it makes allocation decisions regardless of
the whether some pages are actually allocated and migrates them
if needed.

The most interesting patches from the patchset that implement
the functionality are:

09/13: mm: alloc_contig_free_pages() added
10/13: mm: MIGRATE_CMA migration type added
11/13: mm: MIGRATE_CMA isolation functions added
12/13: mm: cma: Migration support added [wip]

Currently, kernel panics in some situations which I am trying
to investigate.

2. cma_pin() and cma_unpin() functions has been added (after
a conversation with Johan Mossberg). The idea is that whenever
hardware does not use the memory (no transaction is on) the
chunk can be moved around. This would allow defragmentation to
be implemented if desired. No defragmentation algorithm is
provided at this time.

3. Sysfs support has been replaced with debugfs. I always felt
unsure about the sysfs interface and when Greg KH pointed it
out I finally got to rewrite it to debugfs.


v5: (intentionally left out as CMA v5 was identical to CMA v4)


v4: 1. The "asterisk" flag has been removed in favour of requiring
that platform will provide a "*=<regions>" rule in the map
attribute.

2. The terminology has been changed slightly renaming "kind" to
"type" of memory. In the previous revisions, the documentation
indicated that device drivers define memory kinds and now,

v3: 1. The command line parameters have been removed (and moved to
a separate patch, the fourth one). As a consequence, the
cma_set_defaults() function has been changed -- it no longer
accepts a string with list of regions but an array of regions.

2. The "asterisk" attribute has been removed. Now, each region
has an "asterisk" flag which lets one specify whether this
region should by considered "asterisk" region.

3. SysFS support has been moved to a separate patch (the third one
in the series) and now also includes list of regions.

v2: 1. The "cma_map" command line have been removed. In exchange,
a SysFS entry has been created under kernel/mm/contiguous.

The intended way of specifying the attributes is
a cma_set_defaults() function called by platform initialisation
code. "regions" attribute (the string specified by "cma"
command line parameter) can be overwritten with command line
parameter; the other attributes can be changed during run-time
using the SysFS entries.

2. The behaviour of the "map" attribute has been modified
slightly. Currently, if no rule matches given device it is
assigned regions specified by the "asterisk" attribute. It is
by default built from the region names given in "regions"
attribute.

3. Devices can register private regions as well as regions that
can be shared but are not reserved using standard CMA
mechanisms. A private region has no name and can be accessed
only by devices that have the pointer to it.

4. The way allocators are registered has changed. Currently,
a cma_allocator_register() function is used for that purpose.
Moreover, allocators are attached to regions the first time
memory is registered from the region or when allocator is
registered which means that allocators can be dynamic modules
that are loaded after the kernel booted (of course, it won't be
possible to allocate a chunk of memory from a region if
allocator is not loaded).

5. Index of new functions:

+static inline dma_addr_t __must_check
+cma_alloc_from(const char *regions, size_t size,
+ dma_addr_t alignment)

+static inline int
+cma_info_about(struct cma_info *info, const const char *regions)

+int __must_check cma_region_register(struct cma_region *reg);

+dma_addr_t __must_check
+cma_alloc_from_region(struct cma_region *reg,
+ size_t size, dma_addr_t alignment);

+static inline dma_addr_t __must_check
+cma_alloc_from(const char *regions,
+ size_t size, dma_addr_t alignment);

+int cma_allocator_register(struct cma_allocator *alloc);


Patches in this patchset:

lib: bitmap: Added alignment offset for bitmap_find_next_zero_area()
lib: genalloc: Generic allocator improvements

Some improvements to genalloc API (most importantly possibility to
allocate memory with alignment requirement).

mm: move some functions from memory_hotplug.c to page_isolation.c
mm: alloc_contig_freed_pages() added

Code "stolen" from Kamezawa. The first patch just moves code
around and the second provide function for "allocates" already
freed memory.

mm: alloc_contig_range() added

This is what Kamezawa asked: a function that tries to migrate all
pages from given range and then use alloc_contig_freed_pages()
(defined by the previous commit) to allocate those pages.

mm: MIGRATE_CMA migration type added
mm: MIGRATE_CMA isolation functions added

Introduction of the new migratetype and support for it in CMA.
MIGRATE_CMA works similar to ZONE_MOVABLE expect almost any
memory range can be marked as one.

mm: cma: Contiguous Memory Allocator added

The code CMA code. Manages CMA contexts and performs memory
allocations.

ARM: integrate CMA with dma-mapping subsystem

Main client of CMA frame work. CMA serves as a alloc_pages()
replacement if device has the cma context assigned.

ARM: S5PV210: add CMA support for FIMC devices on Aquila board

Example of platform/board specific code that creates cma
context and assigns it to particular devices.


Patch summary:

KAMEZAWA Hiroyuki (2):
mm: move some functions from memory_hotplug.c to page_isolation.c
mm: alloc_contig_freed_pages() added

Marek Szyprowski (3):
mm: cma: Contiguous Memory Allocator added
ARM: integrate CMA with dma-mapping subsystem
ARM: S5PV210: add CMA support for FIMC devices on Aquila board

Michal Nazarewicz (5):
lib: bitmap: Added alignment offset for bitmap_find_next_zero_area()
lib: genalloc: Generic allocator improvements
mm: alloc_contig_range() added
mm: MIGRATE_CMA migration type added
mm: MIGRATE_CMA isolation functions added

arch/arm/include/asm/device.h | 3 +
arch/arm/include/asm/dma-mapping.h | 19 ++
arch/arm/mach-s5pv210/Kconfig | 1 +
arch/arm/mach-s5pv210/mach-aquila.c | 26 +++
arch/arm/mm/dma-mapping.c | 60 +++++--
include/linux/bitmap.h | 24 ++-
include/linux/cma.h | 189 ++++++++++++++++++
include/linux/genalloc.h | 50 +++---
include/linux/mmzone.h | 43 ++++-
include/linux/page-isolation.h | 50 ++++--
lib/bitmap.c | 22 ++-
lib/genalloc.c | 190 +++++++++++--------
mm/Kconfig | 29 +++-
mm/Makefile | 1 +
mm/cma.c | 358 +++++++++++++++++++++++++++++++++++
mm/compaction.c | 10 +
mm/internal.h | 3 +
mm/memory_hotplug.c | 111 -----------
mm/page_alloc.c | 292 ++++++++++++++++++++++++++---
mm/page_isolation.c | 130 ++++++++++++-
20 files changed, 1319 insertions(+), 292 deletions(-)
create mode 100644 include/linux/cma.h
create mode 100644 mm/cma.c

--
1.7.1.569.g6f426


2011-06-10 09:55:14

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 01/10] lib: bitmap: Added alignment offset for bitmap_find_next_zero_area()

From: Michal Nazarewicz <[email protected]>

This commit adds a bitmap_find_next_zero_area_off() function which
works like bitmap_find_next_zero_area() function expect it allows an
offset to be specified when alignment is checked. This lets caller
request a bit such that its number plus the offset is aligned
according to the mask.

Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/bitmap.h | 24 +++++++++++++++++++-----
lib/bitmap.c | 22 ++++++++++++----------
2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index dcafe0b..50e2c16 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -45,6 +45,7 @@
* bitmap_set(dst, pos, nbits) Set specified bit area
* bitmap_clear(dst, pos, nbits) Clear specified bit area
* bitmap_find_next_zero_area(buf, len, pos, n, mask) Find bit free area
+ * bitmap_find_next_zero_area_off(buf, len, pos, n, mask) as above
* bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n
* bitmap_shift_left(dst, src, n, nbits) *dst = *src << n
* bitmap_remap(dst, src, old, new, nbits) *dst = map(old, new)(src)
@@ -114,11 +115,24 @@ extern int __bitmap_weight(const unsigned long *bitmap, int bits);

extern void bitmap_set(unsigned long *map, int i, int len);
extern void bitmap_clear(unsigned long *map, int start, int nr);
-extern unsigned long bitmap_find_next_zero_area(unsigned long *map,
- unsigned long size,
- unsigned long start,
- unsigned int nr,
- unsigned long align_mask);
+
+extern unsigned long bitmap_find_next_zero_area_off(unsigned long *map,
+ unsigned long size,
+ unsigned long start,
+ unsigned int nr,
+ unsigned long align_mask,
+ unsigned long align_offset);
+
+static inline unsigned long
+bitmap_find_next_zero_area(unsigned long *map,
+ unsigned long size,
+ unsigned long start,
+ unsigned int nr,
+ unsigned long align_mask)
+{
+ return bitmap_find_next_zero_area_off(map, size, start, nr,
+ align_mask, 0);
+}

extern int bitmap_scnprintf(char *buf, unsigned int len,
const unsigned long *src, int nbits);
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 41baf02..bad4f20 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -315,30 +315,32 @@ void bitmap_clear(unsigned long *map, int start, int nr)
}
EXPORT_SYMBOL(bitmap_clear);

-/*
+/**
* bitmap_find_next_zero_area - find a contiguous aligned zero area
* @map: The address to base the search on
* @size: The bitmap size in bits
* @start: The bitnumber to start searching at
* @nr: The number of zeroed bits we're looking for
* @align_mask: Alignment mask for zero area
+ * @align_offset: Alignment offset for zero area.
*
* The @align_mask should be one less than a power of 2; the effect is that
- * the bit offset of all zero areas this function finds is multiples of that
- * power of 2. A @align_mask of 0 means no alignment is required.
+ * the bit offset of all zero areas this function finds plus @align_offset
+ * is multiple of that power of 2.
*/
-unsigned long bitmap_find_next_zero_area(unsigned long *map,
- unsigned long size,
- unsigned long start,
- unsigned int nr,
- unsigned long align_mask)
+unsigned long bitmap_find_next_zero_area_off(unsigned long *map,
+ unsigned long size,
+ unsigned long start,
+ unsigned int nr,
+ unsigned long align_mask,
+ unsigned long align_offset)
{
unsigned long index, end, i;
again:
index = find_next_zero_bit(map, size, start);

/* Align allocation */
- index = __ALIGN_MASK(index, align_mask);
+ index = __ALIGN_MASK(index + align_offset, align_mask) - align_offset;

end = index + nr;
if (end > size)
@@ -350,7 +352,7 @@ again:
}
return index;
}
-EXPORT_SYMBOL(bitmap_find_next_zero_area);
+EXPORT_SYMBOL(bitmap_find_next_zero_area_off);

/*
* Bitmap printing & parsing functions: first version by Bill Irwin,
--
1.7.1.569.g6f426

2011-06-10 09:55:17

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 02/10] lib: genalloc: Generic allocator improvements

From: Michal Nazarewicz <[email protected]>

This commit adds a gen_pool_alloc_aligned() function to the
generic allocator API. It allows specifying alignment for the
allocated block. This feature uses
the bitmap_find_next_zero_area_off() function.

It also fixes possible issue with bitmap's last element being
not fully allocated (ie. space allocated for chunk->bits is
not a multiple of sizeof(long)).

It also makes some other smaller changes:
- moves structure definitions out of the header file,
- adds __must_check to functions returning value,
- makes gen_pool_add() return -ENOMEM rater than -1 on error,
- changes list_for_each to list_for_each_entry, and
- makes use of bitmap_clear().

Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
[m.szyprowski: rebased and updated to Linux v3.0-rc1]
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/genalloc.h | 50 ++++++------
lib/genalloc.c | 190 +++++++++++++++++++++++++++-------------------
2 files changed, 138 insertions(+), 102 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 5bbebda..af44e88 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -11,28 +11,11 @@

#ifndef __GENALLOC_H__
#define __GENALLOC_H__
-/*
- * General purpose special memory pool descriptor.
- */
-struct gen_pool {
- rwlock_t lock;
- struct list_head chunks; /* list of chunks in this pool */
- int min_alloc_order; /* minimum allocation order */
-};

-/*
- * General purpose special memory pool chunk descriptor.
- */
-struct gen_pool_chunk {
- spinlock_t lock;
- struct list_head next_chunk; /* next chunk in pool */
- phys_addr_t phys_addr; /* physical starting address of memory chunk */
- unsigned long start_addr; /* starting address of memory chunk */
- unsigned long end_addr; /* ending address of memory chunk */
- unsigned long bits[0]; /* bitmap for allocating memory chunk */
-};
-
-extern struct gen_pool *gen_pool_create(int, int);
+struct gen_pool;
+
+struct gen_pool *__must_check gen_pool_create(unsigned order, int nid);
+
extern phys_addr_t gen_pool_virt_to_phys(struct gen_pool *pool, unsigned long);
extern int gen_pool_add_virt(struct gen_pool *, unsigned long, phys_addr_t,
size_t, int);
@@ -53,7 +36,26 @@ static inline int gen_pool_add(struct gen_pool *pool, unsigned long addr,
{
return gen_pool_add_virt(pool, addr, -1, size, nid);
}
-extern void gen_pool_destroy(struct gen_pool *);
-extern unsigned long gen_pool_alloc(struct gen_pool *, size_t);
-extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+
+void gen_pool_destroy(struct gen_pool *pool);
+
+unsigned long __must_check
+gen_pool_alloc_aligned(struct gen_pool *pool, size_t size,
+ unsigned alignment_order);
+
+/**
+ * gen_pool_alloc() - allocate special memory from the pool
+ * @pool: Pool to allocate from.
+ * @size: Number of bytes to allocate from the pool.
+ *
+ * Allocate the requested number of bytes from the specified pool.
+ * Uses a first-fit algorithm.
+ */
+static inline unsigned long __must_check
+gen_pool_alloc(struct gen_pool *pool, size_t size)
+{
+ return gen_pool_alloc_aligned(pool, size, 0);
+}
+
+void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size);
#endif /* __GENALLOC_H__ */
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 577ddf8..b41dd90 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -16,23 +16,46 @@
#include <linux/genalloc.h>


+/* General purpose special memory pool descriptor. */
+struct gen_pool {
+ rwlock_t lock; /* protects chunks list */
+ struct list_head chunks; /* list of chunks in this pool */
+ unsigned order; /* minimum allocation order */
+};
+
+/* General purpose special memory pool chunk descriptor. */
+struct gen_pool_chunk {
+ spinlock_t lock; /* protects bits */
+ struct list_head next_chunk; /* next chunk in pool */
+ phys_addr_t phys_addr; /* physical starting address of memory chunk */
+ unsigned long start; /* start of memory chunk */
+ unsigned long size; /* number of bits */
+ unsigned long bits[0]; /* bitmap for allocating memory chunk */
+};
+
+
/**
- * gen_pool_create - create a new special memory pool
- * @min_alloc_order: log base 2 of number of bytes each bitmap bit represents
- * @nid: node id of the node the pool structure should be allocated on, or -1
+ * gen_pool_create() - create a new special memory pool
+ * @order: Log base 2 of number of bytes each bitmap bit
+ * represents.
+ * @nid: Node id of the node the pool structure should be allocated
+ * on, or -1. This will be also used for other allocations.
*
* Create a new special memory pool that can be used to manage special purpose
* memory not managed by the regular kmalloc/kfree interface.
*/
-struct gen_pool *gen_pool_create(int min_alloc_order, int nid)
+struct gen_pool *__must_check gen_pool_create(unsigned order, int nid)
{
struct gen_pool *pool;

- pool = kmalloc_node(sizeof(struct gen_pool), GFP_KERNEL, nid);
- if (pool != NULL) {
+ if (WARN_ON(order >= BITS_PER_LONG))
+ return NULL;
+
+ pool = kmalloc_node(sizeof *pool, GFP_KERNEL, nid);
+ if (pool) {
rwlock_init(&pool->lock);
INIT_LIST_HEAD(&pool->chunks);
- pool->min_alloc_order = min_alloc_order;
+ pool->order = order;
}
return pool;
}
@@ -40,33 +63,41 @@ EXPORT_SYMBOL(gen_pool_create);

/**
* gen_pool_add_virt - add a new chunk of special memory to the pool
- * @pool: pool to add new memory chunk to
- * @virt: virtual starting address of memory chunk to add to pool
- * @phys: physical starting address of memory chunk to add to pool
- * @size: size in bytes of the memory chunk to add to pool
- * @nid: node id of the node the chunk structure and bitmap should be
- * allocated on, or -1
+ * @pool: Pool to add new memory chunk to
+ * @virt: Virtual starting address of memory chunk to add to pool
+ * @phys: Physical starting address of memory chunk to add to pool
+ * @size: Size in bytes of the memory chunk to add to pool
+ * @nid: Node id of the node the chunk structure and bitmap should be
+ * allocated on, or -1
*
* Add a new chunk of special memory to the specified pool.
*
* Returns 0 on success or a -ve errno on failure.
*/
-int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phys,
- size_t size, int nid)
+int __must_check
+gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phys,
+ size_t size, int nid)
{
struct gen_pool_chunk *chunk;
- int nbits = size >> pool->min_alloc_order;
- int nbytes = sizeof(struct gen_pool_chunk) +
- (nbits + BITS_PER_BYTE - 1) / BITS_PER_BYTE;
+ size_t nbytes;

- chunk = kmalloc_node(nbytes, GFP_KERNEL | __GFP_ZERO, nid);
- if (unlikely(chunk == NULL))
+ if (WARN_ON(!virt || virt + size < virt ||
+ (virt & ((1 << pool->order) - 1))))
+ return -EINVAL;
+
+ size = size >> pool->order;
+ if (WARN_ON(!size))
+ return -EINVAL;
+
+ nbytes = sizeof *chunk + BITS_TO_LONGS(size) * sizeof *chunk->bits;
+ chunk = kzalloc_node(nbytes, GFP_KERNEL, nid);
+ if (!chunk)
return -ENOMEM;

spin_lock_init(&chunk->lock);
+ chunk->start = virt >> pool->order;
+ chunk->size = size;
chunk->phys_addr = phys;
- chunk->start_addr = virt;
- chunk->end_addr = virt + size;

write_lock(&pool->lock);
list_add(&chunk->next_chunk, &pool->chunks);
@@ -90,10 +121,12 @@ phys_addr_t gen_pool_virt_to_phys(struct gen_pool *pool, unsigned long addr)

read_lock(&pool->lock);
list_for_each(_chunk, &pool->chunks) {
+ unsigned long start_addr;
chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);

- if (addr >= chunk->start_addr && addr < chunk->end_addr)
- return chunk->phys_addr + addr - chunk->start_addr;
+ start_addr = chunk->start << pool->order;
+ if (addr >= start_addr && addr < start_addr + chunk->size)
+ return chunk->phys_addr + addr - start_addr;
}
read_unlock(&pool->lock);

@@ -102,115 +135,116 @@ phys_addr_t gen_pool_virt_to_phys(struct gen_pool *pool, unsigned long addr)
EXPORT_SYMBOL(gen_pool_virt_to_phys);

/**
- * gen_pool_destroy - destroy a special memory pool
- * @pool: pool to destroy
+ * gen_pool_destroy() - destroy a special memory pool
+ * @pool: Pool to destroy.
*
* Destroy the specified special memory pool. Verifies that there are no
* outstanding allocations.
*/
void gen_pool_destroy(struct gen_pool *pool)
{
- struct list_head *_chunk, *_next_chunk;
struct gen_pool_chunk *chunk;
- int order = pool->min_alloc_order;
- int bit, end_bit;
-
+ int bit;

- list_for_each_safe(_chunk, _next_chunk, &pool->chunks) {
- chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+ while (!list_empty(&pool->chunks)) {
+ chunk = list_entry(pool->chunks.next, struct gen_pool_chunk,
+ next_chunk);
list_del(&chunk->next_chunk);

- end_bit = (chunk->end_addr - chunk->start_addr) >> order;
- bit = find_next_bit(chunk->bits, end_bit, 0);
- BUG_ON(bit < end_bit);
+ bit = find_next_bit(chunk->bits, chunk->size, 0);
+ BUG_ON(bit < chunk->size);

kfree(chunk);
}
kfree(pool);
- return;
}
EXPORT_SYMBOL(gen_pool_destroy);

/**
- * gen_pool_alloc - allocate special memory from the pool
- * @pool: pool to allocate from
- * @size: number of bytes to allocate from the pool
+ * gen_pool_alloc_aligned() - allocate special memory from the pool
+ * @pool: Pool to allocate from.
+ * @size: Number of bytes to allocate from the pool.
+ * @alignment_order: Order the allocated space should be
+ * aligned to (eg. 20 means allocated space
+ * must be aligned to 1MiB).
*
* Allocate the requested number of bytes from the specified pool.
* Uses a first-fit algorithm.
*/
-unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size)
+unsigned long __must_check
+gen_pool_alloc_aligned(struct gen_pool *pool, size_t size,
+ unsigned alignment_order)
{
- struct list_head *_chunk;
+ unsigned long addr, align_mask = 0, flags, start;
struct gen_pool_chunk *chunk;
- unsigned long addr, flags;
- int order = pool->min_alloc_order;
- int nbits, start_bit, end_bit;

if (size == 0)
return 0;

- nbits = (size + (1UL << order) - 1) >> order;
+ if (alignment_order > pool->order)
+ align_mask = (1 << (alignment_order - pool->order)) - 1;

- read_lock(&pool->lock);
- list_for_each(_chunk, &pool->chunks) {
- chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+ size = (size + (1UL << pool->order) - 1) >> pool->order;

- end_bit = (chunk->end_addr - chunk->start_addr) >> order;
+ read_lock(&pool->lock);
+ list_for_each_entry(chunk, &pool->chunks, next_chunk) {
+ if (chunk->size < size)
+ continue;

spin_lock_irqsave(&chunk->lock, flags);
- start_bit = bitmap_find_next_zero_area(chunk->bits, end_bit, 0,
- nbits, 0);
- if (start_bit >= end_bit) {
+ start = bitmap_find_next_zero_area_off(chunk->bits, chunk->size,
+ 0, size, align_mask,
+ chunk->start);
+ if (start >= chunk->size) {
spin_unlock_irqrestore(&chunk->lock, flags);
continue;
}

- addr = chunk->start_addr + ((unsigned long)start_bit << order);
-
- bitmap_set(chunk->bits, start_bit, nbits);
+ bitmap_set(chunk->bits, start, size);
spin_unlock_irqrestore(&chunk->lock, flags);
- read_unlock(&pool->lock);
- return addr;
+ addr = (chunk->start + start) << pool->order;
+ goto done;
}
+
+ addr = 0;
+done:
read_unlock(&pool->lock);
- return 0;
+ return addr;
}
-EXPORT_SYMBOL(gen_pool_alloc);
+EXPORT_SYMBOL(gen_pool_alloc_aligned);

/**
- * gen_pool_free - free allocated special memory back to the pool
- * @pool: pool to free to
- * @addr: starting address of memory to free back to pool
- * @size: size in bytes of memory to free
+ * gen_pool_free() - free allocated special memory back to the pool
+ * @pool: Pool to free to.
+ * @addr: Starting address of memory to free back to pool.
+ * @size: Size in bytes of memory to free.
*
* Free previously allocated special memory back to the specified pool.
*/
void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
{
- struct list_head *_chunk;
struct gen_pool_chunk *chunk;
unsigned long flags;
- int order = pool->min_alloc_order;
- int bit, nbits;

- nbits = (size + (1UL << order) - 1) >> order;
+ if (!size)
+ return;

- read_lock(&pool->lock);
- list_for_each(_chunk, &pool->chunks) {
- chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+ addr = addr >> pool->order;
+ size = (size + (1UL << pool->order) - 1) >> pool->order;

- if (addr >= chunk->start_addr && addr < chunk->end_addr) {
- BUG_ON(addr + size > chunk->end_addr);
+ BUG_ON(addr + size < addr);
+
+ read_lock(&pool->lock);
+ list_for_each_entry(chunk, &pool->chunks, next_chunk)
+ if (addr >= chunk->start &&
+ addr + size <= chunk->start + chunk->size) {
spin_lock_irqsave(&chunk->lock, flags);
- bit = (addr - chunk->start_addr) >> order;
- while (nbits--)
- __clear_bit(bit++, chunk->bits);
+ bitmap_clear(chunk->bits, addr - chunk->start, size);
spin_unlock_irqrestore(&chunk->lock, flags);
- break;
+ goto done;
}
- }
- BUG_ON(nbits > 0);
+ BUG_ON(1);
+done:
read_unlock(&pool->lock);
}
EXPORT_SYMBOL(gen_pool_free);
--
1.7.1.569.g6f426

2011-06-10 09:57:30

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 03/10] mm: move some functions from memory_hotplug.c to page_isolation.c

From: KAMEZAWA Hiroyuki <[email protected]>

Memory hotplug is a logic for making pages unused in the specified
range of pfn. So, some of core logics can be used for other purpose
as allocating a very large contigous memory block.

This patch moves some functions from mm/memory_hotplug.c to
mm/page_isolation.c. This helps adding a function for large-alloc in
page_isolation.c with memory-unplug technique.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
[m.nazarewicz: reworded commit message]
Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
[m.szyprowski: rebased and updated to Linux v3.0-rc1]
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/page-isolation.h | 7 +++
mm/memory_hotplug.c | 111 --------------------------------------
mm/page_isolation.c | 115 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 122 insertions(+), 111 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 051c1b1..58cdbac 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,5 +33,12 @@ test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
extern int set_migratetype_isolate(struct page *page);
extern void unset_migratetype_isolate(struct page *page);

+/*
+ * For migration.
+ */
+
+int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn);
+unsigned long scan_lru_pages(unsigned long start, unsigned long end);
+int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn);

#endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9f64637..aee6dc0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -637,117 +637,6 @@ int is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
}

/*
- * Confirm all pages in a range [start, end) is belongs to the same zone.
- */
-static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pfn;
- struct zone *zone = NULL;
- struct page *page;
- int i;
- for (pfn = start_pfn;
- pfn < end_pfn;
- pfn += MAX_ORDER_NR_PAGES) {
- i = 0;
- /* This is just a CONFIG_HOLES_IN_ZONE check.*/
- while ((i < MAX_ORDER_NR_PAGES) && !pfn_valid_within(pfn + i))
- i++;
- if (i == MAX_ORDER_NR_PAGES)
- continue;
- page = pfn_to_page(pfn + i);
- if (zone && page_zone(page) != zone)
- return 0;
- zone = page_zone(page);
- }
- return 1;
-}
-
-/*
- * Scanning pfn is much easier than scanning lru list.
- * Scan pfn from start to end and Find LRU page.
- */
-static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
-{
- unsigned long pfn;
- struct page *page;
- for (pfn = start; pfn < end; pfn++) {
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
- if (PageLRU(page))
- return pfn;
- }
- }
- return 0;
-}
-
-static struct page *
-hotremove_migrate_alloc(struct page *page, unsigned long private, int **x)
-{
- /* This should be improooooved!! */
- return alloc_page(GFP_HIGHUSER_MOVABLE);
-}
-
-#define NR_OFFLINE_AT_ONCE_PAGES (256)
-static int
-do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pfn;
- struct page *page;
- int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
- int not_managed = 0;
- int ret = 0;
- LIST_HEAD(source);
-
- for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
- if (!pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
- if (!get_page_unless_zero(page))
- continue;
- /*
- * We can skip free pages. And we can only deal with pages on
- * LRU.
- */
- ret = isolate_lru_page(page);
- if (!ret) { /* Success */
- put_page(page);
- list_add_tail(&page->lru, &source);
- move_pages--;
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
-
- } else {
-#ifdef CONFIG_DEBUG_VM
- printk(KERN_ALERT "removing pfn %lx from LRU failed\n",
- pfn);
- dump_page(page);
-#endif
- put_page(page);
- /* Because we don't have big zone->lock. we should
- check this again here. */
- if (page_count(page)) {
- not_managed++;
- ret = -EBUSY;
- break;
- }
- }
- }
- if (!list_empty(&source)) {
- if (not_managed) {
- putback_lru_pages(&source);
- goto out;
- }
- /* this function returns # of failed pages */
- ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
- true, true);
- if (ret)
- putback_lru_pages(&source);
- }
-out:
- return ret;
-}
-
-/*
* remove from free_area[] and mark all as Reserved.
*/
static int
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 4ae42bb..15b41ec 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -5,6 +5,9 @@
#include <linux/mm.h>
#include <linux/page-isolation.h>
#include <linux/pageblock-flags.h>
+#include <linux/memcontrol.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
#include "internal.h"

static inline struct page *
@@ -139,3 +142,115 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
spin_unlock_irqrestore(&zone->lock, flags);
return ret ? 0 : -EBUSY;
}
+
+
+/*
+ * Confirm all pages in a range [start, end) is belongs to the same zone.
+ */
+int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct zone *zone = NULL;
+ struct page *page;
+ int i;
+ for (pfn = start_pfn;
+ pfn < end_pfn;
+ pfn += MAX_ORDER_NR_PAGES) {
+ i = 0;
+ /* This is just a CONFIG_HOLES_IN_ZONE check.*/
+ while ((i < MAX_ORDER_NR_PAGES) && !pfn_valid_within(pfn + i))
+ i++;
+ if (i == MAX_ORDER_NR_PAGES)
+ continue;
+ page = pfn_to_page(pfn + i);
+ if (zone && page_zone(page) != zone)
+ return 0;
+ zone = page_zone(page);
+ }
+ return 1;
+}
+
+/*
+ * Scanning pfn is much easier than scanning lru list.
+ * Scan pfn from start to end and Find LRU page.
+ */
+unsigned long scan_lru_pages(unsigned long start, unsigned long end)
+{
+ unsigned long pfn;
+ struct page *page;
+ for (pfn = start; pfn < end; pfn++) {
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ if (PageLRU(page))
+ return pfn;
+ }
+ }
+ return 0;
+}
+
+struct page *
+hotremove_migrate_alloc(struct page *page, unsigned long private, int **x)
+{
+ /* This should be improooooved!! */
+ return alloc_page(GFP_HIGHUSER_MOVABLE);
+}
+
+#define NR_OFFLINE_AT_ONCE_PAGES (256)
+int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct page *page;
+ int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
+ int not_managed = 0;
+ int ret = 0;
+ LIST_HEAD(source);
+
+ for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ if (!get_page_unless_zero(page))
+ continue;
+ /*
+ * We can skip free pages. And we can only deal with pages on
+ * LRU.
+ */
+ ret = isolate_lru_page(page);
+ if (!ret) { /* Success */
+ put_page(page);
+ list_add_tail(&page->lru, &source);
+ move_pages--;
+ inc_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+
+ } else {
+#ifdef CONFIG_DEBUG_VM
+ printk(KERN_ALERT "removing pfn %lx from LRU failed\n",
+ pfn);
+ dump_page(page);
+#endif
+ put_page(page);
+ /* Because we don't have big zone->lock. we should
+ check this again here. */
+ if (page_count(page)) {
+ not_managed++;
+ ret = -EBUSY;
+ break;
+ }
+ }
+ }
+ if (!list_empty(&source)) {
+ if (not_managed) {
+ putback_lru_pages(&source);
+ goto out;
+ }
+ /* this function returns # of failed pages */
+ ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
+ true, true);
+ if (ret)
+ putback_lru_pages(&source);
+ }
+out:
+ return ret;
+}
+
--
1.7.1.569.g6f426

2011-06-10 09:55:09

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 04/10] mm: alloc_contig_freed_pages() added

From: KAMEZAWA Hiroyuki <[email protected]>

This commit introduces alloc_contig_freed_pages() function
which allocates (ie. removes from buddy system) free pages
in range. Caller has to guarantee that all pages in range
are in buddy system.

Along with this function, a free_contig_pages() function is
provided which frees all (or a subset of) pages allocated
with alloc_contig_free_pages().

Michal Nazarewicz has modified the function to make it easier
to allocate not MAX_ORDER_NR_PAGES aligned pages by making it
return pfn of one-past-the-last allocated page.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/page-isolation.h | 3 ++
mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 58cdbac..f1417ed 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -32,6 +32,9 @@ test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
*/
extern int set_migratetype_isolate(struct page *page);
extern void unset_migratetype_isolate(struct page *page);
+extern unsigned long alloc_contig_freed_pages(unsigned long start,
+ unsigned long end, gfp_t flag);
+extern void free_contig_pages(struct page *page, int nr_pages);

/*
* For migration.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..00e9b24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5600,6 +5600,50 @@ out:
spin_unlock_irqrestore(&zone->lock, flags);
}

+unsigned long alloc_contig_freed_pages(unsigned long start, unsigned long end,
+ gfp_t flag)
+{
+ unsigned long pfn = start, count;
+ struct page *page;
+ struct zone *zone;
+ int order;
+
+ VM_BUG_ON(!pfn_valid(start));
+ zone = page_zone(pfn_to_page(start));
+
+ spin_lock_irq(&zone->lock);
+
+ page = pfn_to_page(pfn);
+ for (;;) {
+ VM_BUG_ON(page_count(page) || !PageBuddy(page));
+ list_del(&page->lru);
+ order = page_order(page);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+ pfn += 1 << order;
+ if (pfn >= end)
+ break;
+ VM_BUG_ON(!pfn_valid(pfn));
+ page += 1 << order;
+ }
+
+ spin_unlock_irq(&zone->lock);
+
+ /* After this, pages in the range can be freed one be one */
+ page = pfn_to_page(start);
+ for (count = pfn - start; count; --count, ++page)
+ prep_new_page(page, 0, flag);
+
+ return pfn;
+}
+
+void free_contig_pages(struct page *page, int nr_pages)
+{
+ for (; nr_pages; --nr_pages, ++page)
+ __free_page(page);
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be isolated before calling this.
--
1.7.1.569.g6f426

2011-06-10 09:56:46

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 05/10] mm: alloc_contig_range() added

From: Michal Nazarewicz <[email protected]>

This commit adds the alloc_contig_range() function which tries
to allecate given range of pages. It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.

Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
[m.szyprowski: renamed some variables for easier code reading]
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/page-isolation.h | 2 +
mm/page_alloc.c | 144 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 146 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index f1417ed..c5d1a7c 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,6 +34,8 @@ extern int set_migratetype_isolate(struct page *page);
extern void unset_migratetype_isolate(struct page *page);
extern unsigned long alloc_contig_freed_pages(unsigned long start,
unsigned long end, gfp_t flag);
+extern int alloc_contig_range(unsigned long start, unsigned long end,
+ gfp_t flags);
extern void free_contig_pages(struct page *page, int nr_pages);

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00e9b24..2cea044 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5638,6 +5638,150 @@ unsigned long alloc_contig_freed_pages(unsigned long start, unsigned long end,
return pfn;
}

+static unsigned long pfn_to_maxpage(unsigned long pfn)
+{
+ return pfn & ~(MAX_ORDER_NR_PAGES - 1);
+}
+
+static unsigned long pfn_to_maxpage_up(unsigned long pfn)
+{
+ return ALIGN(pfn, MAX_ORDER_NR_PAGES);
+}
+
+#define MIGRATION_RETRY 5
+static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
+{
+ int migration_failed = 0, ret;
+ unsigned long pfn = start;
+
+ /*
+ * Some code "borrowed" from KAMEZAWA Hiroyuki's
+ * __alloc_contig_pages().
+ */
+
+ for (;;) {
+ pfn = scan_lru_pages(pfn, end);
+ if (!pfn || pfn >= end)
+ break;
+
+ ret = do_migrate_range(pfn, end);
+ if (!ret) {
+ migration_failed = 0;
+ } else if (ret != -EBUSY
+ || ++migration_failed >= MIGRATION_RETRY) {
+ return ret;
+ } else {
+ /* There are unstable pages.on pagevec. */
+ lru_add_drain_all();
+ /*
+ * there may be pages on pcplist before
+ * we mark the range as ISOLATED.
+ */
+ drain_all_pages();
+ }
+ cond_resched();
+ }
+
+ if (!migration_failed) {
+ /* drop all pages in pagevec and pcp list */
+ lru_add_drain_all();
+ drain_all_pages();
+ }
+
+ /* Make sure all pages are isolated */
+ if (WARN_ON(test_pages_isolated(start, end)))
+ return -EBUSY;
+
+ return 0;
+}
+
+/**
+ * alloc_contig_range() -- tries to allocate given range of pages
+ * @start: start PFN to allocate
+ * @end: one-past-the-last PFN to allocate
+ * @flags: flags passed to alloc_contig_freed_pages().
+ *
+ * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
+ * aligned, hovewer it's callers responsibility to guarantee that we
+ * are the only thread that changes migrate type of pageblocks the
+ * pages fall in.
+ *
+ * Returns zero on success or negative error code. On success all
+ * pages which PFN is in (start, end) are allocated for the caller and
+ * need to be freed with free_contig_pages().
+ */
+int alloc_contig_range(unsigned long start, unsigned long end,
+ gfp_t flags)
+{
+ unsigned long outer_start, outer_end;
+ int ret;
+
+ /*
+ * What we do here is we mark all pageblocks in range as
+ * MIGRATE_ISOLATE. Because of the way page allocator work, we
+ * align the range to MAX_ORDER pages so that page allocator
+ * won't try to merge buddies from different pageblocks and
+ * change MIGRATE_ISOLATE to some other migration type.
+ *
+ * Once the pageblocks are marked as MIGRATE_ISOLATE, we
+ * migrate the pages from an unaligned range (ie. pages that
+ * we are interested in). This will put all the pages in
+ * range back to page allocator as MIGRATE_ISOLATE.
+ *
+ * When this is done, we take the pages in range from page
+ * allocator removing them from the buddy system. This way
+ * page allocator will never consider using them.
+ *
+ * This lets us mark the pageblocks back as
+ * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
+ * MAX_ORDER aligned range but not in the unaligned, original
+ * range are put back to page allocator so that buddy can use
+ * them.
+ */
+
+ ret = start_isolate_page_range(pfn_to_maxpage(start),
+ pfn_to_maxpage_up(end));
+ if (ret)
+ goto done;
+
+ ret = __alloc_contig_migrate_range(start, end);
+ if (ret)
+ goto done;
+
+ /*
+ * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
+ * aligned blocks that are marked as MIGRATE_ISOLATE. What's
+ * more, all pages in [start, end) are free in page allocator.
+ * What we are going to do is to allocate all pages from
+ * [start, end) (that is remove them from page allocater).
+ *
+ * The only problem is that pages at the beginning and at the
+ * end of interesting range may be not aligned with pages that
+ * page allocator holds, ie. they can be part of higher order
+ * pages. Because of this, we reserve the bigger range and
+ * once this is done free the pages we are not interested in.
+ */
+
+ ret = 0;
+ while (!PageBuddy(pfn_to_page(start & (~0UL << ret))))
+ if (WARN_ON(++ret >= MAX_ORDER))
+ return -EINVAL;
+
+ outer_start = start & (~0UL << ret);
+ outer_end = alloc_contig_freed_pages(outer_start, end, flags);
+
+ /* Free head and tail (if any) */
+ if (start != outer_start)
+ free_contig_pages(pfn_to_page(outer_start), start - outer_start);
+ if (end != outer_end)
+ free_contig_pages(pfn_to_page(end), outer_end - end);
+
+ ret = 0;
+done:
+ undo_isolate_page_range(pfn_to_maxpage(start), pfn_to_maxpage_up(end));
+ return ret;
+}
+
void free_contig_pages(struct page *page, int nr_pages)
{
for (; nr_pages; --nr_pages, ++page)
--
1.7.1.569.g6f426

2011-06-10 09:55:21

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 06/10] mm: MIGRATE_CMA migration type added

From: Michal Nazarewicz <[email protected]>

The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees that page in a MIGRATE_CMA page block can
always be migrated somewhere else (unless there's no memory left
in the system).

It is designed to be used with Contiguous Memory Allocator
(CMA) for allocating big chunks (eg. 10MiB) of physically
contiguous memory. Once driver requests contiguous memory,
CMA will migrate pages from MIGRATE_CMA pageblocks.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types then requested.

Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
[m.szyprowski: cleaned up Kconfig]
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/mmzone.h | 43 ++++++++++++++++++----
mm/Kconfig | 8 ++++-
mm/compaction.c | 10 +++++
mm/internal.h | 3 ++
mm/page_alloc.c | 93 ++++++++++++++++++++++++++++++++++++++----------
5 files changed, 130 insertions(+), 27 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c928dac..a2eeacf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,13 +35,37 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3

-#define MIGRATE_UNMOVABLE 0
-#define MIGRATE_RECLAIMABLE 1
-#define MIGRATE_MOVABLE 2
-#define MIGRATE_PCPTYPES 3 /* the number of types on the pcp lists */
-#define MIGRATE_RESERVE 3
-#define MIGRATE_ISOLATE 4 /* can't allocate from here */
-#define MIGRATE_TYPES 5
+enum {
+ MIGRATE_UNMOVABLE,
+ MIGRATE_RECLAIMABLE,
+ MIGRATE_MOVABLE,
+ MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
+ MIGRATE_RESERVE = MIGRATE_PCPTYPES,
+#ifdef CONFIG_CMA_MIGRATE_TYPE
+ /*
+ * MIGRATE_CMA migration type is designed to mimic the way
+ * ZONE_MOVABLE works. Only movable pages can be allocated
+ * from MIGRATE_CMA pageblocks and page allocator never
+ * implicitly change migration type of MIGRATE_CMA pageblock.
+ *
+ * The way to use it is to change migratetype of a range of
+ * pageblocks to MIGRATE_CMA which can be done by
+ * __free_pageblock_cma() function. What is important though
+ * is that a range of pageblocks must be aligned to
+ * MAX_ORDER_NR_PAGES should biggest page be bigger then
+ * a single pageblock.
+ */
+ MIGRATE_CMA,
+#endif
+ MIGRATE_ISOLATE, /* can't allocate from here */
+ MIGRATE_TYPES
+};
+
+#ifdef CONFIG_CMA_MIGRATE_TYPE
+# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
+#else
+# define is_migrate_cma(migratetype) false
+#endif

#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
@@ -54,6 +78,11 @@ static inline int get_pageblock_migratetype(struct page *page)
return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
}

+static inline bool is_pageblock_cma(struct page *page)
+{
+ return is_migrate_cma(get_pageblock_migratetype(page));
+}
+
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
diff --git a/mm/Kconfig b/mm/Kconfig
index 8ca47a5..6ffedd8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -189,7 +189,7 @@ config COMPACTION
config MIGRATION
bool "Page migration"
def_bool y
- depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
+ depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA_MIGRATE_TYPE
help
Allows the migration of the physical location of pages of processes
while the virtual addresses are not changed. This is useful in
@@ -198,6 +198,12 @@ config MIGRATION
pages as migration can relocate pages to satisfy a huge page
allocation instead of reclaiming.

+config CMA_MIGRATE_TYPE
+ bool
+ help
+ This enables the use the MIGRATE_CMA migrate type, which lets lets CMA
+ work on almost arbitrary memory range and not only inside ZONE_MOVABLE.
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT

diff --git a/mm/compaction.c b/mm/compaction.c
index 021a296..6d013c3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -119,6 +119,16 @@ static bool suitable_migration_target(struct page *page)
if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
return false;

+ /* Keep MIGRATE_CMA alone as well. */
+ /*
+ * XXX Revisit. We currently cannot let compaction touch CMA
+ * pages since compaction insists on changing their migration
+ * type to MIGRATE_MOVABLE (see split_free_page() called from
+ * isolate_freepages_block() above).
+ */
+ if (is_migrate_cma(migratetype))
+ return false;
+
/* If the page is a large free page, then allow migration */
if (PageBuddy(page) && page_order(page) >= pageblock_order)
return true;
diff --git a/mm/internal.h b/mm/internal.h
index d071d380..b44566c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -49,6 +49,9 @@ extern void putback_lru_page(struct page *page);
* in mm/page_alloc.c
*/
extern void __free_pages_bootmem(struct page *page, unsigned int order);
+#ifdef CONFIG_CMA_MIGRATE_TYPE
+extern void __free_pageblock_cma(struct page *page);
+#endif
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2cea044..c13c0dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -719,6 +719,30 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
}
}

+#ifdef CONFIG_CMA_MIGRATE_TYPE
+
+/*
+ * Free whole pageblock and set it's migration type to MIGRATE_CMA.
+ */
+void __init __free_pageblock_cma(struct page *page)
+{
+ struct page *p = page;
+ unsigned i = pageblock_nr_pages;
+
+ prefetchw(p);
+ do {
+ if (--i)
+ prefetchw(p + 1);
+ __ClearPageReserved(p);
+ set_page_count(p, 0);
+ } while (++p, i);
+
+ set_page_refcounted(page);
+ set_pageblock_migratetype(page, MIGRATE_CMA);
+ __free_pages(page, pageblock_order);
+}
+
+#endif

/*
* The order of subdivision here is critical for the IO subsystem.
@@ -827,11 +851,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
* This array describes the order lists are fallen back to when
* the free lists for the desirable migrate type are depleted
*/
-static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
+static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
+#ifdef CONFIG_CMA_MIGRATE_TYPE
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_CMA , MIGRATE_RESERVE },
+#else
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
- [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */
+#endif
+ [MIGRATE_RESERVE] = { MIGRATE_RESERVE }, /* Never used */
};

/*
@@ -926,12 +954,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
/* Find the largest possible block of pages in the other list */
for (current_order = MAX_ORDER-1; current_order >= order;
--current_order) {
- for (i = 0; i < MIGRATE_TYPES - 1; i++) {
+ for (i = 0; i < ARRAY_SIZE(fallbacks[0]); i++) {
migratetype = fallbacks[start_migratetype][i];

/* MIGRATE_RESERVE handled later if necessary */
if (migratetype == MIGRATE_RESERVE)
- continue;
+ break;

area = &(zone->free_area[current_order]);
if (list_empty(&area->free_list[migratetype]))
@@ -946,19 +974,29 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
* pages to the preferred allocation list. If falling
* back for a reclaimable kernel allocation, be more
* aggressive about taking ownership of free pages
+ *
+ * On the other hand, never change migration
+ * type of MIGRATE_CMA pageblocks nor move CMA
+ * pages on different free lists. We don't
+ * want unmovable pages to be allocated from
+ * MIGRATE_CMA areas.
*/
- if (unlikely(current_order >= (pageblock_order >> 1)) ||
- start_migratetype == MIGRATE_RECLAIMABLE ||
- page_group_by_mobility_disabled) {
- unsigned long pages;
+ if (!is_pageblock_cma(page) &&
+ (unlikely(current_order >= pageblock_order / 2) ||
+ start_migratetype == MIGRATE_RECLAIMABLE ||
+ page_group_by_mobility_disabled)) {
+ int pages;
pages = move_freepages_block(zone, page,
- start_migratetype);
+ start_migratetype);

- /* Claim the whole block if over half of it is free */
+ /*
+ * Claim the whole block if over half
+ * of it is free
+ */
if (pages >= (1 << (pageblock_order-1)) ||
- page_group_by_mobility_disabled)
+ page_group_by_mobility_disabled)
set_pageblock_migratetype(page,
- start_migratetype);
+ start_migratetype);

migratetype = start_migratetype;
}
@@ -968,11 +1006,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
rmv_page_order(page);

/* Take ownership for orders >= pageblock_order */
- if (current_order >= pageblock_order)
+ if (current_order >= pageblock_order &&
+ !is_pageblock_cma(page))
change_pageblock_range(page, current_order,
start_migratetype);

- expand(zone, page, order, current_order, area, migratetype);
+ expand(zone, page, order, current_order, area,
+ is_migrate_cma(start_migratetype)
+ ? start_migratetype : migratetype);

trace_mm_page_alloc_extfrag(page, order, current_order,
start_migratetype, migratetype);
@@ -1044,7 +1085,12 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
list_add(&page->lru, list);
else
list_add_tail(&page->lru, list);
- set_page_private(page, migratetype);
+#ifdef CONFIG_CMA_MIGRATE_TYPE
+ if (is_pageblock_cma(page))
+ set_page_private(page, MIGRATE_CMA);
+ else
+#endif
+ set_page_private(page, migratetype);
list = &page->lru;
}
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
@@ -1185,9 +1231,16 @@ void free_hot_cold_page(struct page *page, int cold)
* offlined but treat RESERVE as movable pages so we can get those
* areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
+ *
+ * Still, do not change migration type of MIGRATE_CMA pages (if
+ * they'd be recorded as MIGRATE_MOVABLE an unmovable page could
+ * be allocated from MIGRATE_CMA block and we don't want to allow
+ * that). In this respect, treat MIGRATE_CMA like
+ * MIGRATE_ISOLATE.
*/
if (migratetype >= MIGRATE_PCPTYPES) {
- if (unlikely(migratetype == MIGRATE_ISOLATE)) {
+ if (unlikely(migratetype == MIGRATE_ISOLATE
+ || is_migrate_cma(migratetype))) {
free_one_page(zone, page, 0, migratetype);
goto out;
}
@@ -1276,7 +1329,9 @@ int split_free_page(struct page *page)
if (order >= pageblock_order - 1) {
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages)
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ if (!is_pageblock_cma(page))
+ set_pageblock_migratetype(page,
+ MIGRATE_MOVABLE);
}

return 1 << order;
@@ -5486,8 +5541,8 @@ __count_immobile_pages(struct zone *zone, struct page *page, int count)
*/
if (zone_idx(zone) == ZONE_MOVABLE)
return true;
-
- if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+ if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE ||
+ is_pageblock_cma(page))
return true;

pfn = page_to_pfn(page);
--
1.7.1.569.g6f426

2011-06-10 09:56:16

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 07/10] mm: MIGRATE_CMA isolation functions added

From: Michal Nazarewicz <[email protected]>

This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.

Signed-off-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/page-isolation.h | 40 +++++++++++++++++++++++++++-------------
mm/page_alloc.c | 19 ++++++++++++-------
mm/page_isolation.c | 15 ++++++++-------
3 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index c5d1a7c..177b307 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -3,39 +3,53 @@

/*
* Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
- * If specified range includes migrate types other than MOVABLE,
+ * If specified range includes migrate types other than MOVABLE or CMA,
* this will fail with -EBUSY.
*
* For isolating all pages in the range finally, the caller have to
* free all pages in the range. test_page_isolated() can be used for
* test it.
*/
-extern int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
+int __start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+ unsigned migratetype);
+
+static inline int
+start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ return __start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+}
+
+int __undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+ unsigned migratetype);

/*
* Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
* target range is [start_pfn, end_pfn)
*/
-extern int
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
+static inline int
+undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ return __undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+}

/*
- * test all pages in [start_pfn, end_pfn)are isolated or not.
+ * Test all pages in [start_pfn, end_pfn) are isolated or not.
*/
-extern int
-test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
+int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);

/*
- * Internal funcs.Changes pageblock's migrate type.
- * Please use make_pagetype_isolated()/make_pagetype_movable().
+ * Internal functions. Changes pageblock's migrate type.
*/
-extern int set_migratetype_isolate(struct page *page);
-extern void unset_migratetype_isolate(struct page *page);
+int set_migratetype_isolate(struct page *page);
+void __unset_migratetype_isolate(struct page *page, unsigned migratetype);
+static inline void unset_migratetype_isolate(struct page *page)
+{
+ __unset_migratetype_isolate(page, MIGRATE_MOVABLE);
+}
extern unsigned long alloc_contig_freed_pages(unsigned long start,
unsigned long end, gfp_t flag);
extern int alloc_contig_range(unsigned long start, unsigned long end,
- gfp_t flags);
+ gfp_t flags, unsigned migratetype);
extern void free_contig_pages(struct page *page, int nr_pages);

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c13c0dc..c5b78ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5641,7 +5641,7 @@ out:
return ret;
}

-void unset_migratetype_isolate(struct page *page)
+void __unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
unsigned long flags;
@@ -5649,8 +5649,8 @@ void unset_migratetype_isolate(struct page *page)
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- move_freepages_block(zone, page, MIGRATE_MOVABLE);
+ set_pageblock_migratetype(page, migratetype);
+ move_freepages_block(zone, page, migratetype);
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -5755,6 +5755,10 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
* @flags: flags passed to alloc_contig_freed_pages().
+ * @migratetype: migratetype of the underlaying pageblocks (either
+ * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
+ * in range must have the same migratetype and it must
+ * be either of the two.
*
* The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
* aligned, hovewer it's callers responsibility to guarantee that we
@@ -5766,7 +5770,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
* need to be freed with free_contig_pages().
*/
int alloc_contig_range(unsigned long start, unsigned long end,
- gfp_t flags)
+ gfp_t flags, unsigned migratetype)
{
unsigned long outer_start, outer_end;
int ret;
@@ -5794,8 +5798,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
* them.
*/

- ret = start_isolate_page_range(pfn_to_maxpage(start),
- pfn_to_maxpage_up(end));
+ ret = __start_isolate_page_range(pfn_to_maxpage(start),
+ pfn_to_maxpage_up(end), migratetype);
if (ret)
goto done;

@@ -5833,7 +5837,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,

ret = 0;
done:
- undo_isolate_page_range(pfn_to_maxpage(start), pfn_to_maxpage_up(end));
+ __undo_isolate_page_range(pfn_to_maxpage(start), pfn_to_maxpage_up(end),
+ migratetype);
return ret;
}

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 15b41ec..f8beab5 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -23,10 +23,11 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
}

/*
- * start_isolate_page_range() -- make page-allocation-type of range of pages
+ * __start_isolate_page_range() -- make page-allocation-type of range of pages
* to be MIGRATE_ISOLATE.
* @start_pfn: The lower PFN of the range to be isolated.
* @end_pfn: The upper PFN of the range to be isolated.
+ * @migratetype: migrate type to set in error recovery.
*
* Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
* the range will never be allocated. Any free pages and pages freed in the
@@ -35,8 +36,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* start_pfn/end_pfn must be aligned to pageblock_order.
* Returns 0 on success and -EBUSY if any part of range cannot be isolated.
*/
-int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+int __start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+ unsigned migratetype)
{
unsigned long pfn;
unsigned long undo_pfn;
@@ -59,7 +60,7 @@ undo:
for (pfn = start_pfn;
pfn < undo_pfn;
pfn += pageblock_nr_pages)
- unset_migratetype_isolate(pfn_to_page(pfn));
+ __unset_migratetype_isolate(pfn_to_page(pfn), migratetype);

return -EBUSY;
}
@@ -67,8 +68,8 @@ undo:
/*
* Make isolated pages available again.
*/
-int
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+int __undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+ unsigned migratetype)
{
unsigned long pfn;
struct page *page;
@@ -80,7 +81,7 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
page = __first_valid_page(pfn, pageblock_nr_pages);
if (!page || get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
continue;
- unset_migratetype_isolate(page);
+ __unset_migratetype_isolate(page, migratetype);
}
return 0;
}
--
1.7.1.569.g6f426

2011-06-10 09:57:05

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

The Contiguous Memory Allocator is a set of functions that lets
one initialise a region of memory which then can be used to perform
allocations of contiguous memory chunks from.

CMA allows for creation of separate contexts. Kernel is allowed to
allocate movable pages within CMA's managed memory so that it can be
used for page cache when CMA devices do not use it. On cm_alloc()
request such pages are migrated out of CMA area to free required
contiguous block.

This code is heavily based on earlier works by Michal Nazarewicz.

Signed-off-by: Marek Szyprowski <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
CC: Michal Nazarewicz <[email protected]>
---
include/linux/cma.h | 189 +++++++++++++++++++++++++++
mm/Kconfig | 21 +++
mm/Makefile | 1 +
mm/cma.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 569 insertions(+), 0 deletions(-)
create mode 100644 include/linux/cma.h
create mode 100644 mm/cma.c

diff --git a/include/linux/cma.h b/include/linux/cma.h
new file mode 100644
index 0000000..70a993c
--- /dev/null
+++ b/include/linux/cma.h
@@ -0,0 +1,189 @@
+#ifndef __LINUX_CMA_H
+#define __LINUX_CMA_H
+
+/*
+ * Contiguous Memory Allocator
+ * Copyright (c) 2010-2011 by Samsung Electronics.
+ * Written by:
+ * Michal Nazarewicz <[email protected]>
+ * Marek Szyprowski <[email protected]>
+ */
+
+/*
+ * Contiguous Memory Allocator
+ *
+ * The Contiguous Memory Allocator (CMA) makes it possible for
+ * device drivers to allocate big contiguous chunks of memory after
+ * the system has booted.
+ *
+ * It requires some machine- and/or platform-specific initialisation
+ * code which prepares memory ranges to be used with CMA and later,
+ * device drivers can allocate memory from those ranges.
+ *
+ * Why is it needed?
+ *
+ * Various devices on embedded systems have no scatter-getter and/or
+ * IO map support and require contiguous blocks of memory to
+ * operate. They include devices such as cameras, hardware video
+ * coders, etc.
+ *
+ * Such devices often require big memory buffers (a full HD frame
+ * is, for instance, more then 2 mega pixels large, i.e. more than 6
+ * MB of memory), which makes mechanisms such as kmalloc() or
+ * alloc_page() ineffective.
+ *
+ * At the same time, a solution where a big memory region is
+ * reserved for a device is suboptimal since often more memory is
+ * reserved then strictly required and, moreover, the memory is
+ * inaccessible to page system even if device drivers don't use it.
+ *
+ * CMA tries to solve this issue by operating on memory regions
+ * where only movable pages can be allocated from. This way, kernel
+ * can use the memory for pagecache and when device driver requests
+ * it, allocated pages can be migrated.
+ *
+ * Driver usage
+ *
+ * CMA should not be used directly by the device drivers. It should
+ * be considered as helper framework for dma-mapping subsystm and
+ * respective (platform)bus drivers.
+ *
+ * The CMA client needs to have a pointer to a CMA context
+ * represented by a struct cma (which is an opaque data type).
+ *
+ * Once such pointer is obtained, a caller may allocate contiguous
+ * memory chunk using the following function:
+ *
+ * cm_alloc()
+ *
+ * This function returns a pointer to the first struct page which
+ * represent a contiguous memory chunk. This pointer
+ * may be used with the following function:
+ *
+ * cm_free() -- frees allocated contiguous memory
+ *
+ * Platform/machine integration
+ *
+ * CMA context must be created on platform or machine initialisation
+ * and passed to respective subsystem that will be a client for CMA.
+ * The latter may be done by a global variable or some filed in
+ * struct device. For the former CMA provides the following functions:
+ *
+ * cma_init_migratetype()
+ * cma_reserve()
+ * cma_create()
+ *
+ * The first one initialises a portion of reserved memory so that it
+ * can be used with CMA. The second first tries to reserve memory
+ * (using memblock) and then initialise it.
+ *
+ * The cma_reserve() function must be called when memblock is still
+ * operational and reserving memory with it is still possible. On
+ * ARM platform the "reserve" machine callback is a perfect place to
+ * call it.
+ *
+ * The last function creates a CMA context on a range of previously
+ * initialised memory addresses. Because it uses kmalloc() it needs
+ * to be called after SLAB is initialised.
+ */
+
+/***************************** Kernel level API *****************************/
+
+#ifdef __KERNEL__
+
+struct cma;
+struct page;
+
+#ifdef CONFIG_CMA
+
+/**
+ * cma_init_migratetype() - initialises range of physical memory to be used
+ * with CMA context.
+ * @start: start address of the memory range in bytes.
+ * @size: size of the memory range in bytes.
+ *
+ * The range must be MAX_ORDER_NR_PAGES aligned and it must have been
+ * already reserved (eg. with memblock).
+ *
+ * The actual initialisation is deferred until subsys initcalls are
+ * evaluated (unless this has already happened).
+ *
+ * Returns zero on success or negative error.
+ */
+int cma_init_migratetype(unsigned long start, unsigned long end);
+
+/**
+ * cma_reserve() - reserves memory.
+ * @start: start address of the memory range in bytes hint; if unsure
+ * pass zero.
+ * @size: size of the memory to reserve in bytes.
+ *
+ * It will use memblock to allocate memory. It will also call
+ * cma_init_migratetype() on reserved region so that a CMA context can
+ * be created on given range.
+ *
+ * @start and @size will be aligned to (MAX_ORDER_NR_PAGES << PAGE_SHIFT).
+ *
+ * Returns reserved's area physical address or value that yields true
+ * when checked with IS_ERR_VALUE().
+ */
+unsigned long cma_reserve(unsigned long start, unsigned long size);
+
+/**
+ * cma_create() - creates a CMA context.
+ * @start: start address of the context in bytes.
+ * @size: size of the context in bytes.
+ *
+ * The range must be page aligned. Different contexts cannot overlap.
+ *
+ * The memory range must either lay in ZONE_MOVABLE or must have been
+ * initialised with cma_init_migratetype() function.
+ *
+ * @start and @size must be page aligned.
+ *
+ * Because this function uses kmalloc() it must be called after SLAB
+ * is initialised. This in particular means that it cannot be called
+ * just after cma_reserve() since the former needs to be run way
+ * earlier.
+ *
+ * Returns pointer to CMA context or a pointer-error on error.
+ */
+struct cma *cma_create(unsigned long start, unsigned long size);
+
+/**
+ * cma_destroy() - destroys CMA context.
+ * @cma: context to destroy.
+ */
+void cma_destroy(struct cma *cma);
+
+/**
+ * cm_alloc() - allocates contiguous memory.
+ * @cma: CMA context to use
+ * @count: desired chunk size in pages (must be non-zero)
+ * @order: desired alignment in pages
+ *
+ * Returns pointer to first page structure representing contiguous memory
+ * or a pointer-error on error.
+ */
+struct page *cm_alloc(struct cma *cma, int count, unsigned int order);
+
+/**
+ * cm_free() - frees contiguous memory.
+ * @cma: CMA context to use
+ * @pages: contiguous memory to free
+ * @count: desired chunk size in pages (must be non-zero)
+ *
+ */
+void cm_free(struct cma *cma, struct page *pages, int count);
+
+#else
+struct page *cm_alloc(struct cma *cma, int count, unsigned int order)
+{
+ return NULL;
+};
+void cm_free(struct cma *cma, struct page *pages, int count) { }
+#endif
+
+#endif
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 6ffedd8..b85de4c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -346,6 +346,27 @@ choice
benefit.
endchoice

+config CMA
+ bool "Contiguous Memory Allocator"
+ select CMA_MIGRATE_TYPE
+ select MIGRATION
+ select GENERIC_ALLOCATOR
+ help
+ This enables the Contiguous Memory Allocator which allows drivers
+ to allocate big physically-contiguous blocks of memory for use with
+ hardware components that do not support I/O map nor scatter-gather.
+
+ For more information see <include/linux/cma.h>. If unsure, say "n".
+
+config CMA_DEBUG
+ bool "CMA debug messages (DEVELOPEMENT)"
+ depends on CMA
+ help
+ Turns on debug messages in CMA. This produces KERN_DEBUG
+ messages for every CMA call as well as various messages while
+ processing calls such as cm_alloc(). This option does not
+ affect warning and error messages.
+
#
# UP and nommu archs use km based percpu allocator
#
diff --git a/mm/Makefile b/mm/Makefile
index 836e416..4610320 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -50,3 +50,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
+obj-$(CONFIG_CMA) += cma.o
diff --git a/mm/cma.c b/mm/cma.c
new file mode 100644
index 0000000..aee306c
--- /dev/null
+++ b/mm/cma.c
@@ -0,0 +1,358 @@
+/*
+ * Contiguous Memory Allocator
+ * Copyright (c) 2010-2011 by Samsung Electronics.
+ * Written by:
+ * Michal Nazarewicz <[email protected]>
+ * Marek Szyprowski <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License or (at your optional) any later version of the license.
+ */
+
+/*
+ * See include/linux/cma.h for details.
+ */
+
+#define pr_fmt(fmt) "cma: " fmt
+
+#ifdef CONFIG_CMA_DEBUG
+# define DEBUG
+#endif
+
+#include <asm/page.h>
+#include <asm/errno.h>
+
+#include <linux/cma.h>
+#include <linux/memblock.h>
+#include <linux/err.h>
+#include <linux/genalloc.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/page-isolation.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/mm_types.h>
+
+#include "internal.h"
+
+/* XXX Revisit */
+#ifdef phys_to_pfn
+/* nothing to do */
+#elif defined __phys_to_pfn
+# define phys_to_pfn __phys_to_pfn
+#else
+# warning correct phys_to_pfn implementation needed
+static unsigned long phys_to_pfn(phys_addr_t phys)
+{
+ return virt_to_pfn(phys_to_virt(phys));
+}
+#endif
+
+
+/************************* Initialise CMA *************************/
+
+static struct cma_grabbed {
+ unsigned long start;
+ unsigned long size;
+} cma_grabbed[8] __initdata;
+static unsigned cma_grabbed_count __initdata;
+
+#ifdef CONFIG_DEBUG_VM
+
+static int __cma_give_back(unsigned long start, unsigned long size)
+{
+ unsigned long pfn = phys_to_pfn(start);
+ unsigned i = size >> PAGE_SHIFT;
+ struct zone *zone;
+
+ pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+ VM_BUG_ON(!pfn_valid(pfn));
+ zone = page_zone(pfn_to_page(pfn));
+
+ do {
+ VM_BUG_ON(!pfn_valid(pfn));
+ VM_BUG_ON(page_zone(pfn_to_page(pfn)) != zone);
+ if (!(pfn & (pageblock_nr_pages - 1)))
+ __free_pageblock_cma(pfn_to_page(pfn));
+ ++pfn;
+ ++totalram_pages;
+ } while (--i);
+
+ return 0;
+}
+
+#else
+
+static int __cma_give_back(unsigned long start, unsigned long size)
+{
+ unsigned i = size >> (PAGE_SHIFT + pageblock_order);
+ struct page *p = phys_to_page(start);
+
+ pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+ do {
+ __free_pageblock_cma(p);
+ p += pageblock_nr_pages;
+ totalram_pages += pageblock_nr_pages;
+ } while (--i);
+
+ return 0;
+}
+
+#endif
+
+static int __init __cma_queue_give_back(unsigned long start, unsigned long size)
+{
+ if (cma_grabbed_count == ARRAY_SIZE(cma_grabbed))
+ return -ENOSPC;
+
+ cma_grabbed[cma_grabbed_count].start = start;
+ cma_grabbed[cma_grabbed_count].size = size;
+ ++cma_grabbed_count;
+ return 0;
+}
+
+static int (*cma_give_back)(unsigned long start, unsigned long size) =
+ __cma_queue_give_back;
+
+static int __init cma_give_back_queued(void)
+{
+ struct cma_grabbed *r = cma_grabbed;
+ unsigned i = cma_grabbed_count;
+
+ pr_debug("%s(): will give %u range(s)\n", __func__, i);
+
+ cma_give_back = __cma_give_back;
+
+ for (; i; --i, ++r)
+ __cma_give_back(r->start, r->size);
+
+ return 0;
+}
+core_initcall(cma_give_back_queued);
+
+int __ref cma_init_migratetype(unsigned long start, unsigned long size)
+{
+ pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+ if (!size)
+ return -EINVAL;
+ if ((start | size) & ((MAX_ORDER_NR_PAGES << PAGE_SHIFT) - 1))
+ return -EINVAL;
+ if (start + size < start)
+ return -EOVERFLOW;
+
+ return cma_give_back(start, size);
+}
+
+unsigned long cma_reserve(unsigned long start, unsigned long size)
+{
+ unsigned long alignment;
+ int ret;
+
+ pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+ /* Sanity checks */
+ if (!size)
+ return (unsigned long)-EINVAL;
+
+ /* Sanitise input arguments */
+ start = ALIGN(start, MAX_ORDER_NR_PAGES << PAGE_SHIFT);
+ size = ALIGN(size , MAX_ORDER_NR_PAGES << PAGE_SHIFT);
+ alignment = MAX_ORDER_NR_PAGES << PAGE_SHIFT;
+
+ /* Reserve memory */
+ if (start) {
+ if (memblock_is_region_reserved(start, size) ||
+ memblock_reserve(start, size) < 0)
+ return (unsigned long)-EBUSY;
+ } else {
+ /*
+ * Use __memblock_alloc_base() since
+ * memblock_alloc_base() panic()s.
+ */
+ u64 addr = __memblock_alloc_base(size, alignment, 0);
+ if (!addr) {
+ return (unsigned long)-ENOMEM;
+ } else if (addr + size > ~(unsigned long)0) {
+ memblock_free(addr, size);
+ return (unsigned long)-EOVERFLOW;
+ } else {
+ start = addr;
+ }
+ }
+
+ /* CMA Initialise */
+ ret = cma_init_migratetype(start, size);
+ if (ret < 0) {
+ memblock_free(start, size);
+ return ret;
+ }
+
+ return start;
+}
+
+
+/************************** CMA context ***************************/
+
+struct cma {
+ int migratetype;
+ struct gen_pool *pool;
+};
+
+static int __cma_check_range(unsigned long start, unsigned long size)
+{
+ int migratetype = MIGRATE_MOVABLE;
+ unsigned long pfn, count;
+ struct page *page;
+ struct zone *zone;
+
+ start = phys_to_pfn(start);
+ if (WARN_ON(!pfn_valid(start)))
+ return -EINVAL;
+
+ if (page_zonenum(pfn_to_page(start)) != ZONE_MOVABLE)
+ migratetype = MIGRATE_CMA;
+
+ /* First check if all pages are valid and in the same zone */
+ zone = page_zone(pfn_to_page(start));
+ count = size >> PAGE_SHIFT;
+ pfn = start;
+ while (++pfn, --count) {
+ if (WARN_ON(!pfn_valid(pfn)) ||
+ WARN_ON(page_zone(pfn_to_page(pfn)) != zone))
+ return -EINVAL;
+ }
+
+ /* Now check migratetype of their pageblocks. */
+ start = start & ~(pageblock_nr_pages - 1);
+ pfn = ALIGN(pfn, pageblock_nr_pages);
+ page = pfn_to_page(start);
+ count = (pfn - start) >> PAGE_SHIFT;
+ do {
+ if (WARN_ON(get_pageblock_migratetype(page) != migratetype))
+ return -EINVAL;
+ page += pageblock_nr_pages;
+ } while (--count);
+
+ return migratetype;
+}
+
+struct cma *cma_create(unsigned long start, unsigned long size)
+{
+ struct gen_pool *pool;
+ int migratetype, ret;
+ struct cma *cma;
+
+ pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+ if (!size)
+ return ERR_PTR(-EINVAL);
+ if ((start | size) & (PAGE_SIZE - 1))
+ return ERR_PTR(-EINVAL);
+ if (start + size < start)
+ return ERR_PTR(-EOVERFLOW);
+
+ migratetype = __cma_check_range(start, size);
+ if (migratetype < 0)
+ return ERR_PTR(migratetype);
+
+ cma = kmalloc(sizeof *cma, GFP_KERNEL);
+ if (!cma)
+ return ERR_PTR(-ENOMEM);
+
+ pool = gen_pool_create(ffs(PAGE_SIZE) - 1, -1);
+ if (!pool) {
+ ret = -ENOMEM;
+ goto error1;
+ }
+
+ ret = gen_pool_add(pool, start, size, -1);
+ if (unlikely(ret))
+ goto error2;
+
+ cma->migratetype = migratetype;
+ cma->pool = pool;
+
+ pr_debug("%s: returning <%p>\n", __func__, (void *)cma);
+ return cma;
+
+error2:
+ gen_pool_destroy(pool);
+error1:
+ kfree(cma);
+ return ERR_PTR(ret);
+}
+
+void cma_destroy(struct cma *cma)
+{
+ pr_debug("%s(<%p>)\n", __func__, (void *)cma);
+ gen_pool_destroy(cma->pool);
+}
+
+
+/************************* Allocate and free *************************/
+
+/* Protects cm_alloc(), cm_free() as well as gen_pools of each cm. */
+static DEFINE_MUTEX(cma_mutex);
+
+struct page *cm_alloc(struct cma *cma, int count, unsigned int order)
+{
+ unsigned long start;
+ unsigned long size = count << PAGE_SHIFT;
+
+ if (!cma)
+ return NULL;
+
+ pr_debug("%s(<%p>, %lx/%d)\n", __func__, (void *)cma, size, order);
+
+ if (!size)
+ return NULL;
+
+ mutex_lock(&cma_mutex);
+
+ start = gen_pool_alloc_aligned(cma->pool, size, order+12);
+ if (!start)
+ goto error1;
+
+ if (cma->migratetype) {
+ unsigned long pfn = phys_to_pfn(start);
+ int ret = alloc_contig_range(pfn, pfn + (size >> PAGE_SHIFT),
+ 0, cma->migratetype);
+ if (ret)
+ goto error2;
+ }
+
+ mutex_unlock(&cma_mutex);
+
+ pr_debug("%s(): returning [%p]\n", __func__, (void *)phys_to_page(start));
+ return phys_to_page(start);
+error2:
+ gen_pool_free(cma->pool, start, size);
+error1:
+ mutex_unlock(&cma_mutex);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(cm_alloc);
+
+void cm_free(struct cma *cma, struct page *pages, int count)
+{
+ unsigned long size = count << PAGE_SHIFT;
+ pr_debug("%s([%p])\n", __func__, (void *)pages);
+
+ if (!cma || !pages)
+ return;
+
+ mutex_lock(&cma_mutex);
+
+ gen_pool_free(cma->pool, page_to_phys(pages), size);
+ if (cma->migratetype)
+ free_contig_pages(pages, count);
+
+ mutex_unlock(&cma_mutex);
+}
+EXPORT_SYMBOL_GPL(cm_free);
--
1.7.1.569.g6f426

2011-06-10 09:56:21

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 09/10] ARM: integrate CMA with dma-mapping subsystem

This patch adds support for CMA to dma-mapping subsystem for ARM
architecture. CMA area can be defined individually for each device in
the system. This is up to the board startup code to create CMA area and
assign it to the devices.

Buffer alignment is derived from the buffer size, but for only for
buffers up to 1MiB. Larger buffers are aligned to 1MiB always.

Signed-off-by: Marek Szyprowski <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
---
arch/arm/include/asm/device.h | 3 ++
arch/arm/include/asm/dma-mapping.h | 19 +++++++++++
arch/arm/mm/dma-mapping.c | 60 +++++++++++++++++++++++++++---------
3 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/device.h b/arch/arm/include/asm/device.h
index 9f390ce..942913e 100644
--- a/arch/arm/include/asm/device.h
+++ b/arch/arm/include/asm/device.h
@@ -10,6 +10,9 @@ struct dev_archdata {
#ifdef CONFIG_DMABOUNCE
struct dmabounce_device_info *dmabounce;
#endif
+#ifdef CONFIG_CMA
+ struct cma *cma_area;
+#endif
};

struct pdev_archdata {
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index 4fff837..e387ea7 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -14,6 +14,25 @@
#error Please update to __arch_pfn_to_dma
#endif

+struct cma;
+
+#ifdef CONFIG_CMA
+static inline struct cma *get_dev_cma_area(struct device *dev)
+{
+ return dev->archdata.cma_area;
+}
+
+static inline void set_dev_cma_area(struct device *dev, struct cma *cma)
+{
+ dev->archdata.cma_area = cma;
+}
+#else
+static inline struct cma *get_dev_cma_area(struct device *dev)
+{
+ return NULL;
+}
+#endif
+
/*
* dma_to_pfn/pfn_to_dma/dma_to_virt/virt_to_dma are architecture private
* functions used internally by the DMA-mapping API to provide DMA
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 82a093c..233e34a 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -18,6 +18,7 @@
#include <linux/device.h>
#include <linux/dma-mapping.h>
#include <linux/highmem.h>
+#include <linux/cma.h>

#include <asm/memory.h>
#include <asm/highmem.h>
@@ -52,16 +53,36 @@ static u64 get_coherent_dma_mask(struct device *dev)
return mask;
}

+
+static struct page *__alloc_system_pages(size_t count, unsigned int order, gfp_t gfp)
+{
+ struct page *page, *p, *e;
+
+ page = alloc_pages(gfp, order);
+ if (!page)
+ return NULL;
+
+ /*
+ * Now split the huge page and free the excess pages
+ */
+ split_page(page, order);
+ for (p = page + count, e = page + (1 << order); p < e; p++)
+ __free_page(p);
+ return page;
+}
+
/*
* Allocate a DMA buffer for 'dev' of size 'size' using the
* specified gfp mask. Note that 'size' must be page aligned.
*/
static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gfp)
{
- unsigned long order = get_order(size);
- struct page *page, *p, *e;
+ struct cma *cma = get_dev_cma_area(dev);
+ struct page *page;
+ size_t count = size >> PAGE_SHIFT;
void *ptr;
u64 mask = get_coherent_dma_mask(dev);
+ unsigned long order = get_order(count << PAGE_SHIFT);

#ifdef CONFIG_DMA_API_DEBUG
u64 limit = (mask + 1) & ~mask;
@@ -78,16 +99,19 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
if (mask < 0xffffffffULL)
gfp |= GFP_DMA;

- page = alloc_pages(gfp, order);
- if (!page)
- return NULL;
+ /*
+ * First, try to allocate memory from contiguous area aligned up to 1MiB
+ */
+ page = cm_alloc(cma, count, order < 8 ? 8 : order);

/*
- * Now split the huge page and free the excess pages
+ * Fallback if contiguous alloc fails or is not available
*/
- split_page(page, order);
- for (p = page + (size >> PAGE_SHIFT), e = page + (1 << order); p < e; p++)
- __free_page(p);
+ if (!page)
+ page = __alloc_system_pages(count, order, gfp);
+
+ if (!page)
+ return NULL;

/*
* Ensure that the allocated pages are zeroed, and that any data
@@ -104,13 +128,19 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
/*
* Free a DMA buffer. 'size' must be page aligned.
*/
-static void __dma_free_buffer(struct page *page, size_t size)
+static void __dma_free_buffer(struct device *dev, struct page *page, size_t size)
{
- struct page *e = page + (size >> PAGE_SHIFT);
+ struct cma *cma = get_dev_cma_area(dev);
+ size_t count = size >> PAGE_SHIFT;
+ struct page *e = page + count;

- while (page < e) {
- __free_page(page);
- page++;
+ if (cma) {
+ cm_free(cma, page, count);
+ } else {
+ while (page < e) {
+ __free_page(page);
+ page++;
+ }
}
}

@@ -416,7 +446,7 @@ void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr
if (!arch_is_coherent())
__dma_free_remap(cpu_addr, size);

- __dma_free_buffer(pfn_to_page(dma_to_pfn(dev, handle)), size);
+ __dma_free_buffer(dev, pfn_to_page(dma_to_pfn(dev, handle)), size);
}
EXPORT_SYMBOL(dma_free_coherent);

--
1.7.1.569.g6f426

2011-06-10 09:55:22

by Marek Szyprowski

[permalink] [raw]
Subject: [PATCH 10/10] ARM: S5PV210: add CMA support for FIMC devices on Aquila board

This patch is an example how CMA can be activated for particular devices
in the system. It creates one CMA region and assigns it to all s5p-fimc
devices on Samsung Aquila S5PC110 board.

Signed-off-by: Marek Szyprowski <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
---
arch/arm/mach-s5pv210/Kconfig | 1 +
arch/arm/mach-s5pv210/mach-aquila.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/arm/mach-s5pv210/Kconfig b/arch/arm/mach-s5pv210/Kconfig
index 37b5a97..c09a92c 100644
--- a/arch/arm/mach-s5pv210/Kconfig
+++ b/arch/arm/mach-s5pv210/Kconfig
@@ -64,6 +64,7 @@ menu "S5PC110 Machines"
config MACH_AQUILA
bool "Aquila"
select CPU_S5PV210
+ select CMA
select S3C_DEV_FB
select S5P_DEV_FIMC0
select S5P_DEV_FIMC1
diff --git a/arch/arm/mach-s5pv210/mach-aquila.c b/arch/arm/mach-s5pv210/mach-aquila.c
index 4e1d8ff..8c404e5 100644
--- a/arch/arm/mach-s5pv210/mach-aquila.c
+++ b/arch/arm/mach-s5pv210/mach-aquila.c
@@ -21,6 +21,8 @@
#include <linux/gpio_keys.h>
#include <linux/input.h>
#include <linux/gpio.h>
+#include <linux/cma.h>
+#include <linux/dma-mapping.h>

#include <asm/mach/arch.h>
#include <asm/mach/map.h>
@@ -650,6 +652,19 @@ static void __init aquila_map_io(void)
s5p_set_timer_source(S5P_PWM3, S5P_PWM4);
}

+unsigned long cma_area_start;
+unsigned long cma_area_size = 32 << 20;
+
+static void __init aquila_reserve(void)
+{
+ unsigned long ret = cma_reserve(cma_area_start, cma_area_size);
+
+ if (!IS_ERR_VALUE(ret)) {
+ cma_area_start = ret;
+ printk(KERN_INFO "cma: reserved %ld bytes at %lx\n", cma_area_size, cma_area_start);
+ }
+}
+
static void __init aquila_machine_init(void)
{
/* PMIC */
@@ -672,6 +687,16 @@ static void __init aquila_machine_init(void)
s3c_fb_set_platdata(&aquila_lcd_pdata);

platform_add_devices(aquila_devices, ARRAY_SIZE(aquila_devices));
+
+ if (cma_area_start) {
+ struct cma *cma;
+ cma = cma_create(cma_area_start, cma_area_size);
+ if (cma) {
+ set_dev_cma_area(&s5p_device_fimc0.dev, cma);
+ set_dev_cma_area(&s5p_device_fimc1.dev, cma);
+ set_dev_cma_area(&s5p_device_fimc2.dev, cma);
+ }
+ }
}

MACHINE_START(AQUILA, "Aquila")
@@ -683,4 +708,5 @@ MACHINE_START(AQUILA, "Aquila")
.map_io = aquila_map_io,
.init_machine = aquila_machine_init,
.timer = &s5p_timer,
+ .reserve = aquila_reserve,
MACHINE_END
--
1.7.1.569.g6f426

2011-06-10 11:24:45

by Alan

[permalink] [raw]
Subject: Re: [PATCH 02/10] lib: genalloc: Generic allocator improvements

I am curious about one thing

Why do we need this allocator. Why not use allocate_resource and friends.
The kernel generic resource handler already handles object alignment and
subranges. It just seems to be a surplus allocator that could perhaps be
mostly removed by using the kernel resource allocator we already have ?

Alan

2011-06-10 12:22:14

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [PATCH 02/10] lib: genalloc: Generic allocator improvements

Hello,

On Friday, June 10, 2011 1:25 PM Alan Cox wrote:

> I am curious about one thing
>
> Why do we need this allocator. Why not use allocate_resource and friends.
> The kernel generic resource handler already handles object alignment and
> subranges. It just seems to be a surplus allocator that could perhaps be
> mostly removed by using the kernel resource allocator we already have ?

genalloc was used mainly for historical reasons (in the earlier version we
were looking for direct replacement for first fit allocator).

I plan to replace it with lib/bitmap.c bitmap_* based allocator (similar like
it it is used by dma_declare_coherent_memory() and friends in
drivers/base/dma-coherent.c). We need something really simple for CMA area
management.

IMHO allocate_resource and friends a bit too heavy here, but good to know
that such allocator also exists.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

2011-06-10 12:51:31

by Alan

[permalink] [raw]
Subject: Re: [PATCH 02/10] lib: genalloc: Generic allocator improvements

> I plan to replace it with lib/bitmap.c bitmap_* based allocator (similar like
> it it is used by dma_declare_coherent_memory() and friends in
> drivers/base/dma-coherent.c). We need something really simple for CMA area
> management.
>
> IMHO allocate_resource and friends a bit too heavy here, but good to know
> that such allocator also exists.

Not sure I'd class allocate_resource as heavyweight but providing it's
using something that already exists rather than inventing yet another
allocator.

This wants dealing with before it goes upstream though so the chaneges in
lib/*c etc never have to reach mainline and then get changed back.

Alan

2011-06-10 16:22:09

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Friday 10 June 2011, Marek Szyprowski wrote:
>The Contiguous Memory Allocator is a set of functions that lets
>one initialise a region of memory which then can be used to perform
>allocations of contiguous memory chunks from.
>
>CMA allows for creation of separate contexts. Kernel is allowed to
>allocate movable pages within CMA's managed memory so that it can be
>used for page cache when CMA devices do not use it. On cm_alloc()
>request such pages are migrated out of CMA area to free required
>contiguous block.

Hi Marek,

I'm generally happy with the patches 1 through 7, i.e the heavy lifting
to make contiguous allocations work. Thank you very much for keeping
up the work and submitting these in a good shape.

I do think that we need to discuss the driver-visible API a bit more.
My feeling is that this is rather un-Linux-like and it needs to be
simplified some more. Of course, I don't mind being overruled by the
memory management experts here, or if you can argue that it's really
the right way to do it.

> + * Driver usage
> + *
> + * CMA should not be used directly by the device drivers. It should
> + * be considered as helper framework for dma-mapping subsystm and
> + * respective (platform)bus drivers.
> + *
> + * The CMA client needs to have a pointer to a CMA context
> + * represented by a struct cma (which is an opaque data type).
> + *
> + * Once such pointer is obtained, a caller may allocate contiguous
> + * memory chunk using the following function:
> + *
> + * cm_alloc()
> + *
> + * This function returns a pointer to the first struct page which
> + * represent a contiguous memory chunk. This pointer
> + * may be used with the following function:
> + *
> + * cm_free() -- frees allocated contiguous memory

Please explain why you want a new top-level API here. I think it
would be much nicer if a device driver could simply call
alloc_pages(..., GFP_CMA) or similar, where all the complexity
in here gets hidden behind a conditional deep inside of the
page allocator.

Evidently, the two functions you describe here have an extra argument
for the context. Can you elaborate why that is really needed? What
is the specific requirement to have multiple such contexts in one
system and what is the worst-case effect that you would get when
the interface is simplified to have only one for all devices?

Alternatively, would it be possible to provide the cm_alloc/cm_free
functions only as backing to the dma mapping API and not export them
as a generic device driver interface?

> + * Platform/machine integration
> + *
> + * CMA context must be created on platform or machine initialisation
> + * and passed to respective subsystem that will be a client for CMA.
> + * The latter may be done by a global variable or some filed in
> + * struct device. For the former CMA provides the following functions:
> + *
> + * cma_init_migratetype()
> + * cma_reserve()
> + * cma_create()
> + *
> + * The first one initialises a portion of reserved memory so that it
> + * can be used with CMA. The second first tries to reserve memory
> + * (using memblock) and then initialise it.
> + *
> + * The cma_reserve() function must be called when memblock is still
> + * operational and reserving memory with it is still possible. On
> + * ARM platform the "reserve" machine callback is a perfect place to
> + * call it.
> + *
> + * The last function creates a CMA context on a range of previously
> + * initialised memory addresses. Because it uses kmalloc() it needs
> + * to be called after SLAB is initialised.
> + */

This interface looks flawed to me for multiple reasons:

* It requires you to call three distinct functions in order to do one
thing, and they all take the same arguments (more or less). Why not
have one function call at the latest possible point where you can
still change the memory attributes, and have everything else
happen automatically?

* It requires you to pass the exact location of the area. I can see why
you want that on certain machines that require DMA areas to be spread
across multiple memory buses, but IMHO it's not appropriate for a
generic API.

* It requires you to hardcode the size in a machine specific source file.
This probably seems to be a natural thing to do when you have worked a
lot on the ARM architecture, but it's really not. We really want to
get rid of board files and replace them with generic probing based on
the device tree, and the size of the area is more policy than a property
of the hardware that can be accurately described in the device tree or
a board file.

I'm sure that we can find a solution for all of these problems, it just
takes a few good ideas. Especially for the last point, I don't have a
better suggestion yet, but hopefully someone else can come up with one.

Arnd

2011-06-10 17:16:50

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 02/10] lib: genalloc: Generic allocator improvements

On Fri, 10 Jun 2011 14:52:17 +0200, Alan Cox <[email protected]>
wrote:

>> I plan to replace it with lib/bitmap.c bitmap_* based allocator
>> (similar like
>> it it is used by dma_declare_coherent_memory() and friends in
>> drivers/base/dma-coherent.c). We need something really simple for CMA
>> area
>> management.
>>
>> IMHO allocate_resource and friends a bit too heavy here, but good to
>> know
>> that such allocator also exists.
>
> Not sure I'd class allocate_resource as heavyweight but providing it's
> using something that already exists rather than inventing yet another
> allocator.

genalloc is already in the kernel and is used in a few places, so we
either let everyone use it as they see fit or we deprecate the library.
If we don't deprecate it I see no reason why CMA should not use it.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-13 09:06:20

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hello,

On Friday, June 10, 2011 6:22 PM Arnd Bergmann wrote:

> On Friday 10 June 2011, Marek Szyprowski wrote:
> >The Contiguous Memory Allocator is a set of functions that lets
> >one initialise a region of memory which then can be used to perform
> >allocations of contiguous memory chunks from.
> >
> >CMA allows for creation of separate contexts. Kernel is allowed to
> >allocate movable pages within CMA's managed memory so that it can be
> >used for page cache when CMA devices do not use it. On cm_alloc()
> >request such pages are migrated out of CMA area to free required
> >contiguous block.
>
> Hi Marek,
>
> I'm generally happy with the patches 1 through 7, i.e the heavy lifting
> to make contiguous allocations work. Thank you very much for keeping
> up the work and submitting these in a good shape.
>
> I do think that we need to discuss the driver-visible API a bit more.
> My feeling is that this is rather un-Linux-like and it needs to be
> simplified some more. Of course, I don't mind being overruled by the
> memory management experts here, or if you can argue that it's really
> the right way to do it.

Thanks for your comments!

> > + * Driver usage
> > + *
> > + * CMA should not be used directly by the device drivers. It should
> > + * be considered as helper framework for dma-mapping subsystm and
> > + * respective (platform)bus drivers.
> > + *
> > + * The CMA client needs to have a pointer to a CMA context
> > + * represented by a struct cma (which is an opaque data type).
> > + *
> > + * Once such pointer is obtained, a caller may allocate contiguous
> > + * memory chunk using the following function:
> > + *
> > + * cm_alloc()
> > + *
> > + * This function returns a pointer to the first struct page which
> > + * represent a contiguous memory chunk. This pointer
> > + * may be used with the following function:
> > + *
> > + * cm_free() -- frees allocated contiguous memory
>
> Please explain why you want a new top-level API here. I think it
> would be much nicer if a device driver could simply call
> alloc_pages(..., GFP_CMA) or similar, where all the complexity
> in here gets hidden behind a conditional deep inside of the
> page allocator.
>
> Evidently, the two functions you describe here have an extra argument
> for the context. Can you elaborate why that is really needed? What
> is the specific requirement to have multiple such contexts in one
> system and what is the worst-case effect that you would get when
> the interface is simplified to have only one for all devices?
>
> Alternatively, would it be possible to provide the cm_alloc/cm_free
> functions only as backing to the dma mapping API and not export them
> as a generic device driver interface?

cm_alloc/free are definitely not meant to be called from device drivers.
They should be only considered as a backend for dma-mapping.

'Raw' contiguous memory block doesn't really make sense for the device
drivers. What the drivers require is a contiguous memory block that is
somehow mapped into respective bus address space, so dma-mapping
framework is the right interface.

alloc_pages(..., GFP_CMA) looks nice but in fact it is really impractical.
The driver will need to map such buffer to dma context anyway, so imho
dma_alloc_attributed() will give the drivers much more flexibility. In
terms of dma-mapping the context argument isn't anything odd.

If possible I would like to make cma something similar to
declare_dma_coherent()&friends, so the board/platform/bus startup code
will just call declare_dma_contiguous() to enable support for cma for
particular devices.

> > + * Platform/machine integration
> > + *
> > + * CMA context must be created on platform or machine initialisation
> > + * and passed to respective subsystem that will be a client for CMA.
> > + * The latter may be done by a global variable or some filed in
> > + * struct device. For the former CMA provides the following
> functions:
> > + *
> > + * cma_init_migratetype()
> > + * cma_reserve()
> > + * cma_create()
> > + *
> > + * The first one initialises a portion of reserved memory so that it
> > + * can be used with CMA. The second first tries to reserve memory
> > + * (using memblock) and then initialise it.
> > + *
> > + * The cma_reserve() function must be called when memblock is still
> > + * operational and reserving memory with it is still possible. On
> > + * ARM platform the "reserve" machine callback is a perfect place to
> > + * call it.
> > + *
> > + * The last function creates a CMA context on a range of previously
> > + * initialised memory addresses. Because it uses kmalloc() it needs
> > + * to be called after SLAB is initialised.
> > + */
>
> This interface looks flawed to me for multiple reasons:
>
> * It requires you to call three distinct functions in order to do one
> thing, and they all take the same arguments (more or less). Why not
> have one function call at the latest possible point where you can
> still change the memory attributes, and have everything else
> happen automatically?

Initialization part will be definitely simplified, I must confess that I
was in hurry to post the patches before the weekend and just forgot to
cleanup this part...

> * It requires you to pass the exact location of the area. I can see why
> you want that on certain machines that require DMA areas to be spread
> across multiple memory buses, but IMHO it's not appropriate for a
> generic API.

IMHO we can also use some NULL context to indicate some global, system
wide CMA area and again -> in terms of dma-mapping api having a context
isn't anything uncommon.

> * It requires you to hardcode the size in a machine specific source file.
> This probably seems to be a natural thing to do when you have worked a
> lot on the ARM architecture, but it's really not. We really want to
> get rid of board files and replace them with generic probing based on
> the device tree, and the size of the area is more policy than a property
> of the hardware that can be accurately described in the device tree or
> a board file.

The problem is the fact that right now, we still have board files and we
have to live with them for a while (with all advantages and disadvantages).
I hope that you won't require me to rewrite the whole support for all ARM
platforms to get rid of board files to get CMA merged ;)

I see no problem defining CMA areas in device tree, as this is something
really specific to particular board configuration.

> I'm sure that we can find a solution for all of these problems, it just
> takes a few good ideas. Especially for the last point, I don't have a
> better suggestion yet, but hopefully someone else can come up with one.

I hope we will manage to get agreement :)

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


2011-06-14 13:50:06

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Monday 13 June 2011, Marek Szyprowski wrote:
> cm_alloc/free are definitely not meant to be called from device drivers.
> They should be only considered as a backend for dma-mapping.
>
> 'Raw' contiguous memory block doesn't really make sense for the device
> drivers. What the drivers require is a contiguous memory block that is
> somehow mapped into respective bus address space, so dma-mapping
> framework is the right interface.
>
> alloc_pages(..., GFP_CMA) looks nice but in fact it is really impractical.
> The driver will need to map such buffer to dma context anyway, so imho
> dma_alloc_attributed() will give the drivers much more flexibility. In
> terms of dma-mapping the context argument isn't anything odd.

Ok.

> If possible I would like to make cma something similar to
> declare_dma_coherent()&friends, so the board/platform/bus startup code
> will just call declare_dma_contiguous() to enable support for cma for
> particular devices.

Sounds good, I like that.

> > This interface looks flawed to me for multiple reasons:
> >
> > * It requires you to call three distinct functions in order to do one
> > thing, and they all take the same arguments (more or less). Why not
> > have one function call at the latest possible point where you can
> > still change the memory attributes, and have everything else
> > happen automatically?
>
> Initialization part will be definitely simplified, I must confess that I
> was in hurry to post the patches before the weekend and just forgot to
> cleanup this part...

Fair enough. In cases like this, it's often good to add a TODO section
to the patch description, or a FIXME comment in the patch itself.

> > * It requires you to pass the exact location of the area. I can see why
> > you want that on certain machines that require DMA areas to be spread
> > across multiple memory buses, but IMHO it's not appropriate for a
> > generic API.
>
> IMHO we can also use some NULL context to indicate some global, system
> wide CMA area and again -> in terms of dma-mapping api having a context
> isn't anything uncommon.

Please explain the exact requirements that lead you to defining multiple
contexts. My feeling is that in most cases we don't need them and can
simply live with a single area. Depending on how obscure the cases are
where we do need something beyond that, we can then come up with
special-case solutions for them.

> > * It requires you to hardcode the size in a machine specific source file.
> > This probably seems to be a natural thing to do when you have worked a
> > lot on the ARM architecture, but it's really not. We really want to
> > get rid of board files and replace them with generic probing based on
> > the device tree, and the size of the area is more policy than a property
> > of the hardware that can be accurately described in the device tree or
> > a board file.
>
> The problem is the fact that right now, we still have board files and we
> have to live with them for a while (with all advantages and disadvantages).
> I hope that you won't require me to rewrite the whole support for all ARM
> platforms to get rid of board files to get CMA merged ;)

Of course not. But we need to know what we want a platform with device
tree support to look like when it's using CMA, so we don't make it
harder to change the platforms over than it already is.

> I see no problem defining CMA areas in device tree, as this is something
> really specific to particular board configuration.

The problem with defining CMA areas in the device tree is that it's not
a property of the hardware, but really policy. The device tree file
should not contain anything related to a specific operating system
because you might want to boot something on the board that doesn't
know about CMA, and even when you only care about using Linux, the
implementation might change to the point where hardcoded CMA areas
no longer make sense.

IMHO we should first aim for a solution that works almost everywhere
without the kernel knowing what board it's running on, and then we
can add quirks for devices that have special requirements. I think
the situation is similar to the vmalloc virtual address space, which
normally has a hardcoded size that works almost everywhere, but there
are certain drivers etc that require much more, or there are situations
where you want to make it smaller in order to avoid highmem.

Arnd

2011-06-14 13:55:26

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <[email protected]> wrote:
> Please explain the exact requirements that lead you to defining multiple
> contexts.

Some devices may have access only to some banks of memory. Some devices
may use different banks of memory for different purposes.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-14 15:49:27

by Jordan Crouse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 02/10] lib: genalloc: Generic allocator improvements

On 06/10/2011 06:52 AM, Alan Cox wrote:
>> I plan to replace it with lib/bitmap.c bitmap_* based allocator (similar like
>> it it is used by dma_declare_coherent_memory() and friends in
>> drivers/base/dma-coherent.c). We need something really simple for CMA area
>> management.
>>
>> IMHO allocate_resource and friends a bit too heavy here, but good to know
>> that such allocator also exists.
>
> Not sure I'd class allocate_resource as heavyweight but providing it's
> using something that already exists rather than inventing yet another
> allocator.
>
> This wants dealing with before it goes upstream though so the chaneges in
> lib/*c etc never have to reach mainline and then get changed back.

Even if CMA doesn't end up using genalloc, there are existing consumers of
the API and I think the _aligned function has value.

I agree that allocate_resource isn't overly heavy, but comparatively genalloc
is really simple and lightweight for a driver to manage a contiguous address
space without a lot of extra thought. I think both APIs serve slightly
different masters, but if somebody thought about it long enough there could
be some consolidation (same goes for the internal guts of
dma_declare_coherent_memory).

I agree with Michal - if genalloc goes deprecated, then so be it, but as
long as it lives, I think its useful to get these functions upstream.

Jordan

2011-06-14 16:03:10

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tuesday 14 June 2011, Michal Nazarewicz wrote:
> On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <[email protected]> wrote:
> > Please explain the exact requirements that lead you to defining multiple
> > contexts.
>
> Some devices may have access only to some banks of memory. Some devices
> may use different banks of memory for different purposes.

For all I know, that is something that is only true for a few very special
Samsung devices, and is completely unrelated of the need for contiguous
allocations, so this approach becomes pointless as soon as the next
generation of that chip grows an IOMMU, where we don't handle the special
bank attributes. Also, the way I understood the situation for the Samsung
SoC during the Budapest discussion, it's only a performance hack, not a
functional requirement, unless you count '1080p playback' as a functional
requirement.

Supporting contiguous allocation is a very useful goal and many people want
this, but supporting a crazy one-off hardware design with lots of generic
infrastructure is going a bit too far. If you can't be more specific than
'some devices may need this', I would suggest going forward without having
multiple regions:

* Remove the registration of specific addresses from the initial patch
set (but keep the patch).
* Add a heuristic plus command-line override to automatically come up
with a reasonable location+size for *one* CMA area in the system.
* Ship the patch to add support for multiple CMA areas with the BSP
for the boards that need it (if any).
* Wait for someone on a non-Samsung SoC to run into the same problem,
then have /them/ get the final patch in.

Even if you think you can convince enough people that having support
for distinct predefined regions is a good idea, I would recommend
splitting that out of the initial merge so we can have that discussion
separately from the other issues.

Arnd

2011-06-14 16:58:43

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

>> On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <[email protected]> wrote:
>>> Please explain the exact requirements that lead you to defining
>>> multiple contexts.

> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
>> Some devices may have access only to some banks of memory. Some devices
>> may use different banks of memory for different purposes.

On Tue, 14 Jun 2011 18:03:00 +0200, Arnd Bergmann wrote:
> For all I know, that is something that is only true for a few very
> special Samsung devices,

Maybe. I'm just answering your question. :)

Ah yes, I forgot that separate regions for different purposes could
decrease fragmentation.

> I would suggest going forward without having multiple regions:

Is having support for multiple regions a bad thing? Frankly,
removing this support will change code from reading context passed
as argument to code reading context from global variable. Nothing
is gained; functionality is lost.

> * Remove the registration of specific addresses from the initial patch
> set (but keep the patch).
> * Add a heuristic plus command-line override to automatically come up
> with a reasonable location+size for *one* CMA area in the system.

I'm not arguing those.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-14 17:12:46

by Daniel Stone

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hi,

On Tue, Jun 14, 2011 at 06:03:00PM +0200, Arnd Bergmann wrote:
> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
> > On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <[email protected]> wrote:
> > > Please explain the exact requirements that lead you to defining multiple
> > > contexts.
> >
> > Some devices may have access only to some banks of memory. Some devices
> > may use different banks of memory for different purposes.
>
> For all I know, that is something that is only true for a few very special
> Samsung devices, and is completely unrelated of the need for contiguous
> allocations, so this approach becomes pointless as soon as the next
> generation of that chip grows an IOMMU, where we don't handle the special
> bank attributes. Also, the way I understood the situation for the Samsung
> SoC during the Budapest discussion, it's only a performance hack, not a
> functional requirement, unless you count '1080p playback' as a functional
> requirement.

Hm, I think that was something similar but not quite the same: talking
about having allocations split to lie between two banks of RAM to
maximise the read/write speed for performance reasons. That's something
that can be handled in the allocator, rather than an API constraint, as
this is.

Not that I know of any hardware which is limited as such, but eh.

Cheers,
Daniel

2011-06-14 18:31:04

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
> On Tue, 14 Jun 2011 18:03:00 +0200, Arnd Bergmann wrote:
> > For all I know, that is something that is only true for a few very
> > special Samsung devices,
>
> Maybe. I'm just answering your question. :)
>
> Ah yes, I forgot that separate regions for different purposes could
> decrease fragmentation.

That is indeed a good point, but having a good allocator algorithm
could also solve this. I don't know too much about these allocation
algorithms, but there are probably multiple working approaches to this.

> > I would suggest going forward without having multiple regions:
>
> Is having support for multiple regions a bad thing? Frankly,
> removing this support will change code from reading context passed
> as argument to code reading context from global variable. Nothing
> is gained; functionality is lost.

What is bad IMHO is making them the default, which forces the board
code to care about memory management details. I would much prefer
to have contiguous allocation parameters tuned automatically to just
work on most boards before we add ways to do board-specific hacks.

Arnd

2011-06-14 18:40:54

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
>> Is having support for multiple regions a bad thing? Frankly,
>> removing this support will change code from reading context passed
>> as argument to code reading context from global variable. Nothing
>> is gained; functionality is lost.

On Tue, 14 Jun 2011 20:30:07 +0200, Arnd Bergmann wrote:
> What is bad IMHO is making them the default, which forces the board
> code to care about memory management details. I would much prefer
> to have contiguous allocation parameters tuned automatically to just
> work on most boards before we add ways to do board-specific hacks.

I see those as orthogonal problems. The code can have support for
multiple contexts but by default use a single global context exported
as cma_global variable (or some such).

And I'm not arguing against having “contiguous allocation parameters
tuned automatically to just work on most boards”. I just don't see
the reason to delete functionality that is already there, does not
add much code and can be useful.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-14 18:58:28

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 14 June 2011 12:01, Daniel Stone <[email protected]> wrote:
> Hi,
>
> On Tue, Jun 14, 2011 at 06:03:00PM +0200, Arnd Bergmann wrote:
>> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
>> > On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <[email protected]> wrote:
>> > > Please explain the exact requirements that lead you to defining multiple
>> > > contexts.
>> >
>> > Some devices may have access only to some banks of memory. ?Some devices
>> > may use different banks of memory for different purposes.
>>
>> For all I know, that is something that is only true for a few very special
>> Samsung devices, and is completely unrelated of the need for contiguous
>> allocations, so this approach becomes pointless as soon as the next
>> generation of that chip grows an IOMMU, where we don't handle the special
>> bank attributes. Also, the way I understood the situation for the Samsung
>> SoC during the Budapest discussion, it's only a performance hack, not a
>> functional requirement, unless you count '1080p playback' as a functional
>> requirement.

Coming in mid topic...

I've seen this split bank allocation in Qualcomm and TI SoCs, with
Samsung, that makes 3 major SoC vendors (I would be surprised if
Nvidia didn't also need to do this) - so I think some configurable
method to control allocations is necessarily. The chips can't do
decode without it (and by can't do I mean 1080P and higher decode is
not functionally useful). Far from special, this would appear to be
the default.

> Hm, I think that was something similar but not quite the same: talking
> about having allocations split to lie between two banks of RAM to
> maximise the read/write speed for performance reasons. ?That's something
> that can be handled in the allocator, rather than an API constraint, as
> this is.
>
> Not that I know of any hardware which is limited as such, but eh.
>
> Cheers,
> Daniel
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

2011-06-14 20:42:46

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> I've seen this split bank allocation in Qualcomm and TI SoCs, with
> Samsung, that makes 3 major SoC vendors (I would be surprised if
> Nvidia didn't also need to do this) - so I think some configurable
> method to control allocations is necessarily. The chips can't do
> decode without it (and by can't do I mean 1080P and higher decode is
> not functionally useful). Far from special, this would appear to be
> the default.

Thanks for the insight, that's a much better argument than 'something
may need it'. Are those all chips without an IOMMU or do we also
need to solve the IOMMU case with split bank allocation?

I think I'd still prefer to see the support for multiple regions split
out into one of the later patches, especially since that would defer
the question of how to do the initialization for this case and make
sure we first get a generic way.

You've convinced me that we need to solve the problem of allocating
memory from a specific bank eventually, but separating it from the
one at hand (contiguous allocation) should help getting the important
groundwork in at first.

The possible conflict that I still see with per-bank CMA regions are:

* It completely destroys memory power management in cases where that
is based on powering down entire memory banks.

* We still need to solve the same problem in case of IOMMU mappings
at some point, even if today's hardware doesn't have this combination.
It would be good to use the same solution for both.

Arnd

2011-06-14 21:01:31

by Jordan Crouse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 06/14/2011 02:42 PM, Arnd Bergmann wrote:
> On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
>> I've seen this split bank allocation in Qualcomm and TI SoCs, with
>> Samsung, that makes 3 major SoC vendors (I would be surprised if
>> Nvidia didn't also need to do this) - so I think some configurable
>> method to control allocations is necessarily. The chips can't do
>> decode without it (and by can't do I mean 1080P and higher decode is
>> not functionally useful). Far from special, this would appear to be
>> the default.
>
> Thanks for the insight, that's a much better argument than 'something
> may need it'. Are those all chips without an IOMMU or do we also
> need to solve the IOMMU case with split bank allocation?

Yes. The IOMMU case with split bank allocation is key, especially for shared
buffers. Consider the case where video is using a certain bank for performance
purposes and that frame is shared with the GPU.

Jordan

2011-06-15 06:01:22

by Subash Patel

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hi Arnd,

On 06/14/2011 09:33 PM, Arnd Bergmann wrote:
> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
>> On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann<[email protected]> wrote:
>>> Please explain the exact requirements that lead you to defining multiple
>>> contexts.
>>
>> Some devices may have access only to some banks of memory. Some devices
>> may use different banks of memory for different purposes.
>
> For all I know, that is something that is only true for a few very special
> Samsung devices, and is completely unrelated of the need for contiguous
> allocations, so this approach becomes pointless as soon as the next
> generation of that chip grows an IOMMU, where we don't handle the special
> bank attributes. Also, the way I understood the situation for the Samsung
> SoC during the Budapest discussion, it's only a performance hack, not a
> functional requirement, unless you count '1080p playback' as a functional
> requirement.
>
1080p@30fps is indeed a functional requirement, as the IP has the
capability to achieve it. This IP itself can act as more than one AXI
master, and control more than one memory port(bank). So this is not a
*performance hack*

Also, if I recall, during the Budapest discussion (I was on irc), I
recall that this requirement can become the information available to the
actual allocator. Below is the summary point I could collect from summit
notes:

* May also need to specify more attributes (specific physical
memory region)

As per this point, the requirement (as above) must be attribute to the
allocator, which is CMA in this case.

> Supporting contiguous allocation is a very useful goal and many people want
> this, but supporting a crazy one-off hardware design with lots of generic
> infrastructure is going a bit too far. If you can't be more specific than
> 'some devices may need this', I would suggest going forward without having
> multiple regions:
>
> * Remove the registration of specific addresses from the initial patch
> set (but keep the patch).
> * Add a heuristic plus command-line override to automatically come up
> with a reasonable location+size for *one* CMA area in the system.
> * Ship the patch to add support for multiple CMA areas with the BSP
> for the boards that need it (if any).
> * Wait for someone on a non-Samsung SoC to run into the same problem,
> then have /them/ get the final patch in.
>
> Even if you think you can convince enough people that having support
> for distinct predefined regions is a good idea, I would recommend
> splitting that out of the initial merge so we can have that discussion
> separately from the other issues.
>
> Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Regards,
Subash
SISO-SLG

2011-06-15 07:12:15

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hello,

On Tuesday, June 14, 2011 8:30 PM Arnd Bergmann wrote:

> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
> > On Tue, 14 Jun 2011 18:03:00 +0200, Arnd Bergmann wrote:
> > > For all I know, that is something that is only true for a few very
> > > special Samsung devices,
> >
> > Maybe. I'm just answering your question. :)
> >
> > Ah yes, I forgot that separate regions for different purposes could
> > decrease fragmentation.
>
> That is indeed a good point, but having a good allocator algorithm
> could also solve this. I don't know too much about these allocation
> algorithms, but there are probably multiple working approaches to this.
>
> > > I would suggest going forward without having multiple regions:
> >
> > Is having support for multiple regions a bad thing? Frankly,
> > removing this support will change code from reading context passed
> > as argument to code reading context from global variable. Nothing
> > is gained; functionality is lost.
>
> What is bad IMHO is making them the default, which forces the board
> code to care about memory management details. I would much prefer
> to have contiguous allocation parameters tuned automatically to just
> work on most boards before we add ways to do board-specific hacks.

I see your concerns, but I really wonder how to determine the properties
of the global/default cma pool. You definitely don't want to give all
available memory o CMA, because it will have negative impact on kernel
operation (kernel really needs to allocate unmovable pages from time to
time).

The only solution I see now is to provide Kconfig entry to determine
the size of the global CMA pool, but this still have some issues,
especially for multi-board kernels (each board probably will have
different amount of RAM and different memory-consuming devices
available). It looks that each board startup code still might need to
tweak the size of CMA pool. I can add a kernel command line option for
it, but such solution also will not solve all the cases (afair there
was a discussion about kernel command line parameters for memory
configuration and the conclusion was that it should be avoided).

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


2011-06-15 07:38:41

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> I see your concerns, but I really wonder how to determine the properties
> of the global/default cma pool. You definitely don't want to give all
> available memory o CMA, because it will have negative impact on kernel
> operation (kernel really needs to allocate unmovable pages from time to
> time).

Exactly. This is a hard problem, so I would prefer to see a solution for
coming up with reasonable defaults.

> The only solution I see now is to provide Kconfig entry to determine
> the size of the global CMA pool, but this still have some issues,
> especially for multi-board kernels (each board probably will have
> different amount of RAM and different memory-consuming devices
> available). It looks that each board startup code still might need to
> tweak the size of CMA pool. I can add a kernel command line option for
> it, but such solution also will not solve all the cases (afair there
> was a discussion about kernel command line parameters for memory
> configuration and the conclusion was that it should be avoided).

The command line option can be a last resort if the heuristics fail,
but it's not much better than a fixed Kconfig setting.

How about a Kconfig option that defines the percentage of memory
to set aside for contiguous allocations?

Arnd

2011-06-15 08:03:32

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hello,

On Tuesday, June 14, 2011 3:49 PM Arnd Bergmann wrote:

> On Monday 13 June 2011, Marek Szyprowski wrote:
> > cm_alloc/free are definitely not meant to be called from device drivers.
> > They should be only considered as a backend for dma-mapping.
> >
> > 'Raw' contiguous memory block doesn't really make sense for the device
> > drivers. What the drivers require is a contiguous memory block that is
> > somehow mapped into respective bus address space, so dma-mapping
> > framework is the right interface.
> >
> > alloc_pages(..., GFP_CMA) looks nice but in fact it is really impractical.
> > The driver will need to map such buffer to dma context anyway, so imho
> > dma_alloc_attributed() will give the drivers much more flexibility. In
> > terms of dma-mapping the context argument isn't anything odd.
>
> Ok.
>
> > If possible I would like to make cma something similar to
> > declare_dma_coherent()&friends, so the board/platform/bus startup code
> > will just call declare_dma_contiguous() to enable support for cma for
> > particular devices.
>
> Sounds good, I like that.

Thanks. I thought a bit more on this and decided that I want to make this
declare_dma_contiguous() optional for the drivers. It should be used only
for some sophisticated cases like for example our video codec with two
memory interfaces for 2 separate banks. By default the dma-mapping will
use system-wide cma pool.

(snipped)

> > > * It requires you to pass the exact location of the area. I can see why
> > > you want that on certain machines that require DMA areas to be spread
> > > across multiple memory buses, but IMHO it's not appropriate for a
> > > generic API.
> >
> > IMHO we can also use some NULL context to indicate some global, system
> > wide CMA area and again -> in terms of dma-mapping api having a context
> > isn't anything uncommon.
>
> Please explain the exact requirements that lead you to defining multiple
> contexts. My feeling is that in most cases we don't need them and can
> simply live with a single area. Depending on how obscure the cases are
> where we do need something beyond that, we can then come up with
> special-case solutions for them.

Like it was already stated we need such feature for our multimedia codec
to allocate buffers from different memory banks. I really don't see any
problems with the possibility to have additional cma areas for special
purposes.

> > > * It requires you to hardcode the size in a machine specific source
> file.
> > > This probably seems to be a natural thing to do when you have worked
> a
> > > lot on the ARM architecture, but it's really not. We really want to
> > > get rid of board files and replace them with generic probing based on
> > > the device tree, and the size of the area is more policy than a
> property
> > > of the hardware that can be accurately described in the device tree
> or
> > > a board file.
> >
> > The problem is the fact that right now, we still have board files and we
> > have to live with them for a while (with all advantages and
> disadvantages).
> > I hope that you won't require me to rewrite the whole support for all ARM
> > platforms to get rid of board files to get CMA merged ;)
>
> Of course not. But we need to know what we want a platform with device
> tree support to look like when it's using CMA, so we don't make it
> harder to change the platforms over than it already is.
>
> > I see no problem defining CMA areas in device tree, as this is something
> > really specific to particular board configuration.
>
> The problem with defining CMA areas in the device tree is that it's not
> a property of the hardware, but really policy. The device tree file
> should not contain anything related to a specific operating system
> because you might want to boot something on the board that doesn't
> know about CMA, and even when you only care about using Linux, the
> implementation might change to the point where hardcoded CMA areas
> no longer make sense.

I really doubt that the device tree will carry only system-independent
information. Anyway, the preferred or required memory areas/banks for
buffer allocation is something that is a property of the hardware not
the OS policy.

> IMHO we should first aim for a solution that works almost everywhere
> without the kernel knowing what board it's running on, and then we
> can add quirks for devices that have special requirements. I think
> the situation is similar to the vmalloc virtual address space, which
> normally has a hardcoded size that works almost everywhere, but there
> are certain drivers etc that require much more, or there are situations
> where you want to make it smaller in order to avoid highmem.

I'm trying to create something that will fulfill the requirements of my
hardware, that's why I cannot focus on a generic case only.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


2011-06-15 08:15:35

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hello,

On Wednesday, June 15, 2011 9:37 AM Arnd Bergmann wrote:

> On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> > I see your concerns, but I really wonder how to determine the properties
> > of the global/default cma pool. You definitely don't want to give all
> > available memory o CMA, because it will have negative impact on kernel
> > operation (kernel really needs to allocate unmovable pages from time to
> > time).
>
> Exactly. This is a hard problem, so I would prefer to see a solution for
> coming up with reasonable defaults.

The problem is to define these reasonable defaults, because they also depend
on the target usage pattern for the board. If one doesn't plan to use video
codec at all, then the value calculated for full HD movie decoding are
definitely too high.

> > The only solution I see now is to provide Kconfig entry to determine
> > the size of the global CMA pool, but this still have some issues,
> > especially for multi-board kernels (each board probably will have
> > different amount of RAM and different memory-consuming devices
> > available). It looks that each board startup code still might need to
> > tweak the size of CMA pool. I can add a kernel command line option for
> > it, but such solution also will not solve all the cases (afair there
> > was a discussion about kernel command line parameters for memory
> > configuration and the conclusion was that it should be avoided).
>
> The command line option can be a last resort if the heuristics fail,
> but it's not much better than a fixed Kconfig setting.
>
> How about a Kconfig option that defines the percentage of memory
> to set aside for contiguous allocations?

There can be probably both types of Kconfig entries: for absolute value
and the percentage of the total memory, but still, creating a
fully-functional multi-board kernel will be really hard.

However there is one more issue here. It is quite common that embedded
systems have memory that is not really contiguous in address space
(there are some holes that splits it into 'banks' or regions). So we
might have a system with 256MiB of memory split into 2 memory banks.
In this case the CMA pool can use only one of them (the pool must be
contiguous internally). I'm not sure if such systems might require more
memory for contiguous buffers.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

2011-06-15 08:36:59

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hello,

On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:

> On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > Nvidia didn't also need to do this) - so I think some configurable
> > method to control allocations is necessarily. The chips can't do
> > decode without it (and by can't do I mean 1080P and higher decode is
> > not functionally useful). Far from special, this would appear to be
> > the default.
>
> Thanks for the insight, that's a much better argument than 'something
> may need it'. Are those all chips without an IOMMU or do we also
> need to solve the IOMMU case with split bank allocation?
>
> I think I'd still prefer to see the support for multiple regions split
> out into one of the later patches, especially since that would defer
> the question of how to do the initialization for this case and make
> sure we first get a generic way.
>
> You've convinced me that we need to solve the problem of allocating
> memory from a specific bank eventually, but separating it from the
> one at hand (contiguous allocation) should help getting the important
> groundwork in at first.
>
> The possible conflict that I still see with per-bank CMA regions are:
>
> * It completely destroys memory power management in cases where that
> is based on powering down entire memory banks.

I don't think that per-bank CMA regions destroys memory power management
more than the global CMA pool. Please note that the contiguous buffers
(or in general dma-buffers) right now are unmovable so they don't fit
well into memory power management.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

2011-06-15 09:26:54

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tue, 14 Jun 2011 22:42:24 +0200, Arnd Bergmann <[email protected]> wrote:
> * We still need to solve the same problem in case of IOMMU mappings
> at some point, even if today's hardware doesn't have this combination.
> It would be good to use the same solution for both.

I don't think I follow. What does IOMMU has to do with CMA?

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-15 11:16:12

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 15 June 2011, Marek Szyprowski wrote:

> > > If possible I would like to make cma something similar to
> > > declare_dma_coherent()&friends, so the board/platform/bus startup code
> > > will just call declare_dma_contiguous() to enable support for cma for
> > > particular devices.
> >
> > Sounds good, I like that.
>
> Thanks. I thought a bit more on this and decided that I want to make this
> declare_dma_contiguous() optional for the drivers. It should be used only
> for some sophisticated cases like for example our video codec with two
> memory interfaces for 2 separate banks. By default the dma-mapping will
> use system-wide cma pool.
>
> (snipped)

Ok, good.

> > Please explain the exact requirements that lead you to defining multiple
> > contexts. My feeling is that in most cases we don't need them and can
> > simply live with a single area. Depending on how obscure the cases are
> > where we do need something beyond that, we can then come up with
> > special-case solutions for them.
>
> Like it was already stated we need such feature for our multimedia codec
> to allocate buffers from different memory banks. I really don't see any
> problems with the possibility to have additional cma areas for special
> purposes.

It's not a problem for special cases, I just want to make sure that
the common case works well enough that we don't need too many special
cases.

> > The problem with defining CMA areas in the device tree is that it's not
> > a property of the hardware, but really policy. The device tree file
> > should not contain anything related to a specific operating system
> > because you might want to boot something on the board that doesn't
> > know about CMA, and even when you only care about using Linux, the
> > implementation might change to the point where hardcoded CMA areas
> > no longer make sense.
>
> I really doubt that the device tree will carry only system-independent
> information.

So far, this has worked well enough.

> Anyway, the preferred or required memory areas/banks for
> buffer allocation is something that is a property of the hardware not
> the OS policy.

That is true. If there are hard requirements of the hardware, we
can and should definitely document them in the device tree. It's
totally fine to describe the layout of memory banks and affinity
of devices to specific banks where that exists in hardware.

The part that should not be in the device tree is a specific location
of a buffer inside the bank when there is no hardware reason for
the location.

I guess it's also fine to describe requirements per device regarding
how much contiguous memory it will need to operate, if you can provide
that number independent of the application you want to run. The early
boot code can walk the device tree and ensure that the region is
large enough.

> > IMHO we should first aim for a solution that works almost everywhere
> > without the kernel knowing what board it's running on, and then we
> > can add quirks for devices that have special requirements. I think
> > the situation is similar to the vmalloc virtual address space, which
> > normally has a hardcoded size that works almost everywhere, but there
> > are certain drivers etc that require much more, or there are situations
> > where you want to make it smaller in order to avoid highmem.
>
> I'm trying to create something that will fulfill the requirements of my
> hardware, that's why I cannot focus on a generic case only.

Yes, that is the obvious conflict. My main interest is to have something
that works not just for your hardware but works as good as possible on
the widest possible range of hardware, which may mean it may run not quite
as good in some of the cases.

Arnd

2011-06-15 11:20:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 15 June 2011, Michal Nazarewicz wrote:
> On Tue, 14 Jun 2011 22:42:24 +0200, Arnd Bergmann <[email protected]> wrote:
> > * We still need to solve the same problem in case of IOMMU mappings
> > at some point, even if today's hardware doesn't have this combination.
> > It would be good to use the same solution for both.
>
> I don't think I follow. What does IOMMU has to do with CMA?

The point is that on the higher level device drivers, we want to
hide the presence of CMA and/or IOMMU behind the dma mapping API,
but the device drivers do need to know about the bank properties.

If we want to solve the problem of allocating per-bank memory inside
of CMA, we also need to solve it inside of the IOMMU code, using
the same device driver interface.

Arnd

2011-06-15 11:28:12

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tuesday 14 June 2011, Jordan Crouse wrote:
>
> On 06/14/2011 02:42 PM, Arnd Bergmann wrote:
> > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> >> I've seen this split bank allocation in Qualcomm and TI SoCs, with
> >> Samsung, that makes 3 major SoC vendors (I would be surprised if
> >> Nvidia didn't also need to do this) - so I think some configurable
> >> method to control allocations is necessarily. The chips can't do
> >> decode without it (and by can't do I mean 1080P and higher decode is
> >> not functionally useful). Far from special, this would appear to be
> >> the default.
> >
> > Thanks for the insight, that's a much better argument than 'something
> > may need it'. Are those all chips without an IOMMU or do we also
> > need to solve the IOMMU case with split bank allocation?
>
> Yes. The IOMMU case with split bank allocation is key, especially for shared
> buffers. Consider the case where video is using a certain bank for performance
> purposes and that frame is shared with the GPU.

Could we use the non-uniform memory access (NUMA) code for this? That code
does more than what we've been talking about, and we're currently thinking
only of a degenerate case (one CPU node with multiple memory nodes), but my
feeling is that we can still build on top of it.

The NUMA code can describe relations between different areas of memory
and how they interact with devices and processes, so you can attach a
device to a specific node and have all allocations done from there.
You can also set policy in user space, e.g. to have a video decoder
process running on the bank that is not used by the GPU.

In the DMA mapping API, that would mean we add another dma_attr to
dma_alloc_* that lets you pass a node identifier.

Arnd

2011-06-15 11:30:56

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wed, 15 Jun 2011 13:20:42 +0200, Arnd Bergmann <[email protected]> wrote:
> The point is that on the higher level device drivers, we want to
> hide the presence of CMA and/or IOMMU behind the dma mapping API,
> but the device drivers do need to know about the bank properties.

Gotcha, thanks.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-15 11:53:26

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Tue, Jun 14, 2011 at 20:30, Arnd Bergmann <[email protected]> wrote:
> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
>> Ah yes, I forgot that separate regions for different purposes could
>> decrease fragmentation.
>
> That is indeed a good point, but having a good allocator algorithm
> could also solve this. I don't know too much about these allocation
> algorithms, but there are probably multiple working approaches to this.

imo no allocator algorithm is gonna help if you have comparably large,
variable-sized contiguous allocations out of a restricted address range.
It might work well enough if there are only a few sizes and/or there's
decent headroom. But for really generic workloads this would require
sync objects and eviction callbacks (i.e. what Thomas Hellstrom pushed
with ttm).

So if this is only a requirement on very few platforms and can be
cheaply fixed with multiple cma allocation areas (heck, we have
slabs for the same reasons in the kernel), it might be a sensible
compromise.
-Daniel
--
Daniel Vetter
[email protected] - +41 (0) 79 365 57 48 - http://blog.ffwll.ch

2011-06-15 13:13:24

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 06/15/2011 01:53 PM, Daniel Vetter wrote:
> On Tue, Jun 14, 2011 at 20:30, Arnd Bergmann<[email protected]> wrote:
>
>> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
>>
>>> Ah yes, I forgot that separate regions for different purposes could
>>> decrease fragmentation.
>>>
>> That is indeed a good point, but having a good allocator algorithm
>> could also solve this. I don't know too much about these allocation
>> algorithms, but there are probably multiple working approaches to this.
>>
> imo no allocator algorithm is gonna help if you have comparably large,
> variable-sized contiguous allocations out of a restricted address range.
> It might work well enough if there are only a few sizes and/or there's
> decent headroom. But for really generic workloads this would require
> sync objects and eviction callbacks (i.e. what Thomas Hellstrom pushed
> with ttm).
>

Indeed, IIRC on the meeting I pointed out that there is no way to
generically solve the fragmentation problem without movable buffers.
(I'd do it as a simple CMA backend to TTM). This is exactly the same
problem as trying to fit buffers in a limited VRAM area.

/Thomas

2011-06-15 21:40:05

by Larry Bassel

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 15 Jun 11 10:36, Marek Szyprowski wrote:
> Hello,
>
> On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
>
> > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > Nvidia didn't also need to do this) - so I think some configurable
> > > method to control allocations is necessarily. The chips can't do
> > > decode without it (and by can't do I mean 1080P and higher decode is
> > > not functionally useful). Far from special, this would appear to be
> > > the default.

We at Qualcomm have some platforms that have memory of different
performance characteristics, some drivers will need a way of
specifying that they need fast memory for an allocation (and would prefer
an error if it is not available rather than a fallback to slower
memory). It would also be bad if allocators who don't need fast
memory got it "accidentally", depriving those who really need it.

> >
> > Thanks for the insight, that's a much better argument than 'something
> > may need it'. Are those all chips without an IOMMU or do we also
> > need to solve the IOMMU case with split bank allocation?
> >
> > I think I'd still prefer to see the support for multiple regions split
> > out into one of the later patches, especially since that would defer
> > the question of how to do the initialization for this case and make
> > sure we first get a generic way.
> >
> > You've convinced me that we need to solve the problem of allocating
> > memory from a specific bank eventually, but separating it from the
> > one at hand (contiguous allocation) should help getting the important
> > groundwork in at first.
> >
> > The possible conflict that I still see with per-bank CMA regions are:
> >
> > * It completely destroys memory power management in cases where that
> > is based on powering down entire memory banks.
>
> I don't think that per-bank CMA regions destroys memory power management
> more than the global CMA pool. Please note that the contiguous buffers
> (or in general dma-buffers) right now are unmovable so they don't fit
> well into memory power management.

We also have platforms where a well-defined part of the memory
can be powered off, and other parts can't (or won't). We need a way
to steer the place allocations come from to the memory that won't be
turned off (so that CMA allocations are not an obstacle to memory
hotremove).

>
> Best regards
> --
> Marek Szyprowski
> Samsung Poland R&D Center
>
>
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Larry Bassel

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2011-06-15 22:07:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 15 June 2011 23:39:58 Larry Bassel wrote:
> On 15 Jun 11 10:36, Marek Szyprowski wrote:
> > On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
> >
> > > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > > Nvidia didn't also need to do this) - so I think some configurable
> > > > method to control allocations is necessarily. The chips can't do
> > > > decode without it (and by can't do I mean 1080P and higher decode is
> > > > not functionally useful). Far from special, this would appear to be
> > > > the default.
>
> We at Qualcomm have some platforms that have memory of different
> performance characteristics, some drivers will need a way of
> specifying that they need fast memory for an allocation (and would prefer
> an error if it is not available rather than a fallback to slower
> memory). It would also be bad if allocators who don't need fast
> memory got it "accidentally", depriving those who really need it.

Can you describe how the memory areas differ specifically?
Is there one that is always faster but very small, or are there
just specific circumstances under which some memory is faster than
another?

> > > The possible conflict that I still see with per-bank CMA regions are:
> > >
> > > * It completely destroys memory power management in cases where that
> > > is based on powering down entire memory banks.
> >
> > I don't think that per-bank CMA regions destroys memory power management
> > more than the global CMA pool. Please note that the contiguous buffers
> > (or in general dma-buffers) right now are unmovable so they don't fit
> > well into memory power management.
>
> We also have platforms where a well-defined part of the memory
> can be powered off, and other parts can't (or won't). We need a way
> to steer the place allocations come from to the memory that won't be
> turned off (so that CMA allocations are not an obstacle to memory
> hotremove).

We already established that we have to know something about the banks,
and your additional input makes it even clearer that we need to consider
the bigger picture here: We need to describe parts of memory separately
regarding general performance, device specific allocations and hotplug
characteristics.

It still sounds to me that this can be done using the NUMA properties
that Linux already understands, and teaching more subsystems about it,
but maybe the memory hotplug developers have already come up with
another scheme. The way that memory hotplug and CMA choose their
memory regions certainly needs to take both into account. As far as
I can see there are both conflicting and synergistic effects when
you combine the two.

Arnd

2011-06-16 00:54:59

by Philip Balister

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 06/15/2011 12:37 AM, Arnd Bergmann wrote:
> On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
>> I see your concerns, but I really wonder how to determine the properties
>> of the global/default cma pool. You definitely don't want to give all
>> available memory o CMA, because it will have negative impact on kernel
>> operation (kernel really needs to allocate unmovable pages from time to
>> time).
>
> Exactly. This is a hard problem, so I would prefer to see a solution for
> coming up with reasonable defaults.

Is this a situation where passing the information from device tree might
help? I know this does not help short term, but I am trying to
understand the sorts of problems device tree can help solve.

Philip

>
>> The only solution I see now is to provide Kconfig entry to determine
>> the size of the global CMA pool, but this still have some issues,
>> especially for multi-board kernels (each board probably will have
>> different amount of RAM and different memory-consuming devices
>> available). It looks that each board startup code still might need to
>> tweak the size of CMA pool. I can add a kernel command line option for
>> it, but such solution also will not solve all the cases (afair there
>> was a discussion about kernel command line parameters for memory
>> configuration and the conclusion was that it should be avoided).
>
> The command line option can be a last resort if the heuristics fail,
> but it's not much better than a fixed Kconfig setting.
>
> How about a Kconfig option that defines the percentage of memory
> to set aside for contiguous allocations?
>
> Arnd
>
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>

2011-06-16 03:20:25

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 15 June 2011 16:39, Larry Bassel <[email protected]> wrote:
> On 15 Jun 11 10:36, Marek Szyprowski wrote:
>> Hello,
>>
>> On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
>>
>> > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
>> > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
>> > > Samsung, that makes 3 major SoC vendors (I would be surprised if
>> > > Nvidia didn't also need to do this) - so I think some configurable
>> > > method to control allocations is necessarily. The chips can't do
>> > > decode without it (and by can't do I mean 1080P and higher decode is
>> > > not functionally useful). Far from special, this would appear to be
>> > > the default.
>
> We at Qualcomm have some platforms that have memory of different
> performance characteristics, some drivers will need a way of
> specifying that they need fast memory for an allocation (and would prefer
> an error if it is not available rather than a fallback to slower
> memory). It would also be bad if allocators who don't need fast
> memory got it "accidentally", depriving those who really need it.

I think this statement actually applies to all the SoCs that are
coming out now and in the future from TI, Samsung, Nvidia, Freescale,
ST Ericsson and others. It seems that in all cases users will want to:

1. Allocate memory with a per-SoC physical memory mapping policy that
is usually manually specified, i.e. use this physical memory bank set
for this allocation or nothing.
2. Be able to easily pass a token to this memory between various
userspace processes and the kernel.
3. Be able to easily and explicitly access attributes of an allocation
from all contexts.
4. Be able to save and reload this memory without giving up the
virtual address allocation.

In essence they want a architectural independent map object that can
bounce around the system with a unique handle.
>> >
>> > Thanks for the insight, that's a much better argument than 'something
>> > may need it'. Are those all chips without an IOMMU or do we also
>> > need to solve the IOMMU case with split bank allocation?
>> >
>> > I think I'd still prefer to see the support for multiple regions split
>> > out into one of the later patches, especially since that would defer
>> > the question of how to do the initialization for this case and make
>> > sure we first get a generic way.
>> >
>> > You've convinced me that we need to solve the problem of allocating
>> > memory from a specific bank eventually, but separating it from the
>> > one at hand (contiguous allocation) should help getting the important
>> > groundwork in at first.
>> >
>> > The possible conflict that I still see with per-bank CMA regions are:
>> >
>> > * It completely destroys memory power management in cases where that
>> > ? is based on powering down entire memory banks.
>>
>> I don't think that per-bank CMA regions destroys memory power management
>> more than the global CMA pool. Please note that the contiguous buffers
>> (or in general dma-buffers) right now are unmovable so they don't fit
>> well into memory power management.
>
> We also have platforms where a well-defined part of the memory
> can be powered off, and other parts can't (or won't). We need a way
> to steer the place allocations come from to the memory that won't be
> turned off (so that CMA allocations are not an obstacle to memory
> hotremove).
>
>>
>> Best regards
>> --
>> Marek Szyprowski
>> Samsung Poland R&D Center
>>
>>
>>
>> _______________________________________________
>> Linaro-mm-sig mailing list
>> [email protected]
>> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>
> Larry Bassel
>
> --
> Sent by an employee of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>

2011-06-16 07:03:43

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Thursday 16 June 2011 02:48:12 Philip Balister wrote:
> On 06/15/2011 12:37 AM, Arnd Bergmann wrote:
> > On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> >> I see your concerns, but I really wonder how to determine the properties
> >> of the global/default cma pool. You definitely don't want to give all
> >> available memory o CMA, because it will have negative impact on kernel
> >> operation (kernel really needs to allocate unmovable pages from time to
> >> time).
> >
> > Exactly. This is a hard problem, so I would prefer to see a solution for
> > coming up with reasonable defaults.
>
> Is this a situation where passing the information from device tree might
> help? I know this does not help short term, but I am trying to
> understand the sorts of problems device tree can help solve.

The device tree is a good place to describe any hardware properties such
as 'this device will need 32 MB of contiguous allocations on the memory
bank described in that other device node'.

It is however not a good place to describe user settings such as 'I want
to give this device a 200 MB pool for large allocations so I can run
application X efficiently', because that would require knowledge in the
boot loader about local policy, which it should generally not care about.

Arnd

2011-06-16 17:01:41

by Larry Bassel

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On 16 Jun 11 00:06, Arnd Bergmann wrote:
> On Wednesday 15 June 2011 23:39:58 Larry Bassel wrote:
> > On 15 Jun 11 10:36, Marek Szyprowski wrote:
> > > On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
> > >
> > > > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > > > Nvidia didn't also need to do this) - so I think some configurable
> > > > > method to control allocations is necessarily. The chips can't do
> > > > > decode without it (and by can't do I mean 1080P and higher decode is
> > > > > not functionally useful). Far from special, this would appear to be
> > > > > the default.
> >
> > We at Qualcomm have some platforms that have memory of different
> > performance characteristics, some drivers will need a way of
> > specifying that they need fast memory for an allocation (and would prefer
> > an error if it is not available rather than a fallback to slower
> > memory). It would also be bad if allocators who don't need fast
> > memory got it "accidentally", depriving those who really need it.
>
> Can you describe how the memory areas differ specifically?
> Is there one that is always faster but very small, or are there
> just specific circumstances under which some memory is faster than
> another?

One is always faster, but very small (generally 2-10% the size
of "normal" memory).

Larry

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2011-06-17 12:47:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Thursday 16 June 2011 19:01:33 Larry Bassel wrote:
> > Can you describe how the memory areas differ specifically?
> > Is there one that is always faster but very small, or are there
> > just specific circumstances under which some memory is faster than
> > another?
>
> One is always faster, but very small (generally 2-10% the size
> of "normal" memory).
>

Ok, that sounds like the "SRAM" regions that we are handling on some
ARM platforms using the various interfaces. It should probably
remain outside of the regular allocator, but we can try to generalize
the SRAM support further. There are many possible uses for it.

Arnd

2011-06-17 16:09:59

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 15 June 2011, Daniel Vetter wrote:
> On Tue, Jun 14, 2011 at 20:30, Arnd Bergmann <[email protected]> wrote:
> > On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
> >> Ah yes, I forgot that separate regions for different purposes could
> >> decrease fragmentation.
> >
> > That is indeed a good point, but having a good allocator algorithm
> > could also solve this. I don't know too much about these allocation
> > algorithms, but there are probably multiple working approaches to this.
>
> imo no allocator algorithm is gonna help if you have comparably large,
> variable-sized contiguous allocations out of a restricted address range.
> It might work well enough if there are only a few sizes and/or there's
> decent headroom. But for really generic workloads this would require
> sync objects and eviction callbacks (i.e. what Thomas Hellstrom pushed
> with ttm).

The requirements are quite different depending on what system you
look at. In a lot of cases, the constraints are not that tight at all,
and CMA will easily help to turn "works sometimes" into "works almost
always". Let's get there first and then look into the harder problems.

Unfortunately, memory allocation gets nondeterministic in the corner
cases, you can simply get the system into a state where you don't
have enough memory when you try to do too many things at once. This
may sound like a platitude but it's really what is behind all this:

If we had unlimited amounts of RAM, we would never need CMA, we could
simply set aside a lot of memory at boot time. Having one CMA area
with movable page eviction lets you build systems capable of doing
the same thing with less RAM than without CMA. Adding more complexity
lets you reduce that amount further.

The other aspects that have been mentioned about bank affinity and
SRAM are pretty orthogonal to the allocation, so we should also
treat them separately.

> So if this is only a requirement on very few platforms and can be
> cheaply fixed with multiple cma allocation areas (heck, we have
> slabs for the same reasons in the kernel), it might be a sensible
> compromise.

Yes, we can probably add it later when we find out what the limits
of the generic approach are. I don't really mind having the per-device
pointers to CMA areas, we just need to come up with a good way to
initialize them.

Arnd

2011-06-22 07:05:14

by Hans Verkuil

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday, June 15, 2011 09:37:18 Arnd Bergmann wrote:
> On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> > I see your concerns, but I really wonder how to determine the properties
> > of the global/default cma pool. You definitely don't want to give all
> > available memory o CMA, because it will have negative impact on kernel
> > operation (kernel really needs to allocate unmovable pages from time to
> > time).
>
> Exactly. This is a hard problem, so I would prefer to see a solution for
> coming up with reasonable defaults.
>
> > The only solution I see now is to provide Kconfig entry to determine
> > the size of the global CMA pool, but this still have some issues,
> > especially for multi-board kernels (each board probably will have
> > different amount of RAM and different memory-consuming devices
> > available). It looks that each board startup code still might need to
> > tweak the size of CMA pool. I can add a kernel command line option for
> > it, but such solution also will not solve all the cases (afair there
> > was a discussion about kernel command line parameters for memory
> > configuration and the conclusion was that it should be avoided).
>
> The command line option can be a last resort if the heuristics fail,
> but it's not much better than a fixed Kconfig setting.
>
> How about a Kconfig option that defines the percentage of memory
> to set aside for contiguous allocations?

I would actually like to see a cma_size kernel option of some sort. This would
be for the global CMA pool only as I don't think we should try to do anything
more complicated here.

While it is relatively easy for embedded systems to do a recompile every time
you need to change the pool size, this isn't an option on 'normal' desktop
systems.

While usually you have more than enough memory on such systems and don't need
CMA, there are a number of cases where you do want to reserve sufficient
memory. Usually these involve lots of video capture cards in one system.

What I was wondering about is how this patch series changes the allocation
in case it can't allocate from the CMA pool. Will it attempt to fall back
to a 'normal' allocation?

The reason I ask is that for desktop systems you could just start with a CMA
pool of size 0. And only in specific situations would you need to add a
cma_size kernel parameter depending on your needs. But this scheme would
require a fallback scenario in case of a global CMA pool of size 0.

Hmm, perhaps this fallback scenario is more driver specific. For SoC platform
video devices you may not want a fallback, whereas for PCI(e)/USB devices you
do. I don't know what's best, frankly.

Regards,

Hans

2011-06-22 07:32:22

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wed, 22 Jun 2011 09:03:30 +0200, Hans Verkuil <[email protected]>
wrote:
> What I was wondering about is how this patch series changes the
> allocation in case it can't allocate from the CMA pool. Will it
> attempt to fall back to a 'normal' allocation?

Unless Marek changed something since I wrote the code, which I doubt,
if CMA cannot obtain memory from CMA region, it will fail.

Part of the reason is that CMA lacks the knowledge where to allocate
memory from. For instance, with the case of several memory banks,
it does not know which memory bank to allocate from.

It is, in my opinion, a task for a higher level functions (read:
DMA layer) to try another mechanism if CMA fails.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-22 12:43:16

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 22 June 2011, Hans Verkuil wrote:
> > How about a Kconfig option that defines the percentage of memory
> > to set aside for contiguous allocations?
>
> I would actually like to see a cma_size kernel option of some sort. This would
> be for the global CMA pool only as I don't think we should try to do anything
> more complicated here.

A command line is probably good to override the compile-time default, yes.

We could also go further and add a runtime sysctl mechanism like the one
for hugepages, where you can grow the pool at run time as long as there is
enough free contiguous memory (e.g. from init scripts), or shrink it later
if you want to allow larger nonmovable allocations.

My feeling is that we need to find a way to integrate the global settings
for four kinds of allocations:

* nonmovable kernel pages
* hugetlb pages
* CMA
* memory hotplug

These essentially fight over the same memory (though things are slightly
different with dynamic hugepages), and they all face the same basic problem
of getting as much for themselves without starving the other three.

Arnd

2011-06-22 13:15:47

by Marek Szyprowski

[permalink] [raw]
Subject: RE: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

Hello,

On Wednesday, June 22, 2011 2:42 PM Arnd Bergmann wrote:

> On Wednesday 22 June 2011, Hans Verkuil wrote:
> > > How about a Kconfig option that defines the percentage of memory
> > > to set aside for contiguous allocations?
> >
> > I would actually like to see a cma_size kernel option of some sort. This
> would
> > be for the global CMA pool only as I don't think we should try to do
> anything
> > more complicated here.
>
> A command line is probably good to override the compile-time default, yes.
>
> We could also go further and add a runtime sysctl mechanism like the one
> for hugepages, where you can grow the pool at run time as long as there is
> enough free contiguous memory (e.g. from init scripts), or shrink it later
> if you want to allow larger nonmovable allocations.

Sounds really good, but it might be really hard to implemnt, at least for
CMA, because it needs to tweak parameters of memory management internal
structures very early, when buddy allocator has not been activated yet.

> My feeling is that we need to find a way to integrate the global settings
> for four kinds of allocations:
>
> * nonmovable kernel pages
> * hugetlb pages
> * CMA
> * memory hotplug
>
> These essentially fight over the same memory (though things are slightly
> different with dynamic hugepages), and they all face the same basic problem
> of getting as much for themselves without starving the other three.

I'm not sure we can solve all such issues in the first version. Maybe we should
first have each of the above fully working in mainline separately and then
start the integration works.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


2011-06-22 13:39:33

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wednesday 22 June 2011, Marek Szyprowski wrote:
> Sounds really good, but it might be really hard to implemnt, at least for
> CMA, because it needs to tweak parameters of memory management internal
> structures very early, when buddy allocator has not been activated yet.

Why that? I would expect you can do the same that hugepages (used to) do
and just attempt high-order allocations. If they succeed, you can add them
as a CMA region and free them again, into the movable set of pages, otherwise
you just fail the request from user space when the memory is already
fragmented.

> > These essentially fight over the same memory (though things are slightly
> > different with dynamic hugepages), and they all face the same basic problem
> > of getting as much for themselves without starving the other three.
>
> I'm not sure we can solve all such issues in the first version. Maybe we should
> first have each of the above fully working in mainline separately and then
> start the integration works.

Yes, makes sense. We just need to be careful not to introduce user-visible
interfaces that we cannot change any more in the process.

Arnd

2011-06-22 15:54:56

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

> On Wednesday, June 22, 2011 2:42 PM Arnd Bergmann wrote:
>> We could also go further and add a runtime sysctl mechanism like the
>> one for hugepages, where you can grow the pool at run time as long
>> as there is enough free contiguous memory (e.g. from init scripts),
>> or shrink it later if you want to allow larger nonmovable allocations.

On Wed, 22 Jun 2011 15:15:35 +0200, Marek Szyprowski wrote:
> Sounds really good, but it might be really hard to implement, at
> least for CMA, because it needs to tweak parameters of memory
> management internal structures very early, when buddy allocator
> has not been activated yet.

If you are able to allocate a pageblock of free memory from buddy system,
you should be able to convert it to CMA memory with no problems.

Also, if you want to convert CMA memory back to regular memory you
should be able to do that even if some of the memory is used by CMA
(it just won't be available right away but only when CMA frees it).

It is important to note that, because of the use of migration type,
all such conversion have to be performed on pageblock basis.

I don't think this is a feature we should consider for the first patch
though. We started with an overgrown idea about what CMA might do
and it didn't got us far. Let's first get the basics right and
then start implementing features as they become needed.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--

2011-06-22 16:04:48

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 08/10] mm: cma: Contiguous Memory Allocator added

On Wed, 22 Jun 2011 15:39:23 +0200, Arnd Bergmann <[email protected]> wrote:
> Why that? I would expect you can do the same that hugepages (used to) do
> and just attempt high-order allocations. If they succeed, you can add
> them as a CMA region and free them again, into the movable set of pages,
> otherwise you just fail the request from user space when the memory is
> already fragmented.

Problem with that is that CMA needs to have whole pageblocks allocated
and buddy can allocate at most half a pageblock.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--