2010-11-19 08:16:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/4] big chunk memory allocator v4

Hi, this is an updated version.

No major changes from the last one except for page allocation function.
removed RFC.

Order of patches is

[1/4] move some functions from memory_hotplug.c to page_isolation.c
[2/4] search physically contiguous range suitable for big chunk alloc.
[3/4] allocate big chunk memory based on memory hotplug(migration) technique
[4/4] modify page allocation function.

For what:

I hear there is requirements to allocate a chunk of page which is larger than
MAX_ORDER. Now, some (embeded) device use a big memory chunk. To use memory,
they hide some memory range by boot option (mem=) and use hidden memory
for its own purpose. But this seems a lack of feature in memory management.

This patch adds
alloc_contig_pages(start, end, nr_pages, gfp_mask)
to allocate a chunk of page whose length is nr_pages from [start, end)
phys address. This uses similar logic of memory-unplug, which tries to
offline [start, end) pages. By this, drivers can allocate 30M or 128M or
much bigger memory chunk on demand. (I allocated 1G chunk in my test).

But yes, because of fragmentation, this cannot guarantee 100% alloc.
If alloc_contig_pages() is called in system boot up or movable_zone is used,
this allocation succeeds at high rate.

I tested this on x86-64, and it seems to work as expected. But feedback from
embeded guys are appreciated because I think they are main user of this
function.

Thanks,
-Kame





2010-11-19 08:18:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/4] alloc_contig_pages() move some functions to page_isolation.c

From: KAMEZAWA Hiroyuki <[email protected]>

Memory hotplug is a logic for making pages unused in the specified range
of pfn. So, some of core logics can be used for other purpose as
allocating a very large contigous memory block.

This patch moves some functions from mm/memory_hotplug.c to
mm/page_isolation.c. This helps adding a function for large-alloc in
page_isolation.c with memory-unplug technique.

Changelog: 2010/10/26
- adjusted to mmotm-1024 + Bob's 3 clean ups.
Changelog: 2010/10/21
- adjusted to mmotm-1020

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page-isolation.h | 7 ++
mm/memory_hotplug.c | 108 ---------------------------------------
mm/page_isolation.c | 111 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 118 insertions(+), 108 deletions(-)

Index: mmotm-1117/include/linux/page-isolation.h
===================================================================
--- mmotm-1117.orig/include/linux/page-isolation.h
+++ mmotm-1117/include/linux/page-isolation.h
@@ -33,5 +33,12 @@ test_pages_isolated(unsigned long start_
extern int set_migratetype_isolate(struct page *page);
extern void unset_migratetype_isolate(struct page *page);

+/*
+ * For migration.
+ */
+
+int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn);
+unsigned long scan_lru_pages(unsigned long start, unsigned long end);
+int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn);

#endif
Index: mmotm-1117/mm/memory_hotplug.c
===================================================================
--- mmotm-1117.orig/mm/memory_hotplug.c
+++ mmotm-1117/mm/memory_hotplug.c
@@ -615,114 +615,6 @@ int is_mem_section_removable(unsigned lo
}

/*
- * Confirm all pages in a range [start, end) is belongs to the same zone.
- */
-static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pfn;
- struct zone *zone = NULL;
- struct page *page;
- int i;
- for (pfn = start_pfn;
- pfn < end_pfn;
- pfn += MAX_ORDER_NR_PAGES) {
- i = 0;
- /* This is just a CONFIG_HOLES_IN_ZONE check.*/
- while ((i < MAX_ORDER_NR_PAGES) && !pfn_valid_within(pfn + i))
- i++;
- if (i == MAX_ORDER_NR_PAGES)
- continue;
- page = pfn_to_page(pfn + i);
- if (zone && page_zone(page) != zone)
- return 0;
- zone = page_zone(page);
- }
- return 1;
-}
-
-/*
- * Scanning pfn is much easier than scanning lru list.
- * Scan pfn from start to end and Find LRU page.
- */
-static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
-{
- unsigned long pfn;
- struct page *page;
- for (pfn = start; pfn < end; pfn++) {
- if (pfn_valid(pfn)) {
- page = pfn_to_page(pfn);
- if (PageLRU(page))
- return pfn;
- }
- }
- return 0;
-}
-
-static struct page *
-hotremove_migrate_alloc(struct page *page, unsigned long private, int **x)
-{
- /* This should be improooooved!! */
- return alloc_page(GFP_HIGHUSER_MOVABLE);
-}
-
-#define NR_OFFLINE_AT_ONCE_PAGES (256)
-static int
-do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pfn;
- struct page *page;
- int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
- int not_managed = 0;
- int ret = 0;
- LIST_HEAD(source);
-
- for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
- if (!pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
- if (!page_count(page))
- continue;
- /*
- * We can skip free pages. And we can only deal with pages on
- * LRU.
- */
- ret = isolate_lru_page(page);
- if (!ret) { /* Success */
- list_add_tail(&page->lru, &source);
- move_pages--;
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
-
- } else {
-#ifdef CONFIG_DEBUG_VM
- printk(KERN_ALERT "removing pfn %lx from LRU failed\n",
- pfn);
- dump_page(page);
-#endif
- /* Becasue we don't have big zone->lock. we should
- check this again here. */
- if (page_count(page)) {
- not_managed++;
- ret = -EBUSY;
- break;
- }
- }
- }
- if (!list_empty(&source)) {
- if (not_managed) {
- putback_lru_pages(&source);
- goto out;
- }
- /* this function returns # of failed pages */
- ret = migrate_pages(&source, hotremove_migrate_alloc, 0, 1);
- if (ret)
- putback_lru_pages(&source);
- }
-out:
- return ret;
-}
-
-/*
* remove from free_area[] and mark all as Reserved.
*/
static int
Index: mmotm-1117/mm/page_isolation.c
===================================================================
--- mmotm-1117.orig/mm/page_isolation.c
+++ mmotm-1117/mm/page_isolation.c
@@ -5,6 +5,9 @@
#include <linux/mm.h>
#include <linux/page-isolation.h>
#include <linux/pageblock-flags.h>
+#include <linux/memcontrol.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
#include "internal.h"

static inline struct page *
@@ -139,3 +142,111 @@ int test_pages_isolated(unsigned long st
spin_unlock_irqrestore(&zone->lock, flags);
return ret ? 0 : -EBUSY;
}
+
+
+/*
+ * Confirm all pages in a range [start, end) is belongs to the same zone.
+ */
+int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct zone *zone = NULL;
+ struct page *page;
+ int i;
+ for (pfn = start_pfn;
+ pfn < end_pfn;
+ pfn += MAX_ORDER_NR_PAGES) {
+ i = 0;
+ /* This is just a CONFIG_HOLES_IN_ZONE check.*/
+ while ((i < MAX_ORDER_NR_PAGES) && !pfn_valid_within(pfn + i))
+ i++;
+ if (i == MAX_ORDER_NR_PAGES)
+ continue;
+ page = pfn_to_page(pfn + i);
+ if (zone && page_zone(page) != zone)
+ return 0;
+ zone = page_zone(page);
+ }
+ return 1;
+}
+
+/*
+ * Scanning pfn is much easier than scanning lru list.
+ * Scan pfn from start to end and Find LRU page.
+ */
+unsigned long scan_lru_pages(unsigned long start, unsigned long end)
+{
+ unsigned long pfn;
+ struct page *page;
+ for (pfn = start; pfn < end; pfn++) {
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ if (PageLRU(page))
+ return pfn;
+ }
+ }
+ return 0;
+}
+
+struct page *
+hotremove_migrate_alloc(struct page *page, unsigned long private, int **x)
+{
+ /* This should be improooooved!! */
+ return alloc_page(GFP_HIGHUSER_MOVABLE);
+}
+
+#define NR_OFFLINE_AT_ONCE_PAGES (256)
+int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pfn;
+ struct page *page;
+ int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
+ int not_managed = 0;
+ int ret = 0;
+ LIST_HEAD(source);
+
+ for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ if (!page_count(page))
+ continue;
+ /*
+ * We can skip free pages. And we can only deal with pages on
+ * LRU.
+ */
+ ret = isolate_lru_page(page);
+ if (!ret) { /* Success */
+ list_add_tail(&page->lru, &source);
+ move_pages--;
+ inc_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+
+ } else {
+#ifdef CONFIG_DEBUG_VM
+ printk(KERN_ALERT "removing pfn %lx from LRU failed\n",
+ pfn);
+ dump_page(page);
+#endif
+ /* Because we don't have big zone->lock. we should
+ check this again here. */
+ if (page_count(page)) {
+ not_managed++;
+ ret = -EBUSY;
+ break;
+ }
+ }
+ }
+ if (!list_empty(&source)) {
+ if (not_managed) {
+ putback_lru_pages(&source);
+ goto out;
+ }
+ /* this function returns # of failed pages */
+ ret = migrate_pages(&source, hotremove_migrate_alloc, 0, 1);
+ if (ret)
+ putback_lru_pages(&source);
+ }
+out:
+ return ret;
+}

2010-11-19 08:19:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/4] alloc_contig_pages() find appropriate physical memory range

From: KAMEZAWA Hiroyuki <[email protected]>

Unlike memory hotplug, at an allocation of contigous memory range, address
may not be a problem. IOW, if a requester of memory wants to allocate 100M of
of contigous memory, placement of allocated memory may not be a problem.
So, "finding a range of memory which seems to be MOVABLE" is required.

This patch adds a functon to isolate a length of memory within [start, end).
This function returns a pfn which is 1st page of isolated contigous chunk
of given length within [start, end).

If no_search=true is passed as argument, start address is always same to
the specified "base" addresss.

After isolation, free memory within this area will never be allocated.
But some pages will remain as "Used/LRU" pages. They should be dropped by
page reclaim or migration.

Changelog: 2010-11-17
- fixed some conding style (if-then-else)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/page_isolation.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 146 insertions(+)

Index: mmotm-1117/mm/page_isolation.c
===================================================================
--- mmotm-1117.orig/mm/page_isolation.c
+++ mmotm-1117/mm/page_isolation.c
@@ -7,6 +7,7 @@
#include <linux/pageblock-flags.h>
#include <linux/memcontrol.h>
#include <linux/migrate.h>
+#include <linux/memory_hotplug.h>
#include <linux/mm_inline.h>
#include "internal.h"

@@ -250,3 +251,148 @@ int do_migrate_range(unsigned long start
out:
return ret;
}
+
+/*
+ * Functions for getting contiguous MOVABLE pages in a zone.
+ */
+struct page_range {
+ unsigned long base; /* Base address of searching contigouous block */
+ unsigned long end;
+ unsigned long pages;/* Length of contiguous block */
+ int align_order;
+ unsigned long align_mask;
+};
+
+int __get_contig_block(unsigned long pfn, unsigned long nr_pages, void *arg)
+{
+ struct page_range *blockinfo = arg;
+ unsigned long end;
+
+ end = pfn + nr_pages;
+ pfn = ALIGN(pfn, 1 << blockinfo->align_order);
+ end = end & ~(MAX_ORDER_NR_PAGES - 1);
+
+ if (end < pfn)
+ return 0;
+ if (end - pfn >= blockinfo->pages) {
+ blockinfo->base = pfn;
+ blockinfo->end = end;
+ return 1;
+ }
+ return 0;
+}
+
+static void __trim_zone(struct zone *zone, struct page_range *range)
+{
+ unsigned long pfn;
+ /*
+ * skip pages which dones'nt under the zone.
+ * There are some archs which zones are not in linear layout.
+ */
+ if (page_zone(pfn_to_page(range->base)) != zone) {
+ for (pfn = range->base;
+ pfn < range->end;
+ pfn += MAX_ORDER_NR_PAGES) {
+ if (page_zone(pfn_to_page(pfn)) == zone)
+ break;
+ }
+ range->base = min(pfn, range->end);
+ }
+ /* Here, range-> base is in the zone if range->base != range->end */
+ for (pfn = range->base;
+ pfn < range->end;
+ pfn += MAX_ORDER_NR_PAGES) {
+ if (zone != page_zone(pfn_to_page(pfn))) {
+ pfn = pfn - MAX_ORDER_NR_PAGES;
+ break;
+ }
+ }
+ range->end = min(pfn, range->end);
+ return;
+}
+
+/*
+ * This function is for finding a contiguous memory block which has length
+ * of pages and MOVABLE. If it finds, make the range of pages as ISOLATED
+ * and return the first page's pfn.
+ * This checks all pages in the returned range is free of Pg_LRU. To reduce
+ * the risk of false-positive testing, lru_add_drain_all() should be called
+ * before this function to reduce pages on pagevec for zones.
+ */
+
+static unsigned long find_contig_block(unsigned long base,
+ unsigned long end, unsigned long pages,
+ int align_order, struct zone *zone)
+{
+ unsigned long pfn, pos;
+ struct page_range blockinfo;
+ int ret;
+
+ VM_BUG_ON(pages & (MAX_ORDER_NR_PAGES - 1));
+ VM_BUG_ON(base & ((1 << align_order) - 1));
+retry:
+ blockinfo.base = base;
+ blockinfo.end = end;
+ blockinfo.pages = pages;
+ blockinfo.align_order = align_order;
+ blockinfo.align_mask = (1 << align_order) - 1;
+ /*
+ * At first, check physical page layout and skip memory holes.
+ */
+ ret = walk_system_ram_range(base, end - base, &blockinfo,
+ __get_contig_block);
+ if (!ret)
+ return 0;
+ /* check contiguous pages in a zone */
+ __trim_zone(zone, &blockinfo);
+
+ /*
+ * Ok, we found contiguous memory chunk of size. Isolate it.
+ * We just search MAX_ORDER aligned range.
+ */
+ for (pfn = blockinfo.base; pfn + pages <= blockinfo.end;
+ pfn += (1 << align_order)) {
+ struct zone *z = page_zone(pfn_to_page(pfn));
+ if (z != zone)
+ continue;
+
+ spin_lock_irq(&z->lock);
+ pos = pfn;
+ /*
+ * Check the range only contains free pages or LRU pages.
+ */
+ while (pos < pfn + pages) {
+ struct page *p;
+
+ if (!pfn_valid_within(pos))
+ break;
+ p = pfn_to_page(pos);
+ if (PageReserved(p))
+ break;
+ if (!page_count(p)) {
+ if (!PageBuddy(p))
+ pos++;
+ else
+ pos += (1 << page_order(p));
+ } else if (PageLRU(p)) {
+ pos++;
+ } else
+ break;
+ }
+ spin_unlock_irq(&z->lock);
+ if ((pos == pfn + pages)) {
+ if (!start_isolate_page_range(pfn, pfn + pages))
+ return pfn;
+ } else/* the chunk including "pos" should be skipped */
+ pfn = pos & ~((1 << align_order) - 1);
+ cond_resched();
+ }
+
+ /* failed */
+ if (blockinfo.end + pages <= end) {
+ /* Move base address and find the next block of RAM. */
+ base = blockinfo.end;
+ goto retry;
+ }
+ return 0;
+}

2010-11-19 08:21:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 3/4] alloc_contig_pages() allocate big chunk memory using migration

From: KAMEZAWA Hiroyuki <[email protected]>

Add an function to allocate contiguous memory larger than MAX_ORDER.
The main difference between usual page allocator is that this uses
memory offline technique (Isolate pages and migrate remaining pages.).

I think this is not 100% solution because we can't avoid fragmentation,
but we have kernelcore= boot option and can create MOVABLE zone. That
helps us to allow allocate a contiguous range on demand.

The new function is

alloc_contig_pages(base, end, nr_pages, alignment)

This function will allocate contiguous pages of nr_pages from the range
[base, end). If [base, end) is bigger than nr_pages, some pfn which
meats alignment will be allocated. If alignment is smaller than MAX_ORDER,
it will be raised to be MAX_ORDER.

__alloc_contig_pages() has much more arguments.


Some drivers allocates contig pages by bootmem or hiding some memory
from the kernel at boot. But if contig pages are necessary only in some
situation, kernelcore= boot option and using page migration is a choice.

Changelog: 2010-11-19
- removed no_search
- removed some drain_ functions because they are heavy.
- check -ENOMEM case

Changelog: 2010-10-26
- support gfp_t
- support zonelist/nodemask
- support [base, end)
- support alignment

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page-isolation.h | 15 ++
mm/page_alloc.c | 29 ++++
mm/page_isolation.c | 242 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 286 insertions(+)

Index: mmotm-1117/mm/page_isolation.c
===================================================================
--- mmotm-1117.orig/mm/page_isolation.c
+++ mmotm-1117/mm/page_isolation.c
@@ -5,6 +5,7 @@
#include <linux/mm.h>
#include <linux/page-isolation.h>
#include <linux/pageblock-flags.h>
+#include <linux/swap.h>
#include <linux/memcontrol.h>
#include <linux/migrate.h>
#include <linux/memory_hotplug.h>
@@ -396,3 +397,244 @@ retry:
}
return 0;
}
+
+/*
+ * Comparing caller specified [user_start, user_end) with physical memory layout
+ * [phys_start, phys_end). If no intersection is longer than nr_pages, return 1.
+ * If there is an intersection, return 0 and fill range in [*start, *end)
+ */
+static int
+__calc_search_range(unsigned long user_start, unsigned long user_end,
+ unsigned long nr_pages,
+ unsigned long phys_start, unsigned long phys_end,
+ unsigned long *start, unsigned long *end)
+{
+ if ((user_start >= phys_end) || (user_end <= phys_start))
+ return 1;
+ if (user_start <= phys_start) {
+ *start = phys_start;
+ *end = min(user_end, phys_end);
+ } else {
+ *start = user_start;
+ *end = min(user_end, phys_end);
+ }
+ if (*end - *start < nr_pages)
+ return 1;
+ return 0;
+}
+
+
+/**
+ * __alloc_contig_pages - allocate a contiguous physical pages
+ * @base: the lowest pfn which caller wants.
+ * @end: the highest pfn which caller wants.
+ * @nr_pages: the length of a chunk of pages to be allocated.
+ * @align_order: alignment of start address of returned chunk in order.
+ * Returned' page's order will be aligned to (1 << align_order).If smaller
+ * than MAX_ORDER, it's raised to MAX_ORDER.
+ * @node: allocate near memory to the node, If -1, current node is used.
+ * @gfpflag: used to specify what zone the memory should be from.
+ * @nodemask: allocate memory within the nodemask.
+ *
+ * Search a memory range [base, end) and allocates physically contiguous
+ * pages. If end - base is larger than nr_pages, a chunk in [base, end) will
+ * be allocated
+ *
+ * This returns a page of the beginning of contiguous block. At failure, NULL
+ * is returned.
+ *
+ * Limitation: at allocation, nr_pages may be increased to be aligned to
+ * MAX_ORDER before searching a range. So, even if there is a enough chunk
+ * for nr_pages, it may not be able to be allocated. Extra tail pages of
+ * allocated chunk is returned to buddy allocator before returning the caller.
+ */
+
+#define MIGRATION_RETRY (5)
+struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
+ unsigned long nr_pages, int align_order,
+ int node, gfp_t gfpflag, nodemask_t *mask)
+{
+ unsigned long found, aligned_pages, start;
+ struct page *ret = NULL;
+ int migration_failed;
+ unsigned long align_mask;
+ struct zoneref *z;
+ struct zone *zone;
+ struct zonelist *zonelist;
+ enum zone_type highzone_idx = gfp_zone(gfpflag);
+ unsigned long zone_start, zone_end, rs, re, pos;
+
+ if (node == -1)
+ node = numa_node_id();
+
+ /* check unsupported flags */
+ if (gfpflag & __GFP_NORETRY)
+ return NULL;
+ if ((gfpflag & (__GFP_WAIT | __GFP_IO | __GFP_FS)) !=
+ (__GFP_WAIT | __GFP_IO | __GFP_FS))
+ return NULL;
+
+ if (gfpflag & __GFP_THISNODE)
+ zonelist = &NODE_DATA(node)->node_zonelists[1];
+ else
+ zonelist = &NODE_DATA(node)->node_zonelists[0];
+ /*
+ * Base/nr_page/end should be aligned to MAX_ORDER
+ */
+ found = 0;
+
+ if (align_order < MAX_ORDER)
+ align_order = MAX_ORDER;
+
+ align_mask = (1 << align_order) - 1;
+ /*
+ * We allocates MAX_ORDER aligned pages and cut tail pages later.
+ */
+ aligned_pages = ALIGN(nr_pages, (1 << MAX_ORDER));
+ /*
+ * If end - base == nr_pages, we can't search range. base must be
+ * aligned.
+ */
+ if ((end - base == nr_pages) && (base & align_mask))
+ return NULL;
+
+ base = ALIGN(base, (1 << align_order));
+ if ((end <= base) || (end - base < aligned_pages))
+ return NULL;
+
+ /*
+ * searching contig memory range within [pos, end).
+ * pos is updated at migration failure to find next chunk in zone.
+ * pos is reset to the base at searching next zone.
+ * (see for_each_zone_zonelist_nodemask in mmzone.h)
+ *
+ * Note: we cannot assume zones/nodes are in linear memory layout.
+ */
+ z = first_zones_zonelist(zonelist, highzone_idx, mask, &zone);
+ pos = base;
+retry:
+ if (!zone)
+ return NULL;
+
+ zone_start = ALIGN(zone->zone_start_pfn, 1 << align_order);
+ zone_end = zone->zone_start_pfn + zone->spanned_pages;
+
+ /* check [pos, end) is in this zone. */
+ if ((pos >= end) ||
+ (__calc_search_range(pos, end, aligned_pages,
+ zone_start, zone_end, &rs, &re))) {
+next_zone:
+ /* go to the next zone */
+ z = next_zones_zonelist(++z, highzone_idx, mask, &zone);
+ /* reset the pos */
+ pos = base;
+ goto retry;
+ }
+ /* [pos, end) is trimmed to [rs, re) in this zone. */
+ pos = rs;
+
+ found = find_contig_block(rs, re, aligned_pages, align_order, zone);
+ if (!found)
+ goto next_zone;
+
+ /*
+ * Because we isolated the range, free pages in the range will never
+ * be (re)allocated. scan_lru_pages() finds the next PG_lru page in
+ * the range and returns 0 if it reaches the end.
+ */
+ migration_failed = 0;
+ rs = found;
+ re = found + aligned_pages;
+ for (rs = scan_lru_pages(rs, re);
+ rs && rs < re;
+ rs = scan_lru_pages(rs, re)) {
+ int rc = do_migrate_range(rs, re);
+ if (!rc)
+ migration_failed = 0;
+ else {
+ /* it's better to try another block ? */
+ if (++migration_failed >= MIGRATION_RETRY)
+ break;
+ if (rc == -EBUSY) {
+ /* There are unstable pages.on pagevec. */
+ lru_add_drain_all();
+ /*
+ * there may be pages on pcplist before
+ * we mark the range as ISOLATED.
+ */
+ drain_all_pages();
+ } else if (rc == -ENOMEM)
+ goto nomem;
+ }
+ cond_resched();
+ }
+ if (!migration_failed) {
+ /* drop all pages in pagevec and pcp list */
+ lru_add_drain_all();
+ drain_all_pages();
+ }
+ /* Check all pages are isolated */
+ if (test_pages_isolated(found, found + aligned_pages)) {
+ undo_isolate_page_range(found, aligned_pages);
+ /*
+ * We failed at [found...found+aligned_pages) migration.
+ * "rs" is the last pfn scan_lru_pages() found that the page
+ * is LRU page. Update pos and try next chunk.
+ */
+ pos = ALIGN(rs + 1, (1 << align_order));
+ goto retry; /* goto next chunk */
+ }
+ /*
+ * OK, here, [found...found+pages) memory are isolated.
+ * All pages in the range will be moved into the list with
+ * page_count(page)=1.
+ */
+ ret = pfn_to_page(found);
+ alloc_contig_freed_pages(found, found + aligned_pages, gfpflag);
+ /* unset ISOLATE */
+ undo_isolate_page_range(found, aligned_pages);
+ /* Free unnecessary pages in tail */
+ for (start = found + nr_pages; start < found + aligned_pages; start++)
+ __free_page(pfn_to_page(start));
+ return ret;
+nomem:
+ undo_isolate_page_range(found, aligned_pages);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(__alloc_contig_pages);
+
+void free_contig_pages(struct page *page, int nr_pages)
+{
+ int i;
+ for (i = 0; i < nr_pages; i++)
+ __free_page(page + i);
+}
+EXPORT_SYMBOL_GPL(free_contig_pages);
+
+/*
+ * Allocated pages will not be MOVABLE but MOVABLE zone is a suitable
+ * for allocating big chunk. So, using ZONE_MOVABLE is a default.
+ */
+
+struct page *alloc_contig_pages(unsigned long base, unsigned long end,
+ unsigned long nr_pages, int align_order)
+{
+ return __alloc_contig_pages(base, end, nr_pages, align_order, -1,
+ GFP_KERNEL | __GFP_MOVABLE, NULL);
+}
+EXPORT_SYMBOL_GPL(alloc_contig_pages);
+
+struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order)
+{
+ return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, -1,
+ GFP_KERNEL | __GFP_MOVABLE, NULL);
+}
+EXPORT_SYMBOL_GPL(alloc_contig_pages_host);
+
+struct page *alloc_contig_pages_node(int nid, unsigned long nr_pages,
+ int align_order)
+{
+ return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, nid,
+ GFP_KERNEL | __GFP_THISNODE | __GFP_MOVABLE, NULL);
+}
+EXPORT_SYMBOL_GPL(alloc_contig_pages_node);
Index: mmotm-1117/include/linux/page-isolation.h
===================================================================
--- mmotm-1117.orig/include/linux/page-isolation.h
+++ mmotm-1117/include/linux/page-isolation.h
@@ -32,6 +32,8 @@ test_pages_isolated(unsigned long start_
*/
extern int set_migratetype_isolate(struct page *page);
extern void unset_migratetype_isolate(struct page *page);
+extern void alloc_contig_freed_pages(unsigned long pfn,
+ unsigned long pages, gfp_t flag);

/*
* For migration.
@@ -41,4 +43,17 @@ int test_pages_in_a_zone(unsigned long s
unsigned long scan_lru_pages(unsigned long start, unsigned long end);
int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn);

+/*
+ * For large alloc.
+ */
+struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
+ unsigned long nr_pages, int align_order,
+ int node, gfp_t flag, nodemask_t *mask);
+struct page *alloc_contig_pages(unsigned long base, unsigned long end,
+ unsigned long nr_pages, int align_order);
+struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order);
+struct page *alloc_contig_pages_node(int nid, unsigned long nr_pages,
+ int align_order);
+void free_contig_pages(struct page *page, int nr_pages);
+
#endif
Index: mmotm-1117/mm/page_alloc.c
===================================================================
--- mmotm-1117.orig/mm/page_alloc.c
+++ mmotm-1117/mm/page_alloc.c
@@ -5447,6 +5447,35 @@ out:
spin_unlock_irqrestore(&zone->lock, flags);
}

+
+void alloc_contig_freed_pages(unsigned long pfn, unsigned long end, gfp_t flag)
+{
+ struct page *page;
+ struct zone *zone;
+ int order;
+ unsigned long start = pfn;
+
+ zone = page_zone(pfn_to_page(pfn));
+ spin_lock_irq(&zone->lock);
+ while (pfn < end) {
+ VM_BUG_ON(!pfn_valid(pfn));
+ page = pfn_to_page(pfn);
+ VM_BUG_ON(page_count(page));
+ VM_BUG_ON(!PageBuddy(page));
+ list_del(&page->lru);
+ order = page_order(page);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+ pfn += 1 << order;
+ }
+ spin_unlock_irq(&zone->lock);
+
+ /*After this, pages in the range can be freed one be one */
+ for (pfn = start; pfn < end; pfn++)
+ prep_new_page(pfn_to_page(pfn), 0, flag);
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be isolated before calling this.

2010-11-19 08:22:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/4] alloc_contig_pages() use better allocation function for migration


From: KAMEZAWA Hiroyuki <[email protected]>

Old story.
Because we cannot assume which memory section will be offlined next,
hotremove_migrate_alloc() just uses alloc_page(). i.e. make no decision
where the page should be migrate into. Considering memory hotplug's
nature, the next memory section near to a section which is being removed
will be removed in the next. So, migrate pages to the same node of original
page doesn't make sense in many case, it just increases load.
Migration destination page is allocated from the node where offlining script
runs.

Now, contiguous-alloc uses do_migrate_range(). In this case, migration
destination node should be the same node of migration source page.

This patch modifies hotremove_migrate_alloc() and pass "nid" to it.
Memory hotremove will pass -1. So, if the page will be moved to
the node where offlining script runs....no behavior changes.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page-isolation.h | 3 ++-
mm/memory_hotplug.c | 2 +-
mm/page_isolation.c | 21 ++++++++++++++++-----
3 files changed, 19 insertions(+), 7 deletions(-)

Index: mmotm-1117/include/linux/page-isolation.h
===================================================================
--- mmotm-1117.orig/include/linux/page-isolation.h
+++ mmotm-1117/include/linux/page-isolation.h
@@ -41,7 +41,8 @@ extern void alloc_contig_freed_pages(uns

int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn);
unsigned long scan_lru_pages(unsigned long start, unsigned long end);
-int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn);
+int do_migrate_range(unsigned long start_pfn,
+ unsigned long end_pfn, int node);

/*
* For large alloc.
Index: mmotm-1117/mm/memory_hotplug.c
===================================================================
--- mmotm-1117.orig/mm/memory_hotplug.c
+++ mmotm-1117/mm/memory_hotplug.c
@@ -724,7 +724,7 @@ repeat:

pfn = scan_lru_pages(start_pfn, end_pfn);
if (pfn) { /* We have page on LRU */
- ret = do_migrate_range(pfn, end_pfn);
+ ret = do_migrate_range(pfn, end_pfn, numa_node_id());
if (!ret) {
drain = 1;
goto repeat;
Index: mmotm-1117/mm/page_isolation.c
===================================================================
--- mmotm-1117.orig/mm/page_isolation.c
+++ mmotm-1117/mm/page_isolation.c
@@ -193,12 +193,21 @@ unsigned long scan_lru_pages(unsigned lo
struct page *
hotremove_migrate_alloc(struct page *page, unsigned long private, int **x)
{
- /* This should be improooooved!! */
- return alloc_page(GFP_HIGHUSER_MOVABLE);
+ return alloc_pages_node(private, GFP_HIGHUSER_MOVABLE, 0);
}

+/*
+ * Migrate pages in the range to somewhere. Migration target page is allocated
+ * by hotremove_migrate_alloc(). If on_node is specicied, new page will be
+ * selected from nearby nodes. At hotremove, this "allocate from near node"
+ * can be harmful because we may remove other pages in the node for removing
+ * more pages in node. contiguous_alloc() uses on_node=true for avoiding
+ * unnecessary migration to far node.
+ */
+
#define NR_OFFLINE_AT_ONCE_PAGES (256)
-int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
+int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn,
+ int node)
{
unsigned long pfn;
struct page *page;
@@ -245,7 +254,7 @@ int do_migrate_range(unsigned long start
goto out;
}
/* this function returns # of failed pages */
- ret = migrate_pages(&source, hotremove_migrate_alloc, 0, 1);
+ ret = migrate_pages(&source, hotremove_migrate_alloc, node, 1);
if (ret)
putback_lru_pages(&source);
}
@@ -463,6 +472,7 @@ struct page *__alloc_contig_pages(unsign
struct zonelist *zonelist;
enum zone_type highzone_idx = gfp_zone(gfpflag);
unsigned long zone_start, zone_end, rs, re, pos;
+ int target_node;

if (node == -1)
node = numa_node_id();
@@ -516,6 +526,7 @@ retry:
if (!zone)
return NULL;

+ target_node = zone->zone_pgdat->node_id;
zone_start = ALIGN(zone->zone_start_pfn, 1 << align_order);
zone_end = zone->zone_start_pfn + zone->spanned_pages;

@@ -548,7 +559,7 @@ next_zone:
for (rs = scan_lru_pages(rs, re);
rs && rs < re;
rs = scan_lru_pages(rs, re)) {
- int rc = do_migrate_range(rs, re);
+ int rc = do_migrate_range(rs, re, target_node);
if (!rc)
migration_failed = 0;
else {

2010-11-19 20:57:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] big chunk memory allocator v4

On Fri, 19 Nov 2010 17:10:33 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Hi, this is an updated version.
>
> No major changes from the last one except for page allocation function.
> removed RFC.
>
> Order of patches is
>
> [1/4] move some functions from memory_hotplug.c to page_isolation.c
> [2/4] search physically contiguous range suitable for big chunk alloc.
> [3/4] allocate big chunk memory based on memory hotplug(migration) technique
> [4/4] modify page allocation function.
>
> For what:
>
> I hear there is requirements to allocate a chunk of page which is larger than
> MAX_ORDER. Now, some (embeded) device use a big memory chunk. To use memory,
> they hide some memory range by boot option (mem=) and use hidden memory
> for its own purpose. But this seems a lack of feature in memory management.
>
> This patch adds
> alloc_contig_pages(start, end, nr_pages, gfp_mask)
> to allocate a chunk of page whose length is nr_pages from [start, end)
> phys address. This uses similar logic of memory-unplug, which tries to
> offline [start, end) pages. By this, drivers can allocate 30M or 128M or
> much bigger memory chunk on demand. (I allocated 1G chunk in my test).
>
> But yes, because of fragmentation, this cannot guarantee 100% alloc.
> If alloc_contig_pages() is called in system boot up or movable_zone is used,
> this allocation succeeds at high rate.

So this is an alternatve implementation for the functionality offered
by Michal's "The Contiguous Memory Allocator framework".

> I tested this on x86-64, and it seems to work as expected. But feedback from
> embeded guys are appreciated because I think they are main user of this
> function.

>From where I sit, feedback from the embedded guys is *vital*, because
they are indeed the main users.

Michal, I haven't made a note of all the people who are interested in
and who are potential users of this code. Your patch series has a
billion cc's and is up to version 6. Could I ask that you review and
test this code, and also hunt down other people (probably at other
organisations) who can do likewise for us? Because until we hear from
those people that this work satisfies their needs, we can't really
proceed much further.

Thanks.


2010-11-21 15:07:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/4] alloc_contig_pages() move some functions to page_isolation.c

On Fri, Nov 19, 2010 at 05:12:39PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Memory hotplug is a logic for making pages unused in the specified range
> of pfn. So, some of core logics can be used for other purpose as
> allocating a very large contigous memory block.
>
> This patch moves some functions from mm/memory_hotplug.c to
> mm/page_isolation.c. This helps adding a function for large-alloc in
> page_isolation.c with memory-unplug technique.
>
> Changelog: 2010/10/26
> - adjusted to mmotm-1024 + Bob's 3 clean ups.
> Changelog: 2010/10/21
> - adjusted to mmotm-1020
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-11-21 15:21:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/4] alloc_contig_pages() find appropriate physical memory range

On Fri, Nov 19, 2010 at 05:14:15PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Unlike memory hotplug, at an allocation of contigous memory range, address
> may not be a problem. IOW, if a requester of memory wants to allocate 100M of
> of contigous memory, placement of allocated memory may not be a problem.
> So, "finding a range of memory which seems to be MOVABLE" is required.
>
> This patch adds a functon to isolate a length of memory within [start, end).
> This function returns a pfn which is 1st page of isolated contigous chunk
> of given length within [start, end).
>
> If no_search=true is passed as argument, start address is always same to
> the specified "base" addresss.
>
> After isolation, free memory within this area will never be allocated.
> But some pages will remain as "Used/LRU" pages. They should be dropped by
> page reclaim or migration.
>
> Changelog: 2010-11-17
> - fixed some conding style (if-then-else)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Minchan Kim <[email protected]>

Just some trivial comment below.

Intentionally, I don't add Reviewed-by.
Instead of it, I add Acked-by since I support this work.

I reviewed your old version but have forgot it. :(
So I will have a time to review your code and then add Reviewed-by.

> ---
> mm/page_isolation.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 146 insertions(+)
>
> Index: mmotm-1117/mm/page_isolation.c
> ===================================================================
> --- mmotm-1117.orig/mm/page_isolation.c
> +++ mmotm-1117/mm/page_isolation.c
> @@ -7,6 +7,7 @@
> #include <linux/pageblock-flags.h>
> #include <linux/memcontrol.h>
> #include <linux/migrate.h>
> +#include <linux/memory_hotplug.h>
> #include <linux/mm_inline.h>
> #include "internal.h"
>
> @@ -250,3 +251,148 @@ int do_migrate_range(unsigned long start
> out:
> return ret;
> }
> +
> +/*
> + * Functions for getting contiguous MOVABLE pages in a zone.
> + */
> +struct page_range {
> + unsigned long base; /* Base address of searching contigouous block */
> + unsigned long end;
> + unsigned long pages;/* Length of contiguous block */
> + int align_order;
> + unsigned long align_mask;
> +};
> +
> +int __get_contig_block(unsigned long pfn, unsigned long nr_pages, void *arg)
> +{
> + struct page_range *blockinfo = arg;
> + unsigned long end;
> +
> + end = pfn + nr_pages;
> + pfn = ALIGN(pfn, 1 << blockinfo->align_order);
> + end = end & ~(MAX_ORDER_NR_PAGES - 1);
> +
> + if (end < pfn)
> + return 0;
> + if (end - pfn >= blockinfo->pages) {
> + blockinfo->base = pfn;
> + blockinfo->end = end;
> + return 1;
> + }
> + return 0;
> +}
> +
> +static void __trim_zone(struct zone *zone, struct page_range *range)
> +{
> + unsigned long pfn;
> + /*
> + * skip pages which dones'nt under the zone.

typo

> + * There are some archs which zones are not in linear layout.
> + */
> + if (page_zone(pfn_to_page(range->base)) != zone) {
> + for (pfn = range->base;
> + pfn < range->end;
> + pfn += MAX_ORDER_NR_PAGES) {
> + if (page_zone(pfn_to_page(pfn)) == zone)
> + break;
> + }
> + range->base = min(pfn, range->end);
> + }
> + /* Here, range-> base is in the zone if range->base != range->end */
> + for (pfn = range->base;
> + pfn < range->end;
> + pfn += MAX_ORDER_NR_PAGES) {
> + if (zone != page_zone(pfn_to_page(pfn))) {
> + pfn = pfn - MAX_ORDER_NR_PAGES;
> + break;
> + }
> + }
> + range->end = min(pfn, range->end);
> + return;
> +}
> +
> +/*
> + * This function is for finding a contiguous memory block which has length
> + * of pages and MOVABLE. If it finds, make the range of pages as ISOLATED
> + * and return the first page's pfn.
> + * This checks all pages in the returned range is free of Pg_LRU. To reduce

typo

> + * the risk of false-positive testing, lru_add_drain_all() should be called
> + * before this function to reduce pages on pagevec for zones.
> + */
> +
> +static unsigned long find_contig_block(unsigned long base,
> + unsigned long end, unsigned long pages,
> + int align_order, struct zone *zone)
> +{
> + unsigned long pfn, pos;
> + struct page_range blockinfo;
> + int ret;
> +
> + VM_BUG_ON(pages & (MAX_ORDER_NR_PAGES - 1));
> + VM_BUG_ON(base & ((1 << align_order) - 1));
> +retry:
> + blockinfo.base = base;
> + blockinfo.end = end;
> + blockinfo.pages = pages;
> + blockinfo.align_order = align_order;
> + blockinfo.align_mask = (1 << align_order) - 1;
> + /*
> + * At first, check physical page layout and skip memory holes.
> + */
> + ret = walk_system_ram_range(base, end - base, &blockinfo,
> + __get_contig_block);

We need #include <linux/ioport.h>

> + if (!ret)
> + return 0;
> + /* check contiguous pages in a zone */
> + __trim_zone(zone, &blockinfo);
> +
> + /*
> + * Ok, we found contiguous memory chunk of size. Isolate it.
> + * We just search MAX_ORDER aligned range.
> + */
> + for (pfn = blockinfo.base; pfn + pages <= blockinfo.end;
> + pfn += (1 << align_order)) {
> + struct zone *z = page_zone(pfn_to_page(pfn));
> + if (z != zone)
> + continue;
> +
> + spin_lock_irq(&z->lock);
> + pos = pfn;
> + /*
> + * Check the range only contains free pages or LRU pages.
> + */
> + while (pos < pfn + pages) {
> + struct page *p;
> +
> + if (!pfn_valid_within(pos))
> + break;
> + p = pfn_to_page(pos);
> + if (PageReserved(p))
> + break;
> + if (!page_count(p)) {
> + if (!PageBuddy(p))
> + pos++;
> + else
> + pos += (1 << page_order(p));
> + } else if (PageLRU(p)) {
> + pos++;
> + } else
> + break;
> + }
> + spin_unlock_irq(&z->lock);
> + if ((pos == pfn + pages)) {
> + if (!start_isolate_page_range(pfn, pfn + pages))
> + return pfn;
> + } else/* the chunk including "pos" should be skipped */
> + pfn = pos & ~((1 << align_order) - 1);
> + cond_resched();
> + }
> +
> + /* failed */
> + if (blockinfo.end + pages <= end) {
> + /* Move base address and find the next block of RAM. */
> + base = blockinfo.end;
> + goto retry;
> + }
> + return 0;
> +}
>

--
Kind regards,
Minchan Kim

2010-11-21 15:26:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/4] alloc_contig_pages() allocate big chunk memory using migration

On Fri, Nov 19, 2010 at 05:15:28PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Add an function to allocate contiguous memory larger than MAX_ORDER.
> The main difference between usual page allocator is that this uses
> memory offline technique (Isolate pages and migrate remaining pages.).
>
> I think this is not 100% solution because we can't avoid fragmentation,
> but we have kernelcore= boot option and can create MOVABLE zone. That
> helps us to allow allocate a contiguous range on demand.
>
> The new function is
>
> alloc_contig_pages(base, end, nr_pages, alignment)
>
> This function will allocate contiguous pages of nr_pages from the range
> [base, end). If [base, end) is bigger than nr_pages, some pfn which
> meats alignment will be allocated. If alignment is smaller than MAX_ORDER,
> it will be raised to be MAX_ORDER.
>
> __alloc_contig_pages() has much more arguments.
>
>
> Some drivers allocates contig pages by bootmem or hiding some memory
> from the kernel at boot. But if contig pages are necessary only in some
> situation, kernelcore= boot option and using page migration is a choice.
>
> Changelog: 2010-11-19
> - removed no_search
> - removed some drain_ functions because they are heavy.
> - check -ENOMEM case
>
> Changelog: 2010-10-26
> - support gfp_t
> - support zonelist/nodemask
> - support [base, end)
> - support alignment
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Minchan Kim <[email protected]>

Trivial comment below.

> +EXPORT_SYMBOL_GPL(alloc_contig_pages);
> +
> +struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order)
> +{
> + return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, -1,
> + GFP_KERNEL | __GFP_MOVABLE, NULL);
> +}

We need include #include <linux/bootmem.h> for using max_pfn.

--
Kind regards,
Minchan Kim

2010-11-22 00:10:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] big chunk memory allocator v4

On Fri, 19 Nov 2010 12:56:53 -0800
Andrew Morton <[email protected]> wrote:

> On Fri, 19 Nov 2010 17:10:33 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > Hi, this is an updated version.
> >
> > No major changes from the last one except for page allocation function.
> > removed RFC.
> >
> > Order of patches is
> >
> > [1/4] move some functions from memory_hotplug.c to page_isolation.c
> > [2/4] search physically contiguous range suitable for big chunk alloc.
> > [3/4] allocate big chunk memory based on memory hotplug(migration) technique
> > [4/4] modify page allocation function.
> >
> > For what:
> >
> > I hear there is requirements to allocate a chunk of page which is larger than
> > MAX_ORDER. Now, some (embeded) device use a big memory chunk. To use memory,
> > they hide some memory range by boot option (mem=) and use hidden memory
> > for its own purpose. But this seems a lack of feature in memory management.
> >
> > This patch adds
> > alloc_contig_pages(start, end, nr_pages, gfp_mask)
> > to allocate a chunk of page whose length is nr_pages from [start, end)
> > phys address. This uses similar logic of memory-unplug, which tries to
> > offline [start, end) pages. By this, drivers can allocate 30M or 128M or
> > much bigger memory chunk on demand. (I allocated 1G chunk in my test).
> >
> > But yes, because of fragmentation, this cannot guarantee 100% alloc.
> > If alloc_contig_pages() is called in system boot up or movable_zone is used,
> > this allocation succeeds at high rate.
>
> So this is an alternatve implementation for the functionality offered
> by Michal's "The Contiguous Memory Allocator framework".
>

Yes, this will be a backends for that kind of works.

I think there are two ways to allocate contiguous pages larger than MAX_ORDER.

1) hide some memory at boot and add an another memory allocator.
2) support a range allocator as [start, end)

This is an trial from 2). I used memory-hotplug technique because I know some.
This patch itself has no "map" and "management" function, so it should be
developped in another patch (but maybe it will be not my work.)

> > I tested this on x86-64, and it seems to work as expected. But feedback from
> > embeded guys are appreciated because I think they are main user of this
> > function.
>
> From where I sit, feedback from the embedded guys is *vital*, because
> they are indeed the main users.
>
> Michal, I haven't made a note of all the people who are interested in
> and who are potential users of this code. Your patch series has a
> billion cc's and is up to version 6. Could I ask that you review and
> test this code, and also hunt down other people (probably at other
> organisations) who can do likewise for us? Because until we hear from
> those people that this work satisfies their needs, we can't really
> proceed much further.
>

yes. please.

Thanks,
-Kame

2010-11-22 00:17:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] alloc_contig_pages() find appropriate physical memory range

On Mon, 22 Nov 2010 00:21:31 +0900
Minchan Kim <[email protected]> wrote:

> Acked-by: Minchan Kim <[email protected]>
>
> Just some trivial comment below.
>
> Intentionally, I don't add Reviewed-by.
> Instead of it, I add Acked-by since I support this work.
Thanks.

>
> I reviewed your old version but have forgot it. :(

Sorry, I had a vacation ;(

> So I will have a time to review your code and then add Reviewed-by.
>
> > ---
> > mm/page_isolation.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 146 insertions(+)
> >
> > Index: mmotm-1117/mm/page_isolation.c
> > ===================================================================
> > --- mmotm-1117.orig/mm/page_isolation.c
> > +++ mmotm-1117/mm/page_isolation.c
> > @@ -7,6 +7,7 @@
> > #include <linux/pageblock-flags.h>
> > #include <linux/memcontrol.h>
> > #include <linux/migrate.h>
> > +#include <linux/memory_hotplug.h>
> > #include <linux/mm_inline.h>
> > #include "internal.h"
> >
> > @@ -250,3 +251,148 @@ int do_migrate_range(unsigned long start
> > out:
> > return ret;
> > }
> > +
> > +/*
> > + * Functions for getting contiguous MOVABLE pages in a zone.
> > + */
> > +struct page_range {
> > + unsigned long base; /* Base address of searching contigouous block */
> > + unsigned long end;
> > + unsigned long pages;/* Length of contiguous block */
> > + int align_order;
> > + unsigned long align_mask;
> > +};
> > +
> > +int __get_contig_block(unsigned long pfn, unsigned long nr_pages, void *arg)
> > +{
> > + struct page_range *blockinfo = arg;
> > + unsigned long end;
> > +
> > + end = pfn + nr_pages;
> > + pfn = ALIGN(pfn, 1 << blockinfo->align_order);
> > + end = end & ~(MAX_ORDER_NR_PAGES - 1);
> > +
> > + if (end < pfn)
> > + return 0;
> > + if (end - pfn >= blockinfo->pages) {
> > + blockinfo->base = pfn;
> > + blockinfo->end = end;
> > + return 1;
> > + }
> > + return 0;
> > +}
> > +
> > +static void __trim_zone(struct zone *zone, struct page_range *range)
> > +{
> > + unsigned long pfn;
> > + /*
> > + * skip pages which dones'nt under the zone.
>
> typo
>
will fix.


> > + * There are some archs which zones are not in linear layout.
> > + */
> > + if (page_zone(pfn_to_page(range->base)) != zone) {
> > + for (pfn = range->base;
> > + pfn < range->end;
> > + pfn += MAX_ORDER_NR_PAGES) {
> > + if (page_zone(pfn_to_page(pfn)) == zone)
> > + break;
> > + }
> > + range->base = min(pfn, range->end);
> > + }
> > + /* Here, range-> base is in the zone if range->base != range->end */
> > + for (pfn = range->base;
> > + pfn < range->end;
> > + pfn += MAX_ORDER_NR_PAGES) {
> > + if (zone != page_zone(pfn_to_page(pfn))) {
> > + pfn = pfn - MAX_ORDER_NR_PAGES;
> > + break;
> > + }
> > + }
> > + range->end = min(pfn, range->end);
> > + return;
> > +}
> > +
> > +/*
> > + * This function is for finding a contiguous memory block which has length
> > + * of pages and MOVABLE. If it finds, make the range of pages as ISOLATED
> > + * and return the first page's pfn.
> > + * This checks all pages in the returned range is free of Pg_LRU. To reduce
>
> typo
>
will lfix.

> > + * the risk of false-positive testing, lru_add_drain_all() should be called
> > + * before this function to reduce pages on pagevec for zones.
> > + */
> > +
> > +static unsigned long find_contig_block(unsigned long base,
> > + unsigned long end, unsigned long pages,
> > + int align_order, struct zone *zone)
> > +{
> > + unsigned long pfn, pos;
> > + struct page_range blockinfo;
> > + int ret;
> > +
> > + VM_BUG_ON(pages & (MAX_ORDER_NR_PAGES - 1));
> > + VM_BUG_ON(base & ((1 << align_order) - 1));
> > +retry:
> > + blockinfo.base = base;
> > + blockinfo.end = end;
> > + blockinfo.pages = pages;
> > + blockinfo.align_order = align_order;
> > + blockinfo.align_mask = (1 << align_order) - 1;
> > + /*
> > + * At first, check physical page layout and skip memory holes.
> > + */
> > + ret = walk_system_ram_range(base, end - base, &blockinfo,
> > + __get_contig_block);
>
> We need #include <linux/ioport.h>
>

ok.

Thanks,
-Kame

2010-11-22 00:19:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/4] alloc_contig_pages() allocate big chunk memory using migration

On Mon, 22 Nov 2010 00:25:56 +0900
Minchan Kim <[email protected]> wrote:

> On Fri, Nov 19, 2010 at 05:15:28PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Add an function to allocate contiguous memory larger than MAX_ORDER.
> > The main difference between usual page allocator is that this uses
> > memory offline technique (Isolate pages and migrate remaining pages.).
> >
> > I think this is not 100% solution because we can't avoid fragmentation,
> > but we have kernelcore= boot option and can create MOVABLE zone. That
> > helps us to allow allocate a contiguous range on demand.
> >
> > The new function is
> >
> > alloc_contig_pages(base, end, nr_pages, alignment)
> >
> > This function will allocate contiguous pages of nr_pages from the range
> > [base, end). If [base, end) is bigger than nr_pages, some pfn which
> > meats alignment will be allocated. If alignment is smaller than MAX_ORDER,
> > it will be raised to be MAX_ORDER.
> >
> > __alloc_contig_pages() has much more arguments.
> >
> >
> > Some drivers allocates contig pages by bootmem or hiding some memory
> > from the kernel at boot. But if contig pages are necessary only in some
> > situation, kernelcore= boot option and using page migration is a choice.
> >
> > Changelog: 2010-11-19
> > - removed no_search
> > - removed some drain_ functions because they are heavy.
> > - check -ENOMEM case
> >
> > Changelog: 2010-10-26
> > - support gfp_t
> > - support zonelist/nodemask
> > - support [base, end)
> > - support alignment
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
>
> Trivial comment below.
>
> > +EXPORT_SYMBOL_GPL(alloc_contig_pages);
> > +
> > +struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order)
> > +{
> > + return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, -1,
> > + GFP_KERNEL | __GFP_MOVABLE, NULL);
> > +}
>
> We need include #include <linux/bootmem.h> for using max_pfn.
>

will add that.

Thanks,
-Kame

2010-11-22 00:30:44

by Felipe Contreras

[permalink] [raw]
Subject: Re: [PATCH 0/4] big chunk memory allocator v4

On Fri, Nov 19, 2010 at 10:56 PM, Andrew Morton
<[email protected]> wrote:
> On Fri, 19 Nov 2010 17:10:33 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> Hi, this is an updated version.
>>
>> No major changes from the last one except for page allocation function.
>> removed RFC.
>>
>> Order of patches is
>>
>> [1/4] move some functions from memory_hotplug.c to page_isolation.c
>> [2/4] search physically contiguous range suitable for big chunk alloc.
>> [3/4] allocate big chunk memory based on memory hotplug(migration) technique
>> [4/4] modify page allocation function.
>>
>> For what:
>>
>>   I hear there is requirements to allocate a chunk of page which is larger than
>>   MAX_ORDER. Now, some (embeded) device use a big memory chunk. To use memory,
>>   they hide some memory range by boot option (mem=) and use hidden memory
>>   for its own purpose. But this seems a lack of feature in memory management.

Actually, now that's not needed any more by using memblock:
http://article.gmane.org/gmane.linux.ports.arm.omap/44978

>>   This patch adds
>>       alloc_contig_pages(start, end, nr_pages, gfp_mask)
>>   to allocate a chunk of page whose length is nr_pages from [start, end)
>>   phys address. This uses similar logic of memory-unplug, which tries to
>>   offline [start, end) pages. By this, drivers can allocate 30M or 128M or
>>   much bigger memory chunk on demand. (I allocated 1G chunk in my test).
>>
>>   But yes, because of fragmentation, this cannot guarantee 100% alloc.
>>   If alloc_contig_pages() is called in system boot up or movable_zone is used,
>>   this allocation succeeds at high rate.
>
> So this is an alternatve implementation for the functionality offered
> by Michal's "The Contiguous Memory Allocator framework".
>
>>   I tested this on x86-64, and it seems to work as expected. But feedback from
>>   embeded guys are appreciated because I think they are main user of this
>>   function.
>
> From where I sit, feedback from the embedded guys is *vital*, because
> they are indeed the main users.
>
> Michal, I haven't made a note of all the people who are interested in
> and who are potential users of this code.  Your patch series has a
> billion cc's and is up to version 6.  Could I ask that you review and
> test this code, and also hunt down other people (probably at other
> organisations) who can do likewise for us?  Because until we hear from
> those people that this work satisfies their needs, we can't really
> proceed much further.

As I've explained before, a contiguous memory allocator would be nice,
but on ARM many drivers not only need contiguous memory, but
non-cacheable, and this requires removing the memory from normal
kernel mapping in early boot.

Cheers.

--
Felipe Contreras

2010-11-22 09:00:29

by Andi Kleen

[permalink] [raw]
Subject: RE: [PATCH 0/4] big chunk memory allocator v4

> > But yes, because of fragmentation, this cannot guarantee 100%
> alloc.
> > If alloc_contig_pages() is called in system boot up or movable_zone
> is used,
> > this allocation succeeds at high rate.
>
> So this is an alternatve implementation for the functionality offered
> by Michal's "The Contiguous Memory Allocator framework".

I see them more as orthogonal: Michal's code relies on preallocation
and manages the memory after that.

This code supplies the infrastructure to replace preallocation
with just using movable zones.

-Andi

2010-11-22 11:20:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/4] alloc_contig_pages() find appropriate physical memory range

On Fri, Nov 19, 2010 at 5:14 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Unlike memory hotplug, at an allocation of contigous memory range, address
> may not be a problem. IOW, if a requester of memory wants to allocate 100M of
> of contigous memory, placement of allocated memory may not be a problem.
> So, "finding a range of memory which seems to be MOVABLE" is required.
>
> This patch adds a functon to isolate a length of memory within [start, end).
> This function returns a pfn which is 1st page of isolated contigous chunk
> of given length within [start, end).
>
> If no_search=true is passed as argument, start address is always same to
> the specified "base" addresss.
>
> After isolation, free memory within this area will never be allocated.
> But some pages will remain as "Used/LRU" pages. They should be dropped by
> page reclaim or migration.
>
> Changelog: 2010-11-17
> ?- fixed some conding style (if-then-else)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?mm/page_isolation.c | ?146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> ?1 file changed, 146 insertions(+)
>
> Index: mmotm-1117/mm/page_isolation.c
> ===================================================================
> --- mmotm-1117.orig/mm/page_isolation.c
> +++ mmotm-1117/mm/page_isolation.c
> @@ -7,6 +7,7 @@
> ?#include <linux/pageblock-flags.h>
> ?#include <linux/memcontrol.h>
> ?#include <linux/migrate.h>
> +#include <linux/memory_hotplug.h>
> ?#include <linux/mm_inline.h>
> ?#include "internal.h"
>
> @@ -250,3 +251,148 @@ int do_migrate_range(unsigned long start
> ?out:
> ? ? ? ?return ret;
> ?}
> +
> +/*
> + * Functions for getting contiguous MOVABLE pages in a zone.
> + */
> +struct page_range {
> + ? ? ? unsigned long base; /* Base address of searching contigouous block */
> + ? ? ? unsigned long end;
> + ? ? ? unsigned long pages;/* Length of contiguous block */

Nitpick.
You used nr_pages in other place.
I hope you use the name consistent.

> + ? ? ? int align_order;
> + ? ? ? unsigned long align_mask;

Does we really need this field 'align_mask'?
We can get always from align_order.

> +};
> +
> +int __get_contig_block(unsigned long pfn, unsigned long nr_pages, void *arg)
> +{
> + ? ? ? struct page_range *blockinfo = arg;
> + ? ? ? unsigned long end;
> +
> + ? ? ? end = pfn + nr_pages;
> + ? ? ? pfn = ALIGN(pfn, 1 << blockinfo->align_order);
> + ? ? ? end = end & ~(MAX_ORDER_NR_PAGES - 1);
> +
> + ? ? ? if (end < pfn)
> + ? ? ? ? ? ? ? return 0;
> + ? ? ? if (end - pfn >= blockinfo->pages) {
> + ? ? ? ? ? ? ? blockinfo->base = pfn;
> + ? ? ? ? ? ? ? blockinfo->end = end;
> + ? ? ? ? ? ? ? return 1;
> + ? ? ? }
> + ? ? ? return 0;
> +}
> +
> +static void __trim_zone(struct zone *zone, struct page_range *range)
> +{
> + ? ? ? unsigned long pfn;
> + ? ? ? /*
> + ? ? ? ?* skip pages which dones'nt under the zone.

typo dones'nt -> doesn't :)

> + ? ? ? ?* There are some archs which zones are not in linear layout.
> + ? ? ? ?*/
> + ? ? ? if (page_zone(pfn_to_page(range->base)) != zone) {
> + ? ? ? ? ? ? ? for (pfn = range->base;
> + ? ? ? ? ? ? ? ? ? ? ? pfn < range->end;
> + ? ? ? ? ? ? ? ? ? ? ? pfn += MAX_ORDER_NR_PAGES) {
> + ? ? ? ? ? ? ? ? ? ? ? if (page_zone(pfn_to_page(pfn)) == zone)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? range->base = min(pfn, range->end);
> + ? ? ? }
> + ? ? ? /* Here, range-> base is in the zone if range->base != range->end */
> + ? ? ? for (pfn = range->base;
> + ? ? ? ? ? ?pfn < range->end;
> + ? ? ? ? ? ?pfn += MAX_ORDER_NR_PAGES) {
> + ? ? ? ? ? ? ? if (zone != page_zone(pfn_to_page(pfn))) {
> + ? ? ? ? ? ? ? ? ? ? ? pfn = pfn - MAX_ORDER_NR_PAGES;
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> + ? ? ? range->end = min(pfn, range->end);
> + ? ? ? return;

Remove return

> +}
> +
> +/*
> + * This function is for finding a contiguous memory block which has length
> + * of pages and MOVABLE. If it finds, make the range of pages as ISOLATED
> + * and return the first page's pfn.
> + * This checks all pages in the returned range is free of Pg_LRU. To reduce
> + * the risk of false-positive testing, lru_add_drain_all() should be called
> + * before this function to reduce pages on pagevec for zones.
> + */
> +
> +static unsigned long find_contig_block(unsigned long base,
> + ? ? ? ? ? ? ? unsigned long end, unsigned long pages,
> + ? ? ? ? ? ? ? int align_order, struct zone *zone)
> +{
> + ? ? ? unsigned long pfn, pos;
> + ? ? ? struct page_range blockinfo;
> + ? ? ? int ret;
> +
> + ? ? ? VM_BUG_ON(pages & (MAX_ORDER_NR_PAGES - 1));
> + ? ? ? VM_BUG_ON(base & ((1 << align_order) - 1));
> +retry:
> + ? ? ? blockinfo.base = base;
> + ? ? ? blockinfo.end = end;
> + ? ? ? blockinfo.pages = pages;
> + ? ? ? blockinfo.align_order = align_order;
> + ? ? ? blockinfo.align_mask = (1 << align_order) - 1;

We don't need this.

> + ? ? ? /*
> + ? ? ? ?* At first, check physical page layout and skip memory holes.
> + ? ? ? ?*/
> + ? ? ? ret = walk_system_ram_range(base, end - base, &blockinfo,
> + ? ? ? ? ? ? ? __get_contig_block);
> + ? ? ? if (!ret)
> + ? ? ? ? ? ? ? return 0;
> + ? ? ? /* check contiguous pages in a zone */
> + ? ? ? __trim_zone(zone, &blockinfo);
> +
> + ? ? ? /*
> + ? ? ? ?* Ok, we found contiguous memory chunk of size. Isolate it.
> + ? ? ? ?* We just search MAX_ORDER aligned range.
> + ? ? ? ?*/
> + ? ? ? for (pfn = blockinfo.base; pfn + pages <= blockinfo.end;
> + ? ? ? ? ? ?pfn += (1 << align_order)) {
> + ? ? ? ? ? ? ? struct zone *z = page_zone(pfn_to_page(pfn));
> + ? ? ? ? ? ? ? if (z != zone)
> + ? ? ? ? ? ? ? ? ? ? ? continue;

Could we make sure pass __trim_zone is to satisfy whole pfn in zone
what we want.
Repeated the zone check is rather annoying.
I mean let's __get_contig_block or __trim_zone already does check zone
so that we remove the zone check in here.

> +
> + ? ? ? ? ? ? ? spin_lock_irq(&z->lock);
> + ? ? ? ? ? ? ? pos = pfn;
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Check the range only contains free pages or LRU pages.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? while (pos < pfn + pages) {
> + ? ? ? ? ? ? ? ? ? ? ? struct page *p;
> +
> + ? ? ? ? ? ? ? ? ? ? ? if (!pfn_valid_within(pos))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? p = pfn_to_page(pos);
> + ? ? ? ? ? ? ? ? ? ? ? if (PageReserved(p))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? if (!page_count(p)) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!PageBuddy(p))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? pos++;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? pos += (1 << page_order(p));
> + ? ? ? ? ? ? ? ? ? ? ? } else if (PageLRU(p)) {

Could we check get_pageblock_migratetype(page) == MIGRATE_MOVABLE in
here and early bail out?

> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? pos++;
> + ? ? ? ? ? ? ? ? ? ? ? } else
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? spin_unlock_irq(&z->lock);
> + ? ? ? ? ? ? ? if ((pos == pfn + pages)) {
> + ? ? ? ? ? ? ? ? ? ? ? if (!start_isolate_page_range(pfn, pfn + pages))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return pfn;
> + ? ? ? ? ? ? ? } else/* the chunk including "pos" should be skipped */
> + ? ? ? ? ? ? ? ? ? ? ? pfn = pos & ~((1 << align_order) - 1);
> + ? ? ? ? ? ? ? cond_resched();
> + ? ? ? }
> +
> + ? ? ? /* failed */
> + ? ? ? if (blockinfo.end + pages <= end) {
> + ? ? ? ? ? ? ? /* Move base address and find the next block of RAM. */
> + ? ? ? ? ? ? ? base = blockinfo.end;
> + ? ? ? ? ? ? ? goto retry;
> + ? ? ? }
> + ? ? ? return 0;

If the base is 0, isn't it impossible return pfn 0?
x86 in FLAT isn't impossible but I think some architecture might be possible.
Just guessing.

How about returning negative value and return first page pfn and last
page pfn as out parameter base, end?

> +}
>
>



--
Kind regards,
Minchan Kim

2010-11-22 11:44:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/4] alloc_contig_pages() allocate big chunk memory using migration

On Fri, Nov 19, 2010 at 5:15 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Add an function to allocate contiguous memory larger than MAX_ORDER.
> The main difference between usual page allocator is that this uses
> memory offline technique (Isolate pages and migrate remaining pages.).
>
> I think this is not 100% solution because we can't avoid fragmentation,
> but we have kernelcore= boot option and can create MOVABLE zone. That
> helps us to allow allocate a contiguous range on demand.

And later we can use compaction and reclaim, too.
So I think this approach is the way we have to go.

>
> The new function is
>
> ?alloc_contig_pages(base, end, nr_pages, alignment)
>
> This function will allocate contiguous pages of nr_pages from the range
> [base, end). If [base, end) is bigger than nr_pages, some pfn which
> meats alignment will be allocated. If alignment is smaller than MAX_ORDER,

type meet

> it will be raised to be MAX_ORDER.
>
> __alloc_contig_pages() has much more arguments.
>
>
> Some drivers allocates contig pages by bootmem or hiding some memory
> from the kernel at boot. But if contig pages are necessary only in some
> situation, kernelcore= boot option and using page migration is a choice.
>
> Changelog: 2010-11-19
> ?- removed no_search
> ?- removed some drain_ functions because they are heavy.
> ?- check -ENOMEM case
>
> Changelog: 2010-10-26
> ?- support gfp_t
> ?- support zonelist/nodemask
> ?- support [base, end)
> ?- support alignment
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> ?include/linux/page-isolation.h | ? 15 ++
> ?mm/page_alloc.c ? ? ? ? ? ? ? ?| ? 29 ++++
> ?mm/page_isolation.c ? ? ? ? ? ?| ?242 +++++++++++++++++++++++++++++++++++++++++
> ?3 files changed, 286 insertions(+)
>
> Index: mmotm-1117/mm/page_isolation.c
> ===================================================================
> --- mmotm-1117.orig/mm/page_isolation.c
> +++ mmotm-1117/mm/page_isolation.c
> @@ -5,6 +5,7 @@
> ?#include <linux/mm.h>
> ?#include <linux/page-isolation.h>
> ?#include <linux/pageblock-flags.h>
> +#include <linux/swap.h>
> ?#include <linux/memcontrol.h>
> ?#include <linux/migrate.h>
> ?#include <linux/memory_hotplug.h>
> @@ -396,3 +397,244 @@ retry:
> ? ? ? ?}
> ? ? ? ?return 0;
> ?}
> +
> +/*
> + * Comparing caller specified [user_start, user_end) with physical memory layout
> + * [phys_start, phys_end). If no intersection is longer than nr_pages, return 1.
> + * If there is an intersection, return 0 and fill range in [*start, *end)

I understand the goal of function.
But comment is rather awkward.

> + */
> +static int
> +__calc_search_range(unsigned long user_start, unsigned long user_end,

Personally, I don't like the function name.
How about "__adjust_search_range"?
But I am not against this name strongly. :)

> + ? ? ? ? ? ? ? unsigned long nr_pages,
> + ? ? ? ? ? ? ? unsigned long phys_start, unsigned long phys_end,
> + ? ? ? ? ? ? ? unsigned long *start, unsigned long *end)
> +{
> + ? ? ? if ((user_start >= phys_end) || (user_end <= phys_start))
> + ? ? ? ? ? ? ? return 1;
> + ? ? ? if (user_start <= phys_start) {
> + ? ? ? ? ? ? ? *start = phys_start;
> + ? ? ? ? ? ? ? *end = min(user_end, phys_end);
> + ? ? ? } else {
> + ? ? ? ? ? ? ? *start = user_start;
> + ? ? ? ? ? ? ? *end = min(user_end, phys_end);
> + ? ? ? }
> + ? ? ? if (*end - *start < nr_pages)
> + ? ? ? ? ? ? ? return 1;
> + ? ? ? return 0;
> +}
> +
> +
> +/**
> + * __alloc_contig_pages - allocate a contiguous physical pages
> + * @base: the lowest pfn which caller wants.
> + * @end: ?the highest pfn which caller wants.
> + * @nr_pages: the length of a chunk of pages to be allocated.

the number of pages to be allocated.

> + * @align_order: alignment of start address of returned chunk in order.
> + * ? Returned' page's order will be aligned to (1 << align_order).If smaller
> + * ? than MAX_ORDER, it's raised to MAX_ORDER.
> + * @node: allocate near memory to the node, If -1, current node is used.
> + * @gfpflag: used to specify what zone the memory should be from.
> + * @nodemask: allocate memory within the nodemask.
> + *
> + * Search a memory range [base, end) and allocates physically contiguous
> + * pages. If end - base is larger than nr_pages, a chunk in [base, end) will
> + * be allocated
> + *
> + * This returns a page of the beginning of contiguous block. At failure, NULL
> + * is returned.
> + *
> + * Limitation: at allocation, nr_pages may be increased to be aligned to
> + * MAX_ORDER before searching a range. So, even if there is a enough chunk
> + * for nr_pages, it may not be able to be allocated. Extra tail pages of
> + * allocated chunk is returned to buddy allocator before returning the caller.
> + */
> +
> +#define MIGRATION_RETRY ? ? ? ?(5)
> +struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
> + ? ? ? ? ? ? ? ? ? ? ? unsigned long nr_pages, int align_order,
> + ? ? ? ? ? ? ? ? ? ? ? int node, gfp_t gfpflag, nodemask_t *mask)
> +{
> + ? ? ? unsigned long found, aligned_pages, start;
> + ? ? ? struct page *ret = NULL;
> + ? ? ? int migration_failed;
> + ? ? ? unsigned long align_mask;
> + ? ? ? struct zoneref *z;
> + ? ? ? struct zone *zone;
> + ? ? ? struct zonelist *zonelist;
> + ? ? ? enum zone_type highzone_idx = gfp_zone(gfpflag);
> + ? ? ? unsigned long zone_start, zone_end, rs, re, pos;
> +
> + ? ? ? if (node == -1)
> + ? ? ? ? ? ? ? node = numa_node_id();
> +
> + ? ? ? /* check unsupported flags */
> + ? ? ? if (gfpflag & __GFP_NORETRY)
> + ? ? ? ? ? ? ? return NULL;
> + ? ? ? if ((gfpflag & (__GFP_WAIT | __GFP_IO | __GFP_FS)) !=
> + ? ? ? ? ? ? ? (__GFP_WAIT | __GFP_IO | __GFP_FS))
> + ? ? ? ? ? ? ? return NULL;

Why do we have to care about __GFP_IO|__GFP_FS?
If you consider compaction/reclaim later, I am OK.

> +
> + ? ? ? if (gfpflag & __GFP_THISNODE)
> + ? ? ? ? ? ? ? zonelist = &NODE_DATA(node)->node_zonelists[1];
> + ? ? ? else
> + ? ? ? ? ? ? ? zonelist = &NODE_DATA(node)->node_zonelists[0];
> + ? ? ? /*
> + ? ? ? ?* Base/nr_page/end should be aligned to MAX_ORDER
> + ? ? ? ?*/
> + ? ? ? found = 0;
> +
> + ? ? ? if (align_order < MAX_ORDER)
> + ? ? ? ? ? ? ? align_order = MAX_ORDER;
> +
> + ? ? ? align_mask = (1 << align_order) - 1;
> + ? ? ? /*
> + ? ? ? ?* We allocates MAX_ORDER aligned pages and cut tail pages later.
> + ? ? ? ?*/
> + ? ? ? aligned_pages = ALIGN(nr_pages, (1 << MAX_ORDER));
> + ? ? ? /*
> + ? ? ? ?* If end - base == nr_pages, we can't search range. base must be
> + ? ? ? ?* aligned.
> + ? ? ? ?*/
> + ? ? ? if ((end - base == nr_pages) && (base & align_mask))
> + ? ? ? ? ? ? ? return NULL;
> +
> + ? ? ? base = ALIGN(base, (1 << align_order));
> + ? ? ? if ((end <= base) || (end - base < aligned_pages))
> + ? ? ? ? ? ? ? return NULL;
> +
> + ? ? ? /*
> + ? ? ? ?* searching contig memory range within [pos, end).
> + ? ? ? ?* pos is updated at migration failure to find next chunk in zone.
> + ? ? ? ?* pos is reset to the base at searching next zone.
> + ? ? ? ?* (see for_each_zone_zonelist_nodemask in mmzone.h)
> + ? ? ? ?*
> + ? ? ? ?* Note: we cannot assume zones/nodes are in linear memory layout.
> + ? ? ? ?*/
> + ? ? ? z = first_zones_zonelist(zonelist, highzone_idx, mask, &zone);
> + ? ? ? pos = base;
> +retry:
> + ? ? ? if (!zone)
> + ? ? ? ? ? ? ? return NULL;
> +
> + ? ? ? zone_start = ALIGN(zone->zone_start_pfn, 1 << align_order);
> + ? ? ? zone_end = zone->zone_start_pfn + zone->spanned_pages;
> +
> + ? ? ? /* check [pos, end) is in this zone. */
> + ? ? ? if ((pos >= end) ||
> + ? ? ? ? ? ?(__calc_search_range(pos, end, aligned_pages,
> + ? ? ? ? ? ? ? ? ? ? ? zone_start, zone_end, &rs, &re))) {
> +next_zone:
> + ? ? ? ? ? ? ? /* go to the next zone */
> + ? ? ? ? ? ? ? z = next_zones_zonelist(++z, highzone_idx, mask, &zone);
> + ? ? ? ? ? ? ? /* reset the pos */
> + ? ? ? ? ? ? ? pos = base;
> + ? ? ? ? ? ? ? goto retry;
> + ? ? ? }
> + ? ? ? /* [pos, end) is trimmed to [rs, re) in this zone. */
> + ? ? ? pos = rs;

The 'pos' doesn't used any more at below.

> +
> + ? ? ? found = find_contig_block(rs, re, aligned_pages, align_order, zone);
> + ? ? ? if (!found)
> + ? ? ? ? ? ? ? goto next_zone;
> +
> + ? ? ? /*
> + ? ? ? ?* Because we isolated the range, free pages in the range will never
> + ? ? ? ?* be (re)allocated. scan_lru_pages() finds the next PG_lru page in
> + ? ? ? ?* the range and returns 0 if it reaches the end.
> + ? ? ? ?*/
> + ? ? ? migration_failed = 0;
> + ? ? ? rs = found;
> + ? ? ? re = found + aligned_pages;
> + ? ? ? for (rs = scan_lru_pages(rs, re);
> + ? ? ? ? ? ?rs && rs < re;
> + ? ? ? ? ? ?rs = scan_lru_pages(rs, re)) {
> + ? ? ? ? ? ? ? int rc = do_migrate_range(rs, re);
> + ? ? ? ? ? ? ? if (!rc)
> + ? ? ? ? ? ? ? ? ? ? ? migration_failed = 0;
> + ? ? ? ? ? ? ? else {
> + ? ? ? ? ? ? ? ? ? ? ? /* it's better to try another block ? */
> + ? ? ? ? ? ? ? ? ? ? ? if (++migration_failed >= MIGRATION_RETRY)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? if (rc == -EBUSY) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* There are unstable pages.on pagevec. */
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_add_drain_all();
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* there may be pages on pcplist before
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* we mark the range as ISOLATED.
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? drain_all_pages();
> + ? ? ? ? ? ? ? ? ? ? ? } else if (rc == -ENOMEM)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto nomem;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? cond_resched();
> + ? ? ? }
> + ? ? ? if (!migration_failed) {
> + ? ? ? ? ? ? ? /* drop all pages in pagevec and pcp list */
> + ? ? ? ? ? ? ? lru_add_drain_all();
> + ? ? ? ? ? ? ? drain_all_pages();
> + ? ? ? }
> + ? ? ? /* Check all pages are isolated */
> + ? ? ? if (test_pages_isolated(found, found + aligned_pages)) {
> + ? ? ? ? ? ? ? undo_isolate_page_range(found, aligned_pages);
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* We failed at [found...found+aligned_pages) migration.
> + ? ? ? ? ? ? ? ?* "rs" is the last pfn scan_lru_pages() found that the page
> + ? ? ? ? ? ? ? ?* is LRU page. Update pos and try next chunk.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? pos = ALIGN(rs + 1, (1 << align_order));
> + ? ? ? ? ? ? ? goto retry; /* goto next chunk */
> + ? ? ? }
> + ? ? ? /*
> + ? ? ? ?* OK, here, [found...found+pages) memory are isolated.
> + ? ? ? ?* All pages in the range will be moved into the list with
> + ? ? ? ?* page_count(page)=1.
> + ? ? ? ?*/
> + ? ? ? ret = pfn_to_page(found);
> + ? ? ? alloc_contig_freed_pages(found, found + aligned_pages, gfpflag);
> + ? ? ? /* unset ISOLATE */
> + ? ? ? undo_isolate_page_range(found, aligned_pages);
> + ? ? ? /* Free unnecessary pages in tail */
> + ? ? ? for (start = found + nr_pages; start < found + aligned_pages; start++)
> + ? ? ? ? ? ? ? __free_page(pfn_to_page(start));
> + ? ? ? return ret;
> +nomem:
> + ? ? ? undo_isolate_page_range(found, aligned_pages);
> + ? ? ? return NULL;
> +}
> +EXPORT_SYMBOL_GPL(__alloc_contig_pages);
> +
> +void free_contig_pages(struct page *page, int nr_pages)
> +{
> + ? ? ? int i;
> + ? ? ? for (i = 0; i < nr_pages; i++)
> + ? ? ? ? ? ? ? __free_page(page + i);
> +}
> +EXPORT_SYMBOL_GPL(free_contig_pages);
> +
> +/*
> + * Allocated pages will not be MOVABLE but MOVABLE zone is a suitable
> + * for allocating big chunk. So, using ZONE_MOVABLE is a default.
> + */
> +
> +struct page *alloc_contig_pages(unsigned long base, unsigned long end,
> + ? ? ? ? ? ? ? ? ? ? ? unsigned long nr_pages, int align_order)
> +{
> + ? ? ? return __alloc_contig_pages(base, end, nr_pages, align_order, -1,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GFP_KERNEL | __GFP_MOVABLE, NULL);
> +}
> +EXPORT_SYMBOL_GPL(alloc_contig_pages);
> +
> +struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order)
> +{
> + ? ? ? return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, -1,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GFP_KERNEL | __GFP_MOVABLE, NULL);
> +}
> +EXPORT_SYMBOL_GPL(alloc_contig_pages_host);
> +
> +struct page *alloc_contig_pages_node(int nid, unsigned long nr_pages,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int align_order)
> +{
> + ? ? ? return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, nid,
> + ? ? ? ? ? ? ? ? ? ? ? GFP_KERNEL | __GFP_THISNODE | __GFP_MOVABLE, NULL);
> +}
> +EXPORT_SYMBOL_GPL(alloc_contig_pages_node);
> Index: mmotm-1117/include/linux/page-isolation.h
> ===================================================================
> --- mmotm-1117.orig/include/linux/page-isolation.h
> +++ mmotm-1117/include/linux/page-isolation.h
> @@ -32,6 +32,8 @@ test_pages_isolated(unsigned long start_
> ?*/
> ?extern int set_migratetype_isolate(struct page *page);
> ?extern void unset_migratetype_isolate(struct page *page);
> +extern void alloc_contig_freed_pages(unsigned long pfn,
> + ? ? ? ? ? ? ? unsigned long pages, gfp_t flag);
>
> ?/*
> ?* For migration.
> @@ -41,4 +43,17 @@ int test_pages_in_a_zone(unsigned long s
> ?unsigned long scan_lru_pages(unsigned long start, unsigned long end);
> ?int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn);
>
> +/*
> + * For large alloc.
> + */
> +struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long nr_pages, int align_order,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int node, gfp_t flag, nodemask_t *mask);
> +struct page *alloc_contig_pages(unsigned long base, unsigned long end,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long nr_pages, int align_order);
> +struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order);
> +struct page *alloc_contig_pages_node(int nid, unsigned long nr_pages,
> + ? ? ? ? ? ? ? int align_order);
> +void free_contig_pages(struct page *page, int nr_pages);
> +
> ?#endif
> Index: mmotm-1117/mm/page_alloc.c
> ===================================================================
> --- mmotm-1117.orig/mm/page_alloc.c
> +++ mmotm-1117/mm/page_alloc.c
> @@ -5447,6 +5447,35 @@ out:
> ? ? ? ?spin_unlock_irqrestore(&zone->lock, flags);
> ?}
>
> +
> +void alloc_contig_freed_pages(unsigned long pfn, ?unsigned long end, gfp_t flag)
> +{
> + ? ? ? struct page *page;
> + ? ? ? struct zone *zone;
> + ? ? ? int order;
> + ? ? ? unsigned long start = pfn;
> +
> + ? ? ? zone = page_zone(pfn_to_page(pfn));
> + ? ? ? spin_lock_irq(&zone->lock);
> + ? ? ? while (pfn < end) {
> + ? ? ? ? ? ? ? VM_BUG_ON(!pfn_valid(pfn));
> + ? ? ? ? ? ? ? page = pfn_to_page(pfn);
> + ? ? ? ? ? ? ? VM_BUG_ON(page_count(page));
> + ? ? ? ? ? ? ? VM_BUG_ON(!PageBuddy(page));
> + ? ? ? ? ? ? ? list_del(&page->lru);
> + ? ? ? ? ? ? ? order = page_order(page);
> + ? ? ? ? ? ? ? zone->free_area[order].nr_free--;
> + ? ? ? ? ? ? ? rmv_page_order(page);
> + ? ? ? ? ? ? ? __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> + ? ? ? ? ? ? ? pfn += 1 << order;
> + ? ? ? }
> + ? ? ? spin_unlock_irq(&zone->lock);
> +
> + ? ? ? /*After this, pages in the range can be freed one be one */
> + ? ? ? for (pfn = start; pfn < end; pfn++)
> + ? ? ? ? ? ? ? prep_new_page(pfn_to_page(pfn), 0, flag);
> +}
> +
> ?#ifdef CONFIG_MEMORY_HOTREMOVE
> ?/*
> ?* All pages in the range must be isolated before calling this.
>
>



--
Kind regards,
Minchan Kim

2010-11-22 12:01:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] alloc_contig_pages() use better allocation function for migration

On Fri, Nov 19, 2010 at 5:16 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Old story.
> Because we cannot assume which memory section will be offlined next,
> hotremove_migrate_alloc() just uses alloc_page(). i.e. make no decision
> where the page should be migrate into. Considering memory hotplug's
> nature, the next memory section near to a section which is being removed
> will be removed in the next. So, migrate pages to the same node of original
> page doesn't make sense in many case, it just increases load.
> Migration destination page is allocated from the node where offlining script
> runs.
>
> Now, contiguous-alloc uses do_migrate_range(). In this case, migration
> destination node should be the same node of migration source page.
>
> This patch modifies hotremove_migrate_alloc() and pass "nid" to it.
> Memory hotremove will pass -1. So, if the page will be moved to
> the node where offlining script runs....no behavior changes.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>


--
Kind regards,
Minchan Kim

2010-11-23 15:44:18

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 0/4] big chunk memory allocator v4

On Mon, 22 Nov 2010 09:59:57 +0100, Kleen, Andi <[email protected]> wrote:

>> > But yes, because of fragmentation, this cannot guarantee 100%
>> alloc.
>> > If alloc_contig_pages() is called in system boot up or movable_zone
>> is used,
>> > this allocation succeeds at high rate.
>>
>> So this is an alternatve implementation for the functionality offered
>> by Michal's "The Contiguous Memory Allocator framework".
>
> I see them more as orthogonal: Michal's code relies on preallocation
> and manages the memory after that.

Yes and no. The v6 version adds not-yet-finished support for sharing
the preallocated blocks with page allocator (so if CMA is not using the
memory, page allocator can allocate it, and when CMA finally wants to
use it the allocated pages are migrated).

In the v6 implementation I have added a new migration type (I cannot seem
to find who proposed such approach first). When I'll end debugging the
code I'll try to work things out without adding additional entity (that
is new migration type).

--
Best regards, _ _
| Humble Liege of Serenely Enlightened Majesty of o' \,=./ `o
| Computer Science, Michał "mina86" Nazarewicz (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--

2010-11-23 15:46:08

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 0/4] big chunk memory allocator v4

On Mon, 22 Nov 2010 01:04:31 +0100, KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Fri, 19 Nov 2010 12:56:53 -0800
> Andrew Morton <[email protected]> wrote:
>
>> On Fri, 19 Nov 2010 17:10:33 +0900
>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>> > Hi, this is an updated version.
>> >
>> > No major changes from the last one except for page allocation function.
>> > removed RFC.
>> >
>> > Order of patches is
>> >
>> > [1/4] move some functions from memory_hotplug.c to page_isolation.c
>> > [2/4] search physically contiguous range suitable for big chunk alloc.
>> > [3/4] allocate big chunk memory based on memory hotplug(migration) technique
>> > [4/4] modify page allocation function.
>> >
>> > For what:
>> >
>> > I hear there is requirements to allocate a chunk of page which is larger than
>> > MAX_ORDER. Now, some (embeded) device use a big memory chunk. To use memory,
>> > they hide some memory range by boot option (mem=) and use hidden memory
>> > for its own purpose. But this seems a lack of feature in memory management.
>> >
>> > This patch adds
>> > alloc_contig_pages(start, end, nr_pages, gfp_mask)
>> > to allocate a chunk of page whose length is nr_pages from [start, end)
>> > phys address. This uses similar logic of memory-unplug, which tries to
>> > offline [start, end) pages. By this, drivers can allocate 30M or 128M or
>> > much bigger memory chunk on demand. (I allocated 1G chunk in my test).
>> >
>> > But yes, because of fragmentation, this cannot guarantee 100% alloc.
>> > If alloc_contig_pages() is called in system boot up or movable_zone is used,
>> > this allocation succeeds at high rate.
>>
>> So this is an alternatve implementation for the functionality offered
>> by Michal's "The Contiguous Memory Allocator framework".
>>
>
> Yes, this will be a backends for that kind of works.

As a matter of fact CMA's v6 tries to use code "borrowed" from the alloc_contig_pages()
patches.

The most important difference is that alloc_contig_pages() would look for a chunk
of memory that can be allocated and then perform migration whereas CMA assumes that
regions it controls are always "migratable".

Also, I've tried to remove the requirement for MAX_ORDER alignment.

> I think there are two ways to allocate contiguous pages larger than MAX_ORDER.
>
> 1) hide some memory at boot and add an another memory allocator.
> 2) support a range allocator as [start, end)
>
> This is an trial from 2). I used memory-hotplug technique because I know some.
> This patch itself has no "map" and "management" function, so it should be
> developped in another patch (but maybe it will be not my work.)

Yes, this is also a valid point. From my use cases, the alloc_contig_pages()
would probably not be enough and require some management code to be added.

>> > I tested this on x86-64, and it seems to work as expected. But feedback from
>> > embeded guys are appreciated because I think they are main user of this
>> > function.
>>
>> From where I sit, feedback from the embedded guys is *vital*, because
>> they are indeed the main users.
>>
>> Michal, I haven't made a note of all the people who are interested in
>> and who are potential users of this code. Your patch series has a
>> billion cc's and is up to version 6.

Ah, yes... I was thinking about shrinking the cc list but didn't want to
seem rude or anything removing ppl who have shown interest in the previous
posted version.

>> Could I ask that you review and
>> test this code, and also hunt down other people (probably at other
>> organisations) who can do likewise for us? Because until we hear from
>> those people that this work satisfies their needs, we can't really
>> proceed much further.

A few things than:

1. As Felipe mentioned, on ARM it is often desired to have the memory
mapped as non-cacheable, which most often mean that the memory never
reaches the page allocator. This means, that alloc_contig_pages()
would not be suitable for cases where one needs such memory.

Or could this be overcome by adding the memory back as highmem? But
then, it would force to compile in highmem support even if platform
does not really need it.

2. Device drivers should not by themselves know what ranges of memory to
allocate memory from. Moreover, some device drivers could require
allocation different buffers from different ranges. As such, this
would require some management code on top of alloc_contig_pages().

3. When posting hwmem, Johan Mossberg mentioned that he'd like to see
notion of "pinning" chunks (so that not-pinned chunks can be moved
around when hardware does not use them to defragment memory). This
would again require some management code on top of
alloc_contig_pages().

4. I might be mistaken here, but the way I understand ZONE_MOVABLE work
is that it is cut of from the end of memory. Or am I talking nonsense?
My concern is that at least one chip I'm working with requires
allocations from different memory banks which would basically mean that
there would have to be two movable zones, ie:

+-------------------+-------------------+
| Memory Bank #1 | Memory Bank #2 |
+---------+---------+---------+---------+
| normal | movable | normal | movable |
+---------+---------+---------+---------+

So even though I'm personally somehow drawn by alloc_contig_pages()'s
simplicity (compared to CMA at least), those quick thoughts make me think
that alloc_contig_pages() would work rather as a backend (as Kamezawa
mentioned) for some, maybe even tiny but still present, management code
which would handle "marking memory fragments as ZONE_MOVABLE" (whatever
that would involve) and deciding which memory ranges drivers can allocate
from.

I'm also wondering whether alloc_contig_pages()'s first-fit is suitable but
that probably cannot be judged without some benchmarks.

--
Best regards, _ _
| Humble Liege of Serenely Enlightened Majesty of o' \,=./ `o
| Computer Science, Michał "mina86" Nazarewicz (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--

2010-11-24 00:21:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] alloc_contig_pages() find appropriate physical memory range

On Mon, 22 Nov 2010 20:20:14 +0900
Minchan Kim <[email protected]> wrote:

> On Fri, Nov 19, 2010 at 5:14 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Unlike memory hotplug, at an allocation of contigous memory range, address
> > may not be a problem. IOW, if a requester of memory wants to allocate 100M of
> > of contigous memory, placement of allocated memory may not be a problem.
> > So, "finding a range of memory which seems to be MOVABLE" is required.
> >
> > This patch adds a functon to isolate a length of memory within [start, end).
> > This function returns a pfn which is 1st page of isolated contigous chunk
> > of given length within [start, end).
> >
> > If no_search=true is passed as argument, start address is always same to
> > the specified "base" addresss.
> >
> > After isolation, free memory within this area will never be allocated.
> > But some pages will remain as "Used/LRU" pages. They should be dropped by
> > page reclaim or migration.
> >
> > Changelog: 2010-11-17
> >  - fixed some conding style (if-then-else)
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  mm/page_isolation.c |  146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 146 insertions(+)
> >
> > Index: mmotm-1117/mm/page_isolation.c
> > ===================================================================
> > --- mmotm-1117.orig/mm/page_isolation.c
> > +++ mmotm-1117/mm/page_isolation.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/pageblock-flags.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/migrate.h>
> > +#include <linux/memory_hotplug.h>
> >  #include <linux/mm_inline.h>
> >  #include "internal.h"
> >
> > @@ -250,3 +251,148 @@ int do_migrate_range(unsigned long start
> >  out:
> >        return ret;
> >  }
> > +
> > +/*
> > + * Functions for getting contiguous MOVABLE pages in a zone.
> > + */
> > +struct page_range {
> > +       unsigned long base; /* Base address of searching contigouous block */
> > +       unsigned long end;
> > +       unsigned long pages;/* Length of contiguous block */
>
> Nitpick.
> You used nr_pages in other place.
> I hope you use the name consistent.
>
Sure, I'll fix it.

> > +       int align_order;
> > +       unsigned long align_mask;
>
> Does we really need this field 'align_mask'?

No.

> We can get always from align_order.
>

Always writes ((1 << align_order) -1) ? Hmm.


> > +};
> > +
> > +int __get_contig_block(unsigned long pfn, unsigned long nr_pages, void *arg)
> > +{
> > +       struct page_range *blockinfo = arg;
> > +       unsigned long end;
> > +
> > +       end = pfn + nr_pages;
> > +       pfn = ALIGN(pfn, 1 << blockinfo->align_order);
> > +       end = end & ~(MAX_ORDER_NR_PAGES - 1);
> > +
> > +       if (end < pfn)
> > +               return 0;
> > +       if (end - pfn >= blockinfo->pages) {
> > +               blockinfo->base = pfn;
> > +               blockinfo->end = end;
> > +               return 1;
> > +       }
> > +       return 0;
> > +}
> > +
> > +static void __trim_zone(struct zone *zone, struct page_range *range)
> > +{
> > +       unsigned long pfn;
> > +       /*
> > +        * skip pages which dones'nt under the zone.
>
> typo dones'nt -> doesn't :)
>
will fix.

> > +        * There are some archs which zones are not in linear layout.
> > +        */
> > +       if (page_zone(pfn_to_page(range->base)) != zone) {
> > +               for (pfn = range->base;
> > +                       pfn < range->end;
> > +                       pfn += MAX_ORDER_NR_PAGES) {
> > +                       if (page_zone(pfn_to_page(pfn)) == zone)
> > +                               break;
> > +               }
> > +               range->base = min(pfn, range->end);
> > +       }
> > +       /* Here, range-> base is in the zone if range->base != range->end */
> > +       for (pfn = range->base;
> > +            pfn < range->end;
> > +            pfn += MAX_ORDER_NR_PAGES) {
> > +               if (zone != page_zone(pfn_to_page(pfn))) {
> > +                       pfn = pfn - MAX_ORDER_NR_PAGES;
> > +                       break;
> > +               }
> > +       }
> > +       range->end = min(pfn, range->end);
> > +       return;
>
> Remove return
>
Ah, ok.

> > +}
> > +
> > +/*
> > + * This function is for finding a contiguous memory block which has length
> > + * of pages and MOVABLE. If it finds, make the range of pages as ISOLATED
> > + * and return the first page's pfn.
> > + * This checks all pages in the returned range is free of Pg_LRU. To reduce
> > + * the risk of false-positive testing, lru_add_drain_all() should be called
> > + * before this function to reduce pages on pagevec for zones.
> > + */
> > +
> > +static unsigned long find_contig_block(unsigned long base,
> > +               unsigned long end, unsigned long pages,
> > +               int align_order, struct zone *zone)
> > +{
> > +       unsigned long pfn, pos;
> > +       struct page_range blockinfo;
> > +       int ret;
> > +
> > +       VM_BUG_ON(pages & (MAX_ORDER_NR_PAGES - 1));
> > +       VM_BUG_ON(base & ((1 << align_order) - 1));
> > +retry:
> > +       blockinfo.base = base;
> > +       blockinfo.end = end;
> > +       blockinfo.pages = pages;
> > +       blockinfo.align_order = align_order;
> > +       blockinfo.align_mask = (1 << align_order) - 1;
>
> We don't need this.
>
mask ?

> > +       /*
> > +        * At first, check physical page layout and skip memory holes.
> > +        */
> > +       ret = walk_system_ram_range(base, end - base, &blockinfo,
> > +               __get_contig_block);
> > +       if (!ret)
> > +               return 0;
> > +       /* check contiguous pages in a zone */
> > +       __trim_zone(zone, &blockinfo);
> > +
> > +       /*
> > +        * Ok, we found contiguous memory chunk of size. Isolate it.
> > +        * We just search MAX_ORDER aligned range.
> > +        */
> > +       for (pfn = blockinfo.base; pfn + pages <= blockinfo.end;
> > +            pfn += (1 << align_order)) {
> > +               struct zone *z = page_zone(pfn_to_page(pfn));
> > +               if (z != zone)
> > +                       continue;
>
> Could we make sure pass __trim_zone is to satisfy whole pfn in zone
> what we want.
> Repeated the zone check is rather annoying.
> I mean let's __get_contig_block or __trim_zone already does check zone
> so that we remove the zone check in here.

Ah, yes. I'll remove this.

>
> > +
> > +               spin_lock_irq(&z->lock);
> > +               pos = pfn;
> > +               /*
> > +                * Check the range only contains free pages or LRU pages.
> > +                */
> > +               while (pos < pfn + pages) {
> > +                       struct page *p;
> > +
> > +                       if (!pfn_valid_within(pos))
> > +                               break;
> > +                       p = pfn_to_page(pos);
> > +                       if (PageReserved(p))
> > +                               break;
> > +                       if (!page_count(p)) {
> > +                               if (!PageBuddy(p))
> > +                                       pos++;
> > +                               else
> > +                                       pos += (1 << page_order(p));
> > +                       } else if (PageLRU(p)) {
>
> Could we check get_pageblock_migratetype(page) == MIGRATE_MOVABLE in
> here and early bail out?
>

I'm not sure that's very good. pageblock-type can be fragmented and even
if pageblock-type is not MIGRATABLE, all pages in pageblock may be free.
Because PageLRU() is checked, all required 'quick' check is done, I think.


> > +                               pos++;
> > +                       } else
> > +                               break;
> > +               }
> > +               spin_unlock_irq(&z->lock);
> > +               if ((pos == pfn + pages)) {
> > +                       if (!start_isolate_page_range(pfn, pfn + pages))
> > +                               return pfn;
> > +               } else/* the chunk including "pos" should be skipped */
> > +                       pfn = pos & ~((1 << align_order) - 1);
> > +               cond_resched();
> > +       }
> > +
> > +       /* failed */
> > +       if (blockinfo.end + pages <= end) {
> > +               /* Move base address and find the next block of RAM. */
> > +               base = blockinfo.end;
> > +               goto retry;
> > +       }
> > +       return 0;
>
> If the base is 0, isn't it impossible return pfn 0?
> x86 in FLAT isn't impossible but I think some architecture might be possible.
> Just guessing.
>
> How about returning negative value and return first page pfn and last
> page pfn as out parameter base, end?
>

Hmm, will add a check.

Thanks,
-Kame


> > +}
> >
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2010-11-24 00:25:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/4] alloc_contig_pages() allocate big chunk memory using migration

On Mon, 22 Nov 2010 20:44:03 +0900
Minchan Kim <[email protected]> wrote:

> On Fri, Nov 19, 2010 at 5:15 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Add an function to allocate contiguous memory larger than MAX_ORDER.
> > The main difference between usual page allocator is that this uses
> > memory offline technique (Isolate pages and migrate remaining pages.).
> >
> > I think this is not 100% solution because we can't avoid fragmentation,
> > but we have kernelcore= boot option and can create MOVABLE zone. That
> > helps us to allow allocate a contiguous range on demand.
>
> And later we can use compaction and reclaim, too.
> So I think this approach is the way we have to go.
>
> >
> > The new function is
> >
> >  alloc_contig_pages(base, end, nr_pages, alignment)
> >
> > This function will allocate contiguous pages of nr_pages from the range
> > [base, end). If [base, end) is bigger than nr_pages, some pfn which
> > meats alignment will be allocated. If alignment is smaller than MAX_ORDER,
>
> type meet
>
will fix.

> > it will be raised to be MAX_ORDER.
> >
> > __alloc_contig_pages() has much more arguments.
> >
> >
> > Some drivers allocates contig pages by bootmem or hiding some memory
> > from the kernel at boot. But if contig pages are necessary only in some
> > situation, kernelcore= boot option and using page migration is a choice.
> >
> > Changelog: 2010-11-19
> >  - removed no_search
> >  - removed some drain_ functions because they are heavy.
> >  - check -ENOMEM case
> >
> > Changelog: 2010-10-26
> >  - support gfp_t
> >  - support zonelist/nodemask
> >  - support [base, end)
> >  - support alignment
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> >  include/linux/page-isolation.h |   15 ++
> >  mm/page_alloc.c                |   29 ++++
> >  mm/page_isolation.c            |  242 +++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 286 insertions(+)
> >
> > Index: mmotm-1117/mm/page_isolation.c
> > ===================================================================
> > --- mmotm-1117.orig/mm/page_isolation.c
> > +++ mmotm-1117/mm/page_isolation.c
> > @@ -5,6 +5,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/page-isolation.h>
> >  #include <linux/pageblock-flags.h>
> > +#include <linux/swap.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/migrate.h>
> >  #include <linux/memory_hotplug.h>
> > @@ -396,3 +397,244 @@ retry:
> >        }
> >        return 0;
> >  }
> > +
> > +/*
> > + * Comparing caller specified [user_start, user_end) with physical memory layout
> > + * [phys_start, phys_end). If no intersection is longer than nr_pages, return 1.
> > + * If there is an intersection, return 0 and fill range in [*start, *end)
>
> I understand the goal of function.
> But comment is rather awkward.
>

ok, I will rewrite.

> > + */
> > +static int
> > +__calc_search_range(unsigned long user_start, unsigned long user_end,
>
> Personally, I don't like the function name.
> How about "__adjust_search_range"?
> But I am not against this name strongly. :)
>
I will rename this.


> > +               unsigned long nr_pages,
> > +               unsigned long phys_start, unsigned long phys_end,
> > +               unsigned long *start, unsigned long *end)
> > +{
> > +       if ((user_start >= phys_end) || (user_end <= phys_start))
> > +               return 1;
> > +       if (user_start <= phys_start) {
> > +               *start = phys_start;
> > +               *end = min(user_end, phys_end);
> > +       } else {
> > +               *start = user_start;
> > +               *end = min(user_end, phys_end);
> > +       }
> > +       if (*end - *start < nr_pages)
> > +               return 1;
> > +       return 0;
> > +}
> > +
> > +
> > +/**
> > + * __alloc_contig_pages - allocate a contiguous physical pages
> > + * @base: the lowest pfn which caller wants.
> > + * @end:  the highest pfn which caller wants.
> > + * @nr_pages: the length of a chunk of pages to be allocated.
>
> the number of pages to be allocated.
>
ok.

> > + * @align_order: alignment of start address of returned chunk in order.
> > + *   Returned' page's order will be aligned to (1 << align_order).If smaller
> > + *   than MAX_ORDER, it's raised to MAX_ORDER.
> > + * @node: allocate near memory to the node, If -1, current node is used.
> > + * @gfpflag: used to specify what zone the memory should be from.
> > + * @nodemask: allocate memory within the nodemask.
> > + *
> > + * Search a memory range [base, end) and allocates physically contiguous
> > + * pages. If end - base is larger than nr_pages, a chunk in [base, end) will
> > + * be allocated
> > + *
> > + * This returns a page of the beginning of contiguous block. At failure, NULL
> > + * is returned.
> > + *
> > + * Limitation: at allocation, nr_pages may be increased to be aligned to
> > + * MAX_ORDER before searching a range. So, even if there is a enough chunk
> > + * for nr_pages, it may not be able to be allocated. Extra tail pages of
> > + * allocated chunk is returned to buddy allocator before returning the caller.
> > + */
> > +
> > +#define MIGRATION_RETRY        (5)
> > +struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
> > +                       unsigned long nr_pages, int align_order,
> > +                       int node, gfp_t gfpflag, nodemask_t *mask)
> > +{
> > +       unsigned long found, aligned_pages, start;
> > +       struct page *ret = NULL;
> > +       int migration_failed;
> > +       unsigned long align_mask;
> > +       struct zoneref *z;
> > +       struct zone *zone;
> > +       struct zonelist *zonelist;
> > +       enum zone_type highzone_idx = gfp_zone(gfpflag);
> > +       unsigned long zone_start, zone_end, rs, re, pos;
> > +
> > +       if (node == -1)
> > +               node = numa_node_id();
> > +
> > +       /* check unsupported flags */
> > +       if (gfpflag & __GFP_NORETRY)
> > +               return NULL;
> > +       if ((gfpflag & (__GFP_WAIT | __GFP_IO | __GFP_FS)) !=
> > +               (__GFP_WAIT | __GFP_IO | __GFP_FS))
> > +               return NULL;
>
> Why do we have to care about __GFP_IO|__GFP_FS?
> If you consider compaction/reclaim later, I am OK.
>
because in page migration, we use GFP_HIGHUSER_MOVABLE now.


> > +
> > +       if (gfpflag & __GFP_THISNODE)
> > +               zonelist = &NODE_DATA(node)->node_zonelists[1];
> > +       else
> > +               zonelist = &NODE_DATA(node)->node_zonelists[0];
> > +       /*
> > +        * Base/nr_page/end should be aligned to MAX_ORDER
> > +        */
> > +       found = 0;
> > +
> > +       if (align_order < MAX_ORDER)
> > +               align_order = MAX_ORDER;
> > +
> > +       align_mask = (1 << align_order) - 1;
> > +       /*
> > +        * We allocates MAX_ORDER aligned pages and cut tail pages later.
> > +        */
> > +       aligned_pages = ALIGN(nr_pages, (1 << MAX_ORDER));
> > +       /*
> > +        * If end - base == nr_pages, we can't search range. base must be
> > +        * aligned.
> > +        */
> > +       if ((end - base == nr_pages) && (base & align_mask))
> > +               return NULL;
> > +
> > +       base = ALIGN(base, (1 << align_order));
> > +       if ((end <= base) || (end - base < aligned_pages))
> > +               return NULL;
> > +
> > +       /*
> > +        * searching contig memory range within [pos, end).
> > +        * pos is updated at migration failure to find next chunk in zone.
> > +        * pos is reset to the base at searching next zone.
> > +        * (see for_each_zone_zonelist_nodemask in mmzone.h)
> > +        *
> > +        * Note: we cannot assume zones/nodes are in linear memory layout.
> > +        */
> > +       z = first_zones_zonelist(zonelist, highzone_idx, mask, &zone);
> > +       pos = base;
> > +retry:
> > +       if (!zone)
> > +               return NULL;
> > +
> > +       zone_start = ALIGN(zone->zone_start_pfn, 1 << align_order);
> > +       zone_end = zone->zone_start_pfn + zone->spanned_pages;
> > +
> > +       /* check [pos, end) is in this zone. */
> > +       if ((pos >= end) ||
> > +            (__calc_search_range(pos, end, aligned_pages,
> > +                       zone_start, zone_end, &rs, &re))) {
> > +next_zone:
> > +               /* go to the next zone */
> > +               z = next_zones_zonelist(++z, highzone_idx, mask, &zone);
> > +               /* reset the pos */
> > +               pos = base;
> > +               goto retry;
> > +       }
> > +       /* [pos, end) is trimmed to [rs, re) in this zone. */
> > +       pos = rs;
>
> The 'pos' doesn't used any more at below.
>
Ah, yes. I'll check this was for what and remove this.

Thanks,
-Kame

2010-11-24 00:42:33

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] big chunk memory allocator v4

On Tue, 23 Nov 2010 16:46:03 +0100
Michał Nazarewicz <[email protected]> wrote:

> A few things than:
>
> 1. As Felipe mentioned, on ARM it is often desired to have the memory
> mapped as non-cacheable, which most often mean that the memory never
> reaches the page allocator. This means, that alloc_contig_pages()
> would not be suitable for cases where one needs such memory.
>
> Or could this be overcome by adding the memory back as highmem? But
> then, it would force to compile in highmem support even if platform
> does not really need it.
>
> 2. Device drivers should not by themselves know what ranges of memory to
> allocate memory from. Moreover, some device drivers could require
> allocation different buffers from different ranges. As such, this
> would require some management code on top of alloc_contig_pages().
>
> 3. When posting hwmem, Johan Mossberg mentioned that he'd like to see
> notion of "pinning" chunks (so that not-pinned chunks can be moved
> around when hardware does not use them to defragment memory). This
> would again require some management code on top of
> alloc_contig_pages().
>
> 4. I might be mistaken here, but the way I understand ZONE_MOVABLE work
> is that it is cut of from the end of memory. Or am I talking nonsense?
> My concern is that at least one chip I'm working with requires
> allocations from different memory banks which would basically mean that
> there would have to be two movable zones, ie:
>
> +-------------------+-------------------+
> | Memory Bank #1 | Memory Bank #2 |
> +---------+---------+---------+---------+
> | normal | movable | normal | movable |
> +---------+---------+---------+---------+
>
yes.

> So even though I'm personally somehow drawn by alloc_contig_pages()'s
> simplicity (compared to CMA at least), those quick thoughts make me think
> that alloc_contig_pages() would work rather as a backend (as Kamezawa
> mentioned) for some, maybe even tiny but still present, management code
> which would handle "marking memory fragments as ZONE_MOVABLE" (whatever
> that would involve) and deciding which memory ranges drivers can allocate
> from.
>
> I'm also wondering whether alloc_contig_pages()'s first-fit is suitable but
> that probably cannot be judged without some benchmarks.
>

I'll continue to update patches, you can freely reuse my code and integrate
this set to yours. I works for this firstly for EMBEDED but I want this to be
a _generic_ function for gerenal purpose architecture.
There may be guys who want 1G page on a host with tons of free memory.


Thanks,
-Kame