LinuxLists.cc - [RFC] Avoiding fragmentation through different allocator

2005-01-13 01:39:07

Subject: [RFC] Avoiding fragmentation through different allocator

Hi folks,

I was reading through the archives on threads involving page fragmentation
where high-order allocations are difficult to meet. The solutions I saw
proposed and implemented were focusing on moving the pages around using the
same mechanisms as used for memory hotplug and NUMA node migration. From what I
can gather, this is both expensive and difficult. This patch is a proposal to
change the page allocator to avoid some of the nastier fragmentation problems
in the first place. The objective is that higher order pages will be more
likely to be available and if not, there will be an easy way of finding large
blocks to free up. It does not negate the defragmentation-through-migration
(Marcelo's work I believe) approach to the problem but it should make that
approach's job a lot easier.

The patch is against 2.6.9 but as this is not ready for inclusion yet anyway
so I figured I would get opinions first before forward-porting or thinking
about any optimisations.

So... What the patch does. Allocations are divided up into three different
types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
well as disk-buffer pages into this category. These pages are trivially
reclaimed by writing the page out to swap or syncing with backing
storage

KernelReclaimable - These are pages allocated by the kernel that are easily
reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
These type of pages potentially could be reclaimed by dumping the
caches and reaping the slabs (drastic, but you get the idea). We could
also add pages into this category that are known to be only required
for a short time like buffers used with DMA

KernelNonReclaimable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there are
three, one for each type of allocation. Finally, there is a list of pages of
size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals
with.

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the related types can be clustered together. This means
that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a
block of pages allocated for UserReclaimable and page each of them out.

The patch also implements fallback logic for when there are no 2^MAX_ORDER
pages available and there are no free pages of the desired type. The fallback
lists were chosen in a way that keeps the most easily reclaimable pages
together.

I stress-tested this patch very heavily and it never oopsed so I am
confident of it's stability, so what is left is to look at the results of
this patch were and I think they look promising in a number of respects. I
have graphs that do not translate to text very well, so I'll just point you
to http://www.csn.ul.ie/~mel/projects/mbuddy-results-1 instead.

The two figures at the top show the normal zone with the normal allocator
and with the modified allocator. A value of 0 shows the page is free. 1 is
a user reclaimable page. 2 is a kernel reclaimable page and 3 is a
non-reclaimable page. The standard allocator has these all mixed in
together. The modified allocator has each of them grouped together.

The Normal zone is the main zone of interest, but I included the graph of
what HighMem looks like. As it is almost all userspace allocations anyway,
it does not present a defragmentation problem (just linearlly scan pages,
and either page our or sync with backing storage and dump)

So.... the actual fragmentation results.

The results were not spectacular but still very interesting. Under heavy
stresing (updatedb + 4 simultaneous -j4 kernel compiles with avg load 15)
fragmentation is consistently lower than the standard allocator. It could
also be a lot better if there was some means of purging caches, userpages
and buffers but thats in the future. For the moment, the only real control
I had was the buffer pages.

The figures below are the output of /proc/buddyinfo before and after deleting
the kernel trees that were compiled (flushing their buffer pages).

Standard allocator - Before deleting trees
Node 0, zone DMA 3 1 4 4 4 1 2 0 1 1 2
Node 0, zone Normal 298 8 892 460 268 81 28 8 1 0 14
Node 0, zone HighMem 30 15 1 0 0 0 0 0 1 0 0

Standard allocator - After deleting trees
Node 0, zone DMA 3 1 4 4 4 1 2 0 1 1 2
Node 0, zone Normal 18644 13067 7911 4224 1583 351 52 10 1 0 14
Node 0, zone HighMem 370 343 218 157 3078 355 80 48 26 13 5

Modified allocator - Before deleting trees
Node 0, zone DMA 1 0 4 4 4 1 2 0 1 1 2
Node 0, zone Normal 658 230 113 578 176 73 23 9 10 4 13
Node 0, zone HighMem 40 10 4 0 1 1 1 1 0 0 0

Modified allocator - After deleting trees
Node 0, zone DMA 1 0 4 4 4 1 2 0 1 1 2
Node 0, zone Normal 15407 11695 7280 3858 1563 523 87 13 11 4 13
Node 0, zone HighMem 6816 3616 3407 3328 566 144 89 53 17 9 4

In both cases, deleting the trees really freed up HighMem which is not a
suprise. The fragmentation there will really depend on when buffer pages were
allocated (I could make this close to perfect if I put only buffer pages into
UserRclm) so there is not a lot that can be done so lets look at Normal.

Before the trees were deleted, the modified alloactor was in better shape
than the standard allocator at the higher orders although arguably they
are in similar shape. However, after the trees were deleted, the modified
allocator was in way better shape. Deleting buffers barely made a difference
in high-order page availability in the standard allocator but really improved
with the modified one

S: Node 0, zone Normal 18644 13067 7911 4224 1583 351 52 10 1 0 14
M: Node 0, zone Normal 15407 11695 7280 3858 1563 523 87 13 11 4 13
Delta: -3237 -1372 -631 -366 -20 +172 +35 +3 +10 +4 -1

Ok, order 10 lost out but I consider that just unlucky :) . From order-5
on 5, there were definite improvements. I could show what the figures look
like in each of the UserRclm, KernRclm and KernNoRclm pools but what they
show is that the UserRclm pool really benefits and it is obvious that the
other two pools were used frequently for fallbacks under the heavy memory
pressure.

So.... bottom line, there is consistent improvements in fragmentation using
the allocator before anything drastic like linear page scanning or migration
of pages takes place. I have not measured it yet, but I do not believe the
overhead of the scheme is severe either.

Opinions/Feedback?

>>>Patch follows<<<
diff -rup linux-2.6.9-clean/fs/buffer.c linux-2.6.9-mbuddy/fs/buffer.c
--- linux-2.6.9-clean/fs/buffer.c 2004-10-18 22:54:32.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/buffer.c 2005-01-03 19:33:23.000000000 +0000
@@ -1203,7 +1203,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;

- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ GFP_NOFS | __GFP_USERRCLM);
if (!page)
return NULL;

@@ -3065,7 +3066,8 @@ static void recalc_bh_state(void)

struct buffer_head *alloc_buffer_head(int gfp_flags)
{
- struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+ struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+ gfp_flags|__GFP_KERNRCLM);
if (ret) {
preempt_disable();
__get_cpu_var(bh_accounting).nr++;
diff -rup linux-2.6.9-clean/fs/dcache.c linux-2.6.9-mbuddy/fs/dcache.c
--- linux-2.6.9-clean/fs/dcache.c 2004-10-18 22:53:21.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/dcache.c 2005-01-03 19:34:12.000000000 +0000
@@ -691,7 +691,8 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;

- dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+ dentry = kmem_cache_alloc(dentry_cache,
+ GFP_KERNEL|__GFP_KERNRCLM);
if (!dentry)
return NULL;

diff -rup linux-2.6.9-clean/fs/ext2/super.c linux-2.6.9-mbuddy/fs/ext2/super.c
--- linux-2.6.9-clean/fs/ext2/super.c 2004-10-18 22:54:38.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/ext2/super.c 2005-01-03 19:34:35.000000000 +0000
@@ -134,7 +134,7 @@ static kmem_cache_t * ext2_inode_cachep;
static struct inode *ext2_alloc_inode(struct super_block *sb)
{
struct ext2_inode_info *ei;
- ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+ ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup linux-2.6.9-clean/fs/ext3/super.c linux-2.6.9-mbuddy/fs/ext3/super.c
--- linux-2.6.9-clean/fs/ext3/super.c 2004-10-18 22:55:07.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/ext3/super.c 2005-01-03 19:34:47.000000000 +0000
@@ -442,7 +442,7 @@ static struct inode *ext3_alloc_inode(st
{
struct ext3_inode_info *ei;

- ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+ ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup linux-2.6.9-clean/fs/ntfs/inode.c linux-2.6.9-mbuddy/fs/ntfs/inode.c
--- linux-2.6.9-clean/fs/ntfs/inode.c 2004-10-18 22:55:07.000000000 +0100
+++ linux-2.6.9-mbuddy/fs/ntfs/inode.c 2005-01-03 19:35:27.000000000 +0000
@@ -314,7 +314,7 @@ struct inode *ntfs_alloc_big_inode(struc

ntfs_debug("Entering.");
ni = (ntfs_inode *)kmem_cache_alloc(ntfs_big_inode_cache,
- SLAB_NOFS);
+ SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return VFS_I(ni);
@@ -339,7 +339,8 @@ static inline ntfs_inode *ntfs_alloc_ext
ntfs_inode *ni;

ntfs_debug("Entering.");
- ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+ ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache,
+ SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return ni;
diff -rup linux-2.6.9-clean/include/linux/gfp.h linux-2.6.9-mbuddy/include/linux/gfp.h
--- linux-2.6.9-clean/include/linux/gfp.h 2004-10-18 22:53:44.000000000 +0100
+++ linux-2.6.9-mbuddy/include/linux/gfp.h 2005-01-09 18:34:07.000000000 +0000
@@ -37,21 +37,24 @@ struct vm_area_struct;
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
+#define __GFP_KERNRCLM 0x8000 /* Kernel page that is easily reclaimable */
+#define __GFP_USERRCLM 0x10000 /* User is a userspace user */

-#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 18 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
- __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
+ __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
+ __GFP_KERNRCLM|__GFP_USERRCLM)

#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USERRCLM)
+#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_USERRCLM)

/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
@@ -130,4 +133,7 @@ extern void FASTCALL(free_cold_page(stru

void page_alloc_init(void);

+/* VM Regress: define to indicate tracing callbacks is enabled */
+#define TRACE_PAGE_ALLOCS
+
#endif /* __LINUX_GFP_H */
diff -rup linux-2.6.9-clean/include/linux/mmzone.h linux-2.6.9-mbuddy/include/linux/mmzone.h
--- linux-2.6.9-clean/include/linux/mmzone.h 2004-10-18 22:54:31.000000000 +0100
+++ linux-2.6.9-mbuddy/include/linux/mmzone.h 2005-01-09 15:19:44.000000000 +0000
@@ -19,6 +19,10 @@
#else
#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
#endif
+#define ALLOC_TYPES 3
+#define ALLOC_KERNNORCLM 0
+#define ALLOC_KERNRCLM 1
+#define ALLOC_USERRCLM 2

struct free_area {
struct list_head free_list;
@@ -162,9 +166,23 @@ struct zone {
int prev_priority;

/*
- * free areas of different sizes
+ * free areas of different sizes. There is one global free area that
+ * contains lists of pages of 2^MAX_ORDER in size. Once split, the
+ * buddy will be added to one of the free_area lists in free_area_lists.
+ * Each type is one of User Reclaimable, Kernel Reclaimable and
+ * Kernel Non-reclaimable. The idea is that pages of related use will
+ * be clustered together to reduce fragmentation.
*/
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area_lists[ALLOC_TYPES][MAX_ORDER];
+ struct free_area free_area_global;
+
+ /*
+ * This map tracks what each 2^MAX_ORDER sized block has been used for.
+ * When a page is freed, it's index within this bitmap is calculated
+ * using (address >> MAX_ORDER) * 2 . This means that pages will
+ * always be freed into the correct list in free_area_lists
+ */
+ unsigned long *free_area_usemap;

/*
* wait_table -- the array holding the hash table
diff -rup linux-2.6.9-clean/mm/page_alloc.c linux-2.6.9-mbuddy/mm/page_alloc.c
--- linux-2.6.9-clean/mm/page_alloc.c 2004-10-18 22:53:11.000000000 +0100
+++ linux-2.6.9-mbuddy/mm/page_alloc.c 2005-01-12 14:45:20.000000000 +0000
@@ -40,8 +40,21 @@ unsigned long totalram_pages;
unsigned long totalhigh_pages;
long nr_swap_pages;
int numnodes = 1;
+int fallback_count=0;
+int drastic_fallback_count=0;
+int global_steal=0;
+int global_refill=0;
+int kernnorclm_count=0;
+int kernrclm_count=0;
+int userrclm_count=0;
int sysctl_lower_zone_protection = 0;

+int fallback_allocs[ALLOC_TYPES][ALLOC_TYPES] = {
+ { ALLOC_KERNNORCLM, ALLOC_KERNRCLM, ALLOC_USERRCLM },
+ { ALLOC_KERNRCLM, ALLOC_KERNNORCLM, ALLOC_USERRCLM },
+ { ALLOC_USERRCLM, ALLOC_KERNNORCLM, ALLOC_KERNRCLM }
+};
+
EXPORT_SYMBOL(totalram_pages);
EXPORT_SYMBOL(nr_swap_pages);

@@ -53,6 +66,7 @@ struct zone *zone_table[1 << (ZONES_SHIF
EXPORT_SYMBOL(zone_table);

static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
+static char *type_names[ALLOC_TYPES] = { "KernNoRclm", "KernRclm", "UserRclm"};
int min_free_kbytes = 1024;

unsigned long __initdata nr_kernel_pages;
@@ -94,6 +108,48 @@ static void bad_page(const char *functio
page->mapping = NULL;
}

+/*
+ * Return what type of use the 2^MAX_ORDER block of pages is in use for
+ * that the given page is part of
+ */
+static int get_pageblock_type(struct page *page) {
+ struct zone *zone = page_zone(page);
+ int pageidx = (page - zone->zone_mem_map) >> MAX_ORDER;
+ int bitidx = pageidx * 2;
+
+ /* Bit 1 will be set if the block is kernel reclaimable */
+ if (test_bit(bitidx,zone->free_area_usemap)) return ALLOC_KERNRCLM;
+
+ /* Bit 2 will be set if the block is user reclaimable */
+ if (test_bit(bitidx+1, zone->free_area_usemap)) return ALLOC_USERRCLM;
+
+ return ALLOC_KERNNORCLM;
+}
+
+static void set_pageblock_type(struct page *page, int type) {
+ int bit1, bit2;
+ struct zone *zone = page_zone(page);
+ int pageidx = (page - zone->zone_mem_map) >> MAX_ORDER;
+ int bitidx = pageidx * 2;
+ bit1 = bit2 = 0;
+
+ if (type == ALLOC_KERNRCLM) {
+ set_bit(bitidx, zone->free_area_usemap);
+ clear_bit(bitidx+1, zone->free_area_usemap);
+ return;
+ }
+
+ if (type == ALLOC_USERRCLM) {
+ clear_bit(bitidx, zone->free_area_usemap);
+ set_bit(bitidx+1, zone->free_area_usemap);
+ return;
+ }
+
+ clear_bit(bitidx, zone->free_area_usemap);
+ clear_bit(bitidx+1, zone->free_area_usemap);
+
+}
+
#ifndef CONFIG_HUGETLB_PAGE
#define prep_compound_page(page, order) do { } while (0)
#define destroy_compound_page(page, order) do { } while (0)
@@ -193,7 +249,9 @@ static inline void __free_pages_bulk (st
while (order < MAX_ORDER-1) {
struct page *buddy1, *buddy2;

+ /* MBUDDY TODO: Fix this bug check
BUG_ON(area >= zone->free_area + MAX_ORDER);
+ */
if (!__test_and_change_bit(index, area->map))
/*
* the buddy page is still allocated.
@@ -211,7 +269,16 @@ static inline void __free_pages_bulk (st
area++;
index >>= 1;
page_idx &= mask;
+
+
}
+
+ /* Switch to the global list if this is the MAX_ORDER */
+ if (order >= MAX_ORDER-1) {
+ area = &(zone->free_area_global);
+ global_refill++;
+ }
+
list_add(&(base + page_idx)->lru, &area->free_list);
}

@@ -253,14 +320,23 @@ free_pages_bulk(struct zone *zone, int c
struct free_area *area;
struct page *base, *page = NULL;
int ret = 0;
+ int alloctype;

base = zone->zone_mem_map;
- area = zone->free_area + order;
spin_lock_irqsave(&zone->lock, flags);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
while (!list_empty(list) && count--) {
page = list_entry(list->prev, struct page, lru);
+
+ /*
+ * Find what area this page belonged to. It is possible that
+ * the pages are always of the same type and this expensive
+ * check could be only performed once. Needs investigation
+ */
+ alloctype = get_pageblock_type(page);
+ area = zone->free_area_lists[alloctype] + order;
+
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
__free_pages_bulk(page, base, zone, area, order);
@@ -363,15 +439,37 @@ static void prep_new_page(struct page *p
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int flags)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
unsigned int index;
+ int global_split=0;
+
+ /* Select area to use based on gfp_flags */
+ int alloctype;
+ int retry_count=0;
+ if (flags & __GFP_USERRCLM) {
+ alloctype = ALLOC_USERRCLM;
+ userrclm_count++;
+ }
+ else if (flags & __GFP_KERNRCLM) {
+ alloctype = ALLOC_KERNRCLM;
+ kernrclm_count++;
+ } else {
+ alloctype = ALLOC_KERNNORCLM;
+ kernnorclm_count++;
+ }
+
+ /* Ok, pick the fallback order based on the type */
+ int *fallback_list = fallback_allocs[alloctype];

+retry:
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ alloctype = fallback_list[retry_count];
+ area = zone->free_area_lists[alloctype] + current_order;
+
if (list_empty(&area->free_list))
continue;

@@ -384,6 +482,34 @@ static struct page *__rmqueue(struct zon
return expand(zone, page, index, order, current_order, area);
}

+ /* Take from the global pool if this is the first attempt */
+ if (!global_split && !list_empty(&(zone->free_area_global.free_list))){
+ /*
+ * Remove a MAX_ORDER block from the global pool and add
+ * it to the list of desired alloc_type
+ */
+ page = list_entry(zone->free_area_global.free_list.next,
+ struct page, lru);
+ list_del(&page->lru);
+ list_add(&page->lru,
+ &(zone->free_area_lists[alloctype][MAX_ORDER-1].free_list));
+ global_steal++;
+ global_split=1;
+
+ /* Mark this block of pages as for use with this alloc type */
+ set_pageblock_type(page, alloctype);
+
+ goto retry;
+ }
+
+ /*
+ * Here, the alloc type lists has been depleted as well as the global
+ * pool, so fallback
+ */
+ retry_count++;
+ fallback_count++;
+ if (retry_count != ALLOC_TYPES) goto retry;
+
return NULL;
}

@@ -393,7 +519,8 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list,
+ int gfp_flags)
{
unsigned long flags;
int i;
@@ -402,7 +529,7 @@ static int rmqueue_bulk(struct zone *zon

spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, gfp_flags);
if (page == NULL)
break;
allocated++;
@@ -438,7 +565,7 @@ int is_head_of_free_region(struct page *
{
struct zone *zone = page_zone(page);
unsigned long flags;
- int order;
+ int order,type;
struct list_head *curr;

/*
@@ -446,12 +573,16 @@ int is_head_of_free_region(struct page *
* suspend anyway, but...
*/
spin_lock_irqsave(&zone->lock, flags);
- for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list)
+ for (type = 0 ; type < ALLOC_TYPES ; type++) {
+ for (order = MAX_ORDER - 1; order >= 0; --order) {
+ list_for_each(curr,
+ &zone->free_area_lists[type][order].free_list)
if (page == list_entry(curr, struct page, lru)) {
spin_unlock_irqrestore(&zone->lock, flags);
return 1 << order;
}
+ }
+ }
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
}
@@ -545,14 +676,15 @@ buffered_rmqueue(struct zone *zone, int
struct page *page = NULL;
int cold = !!(gfp_flags & __GFP_COLD);

- if (order == 0) {
+ if (order == 0 && (gfp_flags & __GFP_USERRCLM)) {
struct per_cpu_pages *pcp;

pcp = &zone->pageset[get_cpu()].pcp[cold];
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list,
+ gfp_flags);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -564,7 +696,7 @@ buffered_rmqueue(struct zone *zone, int

if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, gfp_flags);
spin_unlock_irqrestore(&zone->lock, flags);
}

@@ -1040,6 +1172,7 @@ void show_free_areas(void)
unsigned long inactive;
unsigned long free;
struct zone *zone;
+ int type;

for_each_zone(zone) {
show_node(zone);
@@ -1130,8 +1263,11 @@ void show_free_areas(void)
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
nr = 0;
- list_for_each(elem, &zone->free_area[order].free_list)
- ++nr;
+ for (type=0; type < ALLOC_TYPES; type++) {
+ list_for_each(elem,
+ &zone->free_area_lists[type][order].free_list)
+ ++nr;
+ }
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
}
@@ -1445,19 +1581,35 @@ unsigned long pages_to_bitmap_size(unsig
void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone, unsigned long size)
{
int order;
- for (order = 0; ; order++) {
- unsigned long bitmap_size;
+ int type;
+ struct free_area *area;
+ unsigned long bitmap_size;

- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- if (order == MAX_ORDER-1) {
- zone->free_area[order].map = NULL;
- break;
- }
+ /* Initialse the three size ordered lists of free_areas */
+ for (type=0; type < ALLOC_TYPES; type++) {
+
+ for (order = 0; ; order++) {
+ area = zone->free_area_lists[type];
+
+ INIT_LIST_HEAD(&area[order].free_list);
+ if (order == MAX_ORDER-1) {
+ area[order].map = NULL;
+ break;
+ }

- bitmap_size = pages_to_bitmap_size(order, size);
- zone->free_area[order].map =
- (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+ bitmap_size = pages_to_bitmap_size(order, size);
+ area[order].map =
+ (unsigned long*)alloc_bootmem_node(pgdat,
+ bitmap_size);
+ }
}
+
+ /* Initialise the global pool of 2^size pages */
+ INIT_LIST_HEAD(&zone->free_area_global.free_list);
+ bitmap_size = pages_to_bitmap_size(MAX_ORDER-1, size);
+ zone->free_area_global.map =
+ (unsigned long*)alloc_bootmem_node(pgdat,
+ bitmap_size);
}

#ifndef __HAVE_ARCH_MEMMAP_INIT
@@ -1575,6 +1727,11 @@ static void __init free_area_init_core(s
zone_start_pfn += size;

zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+ zone->free_area_usemap =
+ (unsigned long *)alloc_bootmem_node(pgdat,
+ (size >> MAX_ORDER) * 2);
+ memset((unsigned long *)zone->free_area_usemap, ALLOC_KERNNORCLM,
+ (size >> MAX_ORDER) * 2);
}
}

@@ -1653,25 +1810,83 @@ static int frag_show(struct seq_file *m,
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
unsigned long flags;
- int order;
+ int order, type;
+ struct list_head *elem;

+ /* Show global fragmentation statistics */
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
continue;

spin_lock_irqsave(&zone->lock, flags);
- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order) {
- unsigned long nr_bufs = 0;
+ seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ unsigned long nr_bufs = 0;
+ for (order = 0; order < MAX_ORDER-1; ++order) {
+ nr_bufs = 0;
+
+ for (type=0; type < ALLOC_TYPES; type++) {
+ list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+ ++nr_bufs;
+ }
+ seq_printf(m, "%6lu ", nr_bufs);
+ }
+
+ /* Scan global list */
+ nr_bufs = 0;
+ list_for_each(elem, &(zone->free_area_global.free_list))
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+ seq_putc(m, '\n');
+ }
+
+ /* Show statistics for each allocation type */
+ seq_printf(m, "\nPer-allocation-type statistics");
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!zone->present_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ unsigned long nr_bufs = 0;
+ for (type=0; type < ALLOC_TYPES; type++) {
+ seq_printf(m, "\nNode %d, zone %8s, type %10s",
+ pgdat->node_id, zone->name,
+ type_names[type]);
struct list_head *elem;
+ for (order = 0; order < MAX_ORDER; ++order) {
+ nr_bufs = 0;

- list_for_each(elem, &(zone->free_area[order].free_list))
- ++nr_bufs;
- seq_printf(m, "%6lu ", nr_bufs);
+ list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+ }
}
+
+ /* Scan global list */
+ seq_printf(m, "\n");
+ seq_printf(m, "Node %d, zone %8s, type %10s",
+ pgdat->node_id, zone->name,
+ "MAX_ORDER");
+ nr_bufs = 0;
+ list_for_each(elem, &(zone->free_area_global.free_list))
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
+
+ /* Show bean counters */
+ seq_printf(m, "\nGlobal beancounters\n");
+ seq_printf(m, "Global steals: %d\n", global_steal);
+ seq_printf(m, "Global refills: %d\n", global_refill);
+ seq_printf(m, "Fallback count: %d\n", fallback_count);
+ seq_printf(m, "KernNoRclm allocs: %d\n", kernnorclm_count);
+ seq_printf(m, "KernRclm allocs: %d\n", kernrclm_count);
+ seq_printf(m, "UserRclm allocs: %d\n", userrclm_count);
+
+
return 0;
}

2005-01-13 07:03:38

by Matt Mackall

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> I stress-tested this patch very heavily and it never oopsed so I am
> confident of it's stability, so what is left is to look at the results of
> this patch were and I think they look promising in a number of respects. I
> have graphs that do not translate to text very well, so I'll just point you
> to http://www.csn.ul.ie/~mel/projects/mbuddy-results-1 instead.

This graph rather hard to comprehend.

> The results were not spectacular but still very interesting. Under heavy
> stresing (updatedb + 4 simultaneous -j4 kernel compiles with avg load 15)
> fragmentation is consistently lower than the standard allocator. It could
> also be a lot better if there was some means of purging caches, userpages
> and buffers but thats in the future. For the moment, the only real control
> I had was the buffer pages.

You might stress higher order page allocation with a) 8k stacks turned
on b) UDP NFS with large read/write.

> Opinions/Feedback?

Looks interesting.

--
Mathematics is the supreme nostalgia of our time.

2005-01-13 07:20:32

by Trond Myklebust

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

on den 12.01.2005 Klokka 23:03 (-0800) skreiv Matt Mackall:

> You might stress higher order page allocation with a) 8k stacks turned
> on b) UDP NFS with large read/write.

b) Unless your network uses jumbo frames, UDP NFS should not be doing
higher order page allocation.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-01-13 07:32:36

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> So... What the patch does. Allocations are divided up into three different
> types of allocations;
> UserReclaimable - These are userspace pages that are easily reclaimable. Right
> now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
> well as disk-buffer pages into this category. These pages are trivially
> reclaimed by writing the page out to swap or syncing with backing
> storage
> KernelReclaimable - These are pages allocated by the kernel that are easily
> reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
> These type of pages potentially could be reclaimed by dumping the
> caches and reaping the slabs (drastic, but you get the idea). We could
> also add pages into this category that are known to be only required
> for a short time like buffers used with DMA
> KernelNonReclaimable - These are pages that are allocated by the kernel that
> are not trivially reclaimed. For example, the memory allocated for a
> loaded module would be in this category. By default, allocations are
> considered to be of this type

I'd expect to do better with kernel/user discrimination only, having
address-ordering biases in opposite directions for each case.

-- wli

2005-01-13 10:11:41

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Wed, 12 Jan 2005, William Lee Irwin III wrote:

> On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> > So... What the patch does. Allocations are divided up into three different
> > types of allocations;
> > UserReclaimable - These are userspace pages that are easily reclaimable. Right
> > now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
> > well as disk-buffer pages into this category. These pages are trivially
> > reclaimed by writing the page out to swap or syncing with backing
> > storage
> > KernelReclaimable - These are pages allocated by the kernel that are easily
> > reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
> > These type of pages potentially could be reclaimed by dumping the
> > caches and reaping the slabs (drastic, but you get the idea). We could
> > also add pages into this category that are known to be only required
> > for a short time like buffers used with DMA
> > KernelNonReclaimable - These are pages that are allocated by the kernel that
> > are not trivially reclaimed. For example, the memory allocated for a
> > loaded module would be in this category. By default, allocations are
> > considered to be of this type
>
> I'd expect to do better with kernel/user discrimination only, having
> address-ordering biases in opposite directions for each case.
>

I thought about this and was not 100% sure. I'll explain how I see it
before I go redo large portions of the patch.

1. Discriminating kernel/user will leave the kernel area fragmented much
worse than my approach. On my systems I tested, I found that a significant
portion of kernel memory was taken up by slabs like buffer_head,
ext3_inode_cache, ntfs_inode_cache, ntfs_big_inode_cache and
radix_tree_node. I did not want to mix these allocations, which
potentially are easy to get rid of, with allocations that are incredibly
difficult. (Totally aside, I also found that slab caches suffer from
terrible internal fragmentation, sometimes wasting up to 60% of their
memory but thats a separate problem)

2. Address-ordering biases was also something I looked at early on, but
then figured it didn't matter. If I don't reserve a 2^MAX_ORDER blocks,
the system will just get terribly fragmented under memory presure when
they meet in the middle. If I do reserve, I just end up with a similar
layout to what I have now.

--
Mel Gorman

2005-01-13 10:22:21

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Wed, 12 Jan 2005, Matt Mackall wrote:

> On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> > I stress-tested this patch very heavily and it never oopsed so I am
> > confident of it's stability, so what is left is to look at the results of
> > this patch were and I think they look promising in a number of respects. I
> > have graphs that do not translate to text very well, so I'll just point you
> > to http://www.csn.ul.ie/~mel/projects/mbuddy-results-1 instead.
>
> This graph rather hard to comprehend.
>

There is not a lot of ways to illustrate how pages are allocated
throughout an address space. If you have a better suggestion, I can update
the relevant tool.

> > The results were not spectacular but still very interesting. Under heavy
> > stresing (updatedb + 4 simultaneous -j4 kernel compiles with avg load 15)
> > fragmentation is consistently lower than the standard allocator. It could
> > also be a lot better if there was some means of purging caches, userpages
> > and buffers but thats in the future. For the moment, the only real control
> > I had was the buffer pages.
>
> You might stress higher order page allocation with a) 8k stacks turned
> on b) UDP NFS with large read/write.
>

CONFIG_4KSTACKS is not set on my system so I'm already using 8k stacks.
Based on Trond's mail, I checked my network and the MTU is 1500 bytes so I
don't think I can force high-order allocations there.

Also, while I agree this test would be important, I think it'll be way
more important when there is something in place that actually tries to
free up high-order blocks and then compare.

--
Mel Gorman

2005-01-15 00:36:38

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Wed, Jan 12, 2005 at 11:31:46PM -0800, William Lee Irwin III wrote:
> On Wed, Jan 12, 2005 at 09:09:24PM +0000, Mel Gorman wrote:
> > So... What the patch does. Allocations are divided up into three different
> > types of allocations;
> > UserReclaimable - These are userspace pages that are easily reclaimable. Right
> > now, I'm putting all allocations of GFP_USER and GFP_HIGHUSER as
> > well as disk-buffer pages into this category. These pages are trivially
> > reclaimed by writing the page out to swap or syncing with backing
> > storage
> > KernelReclaimable - These are pages allocated by the kernel that are easily
> > reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
> > These type of pages potentially could be reclaimed by dumping the
> > caches and reaping the slabs (drastic, but you get the idea). We could
> > also add pages into this category that are known to be only required
> > for a short time like buffers used with DMA
> > KernelNonReclaimable - These are pages that are allocated by the kernel that
> > are not trivially reclaimed. For example, the memory allocated for a
> > loaded module would be in this category. By default, allocations are
> > considered to be of this type
>
> I'd expect to do better with kernel/user discrimination only, having
> address-ordering biases in opposite directions for each case.

What you mean with "address-ordering biases in opposite directions for each case" ?

You mean to have each case allocate from the top and bottom of the free list, respectively,
and in opposite address direction ? What you gain from that?

And what that means during a long period of VM stress ?

2005-01-15 01:32:15

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Wed, Jan 12, 2005 at 11:31:46PM -0800, William Lee Irwin III wrote:
>> I'd expect to do better with kernel/user discrimination only, having
>> address-ordering biases in opposite directions for each case.

On Fri, Jan 14, 2005 at 07:42:18PM -0200, Marcelo Tosatti wrote:
> What you mean with "address-ordering biases in opposite directions
> for each case" ?
> You mean to have each case allocate from the top and bottom of the
> free list, respectively, and in opposite address direction ? What you
> gain from that?
> And what that means during a long period of VM stress ?

It's one of the standard anti-fragmentation tactics. The large free
areas come from the middle, address ordering disposes of holes in the
used areas, and the areas at opposite ends reflect expected lifetimes.

It's more useful for cases where there is not an upper bound on the
size of an allocation (or power-of-two blocksizes). On second thought,
Mel's approach exploits both the bound and the power-of-two restriction
advantageously.

-- wli

2005-01-15 19:21:53

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Fri, 14 Jan 2005, William Lee Irwin III wrote:

> On Wed, Jan 12, 2005 at 11:31:46PM -0800, William Lee Irwin III wrote:
> >> I'd expect to do better with kernel/user discrimination only, having
> >> address-ordering biases in opposite directions for each case.
>
> On Fri, Jan 14, 2005 at 07:42:18PM -0200, Marcelo Tosatti wrote:
> > What you mean with "address-ordering biases in opposite directions
> > for each case" ?
> > You mean to have each case allocate from the top and bottom of the
> > free list, respectively, and in opposite address direction ? What you
> > gain from that?
> > And what that means during a long period of VM stress ?
>
> It's one of the standard anti-fragmentation tactics. The large free
> areas come from the middle, address ordering disposes of holes in the
> used areas, and the areas at opposite ends reflect expected lifetimes.
>
> It's more useful for cases where there is not an upper bound on the
> size of an allocation (or power-of-two blocksizes). On second thought,
> Mel's approach exploits both the bound and the power-of-two restriction
> advantageously.
>

I think so too and I reckon I have the figures to prove it. Patches with
test tools and figures are on the way. Working at the moment at running
the last of the tests and getting the patches in order.

--
Mel Gorman

2005-01-17 16:48:48

by Tolentino, Matthew E

[permalink] [raw]

Subject: RE: [RFC] Avoiding fragmentation through different allocator

>I considered adding a new zone but I felt it would be a massive job for
>what I considered to be a simple problem. I think my approach is nice
>and isolated within the allocator itself and will be less likely to
>affect other code.

Just for clarity, I prefer this approach over adding zones,
hence my pursuit of something akin to it.

>On possibility is that we could say that the UserRclm and KernRclm pool
>are always eligible for hotplug and have hotplug banks only
>satisy those
>allocations pushing KernNonRclm allocations to fixed banks. How is it
>currently known if a bank of memory is hotplug? Is there a
>node for each
>hotplug bank? If yes, we could flag those nodes to only
>satisify UserRclm
>and KernRclm allocations and force fallback to other nodes.

The hardware/firmware has to tell the kernel in some way. In
my case it is ACPI that delineates between regions that may be
removed. No, there isn't a node for each bank of hot-plug
memory. The reason I was pursuing this was to be able to
avoid coarse granularity distinctions like that. Depending
on the platform, ACPI may provide only memory ranges (via
memory devices detailed in the namespace) for single node
systems or group memory ranges according to nodes via the
ACPI abstraction called containers. It's my understanding
that containers then have some mapping to nodes.

>The danger is
>that allocations would fail because non-hotplug banks were already full
>and pageout would not happen because the watermarks were satisified.

Which implies a potential need for balancing between user/kernel
lists, no?

>(Bear in mind I can't test hotplug-related issues due to lack
>of suitable
>hardware)

Sure. I can, although most of this has been done via emulation
initially and then tested on real hardware soon afterwards.

>If you have already posted a version of the patch (you have
>feedback so I
>guess it's there somewhere), can you send me a link to the thread where
>you introduced your approach? It's possible that we just need
>to merge the
>ideas.

No, I hadn't posted it yet due to chasing a bug. However, perhaps
now I'll instead focus on adding the necessary hotplug support
into your patch, hence merging the hotplug requirements/ideas?

>It's because I consider all 2^MAX_ORDER pages in a zone to be
>equal where
>as I'm guessing you don't. Until they are split, there is
>nothing special
>about them. It is only when it is split that I want it reserved for a
>purpose.
>
>However, if we knew there were blocks that were hot-pluggable, we could
>just have a hotplug-global and non-hotplug-global pool. If
>it's a UserRclm
>or KernRclm allocation, split from hotplug-global, otherwise use
>non-hotplug-global. It'd increase the memory requirements of
>the patch a
>bit though.

Exactly. Perhaps this could just be isolated via the
CONFIG_MEMORY_HOTPLUG build option, thus not increasing the memory
requirements in the common case...

matt

2005-01-19 13:17:50

by Mel Gorman

[permalink] [raw]

Subject: RE: [RFC] Avoiding fragmentation through different allocator

On Mon, 17 Jan 2005, Tolentino, Matthew E wrote:

> >I considered adding a new zone but I felt it would be a massive job for
> >what I considered to be a simple problem. I think my approach is nice
> >and isolated within the allocator itself and will be less likely to
> >affect other code.
>
> Just for clarity, I prefer this approach over adding zones,
> hence my pursuit of something akin to it.
>

Ok, cool.

> >On possibility is that we could say that the UserRclm and KernRclm pool
> >are always eligible for hotplug and have hotplug banks only
> >satisy those
> >allocations pushing KernNonRclm allocations to fixed banks. How is it
> >currently known if a bank of memory is hotplug? Is there a
> >node for each
> >hotplug bank? If yes, we could flag those nodes to only
> >satisify UserRclm
> >and KernRclm allocations and force fallback to other nodes.
>
> The hardware/firmware has to tell the kernel in some way. In
> my case it is ACPI that delineates between regions that may be
> removed. No, there isn't a node for each bank of hot-plug
> memory. The reason I was pursuing this was to be able to
> avoid coarse granularity distinctions like that.

As there is not a node for each hotplug bank, this has to happen at the
zone level. Architectures have the option of defining their own
memmap_init() although only ia64 take advantage of it. If we wanted to be
able to identify hotplug pages in an independant manner, we woulc
implement memmap_init() for hotplug and fill zone->free_area_usemap
accordingly. Currently the bitmap in there has two bits for each
2^MAX_ORDER block that looks like;

00 = Kernel non-reclaimable
10 = Kernel reclaimable
01 = User reclaimable

So, we could say 11 is for hotplug memory. Alternatively if it is possible
to have a system that consists entirely of hotplug memory, we could add a
third bit. As it is one bit per 2^MAX_ORDER pages in the system, it would
not be a big chunk of memory.

The question is what to do with that information then. We can't just say
that User pages go to hotplug regions as that will introduce two problems.
One of balancing and the second of what happens when an unreclaimable page
is in a bank we want to move without page migration in place.

> >The danger is
> >that allocations would fail because non-hotplug banks were already full
> >and pageout would not happen because the watermarks were satisified.
>
> Which implies a potential need for balancing between user/kernel
> lists, no?
>

If we're not careful, yes and I have a gut-feeling that says we should not
need to be balancing anything for hotplug.

> >If you have already posted a version of the patch (you have
> >feedback so I
> >guess it's there somewhere), can you send me a link to the thread where
> >you introduced your approach? It's possible that we just need
> >to merge the
> >ideas.
>
> No, I hadn't posted it yet due to chasing a bug. However, perhaps
> now I'll instead focus on adding the necessary hotplug support
> into your patch, hence merging the hotplug requirements/ideas?
>

I've no problem with that. It's just a case of how we use the information
exactly.

> >It's because I consider all 2^MAX_ORDER pages in a zone to be
> >equal where
> >as I'm guessing you don't. Until they are split, there is
> >nothing special
> >about them. It is only when it is split that I want it reserved for a
> >purpose.
> >
> >However, if we knew there were blocks that were hot-pluggable, we could
> >just have a hotplug-global and non-hotplug-global pool. If
> >it's a UserRclm
> >or KernRclm allocation, split from hotplug-global, otherwise use
> >non-hotplug-global. It'd increase the memory requirements of
> >the patch a
> >bit though.
>
> Exactly. Perhaps this could just be isolated via the
> CONFIG_MEMORY_HOTPLUG build option, thus not increasing the memory
> requirements in the common case...
>

I think so. The changes would be removing pages from the global pool with
macros rather than directly and having the hotplug and non-hotplug
versions.

We just need to think more of what happens when a kernel allocation comes
along, fixed memory is depleted but there is plenty of hotplug memory
left.

--
Mel Gorman