2005-01-20 10:15:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Avoiding fragmentation through different allocator

Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage

Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
with offsets to 2.6.11-rc1-mm1

Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists

Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
anything to do with asynchronous IO

Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
allocation type
o In-code documentation

Version 1
o Initial release against 2.6.9

This patch divides allocations into three different types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are
in this category. These pages are trivially reclaimed by writing
the page out to swap or syncing with backing storage

KernelReclaimable - These are pages allocated by the kernel that are easily
reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
These type of pages potentially could be reclaimed by dumping the
caches and reaping the slabs

KernelNonReclaimable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there are
three, one for each type of allocation. Finally, there is a list of pages of
size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals
with.

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of
pages allocated for UserReclaimable and page each of them out.

Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.

Three benchmark results are included. The first is the output of portions
of AIM9 for the vanilla allocator and the modified one;

root@monocle:~# grep _test aim9-vanilla-120.txt
7 page_test 120.00 9508 79.23333 134696.67 System Allocations & Pages/second
8 brk_test 120.01 3401 28.33931 481768.19 System Memory Allocations/second
9 jmp_test 120.00 498718 4155.98333 4155983.33 Non-local gotos/second
10 signal_test 120.01 11768 98.05850 98058.50 Signal Traps/second
11 exec_test 120.04 1585 13.20393 66.02 Program Loads/second
12 fork_test 120.04 1979 16.48617 1648.62 Task Creations/second
13 link_test 120.01 11174 93.10891 5865.86 Link/Unlink Pairs/second
root@monocle:~# grep _test aim9-mbuddyV3-120.txt
7 page_test 120.01 9660 80.49329 136838.60 System Allocations & Pages/second
8 brk_test 120.01 3409 28.40597 482901.42 System Memory Allocations/second
9 jmp_test 120.00 501533 4179.44167 4179441.67 Non-local gotos/second
10 signal_test 120.00 11677 97.30833 97308.33 Signal Traps/second
11 exec_test 120.05 1585 13.20283 66.01 Program Loads/second
12 fork_test 120.05 1889 15.73511 1573.51 Task Creations/second
13 link_test 120.01 11089 92.40063 5821.24 Link/Unlink Pairs/second

They show that the allocator performs roughly similar to the standard
allocator so there is negligible slowdown with the extra complexity. The
second benchmark tested the CPU cache usage to make sure it was not getting
clobbered. The test was to repeatadly render a large postcript file 10 times
and get the average. The result is;

==> gsbench-2.6.11-rc1Standard.txt <==
Average: 115.468 real, 115.092 user, 0.337 sys

==> gsbench-2.6.11-rc1MBuddy.txt <==
Average: 115.47 real, 115.136 user, 0.338 sys


So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test performs the following;

1. Start updatedb running in the background
2. Load kernel modules that tries to allocate high-order blocks on demand
3. Clean a kernel tree
4. Make 6 copies of the tree. As each copy finishes, a compile starts at -j4
5. Start compiling the primary tree
6. Sleep 3 minutes while the 7 trees are being compiled
7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages
- note, it only attempts 160 times, no matter how often it succeeds
- An allocation is attempted every 1/10th of a second

The result of the allocations under load were;

Vanilla 2.6.11-rc1
Attempted allocations: 160
Success allocs: 3
Failed allocs: 157
% Success: 1

2.6.11-rc1 with modified allocator
Attempted allocations: 160
Success allocs: 81
Failed allocs: 79
% Success: 50

The results show that the modified allocator runs at least as fast as the
normal allocator, has no adverse cache effects but is far less fragmented
and able to satisfy high-order allocations.

Signed-off-by: Mel Gorman <[email protected]>


diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/buffer.c linux-2.6.11-rc1-mbuddy/fs/buffer.c
--- linux-2.6.11-rc1-clean/fs/buffer.c 2005-01-12 04:01:23.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/fs/buffer.c 2005-01-20 09:43:46.000000000 +0000
@@ -1134,7 +1134,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;

- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ GFP_NOFS | __GFP_USERRCLM);
if (!page)
return NULL;

@@ -2997,7 +2998,8 @@ static void recalc_bh_state(void)

struct buffer_head *alloc_buffer_head(int gfp_flags)
{
- struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+ struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+ gfp_flags|__GFP_KERNRCLM);
if (ret) {
preempt_disable();
__get_cpu_var(bh_accounting).nr++;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/dcache.c linux-2.6.11-rc1-mbuddy/fs/dcache.c
--- linux-2.6.11-rc1-clean/fs/dcache.c 2005-01-12 04:00:09.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/fs/dcache.c 2005-01-20 09:43:46.000000000 +0000
@@ -715,7 +715,8 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;

- dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+ dentry = kmem_cache_alloc(dentry_cache,
+ GFP_KERNEL|__GFP_KERNRCLM);
if (!dentry)
return NULL;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/ext2/super.c linux-2.6.11-rc1-mbuddy/fs/ext2/super.c
--- linux-2.6.11-rc1-clean/fs/ext2/super.c 2005-01-12 04:01:24.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/fs/ext2/super.c 2005-01-20 09:43:46.000000000 +0000
@@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep;
static struct inode *ext2_alloc_inode(struct super_block *sb)
{
struct ext2_inode_info *ei;
- ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+ ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/ext3/super.c linux-2.6.11-rc1-mbuddy/fs/ext3/super.c
--- linux-2.6.11-rc1-clean/fs/ext3/super.c 2005-01-12 04:02:11.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/fs/ext3/super.c 2005-01-20 09:43:46.000000000 +0000
@@ -434,7 +434,7 @@ static struct inode *ext3_alloc_inode(st
{
struct ext3_inode_info *ei;

- ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+ ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/ntfs/inode.c linux-2.6.11-rc1-mbuddy/fs/ntfs/inode.c
--- linux-2.6.11-rc1-clean/fs/ntfs/inode.c 2005-01-12 04:01:45.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/fs/ntfs/inode.c 2005-01-20 09:43:46.000000000 +0000
@@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc

ntfs_debug("Entering.");
ni = (ntfs_inode *)kmem_cache_alloc(ntfs_big_inode_cache,
- SLAB_NOFS);
+ SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return VFS_I(ni);
@@ -343,7 +343,8 @@ static inline ntfs_inode *ntfs_alloc_ext
ntfs_inode *ni;

ntfs_debug("Entering.");
- ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+ ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache,
+ SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return ni;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/include/linux/gfp.h linux-2.6.11-rc1-mbuddy/include/linux/gfp.h
--- linux-2.6.11-rc1-clean/include/linux/gfp.h 2005-01-12 04:00:35.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/include/linux/gfp.h 2005-01-20 09:43:46.000000000 +0000
@@ -38,21 +38,24 @@ struct vm_area_struct;
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
#define __GFP_ZERO 0x8000 /* Return zeroed page on success */
+#define __GFP_KERNRCLM 0x10000 /* Kernel page that is easily reclaimable */
+#define __GFP_USERRCLM 0x20000 /* User is a userspace user */

-#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 18 /* Room for 18 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
- __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
+ __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
+ __GFP_USERRCLM|__GFP_KERNRCLM)

#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USERRCLM)
+#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_USERRCLM)

/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/include/linux/mmzone.h linux-2.6.11-rc1-mbuddy/include/linux/mmzone.h
--- linux-2.6.11-rc1-clean/include/linux/mmzone.h 2005-01-12 04:01:17.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/include/linux/mmzone.h 2005-01-20 09:43:46.000000000 +0000
@@ -19,6 +19,10 @@
#else
#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
#endif
+#define ALLOC_TYPES 3
+#define ALLOC_KERNNORCLM 0
+#define ALLOC_KERNRCLM 1
+#define ALLOC_USERRCLM 2

struct free_area {
struct list_head free_list;
@@ -131,8 +135,37 @@ struct zone {
* free areas of different sizes
*/
spinlock_t lock;
- struct free_area free_area[MAX_ORDER];

+ /*
+ * There are ALLOC_TYPE number of MAX_ORDER free lists. Once a
+ * MAX_ORDER block of pages has been split for an allocation type,
+ * the whole block is reserved for that type of allocation. The
+ * types are User Reclaimable, Kernel Reclaimable and Kernel
+ * Non-reclaimable. The objective is to reduce fragmentation
+ * overall
+ */
+ struct free_area free_area_lists[ALLOC_TYPES][MAX_ORDER];
+
+ /*
+ * This is a list of page blocks of 2^MAX_ORDER. Once one of
+ * these are split, the buddy is added to the appropriate
+ * free_area_lists. When the buddies are later merged, they
+ * are placed back here
+ */
+ struct free_area free_area_global;
+
+ /*
+ * This map tracks what each 2^MAX_ORDER sized block has been used for.
+ * Each 2^MAX_ORDER block have pages has 2 bits in this map to remember
+ * what the block is for. When a page is freed, it's index within this
+ * bitmap is calculated using (address >> MAX_ORDER) * 2 . This means
+ * that pages will always be freed into the correct list in
+ * free_area_lists
+ *
+ * The bits are set when a 2^MAX_ORDER block of pages is split
+ */
+
+ unsigned long *free_area_usemap;

ZONE_PADDING(_pad1_)

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/mm/page_alloc.c linux-2.6.11-rc1-mbuddy/mm/page_alloc.c
--- linux-2.6.11-rc1-clean/mm/page_alloc.c 2005-01-12 04:00:02.000000000 +0000
+++ linux-2.6.11-rc1-mbuddy/mm/page_alloc.c 2005-01-20 09:50:22.000000000 +0000
@@ -46,9 +46,30 @@ unsigned long totalhigh_pages;
long nr_swap_pages;
int sysctl_lower_zone_protection = 0;

+/* Bean counters for the per-type buddy allocator */
+int fallback_count[ALLOC_TYPES] = { 0, 0, 0};
+int global_steal=0;
+int global_refill=0;
+int kernnorclm_count=0;
+int kernrclm_count=0;
+int userrclm_count=0;
+
EXPORT_SYMBOL(totalram_pages);
EXPORT_SYMBOL(nr_swap_pages);

+/**
+ * The allocator tries to put allocations of the same type in the
+ * same 2^MAX_ORDER blocks of pages. When memory is low, this may
+ * not be possible so this describes what order they should fall
+ * back on
+ */
+int fallback_allocs[ALLOC_TYPES][ALLOC_TYPES] = {
+ { ALLOC_KERNNORCLM, ALLOC_KERNRCLM, ALLOC_USERRCLM },
+ { ALLOC_KERNRCLM, ALLOC_KERNNORCLM, ALLOC_USERRCLM },
+ { ALLOC_USERRCLM, ALLOC_KERNNORCLM, ALLOC_KERNRCLM }
+};
+
+
/*
* Used by page_zone() to look up the address of the struct zone whose
* id is encoded in the upper bits of page->flags
@@ -57,6 +78,7 @@ struct zone *zone_table[1 << (ZONES_SHIF
EXPORT_SYMBOL(zone_table);

static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
+static char *type_names[ALLOC_TYPES] = { "KernNoRclm", "KernRclm", "UserRclm"};
int min_free_kbytes = 1024;

unsigned long __initdata nr_kernel_pages;
@@ -103,6 +125,46 @@ static void bad_page(const char *functio
tainted |= TAINT_BAD_PAGE;
}

+/*
+ * Return what type of use the 2^MAX_ORDER block of pages is in use for
+ * that the given page is part of
+ */
+static inline int get_pageblock_type(struct page *page) {
+ struct zone *zone = page_zone(page);
+ int bitidx = ((page - zone->zone_mem_map) >> MAX_ORDER) * 2;
+
+ /* Bit 1 will be set if the block is kernel reclaimable */
+ if (test_bit(bitidx,zone->free_area_usemap)) return ALLOC_KERNRCLM;
+
+ /* Bit 2 will be set if the block is user reclaimable */
+ if (test_bit(bitidx+1, zone->free_area_usemap)) return ALLOC_USERRCLM;
+
+ return ALLOC_KERNNORCLM;
+}
+
+static inline void set_pageblock_type(struct page *page,
+ struct zone *zone, int type) {
+ int bit1, bit2;
+ int bitidx = ((page - zone->zone_mem_map) >> MAX_ORDER) * 2;
+ bit1 = bit2 = 0;
+
+ if (type == ALLOC_KERNRCLM) {
+ set_bit(bitidx, zone->free_area_usemap);
+ clear_bit(bitidx+1, zone->free_area_usemap);
+ return;
+ }
+
+ if (type == ALLOC_USERRCLM) {
+ clear_bit(bitidx, zone->free_area_usemap);
+ set_bit(bitidx+1, zone->free_area_usemap);
+ return;
+ }
+
+ clear_bit(bitidx, zone->free_area_usemap);
+ clear_bit(bitidx+1, zone->free_area_usemap);
+
+}
+
#ifndef CONFIG_HUGETLB_PAGE
#define prep_compound_page(page, order) do { } while (0)
#define destroy_compound_page(page, order) do { } while (0)
@@ -231,6 +293,8 @@ static inline void __free_pages_bulk (st
unsigned long page_idx;
struct page *coalesced;
int order_size = 1 << order;
+ struct free_area *area;
+ struct free_area *freelist;

if (unlikely(order))
destroy_compound_page(page, order);
@@ -240,9 +304,11 @@ static inline void __free_pages_bulk (st
BUG_ON(page_idx & (order_size - 1));
BUG_ON(bad_range(zone, page));

+ /* Select the areas to allocate based on the allocation type */
+ freelist = zone->free_area_lists[get_pageblock_type(page)];
+
zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
- struct free_area *area;
struct page *buddy;
int buddy_idx;

@@ -254,16 +320,29 @@ static inline void __free_pages_bulk (st
break;
/* Move the buddy up one level. */
list_del(&buddy->lru);
- area = zone->free_area + order;
+ area = freelist + order;
area->nr_free--;
rmv_page_order(buddy);
page_idx &= buddy_idx;
order++;
}
+
+ /*
+ * If a MAX_ORDER block of pages is being freed, it is
+ * no longer reserved for a particular type of allocation
+ * so put it in the global list
+ */
+ if (order >= MAX_ORDER-1) {
+ area = &(zone->free_area_global);
+ global_refill++;
+ } else {
+ area = freelist + order;
+ }
+
coalesced = base + page_idx;
set_page_order(coalesced, order);
- list_add(&coalesced->lru, &zone->free_area[order].free_list);
- zone->free_area[order].nr_free++;
+ list_add(&coalesced->lru, &area->free_list);
+ area->nr_free++;
}

static inline void free_pages_check(const char *function, struct page *page)
@@ -310,6 +389,7 @@ free_pages_bulk(struct zone *zone, int c
zone->pages_scanned = 0;
while (!list_empty(list) && count--) {
page = list_entry(list->prev, struct page, lru);
+
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
__free_pages_bulk(page, base, zone, order);
@@ -420,16 +500,43 @@ static void prep_new_page(struct page *p
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int flags)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
+ int global_split=0;
+ int *fallback_list;

- for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
- if (list_empty(&area->free_list))
+ /* Select area to use based on gfp_flags */
+ int alloctype;
+ int retry_count=0;
+ int startorder = order;
+ if (flags & __GFP_USERRCLM) {
+ alloctype = ALLOC_USERRCLM;
+ userrclm_count++;
+ }
+ else if (flags & __GFP_KERNRCLM) {
+ alloctype = ALLOC_KERNRCLM;
+ kernrclm_count++;
+ } else {
+ alloctype = ALLOC_KERNNORCLM;
+ kernnorclm_count++;
+ }
+
+ /* Ok, pick the fallback order based on the type */
+ fallback_list = fallback_allocs[alloctype];
+
+retry:
+ alloctype = fallback_list[retry_count];
+ area = zone->free_area_lists[alloctype] + startorder;
+ for (current_order = startorder;
+ current_order < MAX_ORDER; ++current_order) {
+
+ if (list_empty(&area->free_list)) {
+ area++;
continue;
+ }

page = list_entry(area->free_list.next, struct page, lru);
list_del(&page->lru);
@@ -439,6 +546,36 @@ static struct page *__rmqueue(struct zon
return expand(zone, page, order, current_order, area);
}

+ /* Take from the global pool if this is the first attempt */
+ if (!global_split && !list_empty(&(zone->free_area_global.free_list))){
+ /*
+ * Remove a MAX_ORDER block from the global pool and add
+ * it to the list of desired alloc_type
+ */
+ page = list_entry(zone->free_area_global.free_list.next,
+ struct page, lru);
+ list_del(&page->lru);
+ list_add(&page->lru,
+ &(zone->free_area_lists[alloctype][MAX_ORDER-1].free_list));
+ global_steal++;
+ global_split=1;
+
+ /* Mark this block of pages as for use with this alloc type */
+ set_pageblock_type(page, zone, alloctype);
+ startorder = MAX_ORDER-1;
+
+ goto retry;
+ }
+
+ /*
+ * Here, the alloc type lists has been depleted as well as the global
+ * pool, so fallback
+ */
+ retry_count++;
+ startorder=order;
+ fallback_count[alloctype]++;
+ if (retry_count != ALLOC_TYPES) goto retry;
+
return NULL;
}

@@ -448,7 +585,8 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list,
+ int gfp_flags)
{
unsigned long flags;
int i;
@@ -457,7 +595,7 @@ static int rmqueue_bulk(struct zone *zon

spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, gfp_flags);
if (page == NULL)
break;
allocated++;
@@ -493,7 +631,7 @@ static void __drain_pages(unsigned int c
void mark_free_pages(struct zone *zone)
{
unsigned long zone_pfn, flags;
- int order;
+ int order, type;
struct list_head *curr;

if (!zone->spanned_pages)
@@ -503,14 +641,17 @@ void mark_free_pages(struct zone *zone)
for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));

- for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
- unsigned long start_pfn, i;
+ for (type=0; type < ALLOC_TYPES; type++) {

- start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
+ for (order = MAX_ORDER - 1; order >= 0; --order)
+ list_for_each(curr, &zone->free_area_lists[type][order].free_list) {
+ unsigned long start_pfn, i;
+
+ start_pfn = page_to_pfn(list_entry(curr, struct page, lru));

- for (i=0; i < (1<<order); i++)
- SetPageNosaveFree(pfn_to_page(start_pfn+i));
+ for (i=0; i < (1<<order); i++)
+ SetPageNosaveFree(pfn_to_page(start_pfn+i));
+ }
}
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -612,14 +753,15 @@ buffered_rmqueue(struct zone *zone, int
struct page *page = NULL;
int cold = !!(gfp_flags & __GFP_COLD);

- if (order == 0) {
+ if (order == 0 && (gfp_flags & __GFP_USERRCLM)) {
struct per_cpu_pages *pcp;

pcp = &zone->pageset[get_cpu()].pcp[cold];
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list,
+ gfp_flags);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -631,7 +773,7 @@ buffered_rmqueue(struct zone *zone, int

if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, gfp_flags);
spin_unlock_irqrestore(&zone->lock, flags);
}

@@ -658,6 +800,7 @@ int zone_watermark_ok(struct zone *z, in
{
/* free_pages my go negative - that's OK */
long min = mark, free_pages = z->free_pages - (1 << order) + 1;
+ struct free_area *kernnorclm, *kernrclm, *userrclm;
int o;

if (gfp_high)
@@ -667,9 +810,15 @@ int zone_watermark_ok(struct zone *z, in

if (free_pages <= min + z->protection[alloc_type])
return 0;
+ kernnorclm = z->free_area_lists[ALLOC_KERNNORCLM];
+ kernrclm = z->free_area_lists[ALLOC_KERNRCLM];
+ userrclm = z->free_area_lists[ALLOC_USERRCLM];
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
+ free_pages -= (
+ kernnorclm[o].nr_free +
+ kernrclm[o].nr_free +
+ userrclm[o].nr_free) << o;

/* Require fewer higher order pages to be free */
min >>= 1;
@@ -1124,6 +1273,7 @@ void show_free_areas(void)
unsigned long inactive;
unsigned long free;
struct zone *zone;
+ int type;

for_each_zone(zone) {
show_node(zone);
@@ -1216,8 +1366,10 @@ void show_free_areas(void)

spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
- nr = zone->free_area[order].nr_free;
- total += nr << order;
+ for (type=0; type < ALLOC_TYPES; type++) {
+ nr = zone->free_area_lists[type][order].nr_free;
+ total += nr << order;
+ }
printk("%lu*%lukB ", nr, K(1UL) << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -1515,10 +1667,22 @@ void zone_init_free_lists(struct pglist_
unsigned long size)
{
int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
+ int type;
+ struct free_area *area;
+
+ /* Initialse the three size ordered lists of free_areas */
+ for (type=0; type < ALLOC_TYPES; type++) {
+ for (order = 0; order < MAX_ORDER; order++) {
+ area = zone->free_area_lists[type];
+
+ INIT_LIST_HEAD(&area[order].free_list);
+ area[order].nr_free = 0;
+ }
}
+
+ /* Initialise the global pool of 2^size pages */
+ INIT_LIST_HEAD(&zone->free_area_global.free_list);
+ zone->free_area_global.nr_free=0;
}

#ifndef __HAVE_ARCH_MEMMAP_INIT
@@ -1539,6 +1703,7 @@ static void __init free_area_init_core(s
const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
int cpu, nid = pgdat->node_id;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
+ unsigned long usemapsize;

pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
@@ -1637,6 +1802,22 @@ static void __init free_area_init_core(s
zone_start_pfn += size;

zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+
+ /* Calculate size of required bitmap */
+ /* - Number of MAX_ORDER blocks in the zone */
+ usemapsize = (size + (1 << MAX_ORDER)) >> MAX_ORDER;
+
+ /* - Two bits to record what type of block it is */
+ usemapsize = (usemapsize * 2 + 8) / 8;
+
+ zone->free_area_usemap =
+ (unsigned long *)alloc_bootmem_node(pgdat, usemapsize);
+
+ memset((unsigned long *)zone->free_area_usemap,
+ ALLOC_KERNNORCLM, usemapsize);
+
+ printk(KERN_DEBUG " %s zone: %lu pages, %lu real pages, usemap size:%lu\n",
+ zone_names[j], size, realsize, usemapsize);
}
}

@@ -1714,19 +1895,88 @@ static int frag_show(struct seq_file *m,
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
unsigned long flags;
- int order;
+ int order, type;
+ struct list_head *elem;
+ unsigned long nr_bufs = 0;

+ /* Show global fragmentation statistics */
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
continue;

spin_lock_irqsave(&zone->lock, flags);
- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ for (order = 0; order < MAX_ORDER-1; ++order) {
+ nr_bufs = 0;
+
+ for (type=0; type < ALLOC_TYPES; type++) {
+ list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+ ++nr_bufs;
+ }
+ seq_printf(m, "%6lu ", nr_bufs);
+ }
+
+ /* Scan global list */
+ nr_bufs = 0;
+ list_for_each(elem, &(zone->free_area_global.free_list))
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+ seq_putc(m, '\n');
+ }
+
+ /* Show statistics for each allocation type */
+ seq_printf(m, "\nPer-allocation-type statistics");
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!zone->present_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (type=0; type < ALLOC_TYPES; type++) {
+ struct list_head *elem;
+ seq_printf(m, "\nNode %d, zone %8s, type %10s",
+ pgdat->node_id, zone->name,
+ type_names[type]);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ nr_bufs = 0;
+
+ list_for_each(elem, &(zone->free_area_lists[type][order].free_list))
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+ }
+ }
+
+ /* Scan global list */
+ seq_printf(m, "\n");
+ seq_printf(m, "Node %d, zone %8s, type %10s",
+ pgdat->node_id, zone->name,
+ "MAX_ORDER");
+ nr_bufs = 0;
+ list_for_each(elem, &(zone->free_area_global.free_list))
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
+
+ /* Show bean counters */
+ seq_printf(m, "\nGlobal beancounters\n");
+ seq_printf(m, "Global steals: %d\n", global_steal);
+ seq_printf(m, "Global refills: %d\n", global_refill);
+ seq_printf(m, "KernNoRclm allocs: %d\n", kernnorclm_count);
+ seq_printf(m, "KernRclm allocs: %d\n", kernrclm_count);
+ seq_printf(m, "UserRclm allocs: %d\n", userrclm_count);
+ seq_printf(m, "%-10s Fallback count: %d\n", type_names[0],
+ fallback_count[0]);
+ seq_printf(m, "%-10s Fallback count: %d\n", type_names[1],
+ fallback_count[1]);
+ seq_printf(m, "%-10s Fallback count: %d\n", type_names[2],
+ fallback_count[2]);
+
+
+
return 0;
}


2005-01-21 18:00:04

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:
> Changelog since V5
> o Fixed up gcc-2.95 errors
> o Fixed up whitespace damage
>
> Changelog since V4
> o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
> with offsets to 2.6.11-rc1-mm1
>
> Changelog since V3
> o inlined get_pageblock_type() and set_pageblock_type()
> o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
> o When taking from the global pool, do not scan all the low-order lists
>
> Changelog since V2
> o Do not to interfere with the "min" decay
> o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
> anything to do with asynchronous IO
>
> Changelog since V1
> o Update patch to 2.6.11-rc1
> o Cleaned up bug where memory was wasted on a large bitmap
> o Remove code that needed the binary buddy bitmaps
> o Update flags to avoid colliding with __GFP_ZERO changes
> o Extended fallback_count bean counters to show the fallback count for each
> allocation type
> o In-code documentation

Hi Mel,

I was thinking that it would be nice to have a set of high-order intensive workloads,
and I wonder what are the most common high-order allocation paths which fail.

It mostly depends on hardware because most high-order allocations happen inside
device drivers? What are the kernel codepaths which try to do high-order allocations
and fallback if failed?

To measure whether the cost of page migration offsets the ability to be able to deliver
high-order allocations we want a set of meaningful performance tests?

Its quite possible that not all unsatisfiable high-order allocations want to
force page migration (which is quite expensive in terms of CPU/cache). Only migrate on
__GFP_NOFAIL ?

William, that same tradeoff exists for the zone balancing through migration idea
you propose...


2005-01-22 21:48:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Fri, 21 Jan 2005, Marcelo Tosatti wrote:

> On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:
> > <Changelog snipped>
>
> Hi Mel,
>
> I was thinking that it would be nice to have a set of high-order
> intensive workloads, and I wonder what are the most common high-order
> allocation paths which fail.
>

Agreed. As I am not fully sure what workloads require high-order
allocations, I updated VMRegress to keep track of the count of
allocations and released 0.11
(http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
use it to track allocations, do the following

1. Download and unpack vmregress
2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
The patch currently requires the modified allocator but I can fix that up
if people want it. Build and deploy the kernel
3. Build vmregress by
./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
(or whatever path is appropriate)
make
4. Load the modules with;
insmod src/code/vmregress_core.ko
insmod src/sense/trace_alloccount.ko

This will create a proc entry /proc/vmregress/trace_alloccount that looks
something like;

Allocations (V1)
-----------
KernNoRclm 997453 370 50 0 0 0 0 0 0 0 0
KernRclm 35279 0 0 0 0 0 0 0 0 0 0
UserRclm 9870808 0 0 0 0 0 0 0 0 0 0
Total 10903540 370 50 0 0 0 0 0 0 0 0

Frees
-----
KernNoRclm 590965 244 28 0 0 0 0 0 0 0 0
KernRclm 227100 60 5 0 0 0 0 0 0 0 0
UserRclm 7974200 73 17 0 0 0 0 0 0 0 0
Total 19695805 747 100 0 0 0 0 0 0 0 0

To blank the counters, use

echo 0 > /proc/vmregress/trace_alloccount

Whatever workload we come up with, this proc entry will tell us if it is
exercising high-order allocations right now.

> It mostly depends on hardware because most high-order allocations happen
> inside device drivers? What are the kernel codepaths which try to do
> high-order allocations and fallback if failed?
>

I'm not sure. I think that the paths we exercise right now will be largely
artifical. For example, you can force order-2 allocations by scping a
large file through localhost (because of the large MTU in that interface).
I have not come up with another meaningful workload that guarentees
high-order allocations yet.

> To measure whether the cost of page migration offsets the ability to be
> able to deliver high-order allocations we want a set of meaningful
> performance tests?
>

Bear in mind, there are more considerations. The allocator potentially
makes hotplug problems easier and could be easily tied into any
page-zeroing system. Some of your own benchmarks also implied that the
modified allocator helped some types of workloads which is beneficial in
itself.The last consideration is HugeTLB pages, which I am hoping William
will weigh in.

Right now, I believe that the pool of huge pages is of a fixed size
because of fragmentation difficulties. If we knew we could allocate huge
pages, this pool would not have to be fixed. Some applications will
heavily benefit from this. While databases are the obvious one,
applications with large heaps will also benefit like Java Virtual
Machines. I can dig up papers that measured this on Solaris although I
don't have them at hand right now.

We know right now that the overhead of this allocator is fairly low
(anyone got benchmarks to disagree) but I understand that page migration
is relatively expensive. The allocator also does not have adverse
CPU+cache affects like migration and the concept is fairly simple.

> Its quite possible that not all unsatisfiable high-order allocations
> want to force page migration (which is quite expensive in terms of
> CPU/cache). Only migrate on __GFP_NOFAIL ?
>

I still believe with the allocator, we will only have to migrate in
exceptional circumstances.

> William, that same tradeoff exists for the zone balancing through
> migration idea you propose...
>

--
Mel Gorman

2005-01-23 01:34:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Sat, Jan 22, 2005 at 09:48:20PM +0000, Mel Gorman wrote:
> On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
>
> > On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:
> > > <Changelog snipped>
> >
> > Hi Mel,
> >
> > I was thinking that it would be nice to have a set of high-order
> > intensive workloads, and I wonder what are the most common high-order
> > allocation paths which fail.
> >
>
> Agreed. As I am not fully sure what workloads require high-order
> allocations, I updated VMRegress to keep track of the count of
> allocations and released 0.11
> (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> use it to track allocations, do the following
>
> 1. Download and unpack vmregress
> 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
> The patch currently requires the modified allocator but I can fix that up
> if people want it. Build and deploy the kernel
> 3. Build vmregress by
> ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
> (or whatever path is appropriate)
> make
> 4. Load the modules with;
> insmod src/code/vmregress_core.ko
> insmod src/sense/trace_alloccount.ko
>
> This will create a proc entry /proc/vmregress/trace_alloccount that looks
> something like;
>
> Allocations (V1)
> -----------
> KernNoRclm 997453 370 50 0 0 0 0 0 0 0 0
> KernRclm 35279 0 0 0 0 0 0 0 0 0 0
> UserRclm 9870808 0 0 0 0 0 0 0 0 0 0
> Total 10903540 370 50 0 0 0 0 0 0 0 0
>
> Frees
> -----
> KernNoRclm 590965 244 28 0 0 0 0 0 0 0 0
> KernRclm 227100 60 5 0 0 0 0 0 0 0 0
> UserRclm 7974200 73 17 0 0 0 0 0 0 0 0
> Total 19695805 747 100 0 0 0 0 0 0 0 0
>
> To blank the counters, use
>
> echo 0 > /proc/vmregress/trace_alloccount
>
> Whatever workload we come up with, this proc entry will tell us if it is
> exercising high-order allocations right now.

Great, excellent! Thanks.

I plan to spend some time testing and trying to understand the vmregress package
this week.

> > It mostly depends on hardware because most high-order allocations happen
> > inside device drivers? What are the kernel codepaths which try to do
> > high-order allocations and fallback if failed?
> >
>
> I'm not sure. I think that the paths we exercise right now will be largely
> artifical. For example, you can force order-2 allocations by scping a
> large file through localhost (because of the large MTU in that interface).
> I have not come up with another meaningful workload that guarentees
> high-order allocations yet.

Thoughts and criticism of the following ideas are very much appreciated:

In private conversation with wli (who helped me providing this information) we can
conjecture the following:

Modern IO devices are capable of doing scatter/gather IO.

There is overhead associated with setting up and managing the scatter/gather tables.

The benefit of large physically contiguous blocks is the ability to avoid the SG
management overhead.

Now the question is: The added overhead of allocating high order blocks through migration
offsets the overhead of SG IO ? Quantifying that is interesting.

This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and
device/IO subsystem characteristics.

Also filesystems benefit from big physically contiguous blocks. Quoting wli
"they want bigger blocks and contiguous memory to match bigger blocks..."

I completly agree that your simplified allocator decreases fragmentation which in turn
benefits the system overall.

This is an area which can be further improved - ie efficiency in reducing fragmentation
is excellent.
I sincerely appreciate the work you are doing!

> > To measure whether the cost of page migration offsets the ability to be
> > able to deliver high-order allocations we want a set of meaningful
> > performance tests?
> >
>
> Bear in mind, there are more considerations. The allocator potentially
> makes hotplug problems easier and could be easily tied into any
> page-zeroing system. Some of your own benchmarks also implied that the
> modified allocator helped some types of workloads which is beneficial in
> itself.The last consideration is HugeTLB pages, which I am hoping William
> will weigh in.
>
> Right now, I believe that the pool of huge pages is of a fixed size
> because of fragmentation difficulties. If we knew we could allocate huge
> pages, this pool would not have to be fixed. Some applications will
> heavily benefit from this. While databases are the obvious one,
> applications with large heaps will also benefit like Java Virtual
> Machines. I can dig up papers that measured this on Solaris although I
> don't have them at hand right now.

Please.

> We know right now that the overhead of this allocator is fairly low
> (anyone got benchmarks to disagree) but I understand that page migration
> is relatively expensive. The allocator also does not have adverse
> CPU+cache affects like migration and the concept is fairly simple.

Agreed.

> > Its quite possible that not all unsatisfiable high-order allocations
> > want to force page migration (which is quite expensive in terms of
> > CPU/cache). Only migrate on __GFP_NOFAIL ?
> >
>
> I still believe with the allocator, we will only have to migrate in
> exceptional circumstances.

Agreed - best scenario is the guaranteed availability of high-order blocks, where
migration is not necessary.


2005-01-23 17:05:25

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote:
> On Sat, Jan 22, 2005 at 09:48:20PM +0000, Mel Gorman wrote:
> > On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
> >
> > > On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:
> > > > <Changelog snipped>
> > >
> > > Hi Mel,
> > >
> > > I was thinking that it would be nice to have a set of high-order
> > > intensive workloads, and I wonder what are the most common high-order
> > > allocation paths which fail.
> > >
> >
> > Agreed. As I am not fully sure what workloads require high-order
> > allocations, I updated VMRegress to keep track of the count of
> > allocations and released 0.11
> > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > use it to track allocations, do the following
> >
> > 1. Download and unpack vmregress
> > 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
> > The patch currently requires the modified allocator but I can fix that up
> > if people want it. Build and deploy the kernel
> > 3. Build vmregress by
> > ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
> > (or whatever path is appropriate)
> > make
> > 4. Load the modules with;
> > insmod src/code/vmregress_core.ko
> > insmod src/sense/trace_alloccount.ko
> >
> > This will create a proc entry /proc/vmregress/trace_alloccount that looks
> > something like;
> >
> > Allocations (V1)
> > -----------
> > KernNoRclm 997453 370 50 0 0 0 0 0 0 0 0
> > KernRclm 35279 0 0 0 0 0 0 0 0 0 0
> > UserRclm 9870808 0 0 0 0 0 0 0 0 0 0
> > Total 10903540 370 50 0 0 0 0 0 0 0 0
> >
> > Frees
> > -----
> > KernNoRclm 590965 244 28 0 0 0 0 0 0 0 0
> > KernRclm 227100 60 5 0 0 0 0 0 0 0 0
> > UserRclm 7974200 73 17 0 0 0 0 0 0 0 0
> > Total 19695805 747 100 0 0 0 0 0 0 0 0
> >
> > To blank the counters, use
> >
> > echo 0 > /proc/vmregress/trace_alloccount
> >
> > Whatever workload we come up with, this proc entry will tell us if it is
> > exercising high-order allocations right now.
>
> Great, excellent! Thanks.
>
> I plan to spend some time testing and trying to understand the vmregress package
> this week.
>
> > > It mostly depends on hardware because most high-order allocations happen
> > > inside device drivers? What are the kernel codepaths which try to do
> > > high-order allocations and fallback if failed?
> > >
> >
> > I'm not sure. I think that the paths we exercise right now will be largely
> > artifical. For example, you can force order-2 allocations by scping a
> > large file through localhost (because of the large MTU in that interface).
> > I have not come up with another meaningful workload that guarentees
> > high-order allocations yet.
>
> Thoughts and criticism of the following ideas are very much appreciated:
>
> In private conversation with wli (who helped me providing this information) we can
> conjecture the following:
>
> Modern IO devices are capable of doing scatter/gather IO.
>
> There is overhead associated with setting up and managing the scatter/gather tables.
>
> The benefit of large physically contiguous blocks is the ability to avoid the SG
> management overhead.
>
> Now the question is: The added overhead of allocating high order blocks through migration
> offsets the overhead of SG IO ? Quantifying that is interesting.

What is the overhead of the SG IO management and how is the improvement without them?

Are block IO drivers trying to allocate big physical segments? I bet they are not, because the
"pool of huge pages" (as you say) is limited.

>
> This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and
> device/IO subsystem characteristics.
>
> Also filesystems benefit from big physically contiguous blocks. Quoting wli
> "they want bigger blocks and contiguous memory to match bigger blocks..."
>
> I completly agree that your simplified allocator decreases fragmentation which in turn
> benefits the system overall.
>
> This is an area which can be further improved - ie efficiency in reducing fragmentation
> is excellent.
> I sincerely appreciate the work you are doing!
>
> > > To measure whether the cost of page migration offsets the ability to be
> > > able to deliver high-order allocations we want a set of meaningful
> > > performance tests?
> > >
> >
> > Bear in mind, there are more considerations. The allocator potentially
> > makes hotplug problems easier and could be easily tied into any
> > page-zeroing system. Some of your own benchmarks also implied that the
> > modified allocator helped some types of workloads which is beneficial in
> > itself.The last consideration is HugeTLB pages, which I am hoping William
> > will weigh in.
> >
> > Right now, I believe that the pool of huge pages is of a fixed size
> > because of fragmentation difficulties. If we knew we could allocate huge
> > pages, this pool would not have to be fixed. Some applications will
> > heavily benefit from this. While databases are the obvious one,
> > applications with large heaps will also benefit like Java Virtual
> > Machines. I can dig up papers that measured this on Solaris although I
> > don't have them at hand right now.
>
> Please.
>
> > We know right now that the overhead of this allocator is fairly low
> > (anyone got benchmarks to disagree) but I understand that page migration
> > is relatively expensive. The allocator also does not have adverse
> CPU+cache affects like migration and the concept is fairly simple.
>
> Agreed.
>
> > > Its quite possible that not all unsatisfiable high-order allocations
> > > want to force page migration (which is quite expensive in terms of
> > > CPU/cache). Only migrate on __GFP_NOFAIL ?
> > >
> >
> > I still believe with the allocator, we will only have to migrate in
> > exceptional circumstances.
>
> Agreed - best scenario is the guaranteed availability of high-order blocks, where
> migration is not necessary.

2005-01-24 13:28:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Sat, 22 Jan 2005, Marcelo Tosatti wrote:

> > > I was thinking that it would be nice to have a set of high-order
> > > intensive workloads, and I wonder what are the most common high-order
> > > allocation paths which fail.
> > >
> >
> > Agreed. As I am not fully sure what workloads require high-order
> > allocations, I updated VMRegress to keep track of the count of
> > allocations and released 0.11
> > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > use it to track allocations, do the following
> >
> > < VMRegress instructions snipped>
>
> Great, excellent! Thanks.
>
> I plan to spend some time testing and trying to understand the vmregress package
> this week.
>

The documentation is not in sync with the code as the package is fairly
large to maintain as a side-project. For the recent data I posted, The
interesting parts of the tools are;

1. bin/extfrag_stat.pl will display external fragmentation as a percentage
of each order. I can go more into the calculation of this if anyone is
interested. It does not require any special patches or modules

2. bin/intfrag_stat.pl will display internal fragmentation in the system.
Use the --man switch to get a list of all options. Linux occasionally
suffers badly from internal fragmentation but it's a problem for another
time

3. mapfrag_stat.pl is what I used to map where allocations are in the
address space. It requires the kernel patch in
kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a
non-mbuddy version in there) before the vmregress kernel modules can be
loaded

4. extfrag_stat_overtime.pl tracks external fragmentation over time
although the figures are not very useful. It can also graph what
fragmentation for some orders are over time. The figures are not useful
because the fragmentation figures are based on free pages and does not
take into account the layout of the currently allocated pages.

5. The module in src/test/highalloc.ko is what I used to test high-order
allocations. It creates a proc entry /proc/vmregress/test_highalloc that
can be read or written. "echo Order Pages >
/proc/vmregress/test_highalloc" will attempt to allocate 2^Order pages
"Pages" times.

The perl scripts are smart enough to load the modules they need at runtime
if the modules have been installed with "make install".

> > > It mostly depends on hardware because most high-order allocations happen
> > > inside device drivers? What are the kernel codepaths which try to do
> > > high-order allocations and fallback if failed?
> > >
> >
> > I'm not sure. I think that the paths we exercise right now will be largely
> > artifical. For example, you can force order-2 allocations by scping a
> > large file through localhost (because of the large MTU in that interface).
> > I have not come up with another meaningful workload that guarentees
> > high-order allocations yet.
>
> Thoughts and criticism of the following ideas are very much appreciated:
>
> In private conversation with wli (who helped me providing this information) we can
> conjecture the following:
>
> Modern IO devices are capable of doing scatter/gather IO.
>
> There is overhead associated with setting up and managing the
> scatter/gather tables.
>
> The benefit of large physically contiguous blocks is the ability to
> avoid the SG management overhead.
>

Do we get this benefit right now? I read through the path of
generic_file_readv(). If I am reading this correctly (first reading, so
may not be right), scatter/gather IO will always be using order-0 pages.
Is this really true?

>From what I can see, the buffers being written to for readv() are all in
userspace so are going to be order-0 (unless hugetlb is in use, is that
the really interesting case?). For reading from the disk, the blocksize is
what will be important and we can't create a filesystem with blocksizes
greater than pagesize right now.

So, for scatter/gather to take advantage of contiguous blocks, is more
work required? If not, what am I missing?

> Also filesystems benefit from big physically contiguous blocks. Quoting
> wli "they want bigger blocks and contiguous memory to match bigger
> blocks..."
>

This I don't get... What filesystems support really large blocks? ext2/3
only support pagesize and reiser will create a filesystem with a blocksize
of 8192, but not mount it.

> I completly agree that your simplified allocator decreases fragmentation
> which in turn benefits the system overall.
>
> This is an area which can be further improved - ie efficiency in
> reducing fragmentation is excellent. I sincerely appreciate the work
> you are doing!
>

Thanks.

> > <Snip>
> >
> > Right now, I believe that the pool of huge pages is of a fixed size
> > because of fragmentation difficulties. If we knew we could allocate huge
> > pages, this pool would not have to be fixed. Some applications will
> > heavily benefit from this. While databases are the obvious one,
> > applications with large heaps will also benefit like Java Virtual
> > Machines. I can dig up papers that measured this on Solaris although I
> > don't have them at hand right now.
>
> Please.
>

I took a root through but did not find all the papers I was looking for.
What I did find was;

1. Big Heaps and Intimate Shared Memory (ISM)
http://java.sun.com/docs/hotspot/ism.html
This is one of many similar docs that recommend the use of large pages for
applications with large heaps like JVMs and databases. ISM is no longer
used much and has been replaced by Muliple Page Size Support (MPSS) . MPSS
is essentially the same thing but does not lock pages

2. Preformance boost using ISM
http://www.gcocco.com/bench/jbb2k.html#ism_boost
The result is operations per second and ISM gives gains of between 8-10%.
The gains could be a result of either the locking of pages in memory or
the TLB gains, the article does not investigate.

2. Optimising BEA Weblogic Server Performance on Solaris
http://www.bea.com/content/files/eworld/presentations/Tues_03_04_03/Extending_the_BEA_Platform/1239_WLS_Performance_on_Solaris.pdf
Ok, reaching a little here: See slides 19 and 20. It shows what the TLB
miss rates were for 8KiB and 4MiB pages. There are 10 times as many TLB
misses with the smaller pages.

3. Multiple Page Size Support in the Linux Kernel (2002)
http://citeseer.ist.psu.edu/winwood02multiple.html
Page 11 and 12 show the experimental results they found when using large
pages although later parts of the paper talks about code and data locality
in JVMs which is also of interest.

4. Supporting Multiple Page Sizes in Solaris Operating System
http://www.sun.com/blueprints/0304/817-5918.pdf
Illustrates some of the tools Solaris has for measuring stuff like TLB
misses. Also talks about what system calls and libraries they have for
supporting large pages and using them with existing applications without
recompilation. If nothing else, this paper is interesting to compare how
Solaris provides large pages in comparison to Linux.

I could keep digging, but I think the bottom line is that having large
pages generally available rather than a fixed setting is desirable.

> > We know right now that the overhead of this allocator is fairly low
> > (anyone got benchmarks to disagree) but I understand that page migration
> > is relatively expensive. The allocator also does not have adverse
> > CPU+cache affects like migration and the concept is fairly simple.
>
> Agreed.
>
> > > Its quite possible that not all unsatisfiable high-order allocations
> > > want to force page migration (which is quite expensive in terms of
> > > CPU/cache). Only migrate on __GFP_NOFAIL ?
> > >
> >
> > I still believe with the allocator, we will only have to migrate in
> > exceptional circumstances.
>
> Agreed - best scenario is the guaranteed availability of high-order blocks, where
> migration is not necessary.
>

Right, although I also believe it will be a while before we really start
taking advantage of the high-order blocks as we've lived a long time with
the assumption that high-order allocations will fail.

--
Mel Gorman

2005-01-24 16:10:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator


James and Grant added to CC.

On Mon, Jan 24, 2005 at 01:28:47PM +0000, Mel Gorman wrote:
> On Sat, 22 Jan 2005, Marcelo Tosatti wrote:
>
> > > > I was thinking that it would be nice to have a set of high-order
> > > > intensive workloads, and I wonder what are the most common high-order
> > > > allocation paths which fail.
> > > >
> > >
> > > Agreed. As I am not fully sure what workloads require high-order
> > > allocations, I updated VMRegress to keep track of the count of
> > > allocations and released 0.11
> > > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > > use it to track allocations, do the following
> > >
> > > < VMRegress instructions snipped>
> >
> > Great, excellent! Thanks.
> >
> > I plan to spend some time testing and trying to understand the vmregress package
> > this week.
> >
>
> The documentation is not in sync with the code as the package is fairly
> large to maintain as a side-project. For the recent data I posted, The
> interesting parts of the tools are;
>
> 1. bin/extfrag_stat.pl will display external fragmentation as a percentage
> of each order. I can go more into the calculation of this if anyone is
> interested. It does not require any special patches or modules
>
> 2. bin/intfrag_stat.pl will display internal fragmentation in the system.
> Use the --man switch to get a list of all options. Linux occasionally
> suffers badly from internal fragmentation but it's a problem for another
> time
>
> 3. mapfrag_stat.pl is what I used to map where allocations are in the
> address space. It requires the kernel patch in
> kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a
> non-mbuddy version in there) before the vmregress kernel modules can be
> loaded
>
> 4. extfrag_stat_overtime.pl tracks external fragmentation over time
> although the figures are not very useful. It can also graph what
> fragmentation for some orders are over time. The figures are not useful
> because the fragmentation figures are based on free pages and does not
> take into account the layout of the currently allocated pages.
>
> 5. The module in src/test/highalloc.ko is what I used to test high-order
> allocations. It creates a proc entry /proc/vmregress/test_highalloc that
> can be read or written. "echo Order Pages >
> /proc/vmregress/test_highalloc" will attempt to allocate 2^Order pages
> "Pages" times.
>
> The perl scripts are smart enough to load the modules they need at runtime
> if the modules have been installed with "make install".

OK, thanks very much for the information - you might want to write this down
into a text file and add it to the tarball :)

> > > > It mostly depends on hardware because most high-order allocations happen
> > > > inside device drivers? What are the kernel codepaths which try to do
> > > > high-order allocations and fallback if failed?
> > > >
> > >
> > > I'm not sure. I think that the paths we exercise right now will be largely
> > > artifical. For example, you can force order-2 allocations by scping a
> > > large file through localhost (because of the large MTU in that interface).
> > > I have not come up with another meaningful workload that guarentees
> > > high-order allocations yet.
> >
> > Thoughts and criticism of the following ideas are very much appreciated:
> >
> > In private conversation with wli (who helped me providing this information) we can
> > conjecture the following:
> >
> > Modern IO devices are capable of doing scatter/gather IO.
> >
> > There is overhead associated with setting up and managing the
> > scatter/gather tables.
> >
> > The benefit of large physically contiguous blocks is the ability to
> > avoid the SG management overhead.
> >
>
> Do we get this benefit right now?

Since the pages which compose IO operations are most likely sparse (not physically contiguous),
the driver+device has to perform scatter-gather IO on the pages.

The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list
elements (decreased CPU overhead, decreased device overhead, faster).

Best scenario is where only one sg element is required (ie one huge physically contiguous block).

Old devices/unprepared drivers which are not able to perform SG/IO
suffer with sequential small sized operations.

I'm far away from being a SCSI/ATA knowledgeable person, the storage people can
help with expertise here.

Grant Grundler and James Bottomley have been working on this area, they might want to
add some comments to this discussion.

It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose.

> I read through the path of
> generic_file_readv(). If I am reading this correctly (first reading, so
> may not be right), scatter/gather IO will always be using order-0 pages.
> Is this really true?

Yes, it is.

I was referring to scatter/gather IO at the device driver level, not SG IO at
application level (readv/writev).

Thing is that virtually contiguous data buffers which are operated on with read/write,
aio_read/aio_write, etc. become in fact scatter-gather operations at the device
level if they are not physically contiguous.

> >From what I can see, the buffers being written to for readv() are all in
> userspace so are going to be order-0 (unless hugetlb is in use, is that
> the really interesting case?).

Yes, this is a similar interesting case, but this is valid for all operations,
not only readv/writev.

> For reading from the disk, the blocksize is
> what will be important and we can't create a filesystem with blocksizes
> greater than pagesize right now.

But we have extents.

> So, for scatter/gather to take advantage of contiguous blocks, is more
> work required? If not, what am I missing?

I just asked James Bottomley and:

If two 4k pages are physically contigouous, we coalesce them (if the clustering flag is set)
We spent a while in 2.6 making this work correctly.
So if you pass a 128k contiguous region down to SG_IO (or any other SCSI command) and the adapter
supports clustering, it will go out as a single sg element.

> > Also filesystems benefit from big physically contiguous blocks. Quoting
> > wli "they want bigger blocks and contiguous memory to match bigger
> > blocks..."
> >
>
> This I don't get... What filesystems support really large blocks? ext2/3
> only support pagesize and reiser will create a filesystem with a blocksize
> of 8192, but not mount it.

Yes, but ext2/3, JFS, XFS and probably others support large extents composed of
on-disk contiguous blocks:

>From http://e2fsprogs.sourceforge.net/extensions-ext23/

"The ext2 filesystem uses direct, indirect, double indirect, and triple indirection
blocks to map file offsets to on-disk blocks, like most classical Unix filesystems.
Unfortunately, the direct/indirect block scheme is inefficient for large files.
This can be easily demonstrated by deleting a very large file, and noting how
long that operation can take. Fortunately, ext2 block allocation algorithms tend
to be very successful at avoiding fragmentation and in allocating contiguous data
blocks for files. For most Linux filesystems in production use today, the percentage
of non-contiguous files reported by e2fsck is generally less than 10%. This means
that in general, over 90% of the files on an ext2 filesystem only require a single
extent map to describe all of their data blocks."

> > I completly agree that your simplified allocator decreases fragmentation
> > which in turn benefits the system overall.
> >
> > This is an area which can be further improved - ie efficiency in
> > reducing fragmentation is excellent. I sincerely appreciate the work
> > you are doing!
> >
>
> Thanks.
>
> > > <Snip>
> > >
> > > Right now, I believe that the pool of huge pages is of a fixed size
> > > because of fragmentation difficulties. If we knew we could allocate huge
> > > pages, this pool would not have to be fixed. Some applications will
> > > heavily benefit from this. While databases are the obvious one,
> > > applications with large heaps will also benefit like Java Virtual
> > > Machines. I can dig up papers that measured this on Solaris although I
> > > don't have them at hand right now.
> >
> > Please.
> >
>
> I took a root through but did not find all the papers I was looking for.
> What I did find was;
>
> 1. Big Heaps and Intimate Shared Memory (ISM)
> http://java.sun.com/docs/hotspot/ism.html
> This is one of many similar docs that recommend the use of large pages for
> applications with large heaps like JVMs and databases. ISM is no longer
> used much and has been replaced by Muliple Page Size Support (MPSS) . MPSS
> is essentially the same thing but does not lock pages
>
> 2. Preformance boost using ISM
> http://www.gcocco.com/bench/jbb2k.html#ism_boost
> The result is operations per second and ISM gives gains of between 8-10%.
> The gains could be a result of either the locking of pages in memory or
> the TLB gains, the article does not investigate.
>
> 2. Optimising BEA Weblogic Server Performance on Solaris
> http://www.bea.com/content/files/eworld/presentations/Tues_03_04_03/Extending_the_BEA_Platform/1239_WLS_Performance_on_Solaris.pdf
> Ok, reaching a little here: See slides 19 and 20. It shows what the TLB
> miss rates were for 8KiB and 4MiB pages. There are 10 times as many TLB
> misses with the smaller pages.
>
> 3. Multiple Page Size Support in the Linux Kernel (2002)
> http://citeseer.ist.psu.edu/winwood02multiple.html
> Page 11 and 12 show the experimental results they found when using large
> pages although later parts of the paper talks about code and data locality
> in JVMs which is also of interest.
>
> 4. Supporting Multiple Page Sizes in Solaris Operating System
> http://www.sun.com/blueprints/0304/817-5918.pdf
> Illustrates some of the tools Solaris has for measuring stuff like TLB
> misses. Also talks about what system calls and libraries they have for
> supporting large pages and using them with existing applications without
> recompilation. If nothing else, this paper is interesting to compare how
> Solaris provides large pages in comparison to Linux.
>
> I could keep digging, but I think the bottom line is that having large
> pages generally available rather than a fixed setting is desirable.

Definately, yes. Thanks for the pointers.

> > > We know right now that the overhead of this allocator is fairly low
> > > (anyone got benchmarks to disagree) but I understand that page migration
> > > is relatively expensive. The allocator also does not have adverse
> > > CPU+cache affects like migration and the concept is fairly simple.
> >
> > Agreed.
> >
> > > > Its quite possible that not all unsatisfiable high-order allocations
> > > > want to force page migration (which is quite expensive in terms of
> > > > CPU/cache). Only migrate on __GFP_NOFAIL ?
> > > >
> > >
> > > I still believe with the allocator, we will only have to migrate in
> > > exceptional circumstances.
> >
> > Agreed - best scenario is the guaranteed availability of high-order blocks, where
> > migration is not necessary.
> >
>
> Right, although I also believe it will be a while before we really start
> taking advantage of the high-order blocks as we've lived a long time with
> the assumption that high-order allocations will fail.

It seems that the SCSI layer is already prepared to benefit from the high-order
blocks. Others will probably follow.

2005-01-24 16:45:33

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote:
> Since the pages which compose IO operations are most likely sparse (not physically contiguous),
> the driver+device has to perform scatter-gather IO on the pages.
>
> The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list
> elements (decreased CPU overhead, decreased device overhead, faster).
>
> Best scenario is where only one sg element is required (ie one huge physically contiguous block).
>
> Old devices/unprepared drivers which are not able to perform SG/IO
> suffer with sequential small sized operations.
>
> I'm far away from being a SCSI/ATA knowledgeable person, the storage people can
> help with expertise here.
>
> Grant Grundler and James Bottomley have been working on this area, they might want to
> add some comments to this discussion.
>
> It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose.

Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices. They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.

No one has ever really measured an effect we can say "This is due to the
card's SG engine". So, the rule we tend to follow is that if SG element
reduction comes for free, we take it. The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.

The bottom line is that the I/O (block) subsystem is very efficient at
coalescing (both in block space and in physical memory space) and we've
got it to the point where it's about as efficient as it can be. If
you're going to give us better physical contiguity properties, we'll
take them, but if you spend extra cycles doing it, the chances are
you'll slow down the I/O throughput path.

James


2005-01-24 19:32:27

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Mon, Jan 24, 2005 at 10:44:12AM -0600, James Bottomley wrote:
> On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote:
> > Since the pages which compose IO operations are most likely sparse (not physically contiguous),
> > the driver+device has to perform scatter-gather IO on the pages.
> >
> > The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list
> > elements (decreased CPU overhead, decreased device overhead, faster).
> >
> > Best scenario is where only one sg element is required (ie one huge physically contiguous block).
> >
> > Old devices/unprepared drivers which are not able to perform SG/IO
> > suffer with sequential small sized operations.
> >
> > I'm far away from being a SCSI/ATA knowledgeable person, the storage people can
> > help with expertise here.
> >
> > Grant Grundler and James Bottomley have been working on this area, they might want to
> > add some comments to this discussion.
> >
> > It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose.
>
> Well, the basic advice would be not to worry too much about
> fragmentation from the point of view of I/O devices. They mostly all do
> scatter gather (SG) onboard as an intelligent processing operation and
> they're very good at it.

So is it valid to affirm that on average an operation with one SG element pointing to a 1MB
region is similar in speed to an operation with 16 SG elements each pointing to a 64K
region due to the efficient onboard SG processing?

> No one has ever really measured an effect we can say "This is due to the
> card's SG engine". So, the rule we tend to follow is that if SG element
> reduction comes for free, we take it. The issue that actually causes
> problems isn't the reduction in processing overhead, it's that the
> device's SG list is usually finite in size and so it's worth conserving
> if we can; however it's mostly not worth conserving at the expense of
> processor cycles.
>
> The bottom line is that the I/O (block) subsystem is very efficient at
> coalescing (both in block space and in physical memory space) and we've
> got it to the point where it's about as efficient as it can be. If
> you're going to give us better physical contiguity properties, we'll
> take them, but if you spend extra cycles doing it, the chances are
> you'll slow down the I/O throughput path.

OK! thanks.

2005-01-24 19:55:18

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Mon, Jan 24, 2005 at 10:29:52AM -0200, Marcelo Tosatti wrote:
> Grant Grundler and James Bottomley have been working on this area,
> they might want to add some comments to this discussion.
>
> It seems HP (Grant et all) has pursued using big pages on IA64 (64K)
> for this purpose.

Marcello,
That might have been Alex Williamson...but the reasons for 64K pages
is to reduce TLB thrashing, not faster IO.

On HP ZX1 boxes, SG performance is slightly better (max +5%) when going
through the IOMMU than when bypassing it. The IOMMU can perfectly
coalesce DMA pages but has a small CPU and DMA cost to do so as well.

Otherwise, I totally agree with James. IO devices do scatter-gather
pretty well and IO subsystems are tuned for page-size chunk or
smaller anyway.

...
> > I could keep digging, but I think the bottom line is that having large
> > pages generally available rather than a fixed setting is desirable.
>
> Definately, yes. Thanks for the pointers.

Big pages are good for CPU TLB and that's where most of the
research has been done. I think IO devices have learned to cope
with the fact the alot less has been (or can be for many
workloads) done to coalesce IO pages.

grant

2005-01-24 20:43:10

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Mon, 2005-01-24 at 13:49 -0200, Marcelo Tosatti wrote:
> So is it valid to affirm that on average an operation with one SG element pointing to a 1MB
> region is similar in speed to an operation with 16 SG elements each pointing to a 64K
> region due to the efficient onboard SG processing?

it's within a few percent, yes. And the figures depend on how good the
I/O card is at it. I can imagine there are some wildly varying I/O
cards out there.

However, also remember that 1MB of I/O is getting beyond what's sensible
for a disc device anyway. The cable speed is much faster than the
platter speed, so the device takes the I/O into its cache as it services
it. If you overrun the cache it will burp (disconnect) and force a
reconnection to get the rest (effectively splitting the I/O up anyway).
This doesn't apply to arrays with huge caches, but it does to pretty
much everything else. The average disc cache size is only a megabyte or
so.

James


2005-01-24 20:56:30

by Steve Lord

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

James Bottomley wrote:

>
> Well, the basic advice would be not to worry too much about
> fragmentation from the point of view of I/O devices. They mostly all do
> scatter gather (SG) onboard as an intelligent processing operation and
> they're very good at it.
>
> No one has ever really measured an effect we can say "This is due to the
> card's SG engine". So, the rule we tend to follow is that if SG element
> reduction comes for free, we take it. The issue that actually causes
> problems isn't the reduction in processing overhead, it's that the
> device's SG list is usually finite in size and so it's worth conserving
> if we can; however it's mostly not worth conserving at the expense of
> processor cycles.
>

Depends on the device at the other end of the scsi/fiber channel.
We have seen the processor in raid devices get maxed out by linux
when it is not maxed out by windows. Windows tends to be more device
friendly (I hate to say it), by sending larger and fewer scatter gather
elements than linux does.

Running an LSI raid over fiberchannel with 4 ports, windows was
able to sustain ~830 Mbytes/sec, basically channel speed using
only 1500 commands a second. Linux peaked at 550 Mbytes/sec using
over 4000 scsi commands to do it - the sustained rate was more
like 350 Mbytes/sec, I think at the end of the day linux was
sending 128K per scsi request. These numbers predate the current
linux scsi and io code, and I do not have the hardware to rerun
them right now.

I realize this is one data point on one end of the scale, but I
just wanted to make the point that there are cases where it
does matter. Hopefully William's little change from last
year has helped out a lot.

Steve

2005-01-25 07:39:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

Steve Lord <[email protected]> writes:
>
> I realize this is one data point on one end of the scale, but I
> just wanted to make the point that there are cases where it
> does matter. Hopefully William's little change from last
> year has helped out a lot.

There are more datapoints:

e.g. performance on megaraid controllers (very popular because a big
PC vendor ships them) was always quite bad on Linux. Up to the point
that specific IO workloads run half as fast on a megaraid compared to
other controllers. I heard they do work better on Windows.

Also I did some experiments with coalescing SG lists in the Opteron IOMMU
some time ago. With a MPT fusion controller and forcing all SG lists
through the IOMMU so that the SCSI controller always only contiguous mappings
I saw ~5% improvement on some IO tests.

Unfortunately there are some problems that doesn't allow to enable
this unconditionally. But it gives strong evidence that MPT Fusion prefers
shorter SG lists too.

So it seems to be worthwhile to optimize for shorter SG lists.

Ideally the Linux IO patterns would look similar to the Windows IO patterns,
then we could reuse all the optimizations the controller vendors
did for Windows :)

-Andi

2005-01-25 14:04:02

by Mukker, Atul

[permalink] [raw]
Subject: RE: [PATCH] Avoiding fragmentation through different allocator


> e.g. performance on megaraid controllers (very popular
> because a big PC vendor ships them) was always quite bad on
> Linux. Up to the point that specific IO workloads run half as
> fast on a megaraid compared to other controllers. I heard
> they do work better on Windows.
>
<snip>
> Ideally the Linux IO patterns would look similar to the
> Windows IO patterns, then we could reuse all the
> optimizations the controller vendors did for Windows :)

LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.

The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?

Thanks

Atul Mukker
Architect, Drivers and BIOS
LSI Logic Corporation

2005-01-25 14:18:47

by Steve Lord

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

Mukker, Atul wrote:

>
> LSI would leave no stone unturned to make the performance better for
> megaraid controllers under Linux. If you have some hard data in relation to
> comparison of performance for adapters from other vendors, please share with
> us. We would definitely strive to better it.
>
> The megaraid driver is open source, do you see anything that driver can do
> to improve performance. We would greatly appreciate any feedback in this
> regard and definitely incorporate in the driver. The FW under Linux and
> windows is same, so I do not see how the megaraid stack should perform
> differently under Linux and windows?

It is not the driver per se, but the way the memory which is the I/O
source/target is presented to the driver. In linux there is a good
chance it will have to use more scatter gather elements to represent
the same amount of data.

Steve

2005-01-25 14:28:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

> It is not the driver per se, but the way the memory which is the I/O
> source/target is presented to the driver. In linux there is a good
> chance it will have to use more scatter gather elements to represent
> the same amount of data.

Note that a change made a few month ago after seeing issues with
aacraid means it's much more likely to see contingous memory,
there were some numbers on linux-scsi and/or linux-kernel.

2005-01-25 14:49:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Tue, Jan 25, 2005 at 02:27:57PM +0000, Christoph Hellwig wrote:
> > It is not the driver per se, but the way the memory which is the I/O
> > source/target is presented to the driver. In linux there is a good
> > chance it will have to use more scatter gather elements to represent
> > the same amount of data.
>
> Note that a change made a few month ago after seeing issues with
> aacraid means it's much more likely to see contingous memory,
> there were some numbers on linux-scsi and/or linux-kernel.

But only at the beginning. iirc after a few days of uptime
and memory fragmentation it degenerates back to the old numbers.

Perhaps the recent anti defragmentation work will help more.

-Andi

P.S.: on a AMD x86-64 box the theory can be relatively easily tested:
just run with iommu=force,biomerge that will use the IOMMU to merge
SG elements. I just don't recommend it for production because some errors
are not well handled.

2005-01-25 14:56:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
>
> > e.g. performance on megaraid controllers (very popular
> > because a big PC vendor ships them) was always quite bad on
> > Linux. Up to the point that specific IO workloads run half as
> > fast on a megaraid compared to other controllers. I heard
> > they do work better on Windows.
> >
> <snip>
> > Ideally the Linux IO patterns would look similar to the
> > Windows IO patterns, then we could reuse all the
> > optimizations the controller vendors did for Windows :)
>
> LSI would leave no stone unturned to make the performance better for
> megaraid controllers under Linux. If you have some hard data in relation to
> comparison of performance for adapters from other vendors, please share with
> us. We would definitely strive to better it.

Sorry for being vague on this. I don't have much hard data on this,
just telling an annecdote. The issue we saw was over a year ago
and on a machine running an IO intensive multi process stress test
(I believe it was an AIM7 variant with some tweaked workfile). When the test
was moved to a machine with megaraid controller it ran significantly
lower, compared to the old setup with a non RAID SCSI controller from
a different vendor. I unfortunately don't know anymore the exact
type/firmware revision etc. of the megaraid that showed the problem.

If you have already fixed the issues then please accept my apologies.

> The megaraid driver is open source, do you see anything that driver can do
> to improve performance. We would greatly appreciate any feedback in this
> regard and definitely incorporate in the driver. The FW under Linux and
> windows is same, so I do not see how the megaraid stack should perform
> differently under Linux and windows?

My understanding (may be incomplete) of the issue is basically what
Steve said: something in the stack doesn't like the Linux IO patterns
with often relatively long SG lists, which are longer than in some
other popular OS. This is unlikely to be the Linux driver
(drivers tend to just pass the SG lists through without too much processing),
more likely it was the firmware or something below.

-Andi

2005-01-25 16:15:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Tue, 25 Jan 2005, Andi Kleen wrote:

> On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
> >
> > > e.g. performance on megaraid controllers (very popular
> > > because a big PC vendor ships them) was always quite bad on
> > > Linux. Up to the point that specific IO workloads run half as
> > > fast on a megaraid compared to other controllers. I heard
> > > they do work better on Windows.
> > >
> > <snip>
> > > Ideally the Linux IO patterns would look similar to the
> > > Windows IO patterns, then we could reuse all the
> > > optimizations the controller vendors did for Windows :)
> >
> > LSI would leave no stone unturned to make the performance better for
> > megaraid controllers under Linux. If you have some hard data in relation to
> > comparison of performance for adapters from other vendors, please share with
> > us. We would definitely strive to better it.
>
> Sorry for being vague on this. I don't have much hard data on this,
> just telling an annecdote. The issue we saw was over a year ago
> and on a machine running an IO intensive multi process stress test
> (I believe it was an AIM7 variant with some tweaked workfile). When the test
> was moved to a machine with megaraid controller it ran significantly
> lower, compared to the old setup with a non RAID SCSI controller from
> a different vendor. I unfortunately don't know anymore the exact
> type/firmware revision etc. of the megaraid that showed the problem.
>

Ok, for me here, the bottom line is that decent hardware will not benefit
from help from the allocator. Worse, if the work required to provide
adjacent pages is high, it will even adversly affect throughput. I know as
well that to have physically contiguous pages in userspace would involve a
fair amount of overhead so even if we devise a system for providing them,
it would need to be a configurable option.

I will keep an eye out for a means of granting physically contiguous pages
for userspace in a lightweight manner but I'm going to focus on general
availability of large pages for TLBs, extend the system for a pool of
zero'd pages and how it can be adapted to help out the hotplug folks.

The system I have in mind for contiguous pages for userspace right now is
to extend the allocator API so that prefaulting and readahead will request
blocks of pages for userspace rather than a series of order-0 pages. So,
if we prefault 32 pages ahead, the allocator would have a new API that
would return 32 pages that are physically contiguous. That, in combination
with forced IOMMU may show if Contiguous Pages For IO is worth it or not.

This will take a while as I'll have to develop some mechanism for
measuring it while I'm at it and I only do this 2 days a week so it'll
take a while.

--
Mel Gorman

2005-01-25 18:50:16

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] Avoiding fragmentation through different allocator

On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
> The megaraid driver is open source, do you see anything that driver can do
> to improve performance. We would greatly appreciate any feedback in this
> regard and definitely incorporate in the driver. The FW under Linux and
> windows is same, so I do not see how the megaraid stack should perform
> differently under Linux and windows?

Just to second what Andy already stated: it's more likely the
Megaraid firmware could be better at fetching the SG lists.
This is a difficult problem since the firmware needs to work
well on so many different platforms/chipsets.

If LSI has time to turn more stones, get a PCI bus analyzer and filter
it to only capture CPU MMIO traffic and DMA traffic to/from some
"well known" SG lists (ie instrument the driver to print those to
the console). Then run AIM7 or similar multithreaded workload.
A perfect PCI trace will show the device pulling the SG list in
cacheline at time after the CPU MMIO reads/writes from the card
to indicate a new transaction is ready to go.

Another stone LSI could turn is to verify the megaraid controller is
NOT contending with the CPU for cachelines used to build SG lists.
This something the driver controls but I only know how to measure
this on ia64 machines (with pfmon or caliper or similar tool).
If you want examples, see
http://iou.parisc-linux.org/ols2004/pfmon_for_iodorks.pdf

In case it's not clear from above, optimal IO flow means the device
is moving control data and streaming data in cacheline or bigger units.
If Megaraid is already doing that, then the PCI trace timing info
should point at where the latencies are.

hth,
grant