Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261633AbVEVUIY (ORCPT ); Sun, 22 May 2005 16:08:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261272AbVEVUIY (ORCPT ); Sun, 22 May 2005 16:08:24 -0400 Received: from holly.csn.ul.ie ([136.201.105.4]:40155 "EHLO holly.csn.ul.ie") by vger.kernel.org with ESMTP id S261633AbVEVUFM (ORCPT ); Sun, 22 May 2005 16:05:12 -0400 To: linux-mm@kvack.org Subject: Avoiding external fragmentation with a placement policy Version 11 Cc: linux-kernel@vger.kernel.org, akpm@osdl.org Message-Id: <20050522200507.6ED7AECFC@skynet.csn.ul.ie> Date: Sun, 22 May 2005 21:05:07 +0100 (IST) From: mel@csn.ul.ie (Mel Gorman) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 45067 Lines: 1243 Changelog since V10 o Important - All allocation types now use per-cpu caches like the standard allocator. Older versions may have trouble with large numbers of processors o Removed all the additional buddy allocator statistic code o Elimated three zone fields that can be lived without o Removed many unnecessary calculations Changelog since V9 o Tightened what pools are used for fallbacks, less likely to fragment o Many micro-optimisations to have the same performance as the standard allocator. Modified allocator now faster than standard allocator using gcc 3.3.5 Changelog since V8 o rmqueue_bulk() allocates pages in large blocks and breaks it up into the requested size. Reduces the number of calls to __rmqueue() o Beancounters are now a configurable option under "Kernel Hacking" o Broke out some code into inline functions to be more Hotplug-friendly o Increased the size of reserve for fallbacks from 10% to 12.5%. Changelog since V7 o Updated to 2.6.11-rc4 o Lots of cleanups, mainly related to beancounters o Fixed up a miscalculation in the bitmap size as pointed out by Mike Kravetz (thanks Mike) o Introduced a 10% reserve for fallbacks. Drastically reduces the number of kernnorclm allocations that go to the wrong places o Don't trigger OOM when large allocations are involved Changelog since V6 o Updated to 2.6.11-rc2 o Minor change to allow prezeroing to be a cleaner looking patch Changelog since V5 o Fixed up gcc-2.95 errors o Fixed up whitespace damage Changelog since V4 o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies with offsets to 2.6.11-rc1-mm1 Changelog since V3 o inlined get_pageblock_type() and set_pageblock_type() o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone() o When taking from the global pool, do not scan all the low-order lists Changelog since V2 o Do not to interfere with the "min" decay o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably anything to do with asynchronous IO Changelog since V1 o Update patch to 2.6.11-rc1 o Cleaned up bug where memory was wasted on a large bitmap o Remove code that needed the binary buddy bitmaps o Update flags to avoid colliding with __GFP_ZERO changes o Extended fallback_count bean counters to show the fallback count for each allocation type o In-code documentation Version 1 o Initial release against 2.6.9 This patch is designed to reduce fragmentation in the standard buddy allocator without impairing the performance of the allocator. High fragmentation in the standard binary buddy allocator means that high-order allocations can rarely be serviced. This patch works by dividing allocations into three different types of allocations; UserReclaimable - These are userspace pages that are easily reclaimable. Right now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are in this category. These pages are trivially reclaimed by writing the page out to swap or syncing with backing storage KernelReclaimable - These are pages allocated by the kernel that are easily reclaimed. This is stuff like inode caches, dcache, buffer_heads etc. These type of pages potentially could be reclaimed by dumping the caches and reaping the slabs KernelNonReclaimable - These are pages that are allocated by the kernel that are not trivially reclaimed. For example, the memory allocated for a loaded module would be in this category. By default, allocations are considered to be of this type Instead of having one global MAX_ORDER-sized array of free lists, there are four, one for each type of allocation and another 12.5% reserve for fallbacks. Finally, there is a list of pages of size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals with. Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is added to the free-lists for that type, in effect reserving it. Hence, over time, pages of the different types can be clustered together. This means that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of pages allocated for UserReclaimable and page each of them out. Fallback is used when there are no 2^MAX_ORDER pages available and there are no free pages of the desired type. The fallback lists were chosen in a way that keeps the most easily reclaimable pages together. Three benchmark results are included all based on a 2.6.12-rc3 kernel compiled with gcc 3.3.5 (it is known that gcc 2.95.4 results in significantly different results). The first is the output of portions of AIM9 for the vanilla allocator and the modified one; (Tests run with bench-aim9.sh from VMRegress 0.14) 2.6.12-rc4-standard ------------------------------------------------------------------------------------------------------------ Test Test Elapsed Iteration Iteration Operation Number Name Time (sec) Count Rate (loops/sec) Rate (ops/sec) ------------------------------------------------------------------------------------------------------------ 1 add_double 60.03 1536 25.58721 460569.72 Thousand Double Precision Additions/second 2 add_float 60.01 2292 38.19363 458323.61 Thousand Single Precision Additions/second 3 add_long 60.01 1415 23.57940 1414764.21 Thousand Long Integer Additions/second 4 add_int 60.03 1417 23.60486 1416291.85 Thousand Integer Additions/second 5 add_short 60.01 3555 59.24013 1421763.04 Thousand Short Integer Additions/second 6 creat-clo 60.04 1116 18.58761 18587.61 File Creations and Closes/second 7 page_test 60.01 4284 71.38810 121359.77 System Allocations & Pages/second 8 brk_test 60.00 1608 26.80000 455600.00 System Memory Allocations/second 9 jmp_test 60.00 250964 4182.73333 4182733.33 Non-local gotos/second 10 signal_test 60.01 5466 91.08482 91084.82 Signal Traps/second 11 exec_test 60.05 783 13.03913 65.20 Program Loads/second 12 fork_test 60.01 925 15.41410 1541.41 Task Creations/second 13 link_test 60.01 6074 101.21646 6376.64 Link/Unlink Pairs/second 2.6.12-rc3-mbuddy-v10 ------------------------------------------------------------------------------------------------------------ Test Test Elapsed Iteration Iteration Operation Number Name Time (sec) Count Rate (loops/sec) Rate (ops/sec) ------------------------------------------------------------------------------------------------------------ 1 add_double 60.03 1536 25.58721 460569.72 Thousand Double Precision Additions/second 2 add_float 60.01 2292 38.19363 458323.61 Thousand Single Precision Additions/second 3 add_long 60.01 1415 23.57940 1414764.21 Thousand Long Integer Additions/second 4 add_int 60.00 1415 23.58333 1415000.00 Thousand Integer Additions/second 5 add_short 60.02 3538 58.94702 1414728.42 Thousand Short Integer Additions/second 6 creat-clo 60.01 1112 18.53024 18530.24 File Creations and Closes/second 7 page_test 60.01 4354 72.55457 123342.78 System Allocations & Pages/second 8 brk_test 60.03 1629 27.13643 461319.34 System Memory Allocations/second 9 jmp_test 60.00 250858 4180.96667 4180966.67 Non-local gotos/second 10 signal_test 60.01 5468 91.11815 91118.15 Signal Traps/second 11 exec_test 60.07 786 13.08473 65.42 Program Loads/second 12 fork_test 60.03 926 15.42562 1542.56 Task Creations/second 13 link_test 60.01 6112 101.84969 6416.53 Link/Unlink Pairs/second Difference in performance operations report generated by diff-aim9.sh 1 add_double 460569.72 460569.72 0.00 0.00% Thousand Double Precision Additions/second 2 add_float 458323.61 458323.61 0.00 0.00% Thousand Single Precision Additions/second 3 add_long 1414764.21 1414764.21 0.00 0.00% Thousand Long Integer Additions/second 4 add_int 1416291.85 1415000.00 -1291.85 -0.09% Thousand Integer Additions/second 5 add_short 1421763.04 1414728.42 -7034.62 -0.49% Thousand Short Integer Additions/second 7 page_test 121359.77 123342.78 1983.01 1.63% System Allocations & Pages/second 8 brk_test 455600.00 461319.34 5719.34 1.26% System Memory Allocations/second 9 jmp_test 4182733.33 4180966.67 -1766.66 -0.04% Non-local gotos/second 10 signal_test 91084.82 91118.15 33.33 0.04% Signal Traps/second 11 exec_test 65.20 65.42 0.22 0.34% Program Loads/second 12 fork_test 1541.41 1542.56 1.15 0.07% Task Creations/second 13 link_test 6376.64 6416.53 39.89 0.63% Link/Unlink Pairs/second The aim9 results show that there are improvements for common page-related operations so we can provide lower fragmentation without performance loss. The results are compiler dependant and there are variances of 1-2% between versions. The second benchmark tested the CPU cache usage to make sure it was not getting clobbered. The test was to repeatedly render a large postscript file 10 times and get the average. The result is; gsbench-2.6.12-rc4-standard Average: 42.84 real, 42.754 user, 0.037 sys gsbench-2.6.12-rc4-mbuddy-v11 Average: 42.907 real, 42.825 user, 0.037 sys So there are no adverse cache effects. The last test is to show that the allocator can satisfy more high-order allocations, especially under load, than the standard allocator. The test performs the following; 1. Start updatedb running in the background 2. Load kernel modules that tries to allocate high-order blocks on demand 3. Clean a kernel tree 4. Make 6 copies of the tree. As each copy finishes, a compile starts at -j4 5. Start compiling the primary tree 6. Sleep 3 minutes while the 7 trees are being compiled 7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages - note, it only attempts 160 times, no matter how often it succeeds - An allocation is attempted every 1/10th of a second - Performance will get badly shot as it forces consider amounts of pageout The result of the allocations under load (load averaging 25) were; 2.6.12-rc4 Standard Order: 10 Attempted allocations: 160 Success allocs: 8 Failed allocs: 152 % Success: 5 2.6.12-rc4 MBuddy V11 Order: 10 Attempted allocations: 160 Success allocs: 48 Failed allocs: 112 % Success: 30 It is important to note that the standard allocator invoked the out-of-memory killer so often that it killed almost all available processes including X, sshd and all instances of make and gcc. The patch with the placement policy never invoked the OOM killer. The downside of the mbuddy allocator is that it takes a long time for it to free up the MAX_ORDER sized pages as pages are freed in LRU order. At rest, it is known that the modified allocator can allocate almost all 160 MAX_ORDER-1 blocks of pages. However, it takes a few attempts and a while at rest making it an unrealistic test. However, it shows that with a linear scanner freeing pages, we could almost guarantee availability of large pages At rest after all the compilations have finished, the results are The results show that the modified allocator as fast, and often faster the normal allocator, has no adverse cache effects but is far less fragmented and able to satisfy high-order allocations. Signed-off-by: Mel Gorman diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/fs/buffer.c linux-2.6.12-rc4-mbuddy-v11/fs/buffer.c --- linux-2.6.12-rc4-standard/fs/buffer.c 2005-05-22 15:29:15.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/fs/buffer.c 2005-05-22 15:34:59.000000000 +0100 @@ -1135,7 +1135,8 @@ grow_dev_page(struct block_device *bdev, struct page *page; struct buffer_head *bh; - page = find_or_create_page(inode->i_mapping, index, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, index, + GFP_NOFS | __GFP_USERRCLM); if (!page) return NULL; @@ -3052,7 +3053,8 @@ static void recalc_bh_state(void) struct buffer_head *alloc_buffer_head(unsigned int __nocast gfp_flags) { - struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags); + struct buffer_head *ret = kmem_cache_alloc(bh_cachep, + gfp_flags|__GFP_KERNRCLM); if (ret) { preempt_disable(); __get_cpu_var(bh_accounting).nr++; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/fs/dcache.c linux-2.6.12-rc4-mbuddy-v11/fs/dcache.c --- linux-2.6.12-rc4-standard/fs/dcache.c 2005-05-22 15:29:15.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/fs/dcache.c 2005-05-22 15:34:59.000000000 +0100 @@ -719,7 +719,8 @@ struct dentry *d_alloc(struct dentry * p struct dentry *dentry; char *dname; - dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); + dentry = kmem_cache_alloc(dentry_cache, + GFP_KERNEL|__GFP_KERNRCLM); if (!dentry) return NULL; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/fs/ext2/super.c linux-2.6.12-rc4-mbuddy-v11/fs/ext2/super.c --- linux-2.6.12-rc4-standard/fs/ext2/super.c 2005-05-22 15:29:15.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/fs/ext2/super.c 2005-05-22 15:34:59.000000000 +0100 @@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep; static struct inode *ext2_alloc_inode(struct super_block *sb) { struct ext2_inode_info *ei; - ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL); + ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL|__GFP_KERNRCLM); if (!ei) return NULL; #ifdef CONFIG_EXT2_FS_POSIX_ACL diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/fs/ext3/super.c linux-2.6.12-rc4-mbuddy-v11/fs/ext3/super.c --- linux-2.6.12-rc4-standard/fs/ext3/super.c 2005-05-22 15:29:15.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/fs/ext3/super.c 2005-05-22 15:34:59.000000000 +0100 @@ -432,7 +432,7 @@ static struct inode *ext3_alloc_inode(st { struct ext3_inode_info *ei; - ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS); + ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM); if (!ei) return NULL; #ifdef CONFIG_EXT3_FS_POSIX_ACL diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/fs/ntfs/inode.c linux-2.6.12-rc4-mbuddy-v11/fs/ntfs/inode.c --- linux-2.6.12-rc4-standard/fs/ntfs/inode.c 2005-05-22 15:29:16.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/fs/ntfs/inode.c 2005-05-22 15:34:59.000000000 +0100 @@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc ntfs_debug("Entering."); ni = (ntfs_inode *)kmem_cache_alloc(ntfs_big_inode_cache, - SLAB_NOFS); + SLAB_NOFS|__GFP_KERNRCLM); if (likely(ni != NULL)) { ni->state = 0; return VFS_I(ni); @@ -343,7 +343,8 @@ static inline ntfs_inode *ntfs_alloc_ext ntfs_inode *ni; ntfs_debug("Entering."); - ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS); + ni = (ntfs_inode *)kmem_cache_alloc(ntfs_inode_cache, + SLAB_NOFS|__GFP_KERNRCLM); if (likely(ni != NULL)) { ni->state = 0; return ni; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/include/linux/gfp.h linux-2.6.12-rc4-mbuddy-v11/include/linux/gfp.h --- linux-2.6.12-rc4-standard/include/linux/gfp.h 2005-05-22 15:29:20.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/include/linux/gfp.h 2005-05-22 17:52:50.000000000 +0100 @@ -39,7 +39,10 @@ struct vm_area_struct; #define __GFP_COMP 0x4000u /* Add compound page metadata */ #define __GFP_ZERO 0x8000u /* Return zeroed page on success */ #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */ +#define __GFP_KERNRCLM 0x20000u /* Kernel page that is easily reclaimable */ +#define __GFP_USERRCLM 0x40000u /* User is a userspace user */ +#define __GFP_TYPE_SHIFT 17 /* Translate RCLM flags to array index */ #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */ #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) @@ -47,14 +50,14 @@ struct vm_area_struct; #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \ __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \ __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \ - __GFP_NOMEMALLOC) + __GFP_NOMEMALLOC|__GFP_KERNRCLM|__GFP_USERRCLM) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_NOIO (__GFP_WAIT) #define GFP_NOFS (__GFP_WAIT | __GFP_IO) #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS) -#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS) -#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM) +#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_USERRCLM ) +#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_USERRCLM) /* Flag - indicates that the buffer will be suitable for DMA. Ignored on some platforms, used as appropriate on others */ diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/include/linux/mmzone.h linux-2.6.12-rc4-mbuddy-v11/include/linux/mmzone.h --- linux-2.6.12-rc4-standard/include/linux/mmzone.h 2005-05-22 15:29:20.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/include/linux/mmzone.h 2005-05-22 15:44:57.000000000 +0100 @@ -21,6 +21,16 @@ #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER #endif +/* Page allocations are divided into these types */ +#define ALLOC_TYPES 4 +#define ALLOC_KERNNORCLM 0 +#define ALLOC_KERNRCLM 1 +#define ALLOC_USERRCLM 2 +#define ALLOC_FALLBACK 3 + +/* Number of bits required to encode the type */ +#define BITS_PER_ALLOC_TYPE 2 + struct free_area { struct list_head free_list; unsigned long nr_free; @@ -43,12 +53,22 @@ struct zone_padding { #define ZONE_PADDING(name) #endif +/* + * Shared per-cpu lists would cause fragmentation over time + * The pcpu_list is to keep kernel and userrclm allocations + * apart while still allowing all allocation types to have + * per-cpu lists + */ +struct pcpu_list { + int count; + struct list_head list; +} ____cacheline_aligned_in_smp; + struct per_cpu_pages { - int count; /* number of pages in the list */ + struct pcpu_list pcpu_list[2]; /* 0: kernel 1: user */ int low; /* low watermark, refill needed */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ - struct list_head list; /* the list of pages */ }; struct per_cpu_pageset { @@ -128,8 +148,34 @@ struct zone { * free areas of different sizes */ spinlock_t lock; - struct free_area free_area[MAX_ORDER]; + /* + * The map tracks what each 2^MAX_ORDER-1 sized block is being used for + * using BITS_PER_ALLOC_TYPE number of bites. Bits are set when a large + * block is being split and rechecked during freeing of the page + */ + unsigned long *free_area_usemap; + + /* + * There are ALLOC_TYPE number of MAX_ORDER free lists. Once a + * MAX_ORDER block of pages has been split for an allocation type, + * the whole block is reserved for that type of allocation. + */ + struct free_area free_area_lists[ALLOC_TYPES][MAX_ORDER]; + + /* + * A percentage of a zone is reserved for falling back to. Without + * a fallback, memory will slowly fragment over time meaning the + * placement policy only delays the fragmentation problem, not + * fixes it + */ + unsigned long fallback_reserve; + + /* + * When negative, 2^MAX_ORDER-1 sized blocks of pages will be reserved + * for fallbacks + */ + long fallback_balance; ZONE_PADDING(_pad1_) @@ -212,6 +258,11 @@ struct zone { char *name; } ____cacheline_maxaligned_in_smp; +#define inc_reserve_count(zone, type) \ + type == ALLOC_FALLBACK ? zone->fallback_reserve++ : 0 +#define dec_reserve_count(zone, type) \ + (type == ALLOC_FALLBACK && zone->fallback_reserve) ? \ + zone->fallback_reserve-- : 0 /* * The "priority" of VM scanning is how much of the queues we will scan in one diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.12-rc4-standard/mm/page_alloc.c linux-2.6.12-rc4-mbuddy-v11/mm/page_alloc.c --- linux-2.6.12-rc4-standard/mm/page_alloc.c 2005-05-22 15:29:29.000000000 +0100 +++ linux-2.6.12-rc4-mbuddy-v11/mm/page_alloc.c 2005-05-22 15:46:29.000000000 +0100 @@ -64,6 +64,25 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z EXPORT_SYMBOL(totalram_pages); EXPORT_SYMBOL(nr_swap_pages); +/** + * The allocator tries to put allocations of the same type in the + * same 2^MAX_ORDER-1 blocks of pages. When memory is low, this may + * not be possible so this array describes what order allocations should + * fall back to + * + * The order of the fallback is chosen to keep the userrclm and kernrclm + * pools as low as fragmentation as possible. The FALLBACK zone is never + * directly used but acts as a reserve to allocate pages from when the + * normal pools are depleted. + * + */ +int fallback_allocs[ALLOC_TYPES][ALLOC_TYPES+1] = { + {ALLOC_KERNNORCLM,ALLOC_FALLBACK, ALLOC_KERNRCLM, ALLOC_USERRCLM,-1}, + {ALLOC_KERNRCLM, ALLOC_FALLBACK, ALLOC_KERNNORCLM,ALLOC_USERRCLM,-1}, + {ALLOC_USERRCLM, ALLOC_FALLBACK, ALLOC_KERNNORCLM,ALLOC_KERNRCLM,-1}, + {ALLOC_FALLBACK, ALLOC_KERNNORCLM,ALLOC_KERNRCLM, ALLOC_USERRCLM,-1} +}; + /* * Used by page_zone() to look up the address of the struct zone whose * id is encoded in the upper bits of page->flags @@ -118,6 +137,64 @@ static void bad_page(const char *functio tainted |= TAINT_BAD_PAGE; } +/* + * Return what type of page is being allocated from this 2^MAX_ORDER-1 block + * of pages. A bitmap is used as a char array performs slightly slower + */ +static inline unsigned int get_pageblock_type(struct zone *zone, struct page *page) { + int bitidx = (page_to_pfn(page) >> (MAX_ORDER-1)) * BITS_PER_ALLOC_TYPE; + unsigned int type = !!test_bit(bitidx, zone->free_area_usemap); + int i; + + for (i=1; i < BITS_PER_ALLOC_TYPE; i++) { + type = (type << 1) | (!!test_bit(bitidx+i, zone->free_area_usemap)); + } + +#ifdef CONFIG_DEBUG_PAGEALLOC + if (type >= ALLOC_TYPES) { + printk("\nBogus type in get_pageblock_type: %u\n", type); + BUG(); + } +#endif + return type; +} + +/* + * Reserve a block of pages for an allocation type + */ +static inline void set_pageblock_type(struct zone *zone, struct page *page, + int type) { + int bitidx = (page_to_pfn(page) >> (MAX_ORDER-1)) * BITS_PER_ALLOC_TYPE; + + /* Bits set match the alloc types defined in mmzone.h */ + switch (type) { + case ALLOC_KERNRCLM: + clear_bit(bitidx, zone->free_area_usemap); + set_bit(bitidx+1, zone->free_area_usemap); + break; + + case ALLOC_USERRCLM: + set_bit(bitidx, zone->free_area_usemap); + clear_bit(bitidx+1, zone->free_area_usemap); + break; + + default: + set_bit(bitidx, zone->free_area_usemap); + set_bit(bitidx+1, zone->free_area_usemap); + break; + } +} + +/* + * A percentage of a zone is reserved for allocations to fallback to. These + * macros calculate if a fallback reserve is already reserved and if it is + * needed + */ +#define need_min_fallback_reserve(zone) \ + (zone->free_pages >> (MAX_ORDER) < zone->fallback_reserve) +#define is_min_fallback_reserved(zone) \ + (zone->fallback_balance < 0) + #ifndef CONFIG_HUGETLB_PAGE #define prep_compound_page(page, order) do { } while (0) #define destroy_compound_page(page, order) do { } while (0) @@ -276,6 +353,8 @@ static inline void __free_pages_bulk (st { unsigned long page_idx; int order_size = 1 << order; + struct free_area *area; + struct free_area *freelist; if (unlikely(order)) destroy_compound_page(page, order); @@ -285,12 +364,14 @@ static inline void __free_pages_bulk (st BUG_ON(page_idx & (order_size - 1)); BUG_ON(bad_range(zone, page)); + /* Select the areas to place free pages on */ + freelist = zone->free_area_lists[get_pageblock_type(zone,page)]; + zone->free_pages += order_size; while (order < MAX_ORDER-1) { unsigned long combined_idx; - struct free_area *area; struct page *buddy; - + combined_idx = __find_combined_index(page_idx, order); buddy = __page_find_buddy(page, page_idx, order); @@ -299,16 +380,19 @@ static inline void __free_pages_bulk (st if (!page_is_buddy(buddy, order)) break; /* Move the buddy up one level. */ list_del(&buddy->lru); - area = zone->free_area + order; + area = freelist + order; area->nr_free--; rmv_page_order(buddy); page = page + (combined_idx - page_idx); page_idx = combined_idx; order++; } + + if (unlikely(order == MAX_ORDER-1)) zone->fallback_balance++; set_page_order(page, order); - list_add(&page->lru, &zone->free_area[order].free_list); - zone->free_area[order].nr_free++; + area = freelist + order; + list_add_tail(&page->lru, &area->free_list); + area->nr_free++; } static inline void free_pages_check(const char *function, struct page *page) @@ -460,57 +544,211 @@ static void prep_new_page(struct page *p kernel_map_pages(page, 1 << order, 1); } +/* + * Find a list that has a 2^MAX_ORDER-1 block of pages available and + * return it + */ +static inline struct page* steal_largepage(struct zone *zone, int alloctype) { + struct page *page; + struct free_area *area; + int i=0; + + /* Search the other allocation type lists */ + while (i < ALLOC_TYPES) { + area = &(zone->free_area_lists[i][MAX_ORDER-1]); + if (!list_empty(&(area->free_list))) break; + if (++i == alloctype) i++; + } + if (i == ALLOC_TYPES) return NULL; + + /* + * Remove a MAX_ORDER block from the global pool and add + * it to the list of desired alloc_type + */ + page = list_entry(area->free_list.next, struct page, lru); + area->nr_free--; + + /* + * Reserve this whole block of pages. When the pool shrinks, a + * percentage will be reserved for fallbacks. + */ + if (!is_min_fallback_reserved(zone) && + need_min_fallback_reserve(zone)) { + alloctype = ALLOC_FALLBACK; + } + + set_pageblock_type(zone, page, alloctype); + dec_reserve_count(zone, i); + inc_reserve_count(zone, alloctype); + + return page; + +} /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ -static struct page *__rmqueue(struct zone *zone, unsigned int order) +static struct page *__rmqueue(struct zone *zone, unsigned int order, int alloctype) { struct free_area * area; unsigned int current_order; struct page *page; + int *fallback_list; + int start_alloctype; - for (current_order = order; current_order < MAX_ORDER; ++current_order) { - area = zone->free_area + current_order; - if (list_empty(&area->free_list)) - continue; + alloctype >>= __GFP_TYPE_SHIFT; + + /* Search the list for the alloctype */ + area = zone->free_area_lists[alloctype] + order; + for (current_order = order; + current_order < MAX_ORDER; + current_order++, area++) { + if (list_empty(&area->free_list)) continue; page = list_entry(area->free_list.next, struct page, lru); - list_del(&page->lru); - rmv_page_order(page); area->nr_free--; - zone->free_pages -= 1UL << order; - return expand(zone, page, order, current_order, area); + goto remove_page; + } + + /* Allocate from the global list if the preferred free list is empty */ + if ((page = steal_largepage(zone, alloctype)) != NULL) { + area--; + current_order--; + goto remove_page; } + /* Ok, pick the fallback order based on the type */ + fallback_list = fallback_allocs[alloctype]; + start_alloctype = alloctype; + + /* + * Here, the alloc type lists has been depleted as well as the global + * pool, so fallback. When falling back, the largest possible block + * will be taken to keep the fallbacks clustered if possible + */ + while ((alloctype = *(++fallback_list)) != -1) { + + if (alloctype < 0 || alloctype >= ALLOC_TYPES) BUG(); + + /* Find a block to allocate */ + area = zone->free_area_lists[alloctype] + MAX_ORDER; + current_order=MAX_ORDER; + do { + current_order--; + area--; + if (!list_empty(&area->free_list)) { + + page = list_entry(area->free_list.next, struct page, lru); + area->nr_free--; + goto fallback_alloc; + } + + } while (current_order != order); + + } + return NULL; + +fallback_alloc: + /* + * If we are falling back, and the allocation is KERNNORCLM, + * then reserve any buddies for the KERNNORCLM pool. These + * allocations fragment the worst so this helps keep them + * in the one place + */ + if (start_alloctype == ALLOC_KERNNORCLM) { + area = zone->free_area_lists[ALLOC_KERNNORCLM] + current_order; + + /* Reserve the whole block if this is a large split */ + if (current_order >= MAX_ORDER / 2) { + int reserve_type=ALLOC_KERNNORCLM; + dec_reserve_count(zone, get_pageblock_type(zone,page)); + + /* + * Use this block for fallbacks if the + * minimum reserve is not being met + */ + if (!is_min_fallback_reserved(zone)) + reserve_type = ALLOC_FALLBACK; + + set_pageblock_type(zone, page, reserve_type); + inc_reserve_count(zone, reserve_type); + } + + } + +remove_page: + /* + * At this point, page is expected to be the page we are + * about to remove and the area->nr_free count should have been + * updated + */ + if (unlikely(current_order == MAX_ORDER-1)) zone->fallback_balance--; + list_del(&page->lru); + rmv_page_order(page); + zone->free_pages -= 1UL << order; + return expand(zone, page, order, current_order, area); } /* - * Obtain a specified number of elements from the buddy allocator, all under - * a single hold of the lock, for efficiency. Add them to the supplied list. - * Returns the number of new pages which were placed at *list. + * Obtain a specified number of order-0 elements from the buddy allocator, all + * under a single hold of the lock, for efficiency. Add them to the supplied + * list. An attempt is made to keep the allocatoed memory in physically + * contiguous blocks. Returns the number of new pages which were placed at + * *list. + * */ -static int rmqueue_bulk(struct zone *zone, unsigned int order, - unsigned long count, struct list_head *list) +static int rmqueue_bulk(struct zone *zone, + unsigned long count, struct list_head *list, + int alloctype) { unsigned long flags; int i; - int allocated = 0; + unsigned long allocated = count; struct page *page; + unsigned long current_order= 0; + /* Find what order we should start allocating blocks at */ + current_order = ffs(count) - 1; + spin_lock_irqsave(&zone->lock, flags); - for (i = 0; i < count; ++i) { - page = __rmqueue(zone, order); - if (page == NULL) - break; - allocated++; - list_add_tail(&page->lru, list); + + /* + * Satisfy the request in as the largest possible physically + * contiguous block + */ + while (allocated) { + if ((1 << current_order) > allocated) + current_order--; + + /* Allocate a block at the current_order */ + page = __rmqueue(zone, current_order, alloctype); + if (page == NULL) { + if (current_order == 0) break; + current_order--; + continue; + } + allocated -= 1 << current_order; + + /* Move to the next block if order is already 0 */ + if (current_order == 0) { + list_add_tail(&page->lru, list); + continue; + } + + /* Split the large block into order-sized blocks */ + for (i = 1 << current_order; i != 0; i--) { + list_add_tail(&page->lru, list); + page++; + } } + spin_unlock_irqrestore(&zone->lock, flags); - return allocated; + return count - allocated; } + + #if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) static void __drain_pages(unsigned int cpu) { @@ -525,8 +763,11 @@ static void __drain_pages(unsigned int c struct per_cpu_pages *pcp; pcp = &pset->pcp[i]; - pcp->count -= free_pages_bulk(zone, pcp->count, - &pcp->list, 0); + pcp->pcpu_list[0].count -= free_pages_bulk(zone, pcp->pcpu_list[0].count, + &pcp->pcpu_list[0].list, 0); + + pcp->pcpu_list[1].count -= free_pages_bulk(zone, pcp->pcpu_list[1].count, + &pcp->pcpu_list[1].list, 0); } } } @@ -537,7 +778,7 @@ static void __drain_pages(unsigned int c void mark_free_pages(struct zone *zone) { unsigned long zone_pfn, flags; - int order; + int order, type; struct list_head *curr; if (!zone->spanned_pages) @@ -547,14 +788,17 @@ void mark_free_pages(struct zone *zone) for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn)); - for (order = MAX_ORDER - 1; order >= 0; --order) - list_for_each(curr, &zone->free_area[order].free_list) { - unsigned long start_pfn, i; + for (type=0; type < ALLOC_TYPES; type++) { - start_pfn = page_to_pfn(list_entry(curr, struct page, lru)); + for (order = MAX_ORDER - 1; order >= 0; --order) + list_for_each(curr, &zone->free_area_lists[type][order].free_list) { + unsigned long start_pfn, i; + + start_pfn = page_to_pfn(list_entry(curr, struct page, lru)); - for (i=0; i < (1<lock, flags); } @@ -607,6 +851,7 @@ static void fastcall free_hot_cold_page( struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; unsigned long flags; + struct pcpu_list *plist; arch_free_page(page, 0); @@ -616,11 +861,20 @@ static void fastcall free_hot_cold_page( page->mapping = NULL; free_pages_check(__FUNCTION__, page); pcp = &zone->pageset[get_cpu()].pcp[cold]; + + /* + * Strictly speaking, we should not be accessing the zone information + * here. In this case, it does not matter if the read is incorrect + */ + if (get_pageblock_type(zone, page) == ALLOC_USERRCLM) + plist = &pcp->pcpu_list[1]; + else + plist = &pcp->pcpu_list[0]; local_irq_save(flags); - if (pcp->count >= pcp->high) - pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0); - list_add(&page->lru, &pcp->list); - pcp->count++; + if (plist->count >= pcp->high) + plist->count -= free_pages_bulk(zone, pcp->batch, &plist->list, 0); + list_add(&page->lru, &plist->list); + plist->count++; local_irq_restore(flags); put_cpu(); } @@ -650,7 +904,7 @@ static inline void prep_zero_page(struct * or two. */ static struct page * -buffered_rmqueue(struct zone *zone, int order, unsigned int __nocast gfp_flags) +buffered_rmqueue(struct zone *zone, int order, unsigned int __nocast gfp_flags, int alloctype) { unsigned long flags; struct page *page = NULL; @@ -658,16 +912,22 @@ buffered_rmqueue(struct zone *zone, int if (order == 0) { struct per_cpu_pages *pcp; + struct pcpu_list *plist; pcp = &zone->pageset[get_cpu()].pcp[cold]; local_irq_save(flags); - if (pcp->count <= pcp->low) - pcp->count += rmqueue_bulk(zone, 0, - pcp->batch, &pcp->list); - if (pcp->count) { - page = list_entry(pcp->list.next, struct page, lru); + + if (alloctype == __GFP_USERRCLM) plist = &pcp->pcpu_list[1]; + else plist = &pcp->pcpu_list[0]; + + if (plist->count <= pcp->low) + plist->count += rmqueue_bulk(zone, + pcp->batch, &plist->list, + alloctype); + if (plist->count) { + page = list_entry(plist->list.next, struct page, lru); list_del(&page->lru); - pcp->count--; + plist->count--; } local_irq_restore(flags); put_cpu(); @@ -675,7 +935,7 @@ buffered_rmqueue(struct zone *zone, int if (page == NULL) { spin_lock_irqsave(&zone->lock, flags); - page = __rmqueue(zone, order); + page = __rmqueue(zone, order, alloctype); spin_unlock_irqrestore(&zone->lock, flags); } @@ -702,6 +962,7 @@ int zone_watermark_ok(struct zone *z, in { /* free_pages my go negative - that's OK */ long min = mark, free_pages = z->free_pages - (1 << order) + 1; + struct free_area *kernnorclm, *kernrclm, *userrclm; int o; if (gfp_high) @@ -711,15 +972,25 @@ int zone_watermark_ok(struct zone *z, in if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return 0; + kernnorclm = z->free_area_lists[ALLOC_KERNNORCLM]; + kernrclm = z->free_area_lists[ALLOC_KERNRCLM]; + userrclm = z->free_area_lists[ALLOC_USERRCLM]; for (o = 0; o < order; o++) { /* At the next order, this order's pages become unavailable */ - free_pages -= z->free_area[o].nr_free << o; + free_pages -= ( + kernnorclm->nr_free + + kernrclm->nr_free + + userrclm->nr_free) << o; /* Require fewer higher order pages to be free */ min >>= 1; if (free_pages <= min) return 0; + + kernnorclm++; + kernrclm++; + userrclm++; } return 1; } @@ -741,6 +1012,8 @@ __alloc_pages(unsigned int __nocast gfp_ int do_retry; int can_try_harder; int did_some_progress; + int alloctype; + int highorder_retry=3; might_sleep_if(wait); @@ -760,6 +1033,13 @@ __alloc_pages(unsigned int __nocast gfp_ classzone_idx = zone_idx(zones[0]); + /* + * Find what type of allocation this is. Later, this value will + * be shifted __GFP_TYPE_SHIFT bits to the right to give an + * index within the zones freelist + */ + alloctype = (gfp_mask & (__GFP_USERRCLM | __GFP_KERNRCLM)); + restart: /* Go through the zonelist once, looking for a zone with enough free */ for (i = 0; (z = zones[i]) != NULL; i++) { @@ -771,7 +1051,7 @@ __alloc_pages(unsigned int __nocast gfp_ if (!cpuset_zone_allowed(z)) continue; - page = buffered_rmqueue(z, order, gfp_mask); + page = buffered_rmqueue(z, order, gfp_mask, alloctype); if (page) goto got_pg; } @@ -795,7 +1075,7 @@ __alloc_pages(unsigned int __nocast gfp_ if (wait && !cpuset_zone_allowed(z)) continue; - page = buffered_rmqueue(z, order, gfp_mask); + page = buffered_rmqueue(z, order, gfp_mask, alloctype); if (page) goto got_pg; } @@ -809,7 +1089,8 @@ __alloc_pages(unsigned int __nocast gfp_ for (i = 0; (z = zones[i]) != NULL; i++) { if (!cpuset_zone_allowed(z)) continue; - page = buffered_rmqueue(z, order, gfp_mask); + page = buffered_rmqueue(z, order, gfp_mask, + alloctype); if (page) goto got_pg; } @@ -852,7 +1133,7 @@ rebalance: if (!cpuset_zone_allowed(z)) continue; - page = buffered_rmqueue(z, order, gfp_mask); + page = buffered_rmqueue(z, order, gfp_mask, alloctype); if (page) goto got_pg; } @@ -871,12 +1152,19 @@ rebalance: if (!cpuset_zone_allowed(z)) continue; - page = buffered_rmqueue(z, order, gfp_mask); + page = buffered_rmqueue(z, order, gfp_mask, alloctype); if (page) goto got_pg; } - out_of_memory(gfp_mask); + if (order < MAX_ORDER/2) out_of_memory(gfp_mask); + + /* + * Due to low fragmentation efforts, we should try a little + * harder to satisfy high order allocations + */ + if (order >= MAX_ORDER/2 && --highorder_retry > 0) + goto rebalance; goto restart; } @@ -893,6 +1181,8 @@ rebalance: do_retry = 1; if (gfp_mask & __GFP_NOFAIL) do_retry = 1; + if (order >= MAX_ORDER/2 && --highorder_retry > 0) + do_retry=1; } if (do_retry) { blk_congestion_wait(WRITE, HZ/50); @@ -1220,6 +1510,7 @@ void show_free_areas(void) unsigned long inactive; unsigned long free; struct zone *zone; + int type; for_each_zone(zone) { show_node(zone); @@ -1312,8 +1603,10 @@ void show_free_areas(void) spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < MAX_ORDER; order++) { - nr = zone->free_area[order].nr_free; - total += nr << order; + for (type=0; type < ALLOC_TYPES; type++) { + nr = zone->free_area_lists[type][order].nr_free; + total += nr << order; + } printk("%lu*%lukB ", nr, K(1UL) << order); } spin_unlock_irqrestore(&zone->lock, flags); @@ -1609,10 +1902,19 @@ void zone_init_free_lists(struct pglist_ unsigned long size) { int order; - for (order = 0; order < MAX_ORDER ; order++) { - INIT_LIST_HEAD(&zone->free_area[order].free_list); - zone->free_area[order].nr_free = 0; + int type; + struct free_area *area; + + /* Initialse the three size ordered lists of free_areas */ + for (type=0; type < ALLOC_TYPES; type++) { + for (order = 0; order < MAX_ORDER; order++) { + area = zone->free_area_lists[type]; + + INIT_LIST_HEAD(&area[order].free_list); + area[order].nr_free = 0; + } } + } #ifndef __HAVE_ARCH_MEMMAP_INIT @@ -1621,6 +1923,21 @@ void zone_init_free_lists(struct pglist_ #endif /* + * Calculate the size of the zone->usemap + */ +static unsigned long __init usemap_size(unsigned long zonesize) { + unsigned long usemapsize; + + /* - Number of MAX_ORDER blocks in the zone */ + usemapsize = (zonesize + (1 << (MAX_ORDER-1))) >> (MAX_ORDER-1); + + /* - BITS_PER_ALLOC_TYPE bits to record what type of block it is */ + usemapsize = (usemapsize * BITS_PER_ALLOC_TYPE + (sizeof(unsigned long)*8)) / 8; + + return L1_CACHE_ALIGN(usemapsize); +} + +/* * Set up the zone data structures: * - mark all pages reserved * - mark all memory queues empty @@ -1633,6 +1950,7 @@ static void __init free_area_init_core(s const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1); int cpu, nid = pgdat->node_id; unsigned long zone_start_pfn = pgdat->node_start_pfn; + unsigned long usemapsize; pgdat->nr_zones = 0; init_waitqueue_head(&pgdat->kswapd_wait); @@ -1659,6 +1977,11 @@ static void __init free_area_init_core(s spin_lock_init(&zone->lru_lock); zone->zone_pgdat = pgdat; zone->free_pages = 0; + zone->fallback_reserve = 0; + + /* Set the balance so about 12.5% will be used for fallbacks */ + zone->fallback_balance = (realsize >> (MAX_ORDER-1)) - + (realsize >> (MAX_ORDER+2)); zone->temp_priority = zone->prev_priority = DEF_PRIORITY; @@ -1692,18 +2015,22 @@ static void __init free_area_init_core(s struct per_cpu_pages *pcp; pcp = &zone->pageset[cpu].pcp[0]; /* hot */ - pcp->count = 0; + pcp->pcpu_list[0].count = 0; + pcp->pcpu_list[1].count = 0; pcp->low = 2 * batch; pcp->high = 6 * batch; pcp->batch = 1 * batch; - INIT_LIST_HEAD(&pcp->list); + INIT_LIST_HEAD(&pcp->pcpu_list[0].list); + INIT_LIST_HEAD(&pcp->pcpu_list[1].list); pcp = &zone->pageset[cpu].pcp[1]; /* cold */ - pcp->count = 0; + pcp->pcpu_list[0].count = 0; + pcp->pcpu_list[1].count = 0; pcp->low = 0; pcp->high = 2 * batch; pcp->batch = 1 * batch; - INIT_LIST_HEAD(&pcp->list); + INIT_LIST_HEAD(&pcp->pcpu_list[0].list); + INIT_LIST_HEAD(&pcp->pcpu_list[1].list); } printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n", zone_names[j], realsize, batch); @@ -1743,6 +2070,18 @@ static void __init free_area_init_core(s zone_start_pfn += size; zone_init_free_lists(pgdat, zone, zone->spanned_pages); + + usemapsize = usemap_size(size); + + zone->free_area_usemap = + (unsigned long *)alloc_bootmem_node(pgdat, + usemapsize); + + memset((unsigned long *)zone->free_area_usemap, + ALLOC_KERNNORCLM, usemapsize); + + printk(KERN_DEBUG " %s zone: %lu pages, %lu real pages, usemap size:%lu\n", + zone_names[j], size, realsize, usemapsize); } } @@ -1830,19 +2169,38 @@ static int frag_show(struct seq_file *m, struct zone *zone; struct zone *node_zones = pgdat->node_zones; unsigned long flags; - int order; + int order, type; + struct list_head *elem; + unsigned long nr_bufs = 0; + /* Show global fragmentation statistics */ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) { if (!zone->present_pages) continue; spin_lock_irqsave(&zone->lock, flags); - seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); - for (order = 0; order < MAX_ORDER; ++order) - seq_printf(m, "%6lu ", zone->free_area[order].nr_free); - spin_unlock_irqrestore(&zone->lock, flags); - seq_putc(m, '\n'); - } + seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name); + for (order = 0; order < MAX_ORDER-1; ++order) { + nr_bufs = 0; + + for (type=0; type < ALLOC_TYPES; type++) { + list_for_each(elem, &(zone->free_area_lists[type][order].free_list)) + ++nr_bufs; + } + seq_printf(m, "%6lu ", nr_bufs); + } + + /* Scan global list */ + nr_bufs = 0; + for (type=0; type < ALLOC_TYPES; type++) { + nr_bufs += zone->free_area_lists[type][MAX_ORDER-1].nr_free; + } + seq_printf(m, "%6lu ", nr_bufs); + + spin_unlock_irqrestore(&zone->lock, flags); + seq_putc(m, '\n'); + } + return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/