Hi Andrew,
This is the latest release of the fragmentation avoidance patches with no
code changes since v18. If it is possible, I would like to get this into -mm,
so this patch is generated against the latest -mm tree 2.6.14-rc5-mm1 and
is known to apply cleanly. If there is another tree that should be diffed
against instead, just say so and I'll send another version.
Here are a few brief reasons why this set of patches is useful;
o Reduced fragmentation improves the chance a large order allocation succeeds
o General-purpose memory hotplug needs the page/memory groupings provided
o Reduces the number of badly-placed pages that page migration mechanism must
deal with. This also applies to any active page defragmentation mechanism.
o This patch is a pre-requisite for a linear scanning mechanism that could
be used to guarantee large-page allocations
Built and tested successfully on a single processor AMD machine, quad
processor Xeon machine and PPC64. Benchmarks are generated on the Xeon machine.
Changelog since v18
o Resync against 2.6.14-rc5-mm1
o 004_markfree dropped
o Documentation note added on the behavior of free_area.nr_free
Changelog since v17
o Update to 2.6.14-rc4-mm1
o Remove explicit casts where implicit casts were in place
o Change __GFP_USER to __GFP_EASYRCLM, RCLM_USER to RCLM_EASY and PCPU_USER to
PCPU_EASY
o Print a warning and return NULL if both RCLM flags are set in the GFP flags
o Reduce size of fallback_allocs
o Change magic number 64 to FREE_AREA_USEMAP_SIZE
o CodingStyle regressions cleanup
o Move sparsemen setup_usemap() out of header
o Changed fallback_balance to a mechanism that depended on zone->present_pages
to avoid hotplug problems later
o Many superflous parenthesis removed
Changlog since v16
o Variables using bit operations now are unsigned long. Note that when used
as indices, they are integers and cast to unsigned long when necessary.
This is because aim9 shows regressions when used as unsigned longs
throughout (~10% slowdown)
o 004_showfree added to provide more debugging information
o 008_stats dropped. Even with CONFIG_ALLOCSTATS disabled, it is causing
severe performance regressions. No explanation as to why
o for_each_rclmtype_order moved to header
o More coding style cleanups
Changelog since V14 (V15 not released)
o Update against 2.6.14-rc3
o Resync with Joel's work. All suggestions made on fix-ups to his last
set of patches should also be in here. e.g. __GFP_USER is still __GFP_USER
but is better commented.
o Large amount of CodingStyle, readability cleanups and corrections pointed
out by Dave Hansen.
o Fix CONFIG_NUMA error that corrupted per-cpu lists
o Patches broken out to have one-feature-per-patch rather than
more-code-per-patch
o Fix fallback bug where pages for RCLM_NORCLM end up on random other
free lists.
Changelog since V13
o Patches are now broken out
o Added per-cpu draining of userrclm pages
o Brought the patch more in line with memory hotplug work
o Fine-grained use of the __GFP_USER and __GFP_KERNRCLM flags
o Many coding-style corrections
o Many whitespace-damage corrections
Changelog since V12
o Minor whitespace damage fixed as pointed by Joel Schopp
Changelog since V11
o Mainly a redefiff against 2.6.12-rc5
o Use #defines for indexing into pcpu lists
o Fix rounding error in the size of usemap
Changelog since V10
o All allocation types now use per-cpu caches like the standard allocator
o Removed all the additional buddy allocator statistic code
o Elimated three zone fields that can be lived without
o Simplified some loops
o Removed many unnecessary calculations
Changelog since V9
o Tightened what pools are used for fallbacks, less likely to fragment
o Many micro-optimisations to have the same performance as the standard
allocator. Modified allocator now faster than standard allocator using
gcc 3.3.5
o Add counter for splits/coalescing
Changelog since V8
o rmqueue_bulk() allocates pages in large blocks and breaks it up into the
requested size. Reduces the number of calls to __rmqueue()
o Beancounters are now a configurable option under "Kernel Hacking"
o Broke out some code into inline functions to be more Hotplug-friendly
o Increased the size of reserve for fallbacks from 10% to 12.5%.
Changelog since V7
o Updated to 2.6.11-rc4
o Lots of cleanups, mainly related to beancounters
o Fixed up a miscalculation in the bitmap size as pointed out by Mike Kravetz
(thanks Mike)
o Introduced a 10% reserve for fallbacks. Drastically reduces the number of
kernnorclm allocations that go to the wrong places
o Don't trigger OOM when large allocations are involved
Changelog since V6
o Updated to 2.6.11-rc2
o Minor change to allow prezeroing to be a cleaner looking patch
Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage
Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
with offsets to 2.6.11-rc1-mm1
Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists
Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
anything to do with asynchronous IO
Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
allocation type
o In-code documentation
Version 1
o Initial release against 2.6.9
This patch is designed to reduce fragmentation in the standard buddy allocator
without impairing the performance of the allocator. High fragmentation in
the standard binary buddy allocator means that high-order allocations can
rarely be serviced. This patch works by dividing allocations into three
different types of allocations;
EasyReclaimable - These are userspace pages that are easily reclaimable. This
flag is set when it is known that the pages will be trivially reclaimed
by writing the page out to swap or syncing with backing storage
KernelReclaimable - These are pages allocated by the kernel that are easily
reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
These type of pages potentially could be reclaimed by dumping the
caches and reaping the slabs
KernelNonReclaimable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type
Instead of having one global MAX_ORDER-sized array of free lists,
there are four, one for each type of allocation and another reserve for
fallbacks.
Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if 2^MAX_ORDER number of pages were required, the system could linearly scan
a block of pages allocated for EasyReclaimable and page each of them out.
Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.
Three benchmark results are included all based on a 2.6.14-rc3 kernel
compiled with gcc 3.4 (it is known that gcc 2.95 produces different results).
The first is the output of portions of AIM9 for the vanilla allocator and
the modified one;
(Tests run with bench-aim9.sh from VMRegress 0.17)
2.6.14-rc5-mm1-clean
------------------------------------------------------------------------------------------------------------
Test Test Elapsed Iteration Iteration Operation
Number Name Time (sec) Count Rate (loops/sec) Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
1 creat-clo 60.04 961 16.00600 16006.00 File Creations and Closes/second
2 page_test 60.02 4149 69.12696 117515.83 System Allocations & Pages/second
3 brk_test 60.04 1555 25.89940 440289.81 System Memory Allocations/second
4 jmp_test 60.00 250768 4179.46667 4179466.67 Non-local gotos/second
5 signal_test 60.01 4849 80.80320 80803.20 Signal Traps/second
6 exec_test 60.00 741 12.35000 61.75 Program Loads/second
7 fork_test 60.06 797 13.27006 1327.01 Task Creations/second
8 link_test 60.01 5269 87.80203 5531.53 Link/Unlink Pairs/second
2.6.14-rc3-mbuddy-v19
------------------------------------------------------------------------------------------------------------
Test Test Elapsed Iteration Iteration Operation
Number Name Time (sec) Count Rate (loops/sec) Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
1 creat-clo 60.04 954 15.88941 15889.41 File Creations and Closes/second
2 page_test 60.01 4133 68.87185 117082.15 System Allocations & Pages/second
3 brk_test 60.02 1546 25.75808 437887.37 System Memory Allocations/second
4 jmp_test 60.00 250797 4179.95000 4179950.00 Non-local gotos/second
5 signal_test 60.01 5121 85.33578 85335.78 Signal Traps/second
6 exec_test 60.00 743 12.38333 61.92 Program Loads/second
7 fork_test 60.05 806 13.42215 1342.21 Task Creations/second
8 link_test 60.00 5291 88.18333 5555.55 Link/Unlink Pairs/second
Difference in performance operations report generated by diff-aim9.sh
Clean mbuddy-v19
---------- ----------
1 creat-clo 16006.00 15889.41 -116.59 -0.73% File Creations and Closes/second
2 page_test 117515.83 117082.15 -433.68 -0.37% System Allocations & Pages/second
3 brk_test 440289.81 437887.37 -2402.44 -0.55% System Memory Allocations/second
4 jmp_test 4179466.67 4179950.00 483.33 0.01% Non-local gotos/second
5 signal_test 80803.20 85335.78 4532.58 5.61% Signal Traps/second
6 exec_test 61.75 61.92 0.17 0.28% Program Loads/second
7 fork_test 1327.01 1342.21 15.20 1.15% Task Creations/second
8 link_test 5531.53 5555.55 24.02 0.43% Link/Unlink Pairs/second
In this test, there were small regressions in the page_test. However, it
is known that different kernel configurations, compilers and even different
runs show similar varianes of +/- 3% .
The second benchmark tested the CPU cache usage to make sure it was not
getting clobbered. The test was to repeatedly render a large postscript file
10 times and get the average. The result is;
2.6.14-rc5-mm1-clean: Average: 43.254 real, 38.89 user, 0.042 sys
2.6.14-rc5-mm1-mbuddy-v19: Average: 43.212 real, 40.494 user, 0.044 sys
So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test performs the following;
1. Start updatedb running in the background
2. Load kernel modules that tries to allocate high-order blocks on demand
3. Clean a kernel tree
4. Make 6 copies of the tree. As each copy finishes, a compile starts at -j2
5. Start compiling the primary tree
6. Sleep 1 minute while the 7 trees are being compiled
7. Use the kernel module to attempt 160 times to allocate a 2^10 block of pages
- note, it only attempts 160 times, no matter how often it succeeds
- An allocation is attempted every 1/10th of a second
- Performance will get badly shot as it forces considerable amounts of
pageout
The result of the allocations under load (load averaging 18) were;
2.6.14-rc5-mm1 Clean
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 30
Failed allocs: 130
DMA zone allocs: 0
Normal zone allocs: 7
HighMem zone allocs: 23
% Success: 18
2.6.14-rc5-mm1 MBuddy V19
Order: 10
Allocation type: HighMem
Attempted allocations: 160
Success allocs: 76
Failed allocs: 84
DMA zone allocs: 1
Normal zone allocs: 30
HighMem zone allocs: 45
% Success: 47
One thing that had to be changed in the 2.6.14-rc5-mm1 clean test was to
disable the OOM killer. During one test, the OOM killer had better results
but invoked the OOM killer a very large number of times to achieve it. The
patch with the placement policy never invoked the OOM killer.
The above results are not very dramatic but the affect is very noticeable when
the system is at rest after the test completes. After the test, the standard
allocator was able to allocate 45 order-10 pages and the modified allocator
allocated 159. The ability to allocate large pages under load depend heavily
on the decisions of kswapd so there can be large variances in results but
that is a separate problem. It is also known that the success of large
allocations is also dependant on the location of per-cpu pages but fixing
that problem is a separate issue.
The results show that the modified allocator has comparable speed, has no
adverse cache effects but is far less fragmented and in a better position
to satisfy high-order allocations.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
This patch adds two flags __GFP_EASYRCLM and __GFP_KERNRCLM that are used
to trap the type of allocation the caller is made. Allocations using
the __GFP_EASYRCLM flag are expected to be easily reclaimed by syncing
with backing storage (be it a file or swap) or cleaning the buffers and
discarding. Allocations using the __GFP_KERNRCLM flag belong to slab caches
that can be shrunk by the kernel.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Joel Schopp <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/buffer.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c
--- linux-2.6.14-rc5-mm1-clean/fs/buffer.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/buffer.c 2005-10-30 13:34:50.000000000 +0000
@@ -1119,7 +1119,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;
- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ GFP_NOFS|__GFP_EASYRCLM);
if (!page)
return NULL;
@@ -3058,7 +3059,8 @@ static void recalc_bh_state(void)
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
{
- struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+ struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+ gfp_flags|__GFP_KERNRCLM);
if (ret) {
get_cpu_var(bh_accounting).nr++;
recalc_bh_state();
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/compat.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c
--- linux-2.6.14-rc5-mm1-clean/fs/compat.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/compat.c 2005-10-30 13:34:50.000000000 +0000
@@ -1363,7 +1363,7 @@ static int compat_copy_strings(int argc,
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/dcache.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c
--- linux-2.6.14-rc5-mm1-clean/fs/dcache.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/dcache.c 2005-10-30 13:34:50.000000000 +0000
@@ -878,7 +878,7 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;
- dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+ dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_KERNRCLM);
if (!dentry)
return NULL;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/exec.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c
--- linux-2.6.14-rc5-mm1-clean/fs/exec.c 2005-10-30 13:19:59.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/exec.c 2005-10-30 13:34:50.000000000 +0000
@@ -237,7 +237,7 @@ static int copy_strings(int argc, char _
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUSER|__GFP_EASYRCLM);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c
--- linux-2.6.14-rc5-mm1-clean/fs/ext2/super.c 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext2/super.c 2005-10-30 13:34:50.000000000 +0000
@@ -141,7 +141,8 @@ static kmem_cache_t * ext2_inode_cachep;
static struct inode *ext2_alloc_inode(struct super_block *sb)
{
struct ext2_inode_info *ei;
- ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+ ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+ SLAB_KERNEL|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c
--- linux-2.6.14-rc5-mm1-clean/fs/ext3/super.c 2005-10-30 13:20:00.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ext3/super.c 2005-10-30 13:34:50.000000000 +0000
@@ -444,7 +444,7 @@ static struct inode *ext3_alloc_inode(st
{
struct ext3_inode_info *ei;
- ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+ ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_KERNRCLM);
if (!ei)
return NULL;
#ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c
--- linux-2.6.14-rc5-mm1-clean/fs/inode.c 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/inode.c 2005-10-30 13:34:50.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_HIGHUSER|__GFP_EASYRCLM);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c
--- linux-2.6.14-rc5-mm1-clean/fs/ntfs/inode.c 2005-10-30 13:20:01.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/fs/ntfs/inode.c 2005-10-30 13:34:50.000000000 +0000
@@ -318,7 +318,7 @@ struct inode *ntfs_alloc_big_inode(struc
ntfs_inode *ni;
ntfs_debug("Entering.");
- ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+ ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return VFS_I(ni);
@@ -343,7 +343,7 @@ static inline ntfs_inode *ntfs_alloc_ext
ntfs_inode *ni;
ntfs_debug("Entering.");
- ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+ ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS|__GFP_KERNRCLM);
if (likely(ni != NULL)) {
ni->state = 0;
return ni;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h
--- linux-2.6.14-rc5-mm1-clean/include/asm-i386/page.h 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/asm-i386/page.h 2005-10-30 13:34:50.000000000 +0000
@@ -36,7 +36,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | __GFP_EASYRCLM, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h
--- linux-2.6.14-rc5-mm1-clean/include/linux/gfp.h 2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/gfp.h 2005-10-30 13:34:50.000000000 +0000
@@ -50,14 +50,27 @@ struct vm_area_struct;
#define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */
#define __GFP_VALID 0x80000000u /* valid GFP flags */
-#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
+/*
+ * Allocation type modifiers, these are required to be adjacent
+ * __GFP_EASYRCLM: Easily reclaimed pages like userspace or buffer pages
+ * __GFP_KERNRCLM: Short-lived or reclaimable kernel allocation
+ * Both bits off: Kernel non-reclaimable or very hard to reclaim
+ * __GFP_EASYRCLM and __GFP_KERNRCLM should not be specified at the same time
+ * RCLM_SHIFT (defined elsewhere) depends on the location of these bits
+ */
+#define __GFP_EASYRCLM 0x80000u /* User and other easily reclaimed pages */
+#define __GFP_KERNRCLM 0x100000u /* Kernel page that is reclaimable */
+#define __GFP_RCLM_BITS (__GFP_EASYRCLM|__GFP_KERNRCLM)
+
+#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
+ __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL| \
+ __GFP_EASYRCLM|__GFP_KERNRCLM)
#define GFP_ATOMIC (__GFP_VALID | __GFP_HIGH)
#define GFP_NOIO (__GFP_VALID | __GFP_WAIT)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h
--- linux-2.6.14-rc5-mm1-clean/include/linux/highmem.h 2005-10-20 07:23:05.000000000 +0100
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/highmem.h 2005-10-30 13:34:50.000000000 +0000
@@ -47,7 +47,8 @@ static inline void clear_user_highpage(s
static inline struct page *
alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, vaddr);
if (page)
clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/memory.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c
--- linux-2.6.14-rc5-mm1-clean/mm/memory.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/memory.c 2005-10-30 13:34:50.000000000 +0000
@@ -1295,7 +1295,8 @@ static int do_wp_page(struct mm_struct *
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, address);
if (!new_page)
goto oom;
copy_user_highpage(new_page, old_page, address);
@@ -1858,7 +1859,8 @@ retry:
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/shmem.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c
--- linux-2.6.14-rc5-mm1-clean/mm/shmem.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/shmem.c 2005-10-30 13:34:50.000000000 +0000
@@ -906,7 +906,7 @@ shmem_alloc_page(unsigned long gfp, stru
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+ page = alloc_page_vma(gfp | __GFP_ZERO | __GFP_EASYRCLM, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -921,7 +921,7 @@ shmem_swapin(struct shmem_inode_info *in
static inline struct page *
shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
{
- return alloc_page(gfp | __GFP_ZERO);
+ return alloc_page(gfp | __GFP_ZERO | __GFP_EASYRCLM);
}
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-clean/mm/swap_state.c linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c
--- linux-2.6.14-rc5-mm1-clean/mm/swap_state.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/swap_state.c 2005-10-30 13:34:50.000000000 +0000
@@ -341,7 +341,8 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_EASYRCLM,
+ vma, addr);
if (!new_page)
break; /* Out of memory */
}
It is not necessary to apply this patch to get all the anti-fragmentation
code. This patch adds a new config option called CONFIG_ALLOCSTATS. If
set, a number of new bean counters are added that are related to the
anti-fragmentation code. The information is exported via /proc/buddyinfo. This
is very useful when debugging why high-order pages are not available for
allocation.
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h 2005-10-30 13:38:14.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/include/linux/mmzone.h 2005-10-30 13:38:56.000000000 +0000
@@ -193,6 +193,17 @@ struct zone {
/* Number of pages currently used for RCLM_FALLBACK */
unsigned long fallback_reserve;
+#ifdef CONFIG_ALLOCSTATS
+ /*
+ * These are beancounters that track how the placement policy
+ * of the buddy allocator is performing
+ */
+ unsigned long fallback_count[RCLM_TYPES];
+ unsigned long alloc_count[RCLM_TYPES];
+ unsigned long reserve_count[RCLM_TYPES];
+ unsigned long kernnorclm_full_steal;
+ unsigned long kernnorclm_partial_steal;
+#endif
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
@@ -292,6 +303,17 @@ struct zone {
char *name;
} ____cacheline_maxaligned_in_smp;
+#ifdef CONFIG_ALLOCSTATS
+#define inc_fallback_count(zone, type) zone->fallback_count[type]++
+#define inc_alloc_count(zone, type) zone->alloc_count[type]++
+#define inc_kernnorclm_partial_steal(zone) zone->kernnorclm_partial_steal++
+#define inc_kernnorclm_full_steal(zone) zone->kernnorclm_full_steal++
+#else
+#define inc_fallback_count(zone, type) do {} while (0)
+#define inc_alloc_count(zone, type) do {} while (0)
+#define inc_kernnorclm_partial_steal(zone) do {} while (0)
+#define inc_kernnorclm_full_steal(zone) do {} while (0)
+#endif
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
@@ -319,12 +341,19 @@ static inline void inc_reserve_count(str
{
if (type == RCLM_FALLBACK)
zone->fallback_reserve += PAGES_PER_MAXORDER;
+#ifdef CONFIG_ALLOCSTATS
+ zone->reserve_count[type]++;
+#endif
}
static inline void dec_reserve_count(struct zone *zone, int type)
{
if (type == RCLM_FALLBACK && zone->fallback_reserve)
zone->fallback_reserve -= PAGES_PER_MAXORDER;
+#ifdef CONFIG_ALLOCSTATS
+ if (zone->reserve_count[type] > 0)
+ zone->reserve_count[type]--;
+#endif
}
/*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug
--- linux-2.6.14-rc5-mm1-006_percpu/lib/Kconfig.debug 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/lib/Kconfig.debug 2005-10-30 13:38:56.000000000 +0000
@@ -77,6 +77,17 @@ config SCHEDSTATS
application, you can say N to avoid the very slight overhead
this adds.
+config ALLOCSTATS
+ bool "Collection buddy allocator statistics"
+ depends on DEBUG_KERNEL && PROC_FS
+ help
+ If you say Y here, additional code will be inserted into the
+ page allocator routines to collect statistics on the allocator
+ behavior and provide them in /proc/buddyinfo. These stats are
+ useful for measuring fragmentation in the buddy allocator. If
+ you are not debugging or measuring the allocator, you can say N
+ to avoid the slight overhead this adds.
+
config DEBUG_SLAB
bool "Debug memory allocations"
depends on DEBUG_KERNEL
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c 2005-10-30 13:38:14.000000000 +0000
+++ linux-2.6.14-rc5-mm1-007_stats/mm/page_alloc.c 2005-10-30 13:38:56.000000000 +0000
@@ -187,6 +187,11 @@ EXPORT_SYMBOL(zone_table);
static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
+#ifdef CONFIG_ALLOCSTATS
+static char *type_names[RCLM_TYPES] = { "KernNoRclm", "EasyRclm",
+ "KernRclm", "Fallback"};
+#endif /* CONFIG_ALLOCSTATS */
+
unsigned long __initdata nr_kernel_pages;
unsigned long __initdata nr_all_pages;
@@ -684,6 +689,9 @@ fallback_buddy_reserve(int start_allocty
dec_reserve_count(zone, get_pageblock_type(zone,page));
set_pageblock_type(zone, page, reserve_type);
inc_reserve_count(zone, reserve_type);
+ inc_kernnorclm_full_steal(zone);
+ } else {
+ inc_kernnorclm_partial_steal(zone);
}
return area;
}
@@ -726,6 +734,15 @@ fallback_alloc(int alloctype, struct zon
current_order, area);
}
+
+ /*
+ * If the current alloctype is RCLM_FALLBACK, it means
+ * that the requested pool and fallback pool are both
+ * depleted and we are falling back to other pools.
+ * At this point, pools are starting to get fragmented
+ */
+ if (alloctype == RCLM_FALLBACK)
+ inc_fallback_count(zone, start_alloctype);
}
return NULL;
@@ -742,6 +759,8 @@ static struct page *__rmqueue(struct zon
unsigned int current_order;
struct page *page;
+ inc_alloc_count(zone, alloctype);
+
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &zone->free_area_lists[alloctype][current_order];
if (list_empty(&area->free_list))
@@ -2373,6 +2392,9 @@ static __devinit void init_currently_emp
memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+#ifdef CONFIG_ALLOCSTATS
+ zone->reserve_count[RCLM_NORCLM] = zone->present_pages >> (MAX_ORDER-1);
+#endif /* CONFIG_ALLOCSTATS */
}
/*
@@ -2528,6 +2550,18 @@ static int frag_show(struct seq_file *m,
int order, t;
struct free_area *area;
unsigned long nr_bufs = 0;
+#ifdef CONFIG_ALLOCSTATS
+ int i;
+ unsigned long kernnorclm_full_steal = 0;
+ unsigned long kernnorclm_partial_steal = 0;
+ unsigned long reserve_count[RCLM_TYPES];
+ unsigned long fallback_count[RCLM_TYPES];
+ unsigned long alloc_count[RCLM_TYPES];
+
+ memset(reserve_count, 0, sizeof(reserve_count));
+ memset(fallback_count, 0, sizeof(fallback_count));
+ memset(alloc_count, 0, sizeof(alloc_count));
+#endif
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
@@ -2548,6 +2582,86 @@ static int frag_show(struct seq_file *m,
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
+
+#ifdef CONFIG_ALLOCSTATS
+ /* Show statistics for each allocation type */
+ seq_printf(m, "\nPer-allocation-type statistics");
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!zone->present_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (t = 0; t < RCLM_TYPES; t++) {
+ struct list_head *elem;
+ seq_printf(m, "\nNode %d, zone %8s, type %10s ",
+ pgdat->node_id, zone->name,
+ type_names[t]);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ nr_bufs = 0;
+
+ list_for_each(elem, &zone->free_area_lists[t][order].free_list)
+ ++nr_bufs;
+ seq_printf(m, "%6lu ", nr_bufs);
+ }
+ }
+
+ /* Scan global list */
+ seq_printf(m, "\n");
+ seq_printf(m, "Node %d, zone %8s, type %10s",
+ pgdat->node_id, zone->name,
+ "MAX_ORDER");
+ nr_bufs = 0;
+ for (t = 0; t < RCLM_TYPES; t++) {
+ nr_bufs +=
+ zone->free_area_lists[t][MAX_ORDER-1].nr_free;
+ }
+ seq_printf(m, "%6lu ", nr_bufs);
+ seq_printf(m, "\n");
+
+ seq_printf(m, "%s Zone beancounters\n", zone->name);
+ seq_printf(m, "Fallback reserve: %lu (%lu blocks)\n",
+ zone->fallback_reserve,
+ zone->fallback_reserve >> (MAX_ORDER-1));
+ seq_printf(m, "Fallback needed: %lu (%lu blocks)\n",
+ zone->present_pages >> 3,
+ (zone->present_pages >> 3) >> (MAX_ORDER-1));
+ seq_printf(m, "Partial steal: %lu\n",
+ zone->kernnorclm_partial_steal);
+ seq_printf(m, "Full steal: %lu\n",
+ zone->kernnorclm_full_steal);
+
+ kernnorclm_partial_steal += zone->kernnorclm_partial_steal;
+ kernnorclm_full_steal += zone->kernnorclm_full_steal;
+ seq_putc(m, '\n');
+
+ for (i = 0; i< RCLM_TYPES; i++) {
+ seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n",
+ type_names[i],
+ zone->alloc_count[i],
+ zone->reserve_count[i],
+ zone->fallback_count[i]);
+ alloc_count[i] += zone->alloc_count[i];
+ reserve_count[i] += zone->reserve_count[i];
+ fallback_count[i] += zone->fallback_count[i];
+ }
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
+
+ /* Show bean counters */
+ seq_printf(m, "\nGlobal beancounters\n");
+ seq_printf(m, "Partial steal: %lu\n", kernnorclm_partial_steal);
+ seq_printf(m, "Full steal: %lu\n", kernnorclm_full_steal);
+
+ for (i = 0; i< RCLM_TYPES; i++) {
+ seq_printf(m, "%-10s Allocs: %-10lu Reserve: %-10lu Fallbacks: %-10lu\n",
+ type_names[i],
+ alloc_count[i],
+ reserve_count[i],
+ fallback_count[i]);
+ }
+#endif /* CONFIG_ALLOCSTATS */
return 0;
}
This patch adds a "usemap" to the allocator. When a PAGE_PER_MAXORDER block
of pages (i.e. 2^(MAX_ORDER-1)) is split, the usemap is updated with the
type of allocation when splitting. This information is used in an
anti-fragmentation patch to group related allocation types together.
The __GFP_EASYRCLM and __GFP_KERNRCLM bits are used to enumerate three allocation
types;
RCLM_NORLM: These are kernel allocations that cannot be reclaimed
on demand.
RCLM_EASY: These are pages allocated with __GFP_EASYRCLM flag set. They are
considered to be user and other easily reclaimed pages such
as buffers
RCLM_KERN: Allocated for the kernel but for caches that can be reclaimed
on demand.
gfpflags_to_rclmtype() converts gfp_flags to their corresponding RCLM_TYPE
by masking out irrelevant bits and shifting the result right by RCLM_SHIFT.
Compile-time checks are made on RCLM_SHIFT to ensure gfpflags_to_rclmtype()
keeps working. ffz() could be used to avoid static checks, but it would be
runtime overhead for a compile-time constant.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Joel Schopp <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mm.h 2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mm.h 2005-10-30 13:35:31.000000000 +0000
@@ -529,6 +529,12 @@ static inline void set_page_links(struct
extern struct page *mem_map;
#endif
+/*
+ * Return what type of page this 2^(MAX_ORDER-1) block of pages is being
+ * used for. Return value is one of the RCLM_X types
+ */
+extern int get_pageblock_type(struct zone *zone, struct page *page);
+
static inline void *lowmem_page_address(struct page *page)
{
return __va(page_to_pfn(page) << PAGE_SHIFT);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/include/linux/mmzone.h 2005-10-30 13:20:05.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h 2005-10-30 13:35:31.000000000 +0000
@@ -21,6 +21,17 @@
#else
#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
#endif
+#define PAGES_PER_MAXORDER (1 << (MAX_ORDER-1))
+
+/*
+ * The two bit field __GFP_RECLAIMBITS enumerates the following types of
+ * page reclaimability.
+ */
+#define RCLM_NORCLM 0
+#define RCLM_EASY 1
+#define RCLM_KERN 2
+#define RCLM_TYPES 3
+#define BITS_PER_RCLM_TYPE 2
struct free_area {
struct list_head free_list;
@@ -146,6 +157,13 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];
+#ifndef CONFIG_SPARSEMEM
+ /*
+ * The map tracks what each 2^MAX_ORDER-1 sized block is being used for.
+ * Each PAGES_PER_MAXORDER block of pages use BITS_PER_RCLM_TYPE bits
+ */
+ unsigned long *free_area_usemap;
+#endif
ZONE_PADDING(_pad1_)
@@ -501,9 +519,14 @@ extern struct pglist_data contig_page_da
#define PAGES_PER_SECTION (1UL << PFN_SECTION_SHIFT)
#define PAGE_SECTION_MASK (~(PAGES_PER_SECTION-1))
+#define FREE_AREA_BITS 64
+
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
+#if ((SECTION_SIZE_BITS - MAX_ORDER) * BITS_PER_RCLM_TYPE) > FREE_AREA_BITS
+#error free_area_usemap is not big enough
+#endif
struct page;
struct mem_section {
@@ -516,6 +539,7 @@ struct mem_section {
* before using it wrong.
*/
unsigned long section_mem_map;
+ DECLARE_BITMAP(free_area_usemap, FREE_AREA_BITS);
};
#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -584,6 +608,18 @@ static inline struct mem_section *__pfn_
return __nr_to_section(pfn_to_section_nr(pfn));
}
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+ unsigned long pfn)
+{
+ return &__pfn_to_section(pfn)->free_area_usemap[0];
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+ pfn &= (PAGES_PER_SECTION-1);
+ return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE;
+}
+
#define pfn_to_page(pfn) \
({ \
unsigned long __pfn = (pfn); \
@@ -621,6 +657,17 @@ void sparse_init(void);
#else
#define sparse_init() do {} while (0)
#define sparse_index_init(_sec, _nid) do {} while (0)
+static inline unsigned long *pfn_to_usemap(struct zone *zone,
+ unsigned long pfn)
+{
+ return zone->free_area_usemap;
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+ pfn = pfn - zone->zone_start_pfn;
+ return (pfn >> (MAX_ORDER-1)) * BITS_PER_RCLM_TYPE;
+}
#endif /* CONFIG_SPARSEMEM */
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-001_antidefrag_flags/mm/page_alloc.c 2005-10-30 13:20:06.000000000 +0000
+++ linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c 2005-10-30 13:35:31.000000000 +0000
@@ -69,6 +69,99 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
EXPORT_SYMBOL(totalram_pages);
/*
+ * RCLM_SHIFT is the number of bits that a gfp_mask has to be shifted right
+ * to have just the __GFP_EASYRCLM and __GFP_KERNRCLM bits. The static check
+ * is made afterwards in case the GFP flags are not updated without updating
+ * this number
+ */
+#define RCLM_SHIFT 19
+#if (__GFP_EASYRCLM >> RCLM_SHIFT) != RCLM_EASY
+#error __GFP_EASYRCLM not mapping to RCLM_EASY
+#endif
+#if (__GFP_KERNRCLM >> RCLM_SHIFT) != RCLM_KERN
+#error __GFP_KERNRCLM not mapping to RCLM_KERN
+#endif
+
+/*
+ * This function maps gfpflags to their RCLM_TYPE. It makes assumptions
+ * on the location of the GFP flags.
+ */
+static inline int gfpflags_to_rclmtype(gfp_t gfp_flags)
+{
+ unsigned long rclmbits = gfp_flags & __GFP_RCLM_BITS;
+
+ /* Specifying both RCLM flags makes no sense */
+ if (unlikely(rclmbits == __GFP_RCLM_BITS)) {
+ printk(KERN_WARNING "Multiple RCLM GFP flags specified\n");
+ dump_stack();
+ return RCLM_TYPES;
+ }
+
+ return rclmbits >> RCLM_SHIFT;
+}
+
+/*
+ * copy_bits - Copy bits between bitmaps
+ * @dstaddr: The destination bitmap to copy to
+ * @srcaddr: The source bitmap to copy from
+ * @sindex_dst: The start bit index within the destination map to copy to
+ * @sindex_src: The start bit index within the source map to copy from
+ * @nr: The number of bits to copy
+ *
+ * Note that this method is slow and makes no guarantees for atomicity.
+ * It depends on being called with the zone spinlock held to ensure data
+ * safety
+ */
+static inline void copy_bits(unsigned long *dstaddr,
+ unsigned long *srcaddr,
+ int sindex_dst,
+ int sindex_src,
+ int nr)
+{
+ /*
+ * Written like this to take advantage of arch-specific
+ * set_bit() and clear_bit() functions
+ */
+ for (nr = nr - 1; nr >= 0; nr--) {
+ int bit = test_bit(sindex_src + nr, srcaddr);
+ if (bit)
+ set_bit(sindex_dst + nr, dstaddr);
+ else
+ clear_bit(sindex_dst + nr, dstaddr);
+ }
+}
+
+int get_pageblock_type(struct zone *zone, struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long type = 0;
+ unsigned long *usemap;
+ int bitidx;
+
+ bitidx = pfn_to_bitidx(zone, pfn);
+ usemap = pfn_to_usemap(zone, pfn);
+
+ copy_bits(&type, usemap, 0, bitidx, BITS_PER_RCLM_TYPE);
+
+ return type;
+}
+
+/* Reserve a block of pages for an allocation type */
+static inline void set_pageblock_type(struct zone *zone, struct page *page,
+ int type)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long *usemap;
+ unsigned long ltype = type;
+ int bitidx;
+
+ bitidx = pfn_to_bitidx(zone, pfn);
+ usemap = pfn_to_usemap(zone, pfn);
+
+ copy_bits(usemap, <ype, bitidx, 0, BITS_PER_RCLM_TYPE);
+}
+
+/*
* Used by page_zone() to look up the address of the struct zone whose
* id is encoded in the upper bits of page->flags
*/
@@ -498,7 +591,8 @@ static void prep_new_page(struct page *p
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+ int alloctype)
{
struct free_area * area;
unsigned int current_order;
@@ -514,6 +608,14 @@ static struct page *__rmqueue(struct zon
rmv_page_order(page);
area->nr_free--;
zone->free_pages -= 1UL << order;
+
+ /*
+ * If splitting a large block, record what the block is being
+ * used for in the usemap
+ */
+ if (current_order == MAX_ORDER-1)
+ set_pageblock_type(zone, page, alloctype);
+
return expand(zone, page, order, current_order, area);
}
@@ -526,7 +628,8 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list,
+ int alloctype)
{
unsigned long flags;
int i;
@@ -535,7 +638,7 @@ static int rmqueue_bulk(struct zone *zon
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, alloctype);
if (page == NULL)
break;
allocated++;
@@ -719,6 +822,11 @@ buffered_rmqueue(struct zone *zone, int
unsigned long flags;
struct page *page = NULL;
int cold = !!(gfp_flags & __GFP_COLD);
+ int alloctype = gfpflags_to_rclmtype(gfp_flags);
+
+ /* If the alloctype is RCLM_TYPES, the gfp_flags make no sense */
+ if (alloctype == RCLM_TYPES)
+ return NULL;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -727,7 +835,8 @@ buffered_rmqueue(struct zone *zone, int
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list,
+ alloctype);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -739,7 +848,7 @@ buffered_rmqueue(struct zone *zone, int
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, alloctype);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -1866,6 +1975,38 @@ inline void setup_pageset(struct per_cpu
INIT_LIST_HEAD(&pcp->list);
}
+#ifndef CONFIG_SPARSEMEM
+#define roundup(x, y) ((((x)+((y)-1))/(y))*(y))
+/*
+ * Calculate the size of the zone->usemap in bytes rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 RCLM_TYPE worth of bits per MAX_ORDER-1, finally round up
+ * what is now in bits to nearest long in bits, then return it in bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+ unsigned long usemapsize;
+
+ usemapsize = roundup(zonesize, PAGES_PER_MAXORDER);
+ usemapsize = usemapsize >> (MAX_ORDER-1);
+ usemapsize *= BITS_PER_RCLM_TYPE;
+ usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+ return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize)
+{
+ unsigned long usemapsize = usemap_size(zonesize);
+ zone->free_area_usemap = alloc_bootmem_node(pgdat, usemapsize);
+ memset(zone->free_area_usemap, RCLM_NORCLM, usemapsize);
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
#ifdef CONFIG_NUMA
/*
* Boot pageset table. One per cpu which is going to be used for all
@@ -2079,6 +2220,7 @@ static void __init free_area_init_core(s
zonetable_add(zone, nid, j, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
zone_start_pfn += size;
+ setup_usemap(pgdat, zone, size);
}
}
The freelists for each allocation type can slowly become corrupted due to
the per-cpu list. Consider what happens when the following happens
1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation
Now, a kernel page is in the middle of a __GFP_EASYRCLM page. This means
that over long periods of the time, the anti-fragmentation scheme slowly
degrades to the standard allocator.
This patch divides the per-cpu lists into Kernel and User lists. RCLM_NORCLM
and RCLM_KERN use the Kernel list and RCLM_EASY uses the user list. Strictly
speaking, there should be three lists but as little effort is made to reclaim
RCLM_KERN pages, it is not worth the overhead *yet*.
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/include/linux/mmzone.h 2005-10-30 13:36:56.000000000 +0000
+++ linux-2.6.14-rc5-mm1-006_percpu/include/linux/mmzone.h 2005-10-30 13:38:14.000000000 +0000
@@ -60,12 +60,21 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif
+/*
+ * Indices into pcpu_list
+ * PCPU_KERNEL: For RCLM_NORCLM and RCLM_KERN allocations
+ * PCPU_EASY: For RCLM_EASY allocations
+ */
+#define PCPU_KERNEL 0
+#define PCPU_EASY 1
+#define PCPU_TYPES 2
+
struct per_cpu_pages {
- int count; /* number of pages in the list */
+ int count[PCPU_TYPES]; /* Number of pages on each list */
int low; /* low watermark, refill needed */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
- struct list_head list; /* the list of pages */
+ struct list_head list[PCPU_TYPES]; /* the lists of pages */
};
struct per_cpu_pageset {
@@ -80,6 +89,10 @@ struct per_cpu_pageset {
#endif
} ____cacheline_aligned_in_smp;
+/* Helpers for per_cpu_pages */
+#define pset_count(pset) (pset.count[PCPU_KERNEL] + pset.count[PCPU_EASY])
+#define for_each_pcputype(pindex) \
+ for (pindex = 0; pindex < PCPU_TYPES; pindex++)
#ifdef CONFIG_NUMA
#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
#else
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c 2005-10-30 13:37:34.000000000 +0000
+++ linux-2.6.14-rc5-mm1-006_percpu/mm/page_alloc.c 2005-10-30 13:38:14.000000000 +0000
@@ -792,7 +792,7 @@ static int rmqueue_bulk(struct zone *zon
void drain_remote_pages(void)
{
struct zone *zone;
- int i;
+ int i, pindex;
unsigned long flags;
local_irq_save(flags);
@@ -808,9 +808,16 @@ void drain_remote_pages(void)
struct per_cpu_pages *pcp;
pcp = &pset->pcp[i];
- if (pcp->count)
- pcp->count -= free_pages_bulk(zone, pcp->count,
- &pcp->list, 0);
+ for_each_pcputype(pindex) {
+ if (!pcp->count[pindex])
+ continue;
+
+ /* Try remove all pages from the pcpu list */
+ pcp->count[pindex] -=
+ free_pages_bulk(zone,
+ pcp->count[pindex],
+ &pcp->list[pindex], 0);
+ }
}
}
local_irq_restore(flags);
@@ -821,7 +828,7 @@ void drain_remote_pages(void)
static void __drain_pages(unsigned int cpu)
{
struct zone *zone;
- int i;
+ int i, pindex;
for_each_zone(zone) {
struct per_cpu_pageset *pset;
@@ -831,8 +838,16 @@ static void __drain_pages(unsigned int c
struct per_cpu_pages *pcp;
pcp = &pset->pcp[i];
- pcp->count -= free_pages_bulk(zone, pcp->count,
- &pcp->list, 0);
+ for_each_pcputype(pindex) {
+ if (!pcp->count[pindex])
+ continue;
+
+ /* Try remove all pages from the pcpu list */
+ pcp->count[pindex] -=
+ free_pages_bulk(zone,
+ pcp->count[pindex],
+ &pcp->list[pindex], 0);
+ }
}
}
}
@@ -911,6 +926,7 @@ static void fastcall free_hot_cold_page(
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
+ int pindex;
arch_free_page(page, 0);
@@ -920,11 +936,21 @@ static void fastcall free_hot_cold_page(
page->mapping = NULL;
free_pages_check(__FUNCTION__, page);
pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
+
+ /*
+ * Strictly speaking, we should not be accessing the zone information
+ * here. In this case, it does not matter if the read is incorrect
+ */
+ if (get_pageblock_type(zone, page) == RCLM_EASY)
+ pindex = PCPU_EASY;
+ else
+ pindex = PCPU_KERNEL;
local_irq_save(flags);
- list_add(&page->lru, &pcp->list);
- pcp->count++;
- if (pcp->count >= pcp->high)
- pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
+ list_add(&page->lru, &pcp->list[pindex]);
+ pcp->count[pindex]++;
+ if (pcp->count[pindex] >= pcp->high)
+ pcp->count[pindex] -= free_pages_bulk(zone, pcp->batch,
+ &pcp->list[pindex], 0);
local_irq_restore(flags);
put_cpu();
}
@@ -967,17 +993,23 @@ buffered_rmqueue(struct zone *zone, int
if (order == 0) {
struct per_cpu_pages *pcp;
+ int pindex = PCPU_KERNEL;
+ if (alloctype == RCLM_EASY)
+ pindex = PCPU_EASY;
pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
local_irq_save(flags);
- if (pcp->count <= pcp->low)
- pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list,
- alloctype);
- if (pcp->count) {
- page = list_entry(pcp->list.next, struct page, lru);
+ if (pcp->count[pindex] <= pcp->low)
+ pcp->count[pindex] += rmqueue_bulk(zone,
+ 0, pcp->batch,
+ &(pcp->list[pindex]),
+ alloctype);
+
+ if (pcp->count[pindex]) {
+ page = list_entry(pcp->list[pindex].next,
+ struct page, lru);
list_del(&page->lru);
- pcp->count--;
+ pcp->count[pindex]--;
}
local_irq_restore(flags);
put_cpu();
@@ -1678,7 +1710,7 @@ void show_free_areas(void)
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch,
- pageset->pcp[temperature].count);
+ pset_count(pageset->pcp[temperature]));
}
}
@@ -2135,18 +2167,22 @@ inline void setup_pageset(struct per_cpu
struct per_cpu_pages *pcp;
pcp = &p->pcp[0]; /* hot */
- pcp->count = 0;
+ pcp->count[PCPU_KERNEL] = 0;
+ pcp->count[PCPU_EASY] = 0;
pcp->low = 0;
- pcp->high = 6 * batch;
+ pcp->high = 3 * batch;
pcp->batch = max(1UL, 1 * batch);
- INIT_LIST_HEAD(&pcp->list);
+ INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]);
+ INIT_LIST_HEAD(&pcp->list[PCPU_EASY]);
pcp = &p->pcp[1]; /* cold*/
- pcp->count = 0;
+ pcp->count[PCPU_KERNEL] = 0;
+ pcp->count[PCPU_EASY] = 0;
pcp->low = 0;
- pcp->high = 2 * batch;
+ pcp->high = batch;
pcp->batch = max(1UL, batch/2);
- INIT_LIST_HEAD(&pcp->list);
+ INIT_LIST_HEAD(&pcp->list[PCPU_KERNEL]);
+ INIT_LIST_HEAD(&pcp->list[PCPU_EASY]);
}
#ifndef CONFIG_SPARSEMEM
@@ -2574,7 +2610,7 @@ static int zoneinfo_show(struct seq_file
pageset = zone_pcp(zone, i);
for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
- if (pageset->pcp[j].count)
+ if (pset_count(pageset->pcp[j]))
break;
}
if (j == ARRAY_SIZE(pageset->pcp))
@@ -2587,7 +2623,7 @@ static int zoneinfo_show(struct seq_file
"\n high: %i"
"\n batch: %i",
i, j,
- pageset->pcp[j].count,
+ pset_count(pageset->pcp[j]),
pageset->pcp[j].low,
pageset->pcp[j].high,
pageset->pcp[j].batch);
This patch adds the core of the anti-fragmentation strategy. It works by
grouping related allocation types together. The idea is that large groups of
pages that may be reclaimed are placed near each other. The zone->free_area
list is broken into three free lists for each RCLM_TYPE.
This section of the patch looks superflous but it is to surpress a compiler
warning. Suggestions to make this better looking are welcome.
- struct free_area * area;
+ struct free_area * area = NULL;
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Joel Schopp <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-002_usemap/include/linux/mmzone.h 2005-10-30 13:35:31.000000000 +0000
+++ linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h 2005-10-30 13:36:16.000000000 +0000
@@ -33,6 +33,10 @@
#define RCLM_TYPES 3
#define BITS_PER_RCLM_TYPE 2
+#define for_each_rclmtype_order(type, order) \
+ for (order = 0; order < MAX_ORDER; order++) \
+ for (type = 0; type < RCLM_TYPES; type++)
+
struct free_area {
struct list_head free_list;
unsigned long nr_free;
@@ -155,7 +159,6 @@ struct zone {
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif
- struct free_area free_area[MAX_ORDER];
#ifndef CONFIG_SPARSEMEM
/*
@@ -165,6 +168,8 @@ struct zone {
unsigned long *free_area_usemap;
#endif
+ struct free_area free_area_lists[RCLM_TYPES][MAX_ORDER];
+
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-002_usemap/mm/page_alloc.c 2005-10-30 13:35:31.000000000 +0000
+++ linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c 2005-10-30 13:36:16.000000000 +0000
@@ -352,6 +352,15 @@ __find_combined_index(unsigned long page
}
/*
+ * Return the free list for a given page within a zone
+ */
+static inline struct free_area *__page_find_freelist(struct zone *zone,
+ struct page *page)
+{
+ return zone->free_area_lists[get_pageblock_type(zone, page)];
+}
+
+/*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
* (a) the buddy is free &&
@@ -398,6 +407,8 @@ static inline void __free_pages_bulk (st
{
unsigned long page_idx;
int order_size = 1 << order;
+ struct free_area *area;
+ struct free_area *freelist;
if (unlikely(order))
destroy_compound_page(page, order);
@@ -407,10 +418,11 @@ static inline void __free_pages_bulk (st
BUG_ON(page_idx & (order_size - 1));
BUG_ON(bad_range(zone, page));
+ freelist = __page_find_freelist(zone, page);
+
zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
- struct free_area *area;
struct page *buddy;
combined_idx = __find_combined_index(page_idx, order);
@@ -421,7 +433,7 @@ static inline void __free_pages_bulk (st
if (!page_is_buddy(buddy, order))
break; /* Move the buddy up one level. */
list_del(&buddy->lru);
- area = zone->free_area + order;
+ area = &freelist[order];
area->nr_free--;
rmv_page_order(buddy);
page = page + (combined_idx - page_idx);
@@ -429,8 +441,8 @@ static inline void __free_pages_bulk (st
order++;
}
set_page_order(page, order);
- list_add(&page->lru, &zone->free_area[order].free_list);
- zone->free_area[order].nr_free++;
+ list_add_tail(&page->lru, &freelist[order].free_list);
+ freelist[order].nr_free++;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -587,6 +599,45 @@ static void prep_new_page(struct page *p
kernel_map_pages(page, 1 << order, 1);
}
+/*
+ * Find a list that has a 2^MAX_ORDER-1 block of pages available and
+ * return it
+ */
+struct page *steal_maxorder_block(struct zone *zone, int alloctype)
+{
+ struct page *page;
+ struct free_area *area = NULL;
+ int i;
+
+ for(i = 0; i < RCLM_TYPES; i++) {
+ if (i == alloctype)
+ continue;
+
+ area = &zone->free_area_lists[i][MAX_ORDER-1];
+ if (!list_empty(&area->free_list))
+ break;
+ }
+ if (i == RCLM_TYPES)
+ return NULL;
+
+ page = list_entry(area->free_list.next, struct page, lru);
+ area->nr_free--;
+
+ set_pageblock_type(zone, page, alloctype);
+
+ return page;
+}
+
+static inline struct page *
+remove_page(struct zone *zone, struct page *page, unsigned int order,
+ unsigned int current_order, struct free_area *area)
+{
+ list_del(&page->lru);
+ rmv_page_order(page);
+ zone->free_pages -= 1UL << order;
+ return expand(zone, page, order, current_order, area);
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
@@ -594,31 +645,25 @@ static void prep_new_page(struct page *p
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int alloctype)
{
- struct free_area * area;
+ struct free_area * area = NULL;
unsigned int current_order;
struct page *page;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = &zone->free_area_lists[alloctype][current_order];
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- rmv_page_order(page);
area->nr_free--;
- zone->free_pages -= 1UL << order;
-
- /*
- * If splitting a large block, record what the block is being
- * used for in the usemap
- */
- if (current_order == MAX_ORDER-1)
- set_pageblock_type(zone, page, alloctype);
-
- return expand(zone, page, order, current_order, area);
+ return remove_page(zone, page, order, current_order, area);
}
+ /* Allocate a MAX_ORDER block */
+ page = steal_maxorder_block(zone, alloctype);
+ if (page != NULL)
+ return remove_page(zone, page, order, MAX_ORDER-1, area);
+
return NULL;
}
@@ -704,9 +749,9 @@ static void __drain_pages(unsigned int c
void mark_free_pages(struct zone *zone)
{
unsigned long zone_pfn, flags;
- int order;
+ int order, t;
+ unsigned long start_pfn, i;
struct list_head *curr;
-
if (!zone->spanned_pages)
return;
@@ -714,14 +759,12 @@ void mark_free_pages(struct zone *zone)
for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
- for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
- unsigned long start_pfn, i;
-
+ for_each_rclmtype_order(t, order) {
+ list_for_each(curr,&zone->free_area_lists[t][order].free_list) {
start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
-
for (i=0; i < (1<<order); i++)
SetPageNosaveFree(pfn_to_page(start_pfn+i));
+ }
}
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -876,6 +919,7 @@ int zone_watermark_ok(struct zone *z, in
/* free_pages my go negative - that's OK */
long min = mark, free_pages = z->free_pages - (1 << order) + 1;
int o;
+ struct free_area *kernnorclm, *kernrclm, *easyrclm;
if (gfp_high)
min -= min / 2;
@@ -884,15 +928,22 @@ int zone_watermark_ok(struct zone *z, in
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
goto out_failed;
+ kernnorclm = z->free_area_lists[RCLM_NORCLM];
+ easyrclm = z->free_area_lists[RCLM_EASY];
+ kernrclm = z->free_area_lists[RCLM_KERN];
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
+ free_pages -= (kernnorclm->nr_free + kernrclm->nr_free +
+ easyrclm->nr_free) << o;
/* Require fewer higher order pages to be free */
min >>= 1;
if (free_pages <= min)
goto out_failed;
+ kernnorclm++;
+ easyrclm++;
+ kernrclm++;
}
return 1;
@@ -1496,6 +1547,7 @@ void show_free_areas(void)
unsigned long inactive;
unsigned long free;
struct zone *zone;
+ int type;
for_each_zone(zone) {
show_node(zone);
@@ -1575,7 +1627,9 @@ void show_free_areas(void)
}
for_each_zone(zone) {
- unsigned long nr, flags, order, total = 0;
+ unsigned long nr = 0;
+ unsigned long total = 0;
+ unsigned long flags,order;
show_node(zone);
printk("%s: ", zone->name);
@@ -1585,10 +1639,18 @@ void show_free_areas(void)
}
spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
- nr = zone->free_area[order].nr_free;
+ for_each_rclmtype_order(type, order) {
+ nr += zone->free_area_lists[type][order].nr_free;
total += nr << order;
- printk("%lu*%lukB ", nr, K(1UL) << order);
+
+ /*
+ * If type had reached RCLM_TYPE, the free pages
+ * for this order have been summed up
+ */
+ if (type == RCLM_TYPES-1) {
+ printk("%lu*%lukB ", nr, K(1UL) << order);
+ nr = 0;
+ }
}
spin_unlock_irqrestore(&zone->lock, flags);
printk("= %lukB\n", K(total));
@@ -1899,9 +1961,14 @@ void zone_init_free_lists(struct pglist_
unsigned long size)
{
int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
+ int type;
+ struct free_area *area;
+
+ /* Initialse the three size ordered lists of free_areas */
+ for_each_rclmtype_order(type, order) {
+ area = &(zone->free_area_lists[type][order]);
+ INIT_LIST_HEAD(&area->free_list);
+ area->nr_free = 0;
}
}
@@ -2314,16 +2381,26 @@ static int frag_show(struct seq_file *m,
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
unsigned long flags;
- int order;
+ int order, t;
+ struct free_area *area;
+ unsigned long nr_bufs = 0;
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
continue;
spin_lock_irqsave(&zone->lock, flags);
- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ for_each_rclmtype_order(t, order) {
+ area = &(zone->free_area_lists[t][order]);
+ nr_bufs += area->nr_free;
+
+ if (t == RCLM_TYPES-1) {
+ seq_printf(m, "%6lu ", nr_bufs);
+ nr_bufs = 0;
+ }
+ }
+
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
This patch implements fallback logic. In the event there is no 2^(MAX_ORDER-1)
blocks of pages left, this will help the system decide what list to use. The
highlights of the patch are;
o Define a RCLM_FALLBACK type for fallbacks
o Use a percentage of each zone for fallbacks. When a reserved pool of pages
is depleted, it will try and use RCLM_FALLBACK before using anything else.
This greatly reduces the amount of fallbacks causing fragmentation without
needing complex balancing algorithms
o Add a fallback_reserve that records how much of the zone is currently used
for allocations falling back to RCLM_FALLBACK
o Adds a fallback_allocs[] array that determines the order of freelists are
used for each allocation type
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Joel Schopp <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h
--- linux-2.6.14-rc5-mm1-003_fragcore/include/linux/mmzone.h 2005-10-30 13:36:16.000000000 +0000
+++ linux-2.6.14-rc5-mm1-004_fallback/include/linux/mmzone.h 2005-10-30 13:36:56.000000000 +0000
@@ -30,7 +30,8 @@
#define RCLM_NORCLM 0
#define RCLM_EASY 1
#define RCLM_KERN 2
-#define RCLM_TYPES 3
+#define RCLM_FALLBACK 3
+#define RCLM_TYPES 4
#define BITS_PER_RCLM_TYPE 2
#define for_each_rclmtype_order(type, order) \
@@ -168,8 +169,17 @@ struct zone {
unsigned long *free_area_usemap;
#endif
+ /*
+ * With allocation fallbacks, the nr_free count for each RCLM_TYPE must
+ * be added together to get the correct count of free pages for a given
+ * order. Individually, the nr_free count in a free_area may not match
+ * the number of pages in the free_list.
+ */
struct free_area free_area_lists[RCLM_TYPES][MAX_ORDER];
+ /* Number of pages currently used for RCLM_FALLBACK */
+ unsigned long fallback_reserve;
+
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
@@ -292,6 +302,17 @@ struct zonelist {
struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
};
+static inline void inc_reserve_count(struct zone *zone, int type)
+{
+ if (type == RCLM_FALLBACK)
+ zone->fallback_reserve += PAGES_PER_MAXORDER;
+}
+
+static inline void dec_reserve_count(struct zone *zone, int type)
+{
+ if (type == RCLM_FALLBACK && zone->fallback_reserve)
+ zone->fallback_reserve -= PAGES_PER_MAXORDER;
+}
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-003_fragcore/mm/page_alloc.c 2005-10-30 13:36:16.000000000 +0000
+++ linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c 2005-10-30 13:36:56.000000000 +0000
@@ -54,6 +54,22 @@ unsigned long totalhigh_pages __read_mos
long nr_swap_pages;
/*
+ * fallback_allocs contains the fallback types for low memory conditions
+ * where the preferred alloction type if not available.
+ */
+int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
+ {RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES},
+ {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
+ {RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
+};
+
+/* Returns 1 if the needed percentage of the zone is reserved for fallbacks */
+static inline int min_fallback_reserved(struct zone *zone)
+{
+ return zone->fallback_reserve >= zone->present_pages >> 3;
+}
+
+/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
@@ -623,7 +639,12 @@ struct page *steal_maxorder_block(struct
page = list_entry(area->free_list.next, struct page, lru);
area->nr_free--;
+ if (!min_fallback_reserved(zone))
+ alloctype = RCLM_FALLBACK;
+
set_pageblock_type(zone, page, alloctype);
+ dec_reserve_count(zone, i);
+ inc_reserve_count(zone, alloctype);
return page;
}
@@ -638,6 +659,78 @@ remove_page(struct zone *zone, struct pa
return expand(zone, page, order, current_order, area);
}
+/*
+ * If we are falling back, and the allocation is KERNNORCLM,
+ * then reserve any buddies for the KERNNORCLM pool. These
+ * allocations fragment the worst so this helps keep them
+ * in the one place
+ */
+static inline struct free_area *
+fallback_buddy_reserve(int start_alloctype, struct zone *zone,
+ unsigned int current_order, struct page *page,
+ struct free_area *area)
+{
+ if (start_alloctype != RCLM_NORCLM)
+ return area;
+
+ area = &zone->free_area_lists[RCLM_NORCLM][current_order];
+
+ /* Reserve the whole block if this is a large split */
+ if (current_order >= MAX_ORDER / 2) {
+ int reserve_type = RCLM_NORCLM;
+ if (!min_fallback_reserved(zone))
+ reserve_type = RCLM_FALLBACK;
+
+ dec_reserve_count(zone, get_pageblock_type(zone,page));
+ set_pageblock_type(zone, page, reserve_type);
+ inc_reserve_count(zone, reserve_type);
+ }
+ return area;
+}
+
+static struct page *
+fallback_alloc(int alloctype, struct zone *zone, unsigned int order)
+{
+ int *fallback_list;
+ int start_alloctype = alloctype;
+ struct free_area *area;
+ unsigned int current_order;
+ struct page *page;
+ int i;
+
+ /* Ok, pick the fallback order based on the type */
+ BUG_ON(alloctype >= RCLM_TYPES);
+ fallback_list = fallback_allocs[alloctype];
+
+ /*
+ * Here, the alloc type lists has been depleted as well as the global
+ * pool, so fallback. When falling back, the largest possible block
+ * will be taken to keep the fallbacks clustered if possible
+ */
+ for (i = 0; fallback_list[i] != RCLM_TYPES; i++) {
+ alloctype = fallback_list[i];
+
+ /* Find a block to allocate */
+ area = &zone->free_area_lists[alloctype][MAX_ORDER-1];
+ for (current_order = MAX_ORDER - 1; current_order > order;
+ current_order--, area--) {
+ if (list_empty(&area->free_list))
+ continue;
+
+ page = list_entry(area->free_list.next,
+ struct page, lru);
+ area->nr_free--;
+ area = fallback_buddy_reserve(start_alloctype, zone,
+ current_order, page, area);
+ return remove_page(zone, page, order,
+ current_order, area);
+
+ }
+ }
+
+ return NULL;
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
@@ -664,7 +757,8 @@ static struct page *__rmqueue(struct zon
if (page != NULL)
return remove_page(zone, page, order, MAX_ORDER-1, area);
- return NULL;
+ /* Try falling back */
+ return fallback_alloc(alloctype, zone, order);
}
/*
@@ -2270,6 +2364,7 @@ static void __init free_area_init_core(s
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->fallback_reserve = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
Fragmentation avoidance patches increase our chances of satisfying high
order allocations. So this patch takes more than one iteration at trying
to fulfill those allocations because, unlike before, the extra iterations
are often useful.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Joel Schopp <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c
--- linux-2.6.14-rc5-mm1-004_fallback/mm/page_alloc.c 2005-10-30 13:36:56.000000000 +0000
+++ linux-2.6.14-rc5-mm1-005_largealloc_tryharder/mm/page_alloc.c 2005-10-30 13:37:34.000000000 +0000
@@ -1127,6 +1127,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
int do_retry;
int can_try_harder;
int did_some_progress;
+ int highorder_retry = 3;
might_sleep_if(wait);
@@ -1275,7 +1276,17 @@ rebalance:
goto got_pg;
}
- out_of_memory(gfp_mask, order);
+ if (order < MAX_ORDER / 2)
+ out_of_memory(gfp_mask, order);
+
+ /*
+ * Due to low fragmentation efforts, we try a little
+ * harder to satisfy high order allocations and only
+ * go OOM for low-order allocations
+ */
+ if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+ goto rebalance;
+
goto restart;
}
@@ -1292,6 +1303,8 @@ rebalance:
do_retry = 1;
if (gfp_mask & __GFP_NOFAIL)
do_retry = 1;
+ if (order >= MAX_ORDER/2 && --highorder_retry > 0)
+ do_retry = 1;
}
if (do_retry) {
blk_congestion_wait(WRITE, HZ/50);
On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> Here are a few brief reasons why this set of patches is useful;
>
> o Reduced fragmentation improves the chance a large order allocation succeeds
> o General-purpose memory hotplug needs the page/memory groupings provided
> o Reduces the number of badly-placed pages that page migration mechanism must
> deal with. This also applies to any active page defragmentation mechanism.
I can say that this patch set makes hotplug memory remove be of
value on ppc64. My system has 6GB of memory and I would 'load
it up' to the point where it would just start to swap and let it
run for an hour. Without these patches, it was almost impossible
to find a section that could be offlined. With the patches, I
can consistently reduce memory to somewhere between 512MB and 1GB.
Of course, results will vary based on workload. Also, this is
most advantageous for memory hotlug on ppc64 due to relatively
small section size (16MB) as compared to the page grouping size
(8MB). A more general purpose solution is needed for memory hotplug
support on architectures with larger section sizes.
Just another data point,
--
Mike
Mike Kravetz wrote:
> On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
>
>>Here are a few brief reasons why this set of patches is useful;
>>
>>o Reduced fragmentation improves the chance a large order allocation succeeds
>>o General-purpose memory hotplug needs the page/memory groupings provided
>>o Reduces the number of badly-placed pages that page migration mechanism must
>> deal with. This also applies to any active page defragmentation mechanism.
>
>
> I can say that this patch set makes hotplug memory remove be of
> value on ppc64. My system has 6GB of memory and I would 'load
> it up' to the point where it would just start to swap and let it
> run for an hour. Without these patches, it was almost impossible
> to find a section that could be offlined. With the patches, I
> can consistently reduce memory to somewhere between 512MB and 1GB.
> Of course, results will vary based on workload. Also, this is
> most advantageous for memory hotlug on ppc64 due to relatively
> small section size (16MB) as compared to the page grouping size
> (8MB). A more general purpose solution is needed for memory hotplug
> support on architectures with larger section sizes.
>
> Just another data point,
Despite what people were trying to tell me at Ottawa, this patch
set really does add quite a lot of complexity to the page
allocator, and it seems to be increasingly only of benefit to
dynamically allocating hugepages and memory hot unplug.
If that is the case, do we really want to make such sacrifices
for the huge machines that want these things? What about just
making an extra zone for easy-to-reclaim things to live in?
This could possibly even be resized at runtime according to
demand with the memory hotplug stuff (though I haven't been
following that).
Don't take this as criticism of the actual implementation or its
effectiveness.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin <[email protected]> wrote:
>
> Mike Kravetz wrote:
> > On Sun, Oct 30, 2005 at 06:33:55PM +0000, Mel Gorman wrote:
> >
> >>Here are a few brief reasons why this set of patches is useful;
> >>
> >>o Reduced fragmentation improves the chance a large order allocation succeeds
> >>o General-purpose memory hotplug needs the page/memory groupings provided
> >>o Reduces the number of badly-placed pages that page migration mechanism must
> >> deal with. This also applies to any active page defragmentation mechanism.
> >
> >
> > I can say that this patch set makes hotplug memory remove be of
> > value on ppc64. My system has 6GB of memory and I would 'load
> > it up' to the point where it would just start to swap and let it
> > run for an hour. Without these patches, it was almost impossible
> > to find a section that could be offlined. With the patches, I
> > can consistently reduce memory to somewhere between 512MB and 1GB.
> > Of course, results will vary based on workload. Also, this is
> > most advantageous for memory hotlug on ppc64 due to relatively
> > small section size (16MB) as compared to the page grouping size
> > (8MB). A more general purpose solution is needed for memory hotplug
> > support on architectures with larger section sizes.
> >
> > Just another data point,
>
> Despite what people were trying to tell me at Ottawa, this patch
> set really does add quite a lot of complexity to the page
> allocator, and it seems to be increasingly only of benefit to
> dynamically allocating hugepages and memory hot unplug.
Remember that Rohit is seeing ~10% variation between runs of scientific
software, and that his patch to use higher-order pages to preload the
percpu-pages magazines fixed that up. I assume this means that it provided
up to 10% speedup, which is a lot.
But the patch caused page allocator fragmentation and several reports of
gigE Tx buffer allocation failures, so I dropped it.
We think that Mel's patches will allow us to reintroduce Rohit's
optimisation.
> If that is the case, do we really want to make such sacrifices
> for the huge machines that want these things? What about just
> making an extra zone for easy-to-reclaim things to live in?
>
> This could possibly even be resized at runtime according to
> demand with the memory hotplug stuff (though I haven't been
> following that).
>
> Don't take this as criticism of the actual implementation or its
> effectiveness.
>
But yes, adding additional complexity is a black mark, and these patches
add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw).
Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>>Despite what people were trying to tell me at Ottawa, this patch
>>set really does add quite a lot of complexity to the page
>>allocator, and it seems to be increasingly only of benefit to
>>dynamically allocating hugepages and memory hot unplug.
>
>
> Remember that Rohit is seeing ~10% variation between runs of scientific
> software, and that his patch to use higher-order pages to preload the
> percpu-pages magazines fixed that up. I assume this means that it provided
> up to 10% speedup, which is a lot.
>
OK, I wasn't aware of this. I wonder what other approaches we could
try to add a bit of colour to our pages? I bet something simple like
trying to hand out alternate odd/even pages per task might help.
> But the patch caused page allocator fragmentation and several reports of
> gigE Tx buffer allocation failures, so I dropped it.
>
> We think that Mel's patches will allow us to reintroduce Rohit's
> optimisation.
>
>
>>If that is the case, do we really want to make such sacrifices
>>for the huge machines that want these things? What about just
>>making an extra zone for easy-to-reclaim things to live in?
>>
>>This could possibly even be resized at runtime according to
>>demand with the memory hotplug stuff (though I haven't been
>>following that).
>>
>>Don't take this as criticism of the actual implementation or its
>>effectiveness.
>>
>
>
> But yes, adding additional complexity is a black mark, and these patches
> add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw).
>
They do look quite fine. They seem to get their claws pretty deep
into page reclaim, but I guess that is to be expected if we want
to increase readahead smarts much more.
However, I'm hoping bits of that can be merged at a time, and
interfaces and page reclaim stuff can be discussed and the best
option taken. No such luck with these patches AFAIKS - simply
adding another level of page groups, and another level of
heuristics to the page allocator is going to hurt. By definition.
I do wonder why zones can't be used... though I'm sure there are
good reasons.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Mon, 31 Oct 2005, Nick Piggin wrote:
> Andrew Morton wrote:
> > Nick Piggin <[email protected]> wrote:
>
> > > Despite what people were trying to tell me at Ottawa, this patch
> > > set really does add quite a lot of complexity to the page
> > > allocator, and it seems to be increasingly only of benefit to
> > > dynamically allocating hugepages and memory hot unplug.
> >
> >
> > Remember that Rohit is seeing ~10% variation between runs of scientific
> > software, and that his patch to use higher-order pages to preload the
> > percpu-pages magazines fixed that up. I assume this means that it provided
> > up to 10% speedup, which is a lot.
> >
>
> OK, I wasn't aware of this. I wonder what other approaches we could
> try to add a bit of colour to our pages? I bet something simple like
> trying to hand out alternate odd/even pages per task might help.
>
Reading through the kernel archives, it appears that any page colouring
scheme was getting rejected because it slowed up workloads like kernel
compilers that were not very cache sensitive. Where an approach didn't
suffer from that problem, there was disagreement over whether there was a
general performance improvement or not.
I recall Rohit's patch from an earlier -mm. Without knowing anything about
his test, I am guessing he is getting cheap page colouring by preloading
the per-cpu cache with contiguous pages and his workload is faulting in
the batch of pages immediately by doing something like linearly reading a
large array. Hence, the mappings of his workload are getting the right
colour pages. This makes his workload a "lucky" workload. The general
benefit of preloading the percpu magazines is that there is a chance the
allocator only has to be called once, not pcp->batch times.
An odd/even allocation scheme could be provided by having two free_lists
in a free_area. One list for the "left buddy" and the other list for the
"right buddy". However, at best, that would provide two colours. I'm not
sure how much benefit it would give for the cost of more linked lists.
> > gigE Tx buffer allocation failures, so I dropped it.
> >
> > We think that Mel's patches will allow us to reintroduce Rohit's
> > optimisation.
> >
> >
> > > If that is the case, do we really want to make such sacrifices
> > > for the huge machines that want these things? What about just
> > > making an extra zone for easy-to-reclaim things to live in?
> > >
> > > This could possibly even be resized at runtime according to
> > > demand with the memory hotplug stuff (though I haven't been
> > > following that).
> > >
> > > Don't take this as criticism of the actual implementation or its
> > > effectiveness.
> > >
> >
> >
> > But yes, adding additional complexity is a black mark, and these patches
> > add quite a bit. (Ditto the fine-looking adaptive readahead patches, btw).
> >
>
> They do look quite fine. They seem to get their claws pretty deep
> into page reclaim, but I guess that is to be expected if we want
> to increase readahead smarts much more.
>
> However, I'm hoping bits of that can be merged at a time, and
> interfaces and page reclaim stuff can be discussed and the best
> option taken. No such luck with these patches AFAIKS - simply
> adding another level of page groups, and another level of
> heuristics to the page allocator is going to hurt. By definition.
> I do wonder why zones can't be used... though I'm sure there are
> good reasons.
>
Granted, the patch set does add complexity even though I tried to keep it
as simple as possible. Benchmarks were posted with each patchset to show
that it was not suffering in real performance even if the code is a bit
less approachable.
Doing something similar with zones is an old idea and brought up
specifically for memory hotplug. In implementations, the zone was called
ZONE_HOTREMOVABLE or something similar. In my opinion, replicating the
effect of this set of patches with zones introduces it's own set of
headaches and ends up being far more complicated. Hopefully, someone will
point out if I am missing historical context here, am rehashing old
arguments or am just plain wrong :)
To replicate the functionality of these patches with zones would require
two additional zones for NormalEasy and HighmemEasy (I suck at naming
things). The plus side is that once the zone fallback lists are updated,
the page allocator remains more or less the same as it is today. Then the
headaches start.
Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
we are allocating PTEs from high memory, we could fallback to the Normal
zone even if highmem pages are available because the HighMem zone was out
of pages. It will require very different fallback logic to say that
HighMem allocations can also use HighMemEasy rather than falling back to
Normal.
Problem 2: Setting the zone size will be a very difficult tunable to get
right. Right off, we are are introducing a tunable which will make
foreheads furrow. If the tunable is set wrong, system performance will
suffer and we could see situations where kernel allocations fail because
it's zone got depleted.
Problem 3: To get rid of the tunable, we could try resizing the zones
dynamically but that will be hard. Obviously, the zones are going to be
physically adjacent to each other. To resize the zone, the pages at one
end of the zone will need to be free. Shrinking the NormalEasy zone would
be easy enough, but shrinking the Normal zone with kernel pages in it
would be considerably harder, if not outright impossible. One page in the
wrong place will mean the zone cannot be resized
Problem 4: Page reclaim would have two new zones to deal with bringing
with it a new set of zone balancing problems. That brings it's own special
brand of fun.
There may be more problems but these 4 are fairly important. This patchset
does not suffer from the same problems.
Problem 1: This patchset has a fallback list for each allocation type. So
EasyRclm allocations can just as easily use an area reserved for kernel
allocations and vice versa. Obviously we don't like when this happens, but
when it does, things start fragmenting rather than breaking.
Problem 2: The number of pages that get reserved for each type grows and
shrinks on demand. There is no tunable and no need for one.
Problem 3: Problem doesn't exist for this patchset
Problem 4: Problem doesn't exist for this patchset.
Bottom line, using zones will be more complex than this set of patches and
bring a lot of tricky issues with it.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> I recall Rohit's patch from an earlier -mm. Without knowing anything about
> his test, I am guessing he is getting cheap page colouring by preloading
> the per-cpu cache with contiguous pages and his workload is faulting in
> the batch of pages immediately by doing something like linearly reading a
> large array. Hence, the mappings of his workload are getting the right
> colour pages. This makes his workload a "lucky" workload. The general
> benefit of preloading the percpu magazines is that there is a chance the
> allocator only has to be called once, not pcp->batch times.
>
Or we could introduce a new allocation mechanism for anon pages that
passes the vaddr to the allocator, and tries to get an odd/even page
according to the vaddr.
> An odd/even allocation scheme could be provided by having two free_lists
> in a free_area. One list for the "left buddy" and the other list for the
> "right buddy". However, at best, that would provide two colours. I'm not
> sure how much benefit it would give for the cost of more linked lists.
>
2 colours should be a good first order improvement because you will
no longer have adjacent pages of the same colour.
It would definitely be cheaper than fragmentation avoidance + higher
order batch loading.
> To replicate the functionality of these patches with zones would require
> two additional zones for NormalEasy and HighmemEasy (I suck at naming
> things). The plus side is that once the zone fallback lists are updated,
> the page allocator remains more or less the same as it is today. Then the
> headaches start.
>
> Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
> fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
> we are allocating PTEs from high memory, we could fallback to the Normal
> zone even if highmem pages are available because the HighMem zone was out
> of pages. It will require very different fallback logic to say that
> HighMem allocations can also use HighMemEasy rather than falling back to
> Normal.
>
Just be a different set of GFP flags. Your patches obviously also have
some ordering imposed.... pagecache would want HighMemEasy, HighMem,
NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
Note that if you do need to make some changes to the zone allocator, then
IMO that is far preferable to add a new layer of things-that-are-blocks-of-
-memory-but-not-zones, complete with their own balancing and other heuristics.
> Problem 2: Setting the zone size will be a very difficult tunable to get
> right. Right off, we are are introducing a tunable which will make
> foreheads furrow. If the tunable is set wrong, system performance will
> suffer and we could see situations where kernel allocations fail because
> it's zone got depleted.
>
But even so, when you do automatic resizing, you seem to be adding a
fundamental weak point in fragmentation avoidance.
> Problem 3: To get rid of the tunable, we could try resizing the zones
> dynamically but that will be hard. Obviously, the zones are going to be
> physically adjacent to each other. To resize the zone, the pages at one
> end of the zone will need to be free. Shrinking the NormalEasy zone would
> be easy enough, but shrinking the Normal zone with kernel pages in it
> would be considerably harder, if not outright impossible. One page in the
> wrong place will mean the zone cannot be resized
>
OK, maybe it is hard ;) Do they really need to be resized, then?
Isn't the big memory hotunplug push aimed at virtual machines and
hypervisors anyway? In which case one would presumably have some
memory that "must" be reclaimable, in which case we can't expand
non-Easy zones into that memory anyway.
> Problem 4: Page reclaim would have two new zones to deal with bringing
> with it a new set of zone balancing problems. That brings it's own special
> brand of fun.
>
> There may be more problems but these 4 are fairly important. This patchset
> does not suffer from the same problems.
>
If page reclaim can't deal with 5 zones then it is going to have problems
somewhere at 3 and needs to be fixed. I don't see how your patches get
around this fun by simply introducing their own balancing and fallback
heuristics.
> Problem 1: This patchset has a fallback list for each allocation type. So
> EasyRclm allocations can just as easily use an area reserved for kernel
> allocations and vice versa. Obviously we don't like when this happens, but
> when it does, things start fragmenting rather than breaking.
>
> Problem 2: The number of pages that get reserved for each type grows and
> shrinks on demand. There is no tunable and no need for one.
>
> Problem 3: Problem doesn't exist for this patchset
>
> Problem 4: Problem doesn't exist for this patchset.
>
> Bottom line, using zones will be more complex than this set of patches and
> bring a lot of tricky issues with it.
>
Maybe zones don't do exactly what you need, but I think they're better
than you think ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 1 Nov 2005, Nick Piggin wrote:
> Mel Gorman wrote:
>
> > I recall Rohit's patch from an earlier -mm. Without knowing anything about
> > his test, I am guessing he is getting cheap page colouring by preloading
> > the per-cpu cache with contiguous pages and his workload is faulting in
> > the batch of pages immediately by doing something like linearly reading a
> > large array. Hence, the mappings of his workload are getting the right
> > colour pages. This makes his workload a "lucky" workload. The general
> > benefit of preloading the percpu magazines is that there is a chance the
> > allocator only has to be called once, not pcp->batch times.
> >
>
> Or we could introduce a new allocation mechanism for anon pages that
> passes the vaddr to the allocator, and tries to get an odd/even page
> according to the vaddr.
>
We could, but it is a different problem than what this set of patches are
trying to address. I'll add page colouring to the end of the todo list in
case I get stuck for something to do.
> > An odd/even allocation scheme could be provided by having two free_lists
> > in a free_area. One list for the "left buddy" and the other list for the
> > "right buddy". However, at best, that would provide two colours. I'm not
> > sure how much benefit it would give for the cost of more linked lists.
> >
>
> 2 colours should be a good first order improvement because you will
> no longer have adjacent pages of the same colour.
>
> It would definitely be cheaper than fragmentation avoidance + higher
> order batch loading.
>
Ok, but the page colours would also need to be in the per-cpu lists this
new api that supplies vaddrs always takes the spinlock for the free lists.
I don't believe it would be cheaper and any benefit would only show up on
benchmarks that are cache sensitive. Judging by previous discussions on
page colouring in the mail archives, Linus will happily kick the approach
full of holes.
As for current performance, the Aim9 benchmarks show that the
fragmentation avoidance does not have a major performance penalty. A run
of the patches in the -mm tree should find out if there are performance
regressions on other machine types.
>
> > To replicate the functionality of these patches with zones would require
> > two additional zones for NormalEasy and HighmemEasy (I suck at naming
> > things). The plus side is that once the zone fallback lists are updated,
> > the page allocator remains more or less the same as it is today. Then the
> > headaches start.
> >
> > Problem 1: Zone fallback lists are "one-way" and per-node. Lets assume a
> > fallback list of HighMemEasy, HighMem, NormalEasy, Normal, DMA. Assuming
> > we are allocating PTEs from high memory, we could fallback to the Normal
> > zone even if highmem pages are available because the HighMem zone was out
> > of pages. It will require very different fallback logic to say that
> > HighMem allocations can also use HighMemEasy rather than falling back to
> > Normal.
> >
>
> Just be a different set of GFP flags. Your patches obviously also have
> some ordering imposed.... pagecache would want HighMemEasy, HighMem,
> NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
>
As well as a different set of GFP flags, we would also need new zone
fallback logic which will hit the __alloc_pages() path. It will be adding
more complexity to the allocator and we're replacing one type of
complexity with another.
> Note that if you do need to make some changes to the zone allocator, then
> IMO that is far preferable to add a new layer of things-that-are-blocks-of-
> -memory-but-not-zones, complete with their own balancing and other heuristics.
>
Thing is, with my approach, the very worst that happens is that it
fragments just as bad as the normal allocator. With a zone-based approach,
the worst that happens is that the kernel zone is too small, kernel caches
do not grow to a suitable size and overall system performance degrades.
> > Problem 2: Setting the zone size will be a very difficult tunable to get
> > right. Right off, we are are introducing a tunable which will make
> > foreheads furrow. If the tunable is set wrong, system performance will
> > suffer and we could see situations where kernel allocations fail because
> > it's zone got depleted.
> >
>
> But even so, when you do automatic resizing, you seem to be adding a
> fundamental weak point in fragmentation avoidance.
>
The sizing I do is when a large block is split. Then the region is just
marked for a particular allocation type. This is very simple. The second
resizing that occurs is when a kernel allocation "steal" easyrclm pages. I
do not like the fact that we steal in this fashion but the alternative is
to teach kswapd how to reclaim easyrclm pages from other areas. I view
this as "future work" but if it was done, the "steal" mechanism would go
away.
> > Problem 3: To get rid of the tunable, we could try resizing the zones
> > dynamically but that will be hard. Obviously, the zones are going to be
> > physically adjacent to each other. To resize the zone, the pages at one
> > end of the zone will need to be free. Shrinking the NormalEasy zone would
> > be easy enough, but shrinking the Normal zone with kernel pages in it
> > would be considerably harder, if not outright impossible. One page in the
> > wrong place will mean the zone cannot be resized
> >
>
> OK, maybe it is hard ;) Do they really need to be resized, then?
>
I think we would need to, yes. If the size of the region is wrong, bad
things are likely to happen. If the kernel page zone is too small, it'll
be under pressure even though there is memory available elsewhere. If it's
too large, then it will get fragmented and high order allocations will
fail.
> Isn't the big memory hotunplug push aimed at virtual machines and
> hypervisors anyway? In which case one would presumably have some
> memory that "must" be reclaimable, in which case we can't expand
> non-Easy zones into that memory anyway.
>
I believe that is the case for hotplug all right, but not the case where
we just want to satisfy high order allocations in a reasonably reliable
fashion. In that case, it would be nice to reclaim an easyrclm region.
It has already been reported by Mike Kravetz that memory remove works a
whole lot better on PPC64 with this patch than without it. Memory hotplug
remove was not the problem I was trying to solve, but I consider the fact
that it is helped to be a big plus. So, even though it is possible that
this approach still gets fragmented under some workloads, we know that, in
general, it does a pretty good job.
> > Problem 4: Page reclaim would have two new zones to deal with bringing
> > with it a new set of zone balancing problems. That brings it's own special
> > brand of fun.
> >
> > There may be more problems but these 4 are fairly important. This patchset
> > does not suffer from the same problems.
> >
>
> If page reclaim can't deal with 5 zones then it is going to have problems
> somewhere at 3 and needs to be fixed. I don't see how your patches get
> around this fun by simply introducing their own balancing and fallback
> heuristics.
>
If my approach gets the sizes of areas all wrong, it will fragment. If the
zone-based approach gets the sizes of areas wrong, system performance
degrades. I prefer the failure scenario of my approach :).
> > Problem 1: This patchset has a fallback list for each allocation type. So
> > EasyRclm allocations can just as easily use an area reserved for kernel
> > allocations and vice versa. Obviously we don't like when this happens, but
> > when it does, things start fragmenting rather than breaking.
> >
> > Problem 2: The number of pages that get reserved for each type grows and
> > shrinks on demand. There is no tunable and no need for one.
> >
> > Problem 3: Problem doesn't exist for this patchset
> >
> > Problem 4: Problem doesn't exist for this patchset.
> >
> > Bottom line, using zones will be more complex than this set of patches and
> > bring a lot of tricky issues with it.
> >
>
> Maybe zones don't do exactly what you need, but I think they're better
> than you think ;)
>
You may be right, but I still think that my approach is simpler and less
likely to introduce horrible balancing problems.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Tue, 1 Nov 2005, Nick Piggin wrote:
> Ok, but the page colours would also need to be in the per-cpu lists this
> new api that supplies vaddrs always takes the spinlock for the free lists.
> I don't believe it would be cheaper and any benefit would only show up on
> benchmarks that are cache sensitive. Judging by previous discussions on
> page colouring in the mail archives, Linus will happily kick the approach
> full of holes.
>
OK, but I'm just pointing out that improving page colouring doesn't
require contiguous pages.
> As for current performance, the Aim9 benchmarks show that the
> fragmentation avoidance does not have a major performance penalty. A run
> of the patches in the -mm tree should find out if there are performance
> regressions on other machine types.
>
But I can see that there will be penalties. Cache misses, branches,
etc. Obviously any new feature or more sophisticated behaviour is
going to require that but they obviously need good justification.
>>Just be a different set of GFP flags. Your patches obviously also have
>>some ordering imposed.... pagecache would want HighMemEasy, HighMem,
>>NormalEasy, Normal, DMA; ptes will want HighMem, Normal, DMA.
>>
>
>
> As well as a different set of GFP flags, we would also need new zone
> fallback logic which will hit the __alloc_pages() path. It will be adding
> more complexity to the allocator and we're replacing one type of
> complexity with another.
>
It is complexity that is mostly already handled for us with the zones
logic. Picking out a couple of small points that zones don't get exactly
right isn't a good basis to come up with a completely new zoneing layer.
>
>>Note that if you do need to make some changes to the zone allocator, then
>>IMO that is far preferable to add a new layer of things-that-are-blocks-of-
>>-memory-but-not-zones, complete with their own balancing and other heuristics.
>>
>
>
> Thing is, with my approach, the very worst that happens is that it
> fragments just as bad as the normal allocator. With a zone-based approach,
> the worst that happens is that the kernel zone is too small, kernel caches
> do not grow to a suitable size and overall system performance degrades.
>
If you don't need to guarantee higher order allocations, then there is
no problem with our current approach. If you do then you simply need to
make a sacrifice.
>
>>>Problem 2: Setting the zone size will be a very difficult tunable to get
>>>right. Right off, we are are introducing a tunable which will make
>>>foreheads furrow. If the tunable is set wrong, system performance will
>>>suffer and we could see situations where kernel allocations fail because
>>>it's zone got depleted.
>>>
>>
>>But even so, when you do automatic resizing, you seem to be adding a
>>fundamental weak point in fragmentation avoidance.
>>
>
>
> The sizing I do is when a large block is split. Then the region is just
> marked for a particular allocation type. This is very simple. The second
> resizing that occurs is when a kernel allocation "steal" easyrclm pages. I
> do not like the fact that we steal in this fashion but the alternative is
> to teach kswapd how to reclaim easyrclm pages from other areas. I view
> this as "future work" but if it was done, the "steal" mechanism would go
> away.
>
Weak point, as in: gets fragmented.
>
>>>Problem 3: To get rid of the tunable, we could try resizing the zones
>>>dynamically but that will be hard. Obviously, the zones are going to be
>>>physically adjacent to each other. To resize the zone, the pages at one
>>>end of the zone will need to be free. Shrinking the NormalEasy zone would
>>>be easy enough, but shrinking the Normal zone with kernel pages in it
>>>would be considerably harder, if not outright impossible. One page in the
>>>wrong place will mean the zone cannot be resized
>>>
>>
>>OK, maybe it is hard ;) Do they really need to be resized, then?
>>
>
>
> I think we would need to, yes. If the size of the region is wrong, bad
> things are likely to happen. If the kernel page zone is too small, it'll
> be under pressure even though there is memory available elsewhere. If it's
> too large, then it will get fragmented and high order allocations will
> fail.
>
But people will just have to get it right then. If they want to be able
to hot unplug 10G of memory, or allocate 4G of hugepages on demand, then
they simply need to specify their requirements. Not too difficult? It is
really nice to be able to place some burden on huge servers and mainframes,
because they have people administering and tuning them full-time. It
allows us to not penalise small servers and desktops.
>
>>Isn't the big memory hotunplug push aimed at virtual machines and
>>hypervisors anyway? In which case one would presumably have some
>>memory that "must" be reclaimable, in which case we can't expand
>>non-Easy zones into that memory anyway.
>>
>
>
> I believe that is the case for hotplug all right, but not the case where
> we just want to satisfy high order allocations in a reasonably reliable
> fashion. In that case, it would be nice to reclaim an easyrclm region.
>
As I've said before, I think this is a false hope and we need to
move away from higher order allocations.
> It has already been reported by Mike Kravetz that memory remove works a
> whole lot better on PPC64 with this patch than without it. Memory hotplug
> remove was not the problem I was trying to solve, but I consider the fact
> that it is helped to be a big plus. So, even though it is possible that
> this approach still gets fragmented under some workloads, we know that, in
> general, it does a pretty good job.
>
Sure, but using zones would work too, and on the plus side you would
be able to specify exactly how much removable memory to be.
>>
>>Maybe zones don't do exactly what you need, but I think they're better
>>than you think ;)
>>
>
>
> You may be right, but I still think that my approach is simpler and less
> likely to introduce horrible balancing problems.
>
Simpler? We already have zones though. They are a complexity we need to
deal with already. I really can't see how you can use the simpler argument
in favour of your patches ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com