I spent yesterday rebasing the memory compaction code and doing some
additional work on it. It was previously against 2.6.21 but the VM has
changed a bit since then so there are a number of snarl points, places where
it can be improved and places where it may be outright wrong because of
core changes. As a result, I've dropped any acks I had and am starting over.
This basically works on X86-64 flatmem and on qemu-i386. It still needs to
be tested for other architectures, SPARSEMEM and on machine configurations
with memory holes in a zone. I'm posting this now before it's fully ready
because I'm offline all of next week and didn't want to delay it two weeks
when there is something that can be looked at now.
===== CUT HERE =====
This is a prototype of a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.
In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area. A compaction run completes within a
zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of two ways. It may be triggered
explicitly by writing a node number to /proc/sys/vm/compact_node. When a
process fails to allocate a high-order page, it may compact memory in an
attempt to satisfy the allocation instead of entering direct reclaim. Explicit
compaction does not finish until the two scanners meet and direct compaction
ends if a suitable page becomes available that would meet watermarks.
The series is in 7 patches
Patch 1 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 2 exports a "unusable free space index" via /proc/pagetypeinfo. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 3 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 4 is the compaction mechanism although it's unreachable code at this
point
Patch 5 allows the triggering of memory compaction from /proc to aid
debugging and observe its impact. It always performs a full
compaction.
Patch 6 tries "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 7 temporarily disables compaction if an allocation failure occurs
after compaction.
I did not test with CONFIG_COMPACTION not set so there might be gremlins
there. Testing of compaction was primitive and represents one of the
easiest cases that can be faced for lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
For the test, I enabled debugging, preempt, the sleep watchdog and lockdep
but nothing nasty popped out. The results were;
2.6.33-rc2 Vanilla
Starting page count: 0
Requesting at each increment: 50 huge pages
1: 50 pages Success time:0.11 rclm:16883 cblock:0 csuccess:0 alloc: 50/50
2: 100 pages Success time:0.10 rclm:13752 cblock:0 csuccess:0 alloc: 50/50
3: 150 pages Success time:0.05 rclm:13303 cblock:0 csuccess:0 alloc: 50/50
4: 200 pages Success time:0.09 rclm:11257 cblock:0 csuccess:0 alloc: 50/50
5: 250 pages Success time:0.07 rclm:14319 cblock:0 csuccess:0 alloc: 50/50
6: 300 pages Success time:0.05 rclm:11158 cblock:0 csuccess:0 alloc: 50/50
7: 350 pages Success time:0.07 rclm:12244 cblock:0 csuccess:0 alloc: 50/50
8: 400 pages Success time:0.14 rclm:8553 cblock:0 csuccess:0 alloc: 50/50
9: 450 pages Success time:0.02 rclm:236 cblock:0 csuccess:0 alloc: 50/50
10: 500 pages Success time:0.03 rclm:142 cblock:0 csuccess:0 alloc: 50/50
11: 550 pages Success time:0.03 rclm:183 cblock:0 csuccess:0 alloc: 50/50
12: 600 pages Success time:0.01 rclm:330 cblock:0 csuccess:0 alloc: 50/50
13: 650 pages Success time:0.01 rclm:182 cblock:0 csuccess:0 alloc: 50/50
14: 700 pages Success time:0.02 rclm:215 cblock:0 csuccess:0 alloc: 50/50
15: 750 pages Success time:0.00 rclm:0 cblock:0 csuccess:0 alloc: 50/50
16: 800 pages Success time:0.02 rclm:0 cblock:0 csuccess:0 alloc: 50/50
17: 850 pages Success time:0.01 rclm:85 cblock:0 csuccess:0 alloc: 50/50
18: 867 pages Success time:0.42 rclm:116 cblock:0 csuccess:0 alloc: 17/50
19: 869 pages Success time:0.81 rclm:85 cblock:0 csuccess:0 alloc: 2/50
20: 870 pages Success time:1.62 rclm:170 cblock:0 csuccess:0 alloc: 1/50
21: 879 pages Success time:0.31 rclm:106 cblock:0 csuccess:0 alloc: 9/50
22: 879 pages Failed time:0.22 rclm:104 cblock:0 csuccess:0
23: 880 pages Success time:1.11 rclm:143 cblock:0 csuccess:0 alloc: 1/50
24: 880 pages Failed time:0.71 rclm:264 cblock:0 csuccess:0
25: 881 pages Success time:1.36 rclm:206 cblock:0 csuccess:0 alloc: 1/50
26: 881 pages Failed time:0.75 rclm:176 cblock:0 csuccess:0
27: 881 pages Failed time:0.94 rclm:284 cblock:0 csuccess:0
28: 881 pages Failed time:0.25 rclm:112 cblock:0 csuccess:0
29: 881 pages Failed time:1.48 rclm:318 cblock:0 csuccess:0
30: 881 pages Failed time:0.96 rclm:206 cblock:0 csuccess:0
Final page count: 881
Total pages reclaimed: 105132
Total blocks compacted: 0
Total compact pages alloced: 0
2.6.33-rc2 Compaction V1
Starting page count: 0
Requesting at each increment: 50 huge pages
1: 50 pages Success time:0.12 rclm:0 cblock:180 csuccess:43 alloc: 50/50
2: 100 pages Success time:0.04 rclm:9976 cblock:24 csuccess:6 alloc: 50/50
3: 150 pages Success time:0.05 rclm:995 cblock:144 csuccess:35 alloc: 50/50
4: 200 pages Success time:0.07 rclm:9054 cblock:60 csuccess:12 alloc: 50/50
5: 250 pages Success time:0.05 rclm:8096 cblock:60 csuccess:12 alloc: 50/50
6: 300 pages Success time:0.04 rclm:4855 cblock:39 csuccess:9 alloc: 50/50
7: 350 pages Success time:0.04 rclm:6375 cblock:23 csuccess:6 alloc: 50/50
8: 400 pages Success time:0.02 rclm:6656 cblock:6 csuccess:4 alloc: 50/50
9: 450 pages Success time:0.04 rclm:3943 cblock:117 csuccess:26 alloc: 50/50
10: 500 pages Success time:0.04 rclm:1534 cblock:136 csuccess:30 alloc: 50/50
11: 527 pages Success time:0.02 rclm:1021 cblock:37 csuccess:6 alloc: 27/50
12: 577 pages Success time:0.10 rclm:6566 cblock:55 csuccess:9 alloc: 50/50
13: 627 pages Success time:0.02 rclm:0 cblock:19 csuccess:19 alloc: 50/50
14: 677 pages Success time:0.01 rclm:0 cblock:5 csuccess:13 alloc: 50/50
15: 727 pages Success time:0.00 rclm:0 cblock:0 csuccess:5 alloc: 50/50
16: 777 pages Success time:0.01 rclm:0 cblock:7 csuccess:12 alloc: 50/50
17: 827 pages Success time:0.01 rclm:0 cblock:6 csuccess:14 alloc: 50/50
18: 877 pages Success time:0.11 rclm:0 cblock:26 csuccess:20 alloc: 50/50
19: 912 pages Success time:18.90 rclm:5958 cblock:218 csuccess:9 alloc: 35/50
20: 913 pages Success time:9.99 rclm:2668 cblock:114 csuccess:1 alloc: 1/50
21: 915 pages Success time:18.20 rclm:4338 cblock:96 csuccess:1 alloc: 2/50
22: 917 pages Success time:6.54 rclm:1827 cblock:42 csuccess:0 alloc: 2/50
23: 917 pages Failed time:4.82 rclm:1327 cblock:54 csuccess:0
24: 919 pages Success time:17.97 rclm:4109 cblock:132 csuccess:2 alloc: 2/50
25: 919 pages Failed time:29.67 rclm:5681 cblock:118 csuccess:0
26: 919 pages Failed time:32.81 rclm:7248 cblock:100 csuccess:0
27: 921 pages Success time:57.01 rclm:12690 cblock:179 csuccess:1 alloc: 2/50
28: 921 pages Failed time:33.72 rclm:7413 cblock:115 csuccess:0
29: 921 pages Failed time:25.91 rclm:5845 cblock:126 csuccess:0
30: 921 pages Failed time:0.48 rclm:334 cblock:41 csuccess:0
31: 921 pages Failed time:0.06 rclm:103 cblock:15 csuccess:0
32: 921 pages Failed time:0.36 rclm:341 cblock:58 csuccess:0
Final page count: 921
Total pages reclaimed: 118953
Total blocks compacted: 2352
Total compact pages alloced: 295
The time differences are marginal but bear in mind that this is an ideal
case of mostly unmapped buffer pages. On nice set of results is between
allocations 13-18 where no pages were reclaimed, some compaction occured
and 300 huge pages were allocated in 0.16 seconds. Furthermore, compaction
allocated a high higher percentage of memory (91% of RAM as huge pages).
The downside appears to be that the compaction kernel reclaimed even more
pages than the vanilla kernel. However, take the cut-off point of 880 pages
that both kernels succeeded. The vanilla kernel had reclaimed 105132 pages
at that point. The kernel with compaction had reclaimed 59071, less than
half of what the vanilla kernel reclaimed. i.e. the bulk of pages reclaimed
with the compaction kernel were to get from 87% of memory allocated to 91%
as huge pages.
These results would appear to be an encouraging enough start.
Comments?
include/linux/compaction.h | 26 +++
include/linux/mm.h | 1 +
include/linux/mmzone.h | 7 +
include/linux/swap.h | 5 +
include/linux/vmstat.h | 2 +
kernel/sysctl.c | 11 +
mm/Kconfig | 12 +-
mm/Makefile | 1 +
mm/compaction.c | 508 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 74 +++++++
mm/vmscan.c | 5 -
mm/vmstat.c | 179 ++++++++++++++++
12 files changed, 825 insertions(+), 6 deletions(-)
create mode 100644 include/linux/compaction.h
create mode 100644 mm/compaction.c
Unusuable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/pagetypeinfo.
The index is normally calculated as a value between 0 and 1 which is
obviously unsuitable within the kernel. Instead, the first three decimal
places are used as a value between 0 and 1000 for an integer approximation.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmstat.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 99 insertions(+), 0 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6051fba..e1ea2d5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
return 0;
}
+
+struct config_page_info {
+ unsigned long free_pages;
+ unsigned long free_blocks_total;
+ unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes to attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+ unsigned int suitable_order,
+ struct config_page_info *info)
+{
+ unsigned int order;
+
+ info->free_pages = 0;
+ info->free_blocks_total = 0;
+ info->free_blocks_suitable = 0;
+
+ for (order = 0; order < MAX_ORDER; order++) {
+ unsigned long blocks;
+
+ /* Count number of free blocks */
+ blocks = zone->free_area[order].nr_free;
+ info->free_blocks_total += blocks;
+
+ /* Count free base pages */
+ info->free_pages += blocks << order;
+
+ /* Count the suitable free blocks */
+ if (order >= suitable_order)
+ info->free_blocks_suitable += blocks <<
+ (order - suitable_order);
+ }
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+int unusable_free_index(struct zone *zone,
+ unsigned int order,
+ struct config_page_info *info)
+{
+ /* No free memory is interpreted as all free memory is unusable */
+ if (info->free_pages == 0)
+ return 100;
+
+ /*
+ * Index should be a value between 0 and 1. Return a value to 3
+ * decimal places.
+ *
+ * 0 => no fragmentation
+ * 1 => high fragmentation
+ */
+ return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
+
+}
+
+static void pagetypeinfo_showunusable_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct config_page_info info;
+
+ seq_printf(m, "Node %4d, zone %8s %19s",
+ pgdat->node_id,
+ zone->name, " ");
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "\nUnusable free space index at order\n");
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
+
+ return 0;
+}
+
static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
@@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
+ pagetypeinfo_showunusable(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);
return 0;
--
1.6.5
CONFIG_MIGRATION currently depends on CONFIG_NUMA. The current users of
page migration such as sys_move_pages(), sys_migrate_pages() and cpuset
process migration are ordinarily only beneficial on NUMA.
As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.
TODO
o After this patch is applied, the migration core is available but it
also makes NUMA-specific features available. This is too much
exposure so revisit this.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/Kconfig | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 17b8947..1d8e2b2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -168,12 +168,22 @@ config SPLIT_PTLOCK_CPUS
default "4"
#
+# support for memory compaction
+config COMPACTION
+ bool "Allow for memory compaction"
+ def_bool y
+ select MIGRATION
+ depends on EXPERIMENTAL && HUGETLBFS
+ help
+ Allows the compaction of memory for the allocation of huge pages.
+
+#
# support for page migration
#
config MIGRATION
bool "Page migration"
def_bool y
- depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
+ depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
help
Allows the migration of the physical location of pages of processes
while the virtual addresses are not changed. This is useful for
--
1.6.5
Fragmentation index is a value that makes sense when an allocation of a
given size would fail. the index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1). For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/pagetypeinfo.
The index is normally calculated as a value between 0 and 1 which is
obviously unsuitable within the kernel. Instead, the first three decimal
places are used as a value between 0 and 1000 for an integer approximation.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmstat.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 63 insertions(+), 0 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1ea2d5..560ee08 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -494,6 +494,35 @@ static void fill_contig_page_info(struct zone *zone,
}
/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(struct zone *zone,
+ unsigned int order,
+ struct config_page_info *info)
+{
+ unsigned long requested = 1UL << order;
+
+ if (!info->free_blocks_total)
+ return 0;
+
+ /* Fragmentation index only makes sense when a request would fail */
+ if (info->free_blocks_suitable)
+ return -1;
+
+ /*
+ * Index is between 0 and 1 so return within 3 decimal places
+ *
+ * 0 => allocation would fail due to lack of memory
+ * 1 => allocation would fail due to fragmentation
+ */
+ return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+/*
* Return an index indicating how much of the available free memory is
* unusable for an allocation of the requested size.
*/
@@ -516,6 +545,39 @@ int unusable_free_index(struct zone *zone,
}
+static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct config_page_info info;
+
+ seq_printf(m, "Node %4d, zone %8s %19s",
+ pgdat->node_id,
+ zone->name, " ");
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int pagetypeinfo_showfragmentation(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "\nFragmentation index at order\n");
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showfragmentation_print);
+
+ return 0;
+}
+
static void pagetypeinfo_showunusable_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
@@ -657,6 +719,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
pagetypeinfo_showunusable(m, pgdat);
+ pagetypeinfo_showfragmentation(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);
return 0;
--
1.6.5
This patch adds a proc file /proc/sys/vm/compact_node. When a NID is written
to the file, each zone in that node is compacted. This should be done with
debugfs but this was what was available to rebase quickly and I suspect
debugfs either did not exist or was in development during the first
implementation.
If this interface is to exist in the long term, it needs to be thought
about carefully. For the moment, it's handy to have to test compaction
under a controlled setting.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/compaction.h | 5 ++++
kernel/sysctl.c | 11 +++++++++
mm/compaction.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+), 0 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6201371..5965ef2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -5,4 +5,9 @@
#define COMPACT_INCOMPLETE 0
#define COMPACT_COMPLETE 1
+#ifdef CONFIG_MIGRATION
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_MIGRATION */
+
#endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..6202e95 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,7 @@
#include <linux/ftrace.h>
#include <linux/slow-work.h>
#include <linux/perf_event.h>
+#include <linux/compaction.h>
#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -80,6 +81,7 @@ extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
+extern int sysctl_compact_node;
extern int percpu_pagelist_fraction;
extern int compat_log;
extern int latencytop_enabled;
@@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = drop_caches_sysctl_handler,
},
+#ifdef CONFIG_MIGRATION
+ {
+ .procname = "compact_node",
+ .data = &sysctl_compact_node,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = sysctl_compaction_handler,
+ },
+#endif /* CONFIG_MIGRATION */
{
.procname = "min_free_kbytes",
.data = &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index d36760a..a8bcae2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -11,6 +11,7 @@
#include <linux/migrate.h>
#include <linux/compaction.h>
#include <linux/mm_inline.h>
+#include <linux/sysctl.h>
#include "internal.h"
/*
@@ -338,3 +339,54 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}
+/* Compact all zones within a node */
+int compact_node(int nid)
+{
+ int zoneid;
+ pg_data_t *pgdat;
+ struct zone *zone;
+
+ if (nid < 0 || nid > nr_node_ids || !node_online(nid))
+ return -EINVAL;
+ pgdat = NODE_DATA(nid);
+
+ /* Flush pending updates to the LRU lists */
+ lru_add_drain_all();
+
+ printk(KERN_INFO "Compacting memory in node %d\n", nid);
+ for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+ struct compact_control cc;
+
+ zone = &pgdat->node_zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ cc.nr_freepages = 0;
+ cc.nr_migratepages = 0;
+ cc.zone = zone;
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ compact_zone(zone, &cc);
+
+ VM_BUG_ON(!list_empty(&cc.freepages));
+ VM_BUG_ON(!list_empty(&cc.migratepages));
+ }
+ printk(KERN_INFO "Compaction of node %d complete\n", nid);
+
+ return 0;
+}
+
+/* This is global and fierce ugly but it's straight-forward */
+int sysctl_compact_node;
+
+/* This is the entry point for compacting nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, buffer, length, ppos);
+ if (write)
+ return compact_node(sysctl_compact_node);
+
+ return 0;
+}
--
1.6.5
This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.
A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.
TODO
o Rethink block selection for migration. When originally developed,
there was a deliberate effort to place MIGRATE_UNMOVABLE pages
to the lower PFNs so the block selection made a lot of sense.
However, that is no longer the case so rethink.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/compaction.h | 8 +
include/linux/mm.h | 1 +
include/linux/swap.h | 5 +
include/linux/vmstat.h | 1 +
mm/Makefile | 1 +
mm/compaction.c | 340 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 37 +++++
mm/vmscan.c | 5 -
mm/vmstat.c | 5 +
9 files changed, 398 insertions(+), 5 deletions(-)
create mode 100644 include/linux/compaction.h
create mode 100644 mm/compaction.c
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..6201371
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE 0
+#define COMPACT_COMPLETE 1
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2265f28..757ab86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -332,6 +332,7 @@ void put_page(struct page *page);
void put_pages_list(struct list_head *pages);
void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
/*
* Compound pages have a destructor function. Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..7e7181b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
__lru_cache_add(page, LRU_ACTIVE_FILE);
}
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
+#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
+
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ee03bba..d7f7236 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
obj-$(CONFIG_SMP) += percpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..d36760a
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,340 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2009 Mel Gorman <[email protected]>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+ struct list_head freepages; /* List of free pages to migrate to */
+ struct list_head migratepages; /* List of pages being migrated */
+ unsigned long nr_freepages; /* Number of isolated free pages */
+ unsigned long nr_migratepages; /* Number of pages to migrate */
+ unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long migrate_pfn; /* isolate_migratepages search base */
+ struct zone *zone;
+};
+
+static int release_freepages(struct zone *zone, struct list_head *freelist)
+{
+ struct page *page, *next;
+ int count = 0;
+
+ list_for_each_entry_safe(page, next, freelist, lru) {
+ list_del(&page->lru);
+ __free_page(page);
+ count++;
+ }
+
+ return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+ unsigned long blockpfn,
+ struct list_head *freelist)
+{
+ unsigned long zone_end_pfn, end_pfn;
+ int total_isolated = 0;
+
+ /* Get the last PFN we should scan for free pages at */
+ zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ end_pfn = blockpfn + pageblock_nr_pages;
+ if (end_pfn > zone_end_pfn)
+ end_pfn = zone_end_pfn;
+
+ /* Isolate free pages. This assumes the block is valid */
+ for (; blockpfn < end_pfn; blockpfn++) {
+ struct page *page;
+ int isolated, i;
+
+ if (!pfn_valid_within(blockpfn))
+ continue;
+
+ page = pfn_to_page(blockpfn);
+ if (!PageBuddy(page))
+ continue;
+
+ /* Found a free page, break it into order-0 pages */
+ isolated = split_free_page(page);
+ total_isolated += isolated;
+ for (i = 0; i < isolated; i++) {
+ list_add(&page->lru, freelist);
+ page++;
+ blockpfn += isolated - 1;
+ }
+ }
+
+ return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+ /* If the page is a large free page, then allow migration */
+ if (PageBuddy(page) && page_order(page) >= pageblock_order)
+ return 1;
+
+ /* If the block is MIGRATE_MOVABLE, allow migration */
+ if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+ return 1;
+
+ /* Otherwise skip the block */
+ return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+ struct compact_control *cc)
+{
+ struct page *page;
+ unsigned long high_pfn, low_pfn, pfn;
+ int nr_freepages = cc->nr_freepages;
+ struct list_head *freelist = &cc->freepages;
+
+ pfn = cc->free_pfn;
+ low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+ high_pfn = low_pfn;
+
+ /*
+ * Isolate free pages until enough are available to migrate the
+ * pages on cc->migratepages. We stop searching if the migrate
+ * and free page scanners meet or enough free pages are isolated.
+ */
+ spin_lock_irq(&zone->lock);
+ for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+ pfn -= pageblock_nr_pages) {
+ int isolated;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ /* Check for overlapping nodes/zones */
+ page = pfn_to_page(pfn);
+ if (page_zone(page) != zone)
+ continue;
+
+ /* Check the block is suitable for migration */
+ if (!suitable_migration_target(page))
+ continue;
+
+ /* Found a block suitable for isolating free pages from */
+ isolated = isolate_freepages_block(zone, pfn, freelist);
+ nr_freepages += isolated;
+
+ /*
+ * Record the highest PFN we isolated pages from. When next
+ * looking for free pages, the search will restart here as
+ * page migration may have returned some pages to the allocator
+ */
+ if (isolated)
+ high_pfn = max(high_pfn, pfn);
+ }
+ spin_unlock_irq(&zone->lock);
+
+ cc->free_pfn = high_pfn;
+ cc->nr_freepages = nr_freepages;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+ struct compact_control *cc)
+{
+ unsigned long low_pfn, end_pfn;
+ struct list_head *migratelist;
+ enum lru_list lru_src;
+
+ low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
+ migratelist = &cc->migratepages;
+
+ /* Do not scan outside zone boundaries */
+ if (low_pfn < zone->zone_start_pfn)
+ low_pfn = zone->zone_start_pfn;
+
+ /* Setup to scan one block but not past where we are migrating to */
+ end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+ cc->migrate_pfn = end_pfn;
+ VM_BUG_ON(end_pfn > cc->free_pfn);
+
+ if (!pfn_valid(low_pfn))
+ return 0;
+
+ migrate_prep();
+
+ /* Time to isolate some pages for migration */
+ spin_lock_irq(&zone->lru_lock);
+ for (; low_pfn < end_pfn; low_pfn++) {
+ struct page *page;
+ if (!pfn_valid_within(low_pfn))
+ continue;
+
+ /* Get the page and skip if free */
+ page = pfn_to_page(low_pfn);
+ if (PageBuddy(page)) {
+ low_pfn += (1 << page_order(page)) - 1;
+ continue;
+ }
+
+ if (!PageLRU(page) || PageUnevictable(page))
+ continue;
+
+ /* Try isolate the page */
+ lru_src = page_lru(page);
+ switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
+ case 0:
+ list_move(&page->lru, migratelist);
+ mem_cgroup_del_lru(page);
+ cc->nr_migratepages++;
+ break;
+
+ case -EBUSY:
+ /*
+ * else it is being freed elsewhere. The
+ * problem is that we are not really sure where
+ * it came from in the first place
+ * XXX: Verify the putback logic is ok. This was
+ * all written before LRU lists were split
+ */
+ list_move(&page->lru, &zone->lru[lru_src].list);
+ mem_cgroup_rotate_lru_list(page, page_lru(page));
+ continue;
+
+ default:
+ BUG();
+ }
+ }
+ spin_unlock_irq(&zone->lru_lock);
+
+ return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+ unsigned long data,
+ int **result)
+{
+ struct compact_control *cc = (struct compact_control *)data;
+ struct page *freepage;
+
+ VM_BUG_ON(cc == NULL);
+
+ /* Isolate free pages if necessary */
+ if (list_empty(&cc->freepages)) {
+ isolate_freepages(cc->zone, cc);
+
+ if (list_empty(&cc->freepages))
+ return NULL;
+ }
+
+ freepage = list_entry(cc->freepages.next, struct page, lru);
+ list_del(&freepage->lru);
+ cc->nr_freepages--;
+
+ return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+ int nr_migratepages = 0;
+ int nr_freepages = 0;
+ struct page *page;
+ list_for_each_entry(page, &cc->migratepages, lru)
+ nr_migratepages++;
+ list_for_each_entry(page, &cc->freepages, lru)
+ nr_freepages++;
+
+ cc->nr_migratepages = nr_migratepages;
+ cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+ struct compact_control *cc)
+{
+ /* Compaction run completes if the migrate and free scanner meet */
+ if (cc->free_pfn <= cc->migrate_pfn)
+ return COMPACT_COMPLETE;
+
+ return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+ int ret = COMPACT_INCOMPLETE;
+
+ /*
+ * Setup to move all movable pages to the end of the zone
+ * XXX: This could be improved upon. In the event compaction
+ * is being successful quickly but called often, there
+ * is a likelihood of scanning the same blocks as sources
+ * and targets frequently. Might be worth caching the
+ * last migrate_pfn to reduce scan times.
+ */
+ cc->migrate_pfn = zone->zone_start_pfn;
+ cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+ cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+ unsigned long nr_migrate, nr_remaining;
+ if (!isolate_migratepages(zone, cc))
+ continue;
+
+ nr_migrate = cc->nr_migratepages;
+ migrate_pages(&cc->migratepages, compaction_alloc,
+ (unsigned long)cc, 0);
+ update_nr_listpages(cc);
+ nr_remaining = cc->nr_migratepages;
+
+ count_vm_event(COMPACTBLOCKS);
+ count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+ if (nr_remaining)
+ count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+ }
+
+ /* Release free pages and check accounting */
+ cc->nr_freepages -= release_freepages(zone, &cc->freepages);
+ VM_BUG_ON(cc->nr_freepages != 0);
+
+ /*
+ * Release LRU pages not migrated
+ * XXX: Page migration at this point tries fairly hard to move
+ * pages as it is but if migration fails, pages are left
+ * on cc->migratepages for more passes. This might cause
+ * multiple useless failures. Watch compact_pagemigrate_failed
+ * in /proc/vmstat. If it grows a lot, then putback should
+ * happen after each failed migration
+ */
+ if (!list_empty(&cc->migratepages))
+ putback_lru_pages(&cc->migratepages);
+
+ return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e9f5cc..462431a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1167,6 +1167,43 @@ void split_page(struct page *page, unsigned int order)
set_page_refcounted(page + i);
}
+/* Similar to split_page except the page is already free */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ unsigned long watermark;
+ struct zone *zone;
+
+ BUG_ON(!PageBuddy(page));
+
+ zone = page_zone(page);
+ order = page_order(page);
+
+ /* Obey watermarks or the system could deadlock */
+ watermark = low_wmark_pages(zone) + (1 << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ return 0;
+
+ /* Remove page from free list */
+ list_del(&page->lru);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+
+ /* Set the migratetype on the assumption it's for migration */
+ if (order >= pageblock_order - 1) {
+ struct page *endpage = page + (1 << order) - 1;
+ for (; page < endpage; page += pageblock_nr_pages)
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ }
+
+ return 1 << order;
+}
+
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 885207a..2ad8603 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -803,11 +803,6 @@ keep:
return nr_reclaimed;
}
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
-#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
-
/*
* Attempt to remove the specified page from its LRU. Only take this page
* if it is of the appropriate PageActive status. Pages which are being
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 560ee08..41585bf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
"allocstall",
"pgrotated",
+
+ "compact_blocks_moved",
+ "compact_pages_moved",
+ "compact_pagemigrate_failed",
+
#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
"htlb_buddy_alloc_fail",
--
1.6.5
The fragmentation index may indicate that a failure it due to external
fragmentation, a compaction run complete and an allocation failure still
fail. There are two obvious reasons as to why
o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met
In the event of compaction and allocation failure, this patch prevents
compaction happening for a short interval. It's only recorded on the
preferred zone but that should be enough coverage. This could have been
implemented similar to the zonelist_cache but the increased size of the
zonelist did not appear to be justified.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 7 +++++++
mm/page_alloc.c | 15 ++++++++++++++-
2 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..1d6ccbe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -328,6 +328,13 @@ struct zone {
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
+#ifdef CONFIG_MIGRATION
+ /*
+ * If a compaction fails, do not try compaction again until
+ * jiffies is after the value of compact_resume
+ */
+ unsigned long compact_resume;
+#endif
ZONE_PADDING(_pad1_)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7275afb..9c86606 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1729,7 +1729,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
cond_resched();
/* Try memory compaction for high-order allocations before reclaim */
- if (order) {
+ if (order && time_after(jiffies, preferred_zone->compact_resume)) {
*did_some_progress = try_to_compact_pages(zonelist,
order, gfp_mask, nodemask);
if (*did_some_progress != COMPACT_INCOMPLETE) {
@@ -1748,6 +1748,19 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
+
+ /*
+ * On failure, avoid compaction for a short time.
+ * XXX: This is very unsatisfactory. The failure
+ * to compact has nothing to do with time
+ * and everything to do with the requested
+ * order, the number of free pages and
+ * watermarks. How to wait on that is more
+ * unclear, but the answer would apply to
+ * other areas where the VM waits based on
+ * time.
+ */
+ preferred_zone->compact_resume = jiffies + HZ/50;
}
}
--
1.6.5
Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation. With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.
Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/compaction.h | 17 ++++++-
include/linux/vmstat.h | 1 +
mm/compaction.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 24 +++++++++
mm/vmstat.c | 16 +++++-
5 files changed, 170 insertions(+), 4 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5965ef2..247d497 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,13 +1,26 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H
-/* Return values for compact_zone() */
+/* Return values for compact_zone() and try_to_compact_pages() */
#define COMPACT_INCOMPLETE 0
-#define COMPACT_COMPLETE 1
+#define COMPACT_PARTIAL 1
+#define COMPACT_COMPLETE 2
#ifdef CONFIG_MIGRATION
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ return COMPACT_INCOMPLETE;
+}
+
#endif /* CONFIG_MIGRATION */
#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d7f7236..0ea7a38 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+ COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index a8bcae2..00953e4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -28,6 +28,9 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+
+ unsigned int order; /* order a direct compactor needs */
+ int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
};
@@ -280,10 +283,31 @@ static void update_nr_listpages(struct compact_control *cc)
static inline int compact_finished(struct zone *zone,
struct compact_control *cc)
{
+ unsigned int order;
+ unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
/* Compaction run completes if the migrate and free scanner meet */
if (cc->free_pfn <= cc->migrate_pfn)
return COMPACT_COMPLETE;
+ /* Compaction run is not finished if the watermark is not met */
+ if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+ return COMPACT_INCOMPLETE;
+
+ if (cc->order == -1)
+ return COMPACT_INCOMPLETE;
+
+ /* Direct compactor: Is a suitable page free? */
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (order >= pageblock_order && zone->free_area[order].nr_free)
+ return COMPACT_PARTIAL;
+ }
+
return COMPACT_INCOMPLETE;
}
@@ -339,6 +363,98 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}
+static inline unsigned long compact_zone_order(struct zone *zone,
+ int order, gfp_t gfp_mask)
+{
+ struct compact_control cc = {
+ .nr_freepages = 0,
+ .nr_migratepages = 0,
+ .order = order,
+ .migratetype = allocflags_to_migratetype(gfp_mask),
+ .zone = zone,
+ };
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ int may_enter_fs = gfp_mask & __GFP_FS;
+ int may_perform_io = gfp_mask & __GFP_IO;
+ unsigned long watermark;
+ struct zoneref *z;
+ struct zone *zone;
+ int rc = COMPACT_INCOMPLETE;
+
+ /* Check whether it is worth even starting compaction */
+ if (order == 0 || !may_enter_fs || !may_perform_io)
+ return rc;
+
+ /*
+ * XXX: We will not stall in migratepages if the necessary
+ * conditions are not met but direct reclaim seems to
+ * account for a stall just because it tries. Confirm
+ * this is true.
+ */
+ count_vm_event(COMPACTSTALL);
+
+ /* Compact each zone in the list */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+ nodemask) {
+ int fragindex;
+ int status;
+
+ /* Watermarks for order-0 must be met for compaction */
+ watermark = low_wmark_pages(zone) + (1 << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ continue;
+
+ /*
+ * fragmentation index determines if allocation failures are
+ * due to low memory or external fragmentation
+ *
+ * index of -1 implies allocations might succeed depending
+ * on watermarks
+ * index < 500 implies alloc failure is due to lack of memory
+ *
+ * XXX: The choice of 500 is arbitrary. Reinvestigate
+ * appropriately to determine a sensible default.
+ * and what it means when watermarks are also taken
+ * into account. Consider making it a sysctl
+ */
+ fragindex = fragmentation_index(zone, order);
+ if (fragindex >= 0 && fragindex <= 500)
+ continue;
+
+ if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+ rc = COMPACT_PARTIAL;
+ break;
+ }
+
+ status = compact_zone_order(zone, order, gfp_mask);
+ rc = max(status, rc);
+
+ if (zone_watermark_ok(zone, order, watermark, 0, 0))
+ break;
+ }
+
+ return rc;
+}
+
+
/* Compact all zones within a node */
int compact_node(int nid)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 462431a..7275afb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
#include <linux/memory.h>
+#include <linux/compaction.h>
#include <trace/events/kmem.h>
#include <asm/tlbflush.h>
@@ -1727,6 +1728,29 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
cond_resched();
+ /* Try memory compaction for high-order allocations before reclaim */
+ if (order) {
+ *did_some_progress = try_to_compact_pages(zonelist,
+ order, gfp_mask, nodemask);
+ if (*did_some_progress != COMPACT_INCOMPLETE) {
+ page = get_page_from_freelist(gfp_mask, nodemask,
+ order, zonelist, high_zoneidx,
+ alloc_flags, preferred_zone,
+ migratetype);
+ if (page) {
+ __count_vm_event(COMPACTSUCCESS);
+ return page;
+ }
+
+ /*
+ * It's bad if compaction run occurs and fails.
+ * The most likely reason is that pages exist,
+ * but not enough to satisfy watermarks.
+ */
+ count_vm_event(COMPACTFAIL);
+ }
+ }
+
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 41585bf..1598fdc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -500,7 +500,7 @@ static void fill_contig_page_info(struct zone *zone,
* The value can be used to determine if page reclaim or compaction
* should be used
*/
-int fragmentation_index(struct zone *zone,
+static int __fragmentation_index(struct zone *zone,
unsigned int order,
struct config_page_info *info)
{
@@ -522,6 +522,15 @@ int fragmentation_index(struct zone *zone,
return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
}
+/* Same as __fragmentation index but allocs config_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+ struct config_page_info info;
+
+ fill_contig_page_info(zone, order, &info);
+ return __fragmentation_index(zone, order, &info);
+}
+
/*
* Return an index indicating how much of the available free memory is
* unusable for an allocation of the requested size.
@@ -558,7 +567,7 @@ static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
zone->name, " ");
for (order = 0; order < MAX_ORDER; ++order) {
fill_contig_page_info(zone, order, &info);
- seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
+ seq_printf(m, "%6d ", __fragmentation_index(zone, order, &info));
}
seq_putc(m, '\n');
@@ -856,6 +865,9 @@ static const char * const vmstat_text[] = {
"compact_blocks_moved",
"compact_pages_moved",
"compact_pagemigrate_failed",
+ "compact_stall",
+ "compact_fail",
+ "compact_success",
#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
--
1.6.5
On Wed, 2010-01-06 at 16:26 +0000, Mel Gorman wrote:
> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +int unusable_free_index(struct zone *zone,
> + unsigned int order,
> + struct config_page_info *info)
> +{
> + /* No free memory is interpreted as all free memory is unusable */
> + if (info->free_pages == 0)
> + return 100;
Should the above be 1000?
--
Thanks,
Adam
On Wed, Jan 06, 2010 at 11:10:48AM -0600, Adam Litke wrote:
> On Wed, 2010-01-06 at 16:26 +0000, Mel Gorman wrote:
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +int unusable_free_index(struct zone *zone,
> > + unsigned int order,
> > + struct config_page_info *info)
> > +{
> > + /* No free memory is interpreted as all free memory is unusable */
> > + if (info->free_pages == 0)
> > + return 100;
>
> Should the above be 1000?
>
Yes. Fortunately, the value is not actually used by any of the code.
It's for consumption by people or tools.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Jan 06, 2010 at 04:26:06PM +0000, Mel Gorman wrote:
> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
>
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
>
> TODO
> o Rethink block selection for migration. When originally developed,
> there was a deliberate effort to place MIGRATE_UNMOVABLE pages
> to the lower PFNs so the block selection made a lot of sense.
> However, that is no longer the case so rethink.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/compaction.h | 8 +
> include/linux/mm.h | 1 +
> include/linux/swap.h | 5 +
> include/linux/vmstat.h | 1 +
> mm/Makefile | 1 +
> mm/compaction.c | 340 ++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 37 +++++
> mm/vmscan.c | 5 -
> mm/vmstat.c | 5 +
> 9 files changed, 398 insertions(+), 5 deletions(-)
> create mode 100644 include/linux/compaction.h
> create mode 100644 mm/compaction.c
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> new file mode 100644
> index 0000000..6201371
> --- /dev/null
> +++ b/include/linux/compaction.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_COMPACTION_H
> +#define _LINUX_COMPACTION_H
> +
> +/* Return values for compact_zone() */
> +#define COMPACT_INCOMPLETE 0
> +#define COMPACT_COMPLETE 1
> +
> +#endif /* _LINUX_COMPACTION_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2265f28..757ab86 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -332,6 +332,7 @@ void put_page(struct page *page);
> void put_pages_list(struct list_head *pages);
>
> void split_page(struct page *page, unsigned int order);
> +int split_free_page(struct page *page);
>
> /*
> * Compound pages have a destructor function. Provide a
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..7e7181b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> __lru_cache_add(page, LRU_ACTIVE_FILE);
> }
>
> +/* LRU Isolation modes. */
> +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> +
> /* linux/mm/vmscan.c */
> extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ee03bba..d7f7236 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> KSWAPD_SKIP_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> + COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> #ifdef CONFIG_HUGETLB_PAGE
> HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 7a68d2a..ccb1f72 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_COMPACTION) += compaction.o
> obj-$(CONFIG_SMP) += percpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> diff --git a/mm/compaction.c b/mm/compaction.c
> new file mode 100644
> index 0000000..d36760a
> --- /dev/null
> +++ b/mm/compaction.c
> @@ -0,0 +1,340 @@
> +/*
> + * linux/mm/compaction.c
> + *
> + * Memory compaction for the reduction of external fragmentation. Note that
> + * this heavily depends upon page migration to do all the real heavy
> + * lifting
> + *
> + * Copyright IBM Corp. 2009 Mel Gorman <[email protected]>
> + */
> +#include <linux/swap.h>
> +#include <linux/migrate.h>
> +#include <linux/compaction.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +/*
> + * compact_control is used to track pages being migrated and the free pages
> + * they are being migrated to during memory compaction. The free_pfn starts
> + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> + * are moved to the end of a zone during a compaction run and the run
> + * completes when free_pfn <= migrate_pfn
> + */
> +struct compact_control {
> + struct list_head freepages; /* List of free pages to migrate to */
> + struct list_head migratepages; /* List of pages being migrated */
> + unsigned long nr_freepages; /* Number of isolated free pages */
> + unsigned long nr_migratepages; /* Number of pages to migrate */
> + unsigned long free_pfn; /* isolate_freepages search base */
> + unsigned long migrate_pfn; /* isolate_migratepages search base */
> + struct zone *zone;
> +};
> +
> +static int release_freepages(struct zone *zone, struct list_head *freelist)
> +{
> + struct page *page, *next;
> + int count = 0;
> +
> + list_for_each_entry_safe(page, next, freelist, lru) {
> + list_del(&page->lru);
> + __free_page(page);
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> +static int isolate_freepages_block(struct zone *zone,
> + unsigned long blockpfn,
> + struct list_head *freelist)
> +{
> + unsigned long zone_end_pfn, end_pfn;
> + int total_isolated = 0;
> +
> + /* Get the last PFN we should scan for free pages at */
> + zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> + end_pfn = blockpfn + pageblock_nr_pages;
> + if (end_pfn > zone_end_pfn)
> + end_pfn = zone_end_pfn;
> +
> + /* Isolate free pages. This assumes the block is valid */
> + for (; blockpfn < end_pfn; blockpfn++) {
> + struct page *page;
> + int isolated, i;
> +
> + if (!pfn_valid_within(blockpfn))
> + continue;
> +
> + page = pfn_to_page(blockpfn);
> + if (!PageBuddy(page))
> + continue;
> +
> + /* Found a free page, break it into order-0 pages */
> + isolated = split_free_page(page);
> + total_isolated += isolated;
> + for (i = 0; i < isolated; i++) {
> + list_add(&page->lru, freelist);
> + page++;
> + blockpfn += isolated - 1;
*sigh*
This increment is wrong. It pushs blockpfn forward too fast and takes fewer
free pages than it should from that pageblock.
> + }
> + }
> +
> + return total_isolated;
> +}
> +
> +/* Returns 1 if the page is within a block suitable for migration to */
> +static int suitable_migration_target(struct page *page)
> +{
> + /* If the page is a large free page, then allow migration */
> + if (PageBuddy(page) && page_order(page) >= pageblock_order)
> + return 1;
> +
> + /* If the block is MIGRATE_MOVABLE, allow migration */
> + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> + return 1;
> +
> + /* Otherwise skip the block */
> + return 0;
> +}
> +
> +/*
> + * Based on information in the current compact_control, find blocks
> + * suitable for isolating free pages from
> + */
> +static void isolate_freepages(struct zone *zone,
> + struct compact_control *cc)
> +{
> + struct page *page;
> + unsigned long high_pfn, low_pfn, pfn;
> + int nr_freepages = cc->nr_freepages;
> + struct list_head *freelist = &cc->freepages;
> +
> + pfn = cc->free_pfn;
> + low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> + high_pfn = low_pfn;
> +
> + /*
> + * Isolate free pages until enough are available to migrate the
> + * pages on cc->migratepages. We stop searching if the migrate
> + * and free page scanners meet or enough free pages are isolated.
> + */
> + spin_lock_irq(&zone->lock);
> + for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> + pfn -= pageblock_nr_pages) {
> + int isolated;
> +
> + if (!pfn_valid(pfn))
> + continue;
> +
> + /* Check for overlapping nodes/zones */
> + page = pfn_to_page(pfn);
> + if (page_zone(page) != zone)
> + continue;
> +
> + /* Check the block is suitable for migration */
> + if (!suitable_migration_target(page))
> + continue;
> +
> + /* Found a block suitable for isolating free pages from */
> + isolated = isolate_freepages_block(zone, pfn, freelist);
> + nr_freepages += isolated;
> +
> + /*
> + * Record the highest PFN we isolated pages from. When next
> + * looking for free pages, the search will restart here as
> + * page migration may have returned some pages to the allocator
> + */
> + if (isolated)
> + high_pfn = max(high_pfn, pfn);
> + }
> + spin_unlock_irq(&zone->lock);
> +
> + cc->free_pfn = high_pfn;
> + cc->nr_freepages = nr_freepages;
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static unsigned long isolate_migratepages(struct zone *zone,
> + struct compact_control *cc)
> +{
> + unsigned long low_pfn, end_pfn;
> + struct list_head *migratelist;
> + enum lru_list lru_src;
> +
> + low_pfn = ALIGN(cc->migrate_pfn, pageblock_nr_pages);
> + migratelist = &cc->migratepages;
> +
> + /* Do not scan outside zone boundaries */
> + if (low_pfn < zone->zone_start_pfn)
> + low_pfn = zone->zone_start_pfn;
> +
> + /* Setup to scan one block but not past where we are migrating to */
> + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> + cc->migrate_pfn = end_pfn;
> + VM_BUG_ON(end_pfn > cc->free_pfn);
> +
> + if (!pfn_valid(low_pfn))
> + return 0;
> +
> + migrate_prep();
> +
> + /* Time to isolate some pages for migration */
> + spin_lock_irq(&zone->lru_lock);
> + for (; low_pfn < end_pfn; low_pfn++) {
> + struct page *page;
> + if (!pfn_valid_within(low_pfn))
> + continue;
> +
> + /* Get the page and skip if free */
> + page = pfn_to_page(low_pfn);
> + if (PageBuddy(page)) {
> + low_pfn += (1 << page_order(page)) - 1;
> + continue;
> + }
> +
> + if (!PageLRU(page) || PageUnevictable(page))
> + continue;
> +
> + /* Try isolate the page */
> + lru_src = page_lru(page);
> + switch (__isolate_lru_page(page, ISOLATE_BOTH, 0)) {
> + case 0:
> + list_move(&page->lru, migratelist);
> + mem_cgroup_del_lru(page);
> + cc->nr_migratepages++;
> + break;
> +
> + case -EBUSY:
> + /*
> + * else it is being freed elsewhere. The
> + * problem is that we are not really sure where
> + * it came from in the first place
> + * XXX: Verify the putback logic is ok. This was
> + * all written before LRU lists were split
> + */
> + list_move(&page->lru, &zone->lru[lru_src].list);
> + mem_cgroup_rotate_lru_list(page, page_lru(page));
> + continue;
> +
> + default:
> + BUG();
> + }
> + }
> + spin_unlock_irq(&zone->lru_lock);
> +
> + return cc->nr_migratepages;
> +}
> +
> +/*
> + * This is a migrate-callback that "allocates" freepages by taking pages
> + * from the isolated freelists in the block we are migrating to.
> + */
> +static struct page *compaction_alloc(struct page *migratepage,
> + unsigned long data,
> + int **result)
> +{
> + struct compact_control *cc = (struct compact_control *)data;
> + struct page *freepage;
> +
> + VM_BUG_ON(cc == NULL);
> +
> + /* Isolate free pages if necessary */
> + if (list_empty(&cc->freepages)) {
> + isolate_freepages(cc->zone, cc);
> +
> + if (list_empty(&cc->freepages))
> + return NULL;
> + }
> +
> + freepage = list_entry(cc->freepages.next, struct page, lru);
> + list_del(&freepage->lru);
> + cc->nr_freepages--;
> +
> + return freepage;
> +}
> +
> +/*
> + * We cannot control nr_migratepages and nr_freepages fully when migration is
> + * running as migrate_pages() has no knowledge of compact_control. When
> + * migration is complete, we count the number of pages on the lists by hand.
> + */
> +static void update_nr_listpages(struct compact_control *cc)
> +{
> + int nr_migratepages = 0;
> + int nr_freepages = 0;
> + struct page *page;
> + list_for_each_entry(page, &cc->migratepages, lru)
> + nr_migratepages++;
> + list_for_each_entry(page, &cc->freepages, lru)
> + nr_freepages++;
> +
> + cc->nr_migratepages = nr_migratepages;
> + cc->nr_freepages = nr_freepages;
> +}
> +
> +static inline int compact_finished(struct zone *zone,
> + struct compact_control *cc)
> +{
> + /* Compaction run completes if the migrate and free scanner meet */
> + if (cc->free_pfn <= cc->migrate_pfn)
> + return COMPACT_COMPLETE;
> +
> + return COMPACT_INCOMPLETE;
> +}
> +
> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> + int ret = COMPACT_INCOMPLETE;
> +
> + /*
> + * Setup to move all movable pages to the end of the zone
> + * XXX: This could be improved upon. In the event compaction
> + * is being successful quickly but called often, there
> + * is a likelihood of scanning the same blocks as sources
> + * and targets frequently. Might be worth caching the
> + * last migrate_pfn to reduce scan times.
> + */
> + cc->migrate_pfn = zone->zone_start_pfn;
> + cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> + cc->free_pfn &= ~(pageblock_nr_pages-1);
> +
> + for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> + unsigned long nr_migrate, nr_remaining;
> + if (!isolate_migratepages(zone, cc))
> + continue;
> +
> + nr_migrate = cc->nr_migratepages;
> + migrate_pages(&cc->migratepages, compaction_alloc,
> + (unsigned long)cc, 0);
> + update_nr_listpages(cc);
> + nr_remaining = cc->nr_migratepages;
> +
> + count_vm_event(COMPACTBLOCKS);
> + count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> + if (nr_remaining)
> + count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> + }
> +
> + /* Release free pages and check accounting */
> + cc->nr_freepages -= release_freepages(zone, &cc->freepages);
> + VM_BUG_ON(cc->nr_freepages != 0);
> +
> + /*
> + * Release LRU pages not migrated
> + * XXX: Page migration at this point tries fairly hard to move
> + * pages as it is but if migration fails, pages are left
> + * on cc->migratepages for more passes. This might cause
> + * multiple useless failures. Watch compact_pagemigrate_failed
> + * in /proc/vmstat. If it grows a lot, then putback should
> + * happen after each failed migration
> + */
> + if (!list_empty(&cc->migratepages))
> + putback_lru_pages(&cc->migratepages);
> +
> + return ret;
> +}
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4e9f5cc..462431a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1167,6 +1167,43 @@ void split_page(struct page *page, unsigned int order)
> set_page_refcounted(page + i);
> }
>
> +/* Similar to split_page except the page is already free */
> +int split_free_page(struct page *page)
> +{
> + unsigned int order;
> + unsigned long watermark;
> + struct zone *zone;
> +
> + BUG_ON(!PageBuddy(page));
> +
> + zone = page_zone(page);
> + order = page_order(page);
> +
> + /* Obey watermarks or the system could deadlock */
> + watermark = low_wmark_pages(zone) + (1 << order);
> + if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> + return 0;
> +
> + /* Remove page from free list */
> + list_del(&page->lru);
> + zone->free_area[order].nr_free--;
> + rmv_page_order(page);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> +
> + /* Split into individual pages */
> + set_page_refcounted(page);
> + split_page(page, order);
> +
> + /* Set the migratetype on the assumption it's for migration */
> + if (order >= pageblock_order - 1) {
> + struct page *endpage = page + (1 << order) - 1;
> + for (; page < endpage; page += pageblock_nr_pages)
> + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> + }
> +
> + return 1 << order;
> +}
> +
> /*
> * Really, prep_compound_page() should be called from __rmqueue_bulk(). But
> * we cheat by calling it from here, in the order > 0 path. Saves a branch
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 885207a..2ad8603 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -803,11 +803,6 @@ keep:
> return nr_reclaimed;
> }
>
> -/* LRU Isolation modes. */
> -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
> -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
> -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
> -
> /*
> * Attempt to remove the specified page from its LRU. Only take this page
> * if it is of the appropriate PageActive status. Pages which are being
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 560ee08..41585bf 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -852,6 +852,11 @@ static const char * const vmstat_text[] = {
> "allocstall",
>
> "pgrotated",
> +
> + "compact_blocks_moved",
> + "compact_pages_moved",
> + "compact_pagemigrate_failed",
> +
> #ifdef CONFIG_HUGETLB_PAGE
> "htlb_buddy_alloc_success",
> "htlb_buddy_alloc_fail",
> --
> 1.6.5
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Jan 06, 2010 at 05:50:32PM +0000, Mel Gorman wrote:
> > + /* Found a free page, break it into order-0 pages */
> > + isolated = split_free_page(page);
> > + total_isolated += isolated;
> > + for (i = 0; i < isolated; i++) {
> > + list_add(&page->lru, freelist);
> > + page++;
>
>
> > + blockpfn += isolated - 1;
>
> *sigh*
>
> This increment is wrong. It pushs blockpfn forward too fast and takes fewer
> free pages than it should from that pageblock.
>
I should point out that the consequence of this is that pages are not
being isolated properly and the free scanner moves too quickly. This
both aborts the compaction early and the end result is not as compact.
With this stupidity repaired, the results become
Starting page count: 0
Requesting at each increment: 50 huge pages
1: 50 pages Success time:0.09 rclm:8192 cblock:59 csuccess:20 alloc: 50/50
2: 100 pages Success time:0.09 rclm:3924 cblock:80 csuccess:28 alloc: 50/50
3: 150 pages Success time:0.08 rclm:4088 cblock:110 csuccess:26 alloc: 50/50
4: 200 pages Success time:0.15 rclm:4749 cblock:148 csuccess:32 alloc: 50/50
5: 250 pages Success time:0.05 rclm:7142 cblock:56 csuccess:17 alloc: 50/50
6: 300 pages Success time:0.08 rclm:6361 cblock:38 csuccess:11 alloc: 50/50
7: 350 pages Success time:0.04 rclm:6220 cblock:40 csuccess:11 alloc: 50/50
8: 400 pages Success time:0.01 rclm:1037 cblock:12 csuccess:11 alloc: 50/50
9: 450 pages Success time:0.01 rclm:203 cblock:13 csuccess:14 alloc: 50/50
10: 500 pages Success time:0.06 rclm:32 cblock:137 csuccess:33 alloc: 50/50
11: 550 pages Success time:0.03 rclm:0 cblock:67 csuccess:16 alloc: 50/50
12: 600 pages Success time:0.01 rclm:0 cblock:11 csuccess:11 alloc: 50/50
13: 650 pages Success time:0.00 rclm:0 cblock:0 csuccess:2 alloc: 50/50
14: 700 pages Success time:0.01 rclm:0 cblock:0 csuccess:5 alloc: 50/50
15: 750 pages Success time:0.00 rclm:0 cblock:0 csuccess:0 alloc: 50/50
16: 800 pages Success time:0.00 rclm:0 cblock:3 csuccess:17 alloc: 50/50
17: 850 pages Success time:0.01 rclm:0 cblock:6 csuccess:3 alloc: 50/50
18: 895 pages Success time:0.30 rclm:70 cblock:62 csuccess:16 alloc: 45/50
19: 900 pages Success time:0.39 rclm:415 cblock:78 csuccess:2 alloc: 5/50
20: 900 pages Failed time:0.09 rclm:53 cblock:48 csuccess:0
21: 902 pages Success time:0.04 rclm:0 cblock:21 csuccess:1 alloc: 2/50
22: 902 pages Failed time:0.04 rclm:0 cblock:21 csuccess:0
23: 902 pages Failed time:0.09 rclm:1017 cblock:47 csuccess:0
24: 903 pages Success time:0.04 rclm:0 cblock:22 csuccess:0 alloc: 1/50
25: 903 pages Failed time:0.58 rclm:464 cblock:67 csuccess:0
26: 911 pages Success time:0.06 rclm:0 cblock:22 csuccess:1 alloc: 8/50
27: 911 pages Failed time:0.04 rclm:0 cblock:21 csuccess:0
28: 917 pages Success time:0.86 rclm:224 cblock:94 csuccess:2 alloc: 6/50
29: 918 pages Success time:0.04 rclm:0 cblock:23 csuccess:1 alloc: 1/50
30: 918 pages Failed time:0.30 rclm:413 cblock:48 csuccess:0
31: 919 pages Success time:0.19 rclm:486 cblock:52 csuccess:0 alloc: 1/50
32: 919 pages Failed time:0.04 rclm:0 cblock:14 csuccess:0
33: 919 pages Failed time:0.23 rclm:425 cblock:52 csuccess:0
34: 919 pages Failed time:0.04 rclm:0 cblock:14 csuccess:0
35: 919 pages Failed time:0.15 rclm:420 cblock:48 csuccess:0
36: 919 pages Failed time:0.04 rclm:0 cblock:14 csuccess:0
Final page count: 919
Total pages reclaimed: 45935
Total blocks compacted: 1548
Total compact pages alloced: 280
So that it works much more as expected. Reclaims are way down, allocation
success rates are still very high.
Credit goes to Eric Munson who pointed the bug out to be on IRC.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman <[email protected]> writes:
Haven't reviewed the full thing, but one thing I noticed below:
> +
> + /*
> + * Isolate free pages until enough are available to migrate the
> + * pages on cc->migratepages. We stop searching if the migrate
> + * and free page scanners meet or enough free pages are isolated.
> + */
> + spin_lock_irq(&zone->lock);
Won't that cause very long lock hold times on large zones?
Presumably you need some kind of lock break heuristic.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Jan 06, 2010 at 10:37:22PM +0100, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
>
>
> Haven't reviewed the full thing, but one thing I noticed below:
>
> > +
> > + /*
> > + * Isolate free pages until enough are available to migrate the
> > + * pages on cc->migratepages. We stop searching if the migrate
> > + * and free page scanners meet or enough free pages are isolated.
> > + */
> > + spin_lock_irq(&zone->lock);
>
> Won't that cause very long lock hold times on large zones?
Good question. The amount of memory unavailable and the duration should
be bounded.
isolate_migratepages only considers a pageblock of pages, the maximum of
which will be MAX_ORDER_NR_PAGES so ordinarily you would expect the hold
time to be fairly short - even on large zones.
The one exception is if migration of too many of these pages are failing. The
pages are not immediately put back on the LRU list. In a really bad scenario,
too many free pages could indeed get isolated. I comment on this problem
although from another perspective here
* XXX: Page migration at this point tries fairly hard to move
* pages as it is but if migration fails, pages are left
* on cc->migratepages for more passes. This might cause
* multiple useless failures. Watch
* compact_pagemigrate_failed
* in /proc/vmstat. If it grows a lot, then putback should
* happen after each failed migration
So, in theory in a worst case scenario, it could grow too much. The
solution would be to put pages that fail to migrate back on the LRU
list. That would keep the length of time zone->lock is held low.
Even in that worst case scenario, there is a limit to how many pages will
be removed from the free lists. When isolating free pages, split_free_page
is called and one of the checks it makes is
/* Obey watermarks or the system could deadlock */
watermark = low_wmark_pages(zone) + (1 << order);
if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;
i.e. it shouldn't be isolating pages if watermarks get messed up. If
enough free pages are not available, migration should fail, compaction
therefore fails and all the pages get put back.
Bottom line, I do not expect it to be bad. I'm much more concerned about
zone->lock getting hammered by isolating free pages, then giving them
back because page migration keeps failing and freeing the isolated pages
back to the lists.
> Presumably you need some kind of lock break heuristic.
>
The heuristic I'm going for is "never be taking too many pages".
Just in case though, I'll put in a
WARN_ON_ONCE(nr_migratepages > MAX_ORDER_NR_PAGES * 3);
in isolate_free_pages. If that warning triggers, it likely means the
lock is being held too long.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Jan 6, 2010 at 8:26 AM, Mel Gorman <[email protected]> wrote:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..e1ea2d5 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> ? ? ? ?return 0;
> ?}
>
> +
> +struct config_page_info {
> + ? ? ? unsigned long free_pages;
> + ? ? ? unsigned long free_blocks_total;
> + ? ? ? unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes to attempt to estimate
s/makes to/makes no/ ?
Tim
On Wed, 6 Jan 2010, Mel Gorman wrote:
> CONFIG_MIGRATION currently depends on CONFIG_NUMA. The current users of
> page migration such as sys_move_pages(), sys_migrate_pages() and cpuset
> process migration are ordinarily only beneficial on NUMA.
>
> As memory compaction will operate within a zone and is useful on both NUMA
> and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> user selects CONFIG_COMPACTION as an option.
>
> TODO
> o After this patch is applied, the migration core is available but it
> also makes NUMA-specific features available. This is too much
> exposure so revisit this.
>
CONFIG_MIGRATION is no longer strictly dependent on CONFIG_NUMA since
ARCH_ENABLE_MEMORY_HOTREMOVE has allowed it to be configured for UMA
machines. All strictly NUMA features in the migration core should be
isolated under its #ifdef CONFIG_NUMA (sys_move_pages()) in mm/migrate.c
or by simply not compiling mm/mempolicy.c (sys_migrate_pages()), so this
patch looks fine as is (although the "help" text for CONFIG_MIGRATION
could be updated to reflect that it's useful for both memory hot-remove
and now compaction).
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/Kconfig | 12 +++++++++++-
> 1 files changed, 11 insertions(+), 1 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 17b8947..1d8e2b2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -168,12 +168,22 @@ config SPLIT_PTLOCK_CPUS
> default "4"
>
> #
> +# support for memory compaction
> +config COMPACTION
> + bool "Allow for memory compaction"
> + def_bool y
> + select MIGRATION
> + depends on EXPERIMENTAL && HUGETLBFS
> + help
> + Allows the compaction of memory for the allocation of huge pages.
> +
> +#
> # support for page migration
> #
> config MIGRATION
> bool "Page migration"
> def_bool y
> - depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
> + depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
> help
> Allows the migration of the physical location of pages of processes
> while the virtual addresses are not changed. This is useful for
On Wed, 6 Jan 2010, Mel Gorman wrote:
> This patch adds a proc file /proc/sys/vm/compact_node. When a NID is written
> to the file, each zone in that node is compacted. This should be done with
> debugfs but this was what was available to rebase quickly and I suspect
> debugfs either did not exist or was in development during the first
> implementation.
>
> If this interface is to exist in the long term, it needs to be thought
> about carefully. For the moment, it's handy to have to test compaction
> under a controlled setting.
>
With Lee's work on mempolicy-constrained hugepage allocations, there is a
use-case for this explicit trigger to be exported via sysfs in the
longterm: we should be able to determine how successful the allocation of
hugepages will be on a particular node without actually allocating them
depending on the degree of external fragmentation to form our mempolicy.
Since node-targeted hugepage allocation and freeing now exists in the
kernel and compaction is used primary for the former, I don't see why it
shouldn't be exposed.
I like the (slightly racy) interface that avoids having a trigger for each
node under /sys/devices/system/node as well.
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/compaction.h | 5 ++++
> kernel/sysctl.c | 11 +++++++++
> mm/compaction.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 68 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 6201371..5965ef2 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -5,4 +5,9 @@
> #define COMPACT_INCOMPLETE 0
> #define COMPACT_COMPLETE 1
>
> +#ifdef CONFIG_MIGRATION
> +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *length, loff_t *ppos);
> +#endif /* CONFIG_MIGRATION */
> +
> #endif /* _LINUX_COMPACTION_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a68b24..6202e95 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -50,6 +50,7 @@
> #include <linux/ftrace.h>
> #include <linux/slow-work.h>
> #include <linux/perf_event.h>
> +#include <linux/compaction.h>
>
> #include <asm/uaccess.h>
> #include <asm/processor.h>
> @@ -80,6 +81,7 @@ extern int pid_max;
> extern int min_free_kbytes;
> extern int pid_max_min, pid_max_max;
> extern int sysctl_drop_caches;
> +extern int sysctl_compact_node;
> extern int percpu_pagelist_fraction;
> extern int compat_log;
> extern int latencytop_enabled;
> @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
> .mode = 0644,
> .proc_handler = drop_caches_sysctl_handler,
> },
> +#ifdef CONFIG_MIGRATION
> + {
> + .procname = "compact_node",
> + .data = &sysctl_compact_node,
> + .maxlen = sizeof(int),
> + .mode = 0644,
This should only need 0200?
> + .proc_handler = sysctl_compaction_handler,
> + },
> +#endif /* CONFIG_MIGRATION */
> {
> .procname = "min_free_kbytes",
> .data = &min_free_kbytes,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d36760a..a8bcae2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -11,6 +11,7 @@
> #include <linux/migrate.h>
> #include <linux/compaction.h>
> #include <linux/mm_inline.h>
> +#include <linux/sysctl.h>
> #include "internal.h"
>
> /*
> @@ -338,3 +339,54 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> return ret;
> }
>
> +/* Compact all zones within a node */
> +int compact_node(int nid)
> +{
> + int zoneid;
> + pg_data_t *pgdat;
> + struct zone *zone;
> +
> + if (nid < 0 || nid > nr_node_ids || !node_online(nid))
> + return -EINVAL;
> + pgdat = NODE_DATA(nid);
> +
> + /* Flush pending updates to the LRU lists */
> + lru_add_drain_all();
> +
> + printk(KERN_INFO "Compacting memory in node %d\n", nid);
> + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> + struct compact_control cc;
> +
> + zone = &pgdat->node_zones[zoneid];
> + if (!populated_zone(zone))
> + continue;
> +
> + cc.nr_freepages = 0;
> + cc.nr_migratepages = 0;
> + cc.zone = zone;
> + INIT_LIST_HEAD(&cc.freepages);
> + INIT_LIST_HEAD(&cc.migratepages);
> +
> + compact_zone(zone, &cc);
> +
> + VM_BUG_ON(!list_empty(&cc.freepages));
> + VM_BUG_ON(!list_empty(&cc.migratepages));
> + }
> + printk(KERN_INFO "Compaction of node %d complete\n", nid);
If it's exposed through sysfs, the printk's should probably be demoted to
pr_debug().
> +
> + return 0;
> +}
> +
> +/* This is global and fierce ugly but it's straight-forward */
> +int sysctl_compact_node;
> +
> +/* This is the entry point for compacting nodes via /proc/sys/vm */
> +int sysctl_compaction_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *length, loff_t *ppos)
> +{
> + proc_dointvec(table, write, buffer, length, ppos);
> + if (write)
> + return compact_node(sysctl_compact_node);
> +
> + return 0;
> +}
On Thu, 7 Jan 2010, David Rientjes wrote:
> CONFIG_MIGRATION is no longer strictly dependent on CONFIG_NUMA since
> ARCH_ENABLE_MEMORY_HOTREMOVE has allowed it to be configured for UMA
> machines. All strictly NUMA features in the migration core should be
> isolated under its #ifdef CONFIG_NUMA (sys_move_pages()) in mm/migrate.c
> or by simply not compiling mm/mempolicy.c (sys_migrate_pages()), so this
> patch looks fine as is (although the "help" text for CONFIG_MIGRATION
> could be updated to reflect that it's useful for both memory hot-remove
> and now compaction).
Correct.
Reviewed-by: Christoph Lameter <[email protected]>
On Thu, 7 Jan 2010, David Rientjes wrote:
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 6201371..5965ef2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -5,4 +5,9 @@
> > #define COMPACT_INCOMPLETE 0
> > #define COMPACT_COMPLETE 1
> >
> > +#ifdef CONFIG_MIGRATION
> > +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > + void __user *buffer, size_t *length, loff_t *ppos);
> > +#endif /* CONFIG_MIGRATION */
> > +
> > #endif /* _LINUX_COMPACTION_H */
This should be CONFIG_COMPACTION since mm/compaction.c won't be compiled
without it; the later additions to this ifdef, fragmentation_index() and
try_to_compact_pages(), can also be under CONFIG_COMPACTION since neither
are used outside of the compaction core directly (__fragmentation_index()
from vmstat uses its wrapped function at file scope).
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a68b24..6202e95 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -50,6 +50,7 @@
> > #include <linux/ftrace.h>
> > #include <linux/slow-work.h>
> > #include <linux/perf_event.h>
> > +#include <linux/compaction.h>
> >
> > #include <asm/uaccess.h>
> > #include <asm/processor.h>
> > @@ -80,6 +81,7 @@ extern int pid_max;
> > extern int min_free_kbytes;
> > extern int pid_max_min, pid_max_max;
> > extern int sysctl_drop_caches;
> > +extern int sysctl_compact_node;
> > extern int percpu_pagelist_fraction;
> > extern int compat_log;
> > extern int latencytop_enabled;
> > @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
> > .mode = 0644,
> > .proc_handler = drop_caches_sysctl_handler,
> > },
> > +#ifdef CONFIG_MIGRATION
> > + {
> > + .procname = "compact_node",
> > + .data = &sysctl_compact_node,
> > + .maxlen = sizeof(int),
> > + .mode = 0644,
>
> This should only need 0200?
>
This needs to be CONFIG_COMPACTION as well, we won't have the handler
without mm/compaction.c.
On Wed, 6 Jan 2010, Mel Gorman wrote:
> The fragmentation index may indicate that a failure it due to external
> fragmentation, a compaction run complete and an allocation failure still
> fail. There are two obvious reasons as to why
>
> o Page migration cannot move all pages so fragmentation remains
> o A suitable page may exist but watermarks are not met
>
> In the event of compaction and allocation failure, this patch prevents
> compaction happening for a short interval. It's only recorded on the
> preferred zone but that should be enough coverage. This could have been
> implemented similar to the zonelist_cache but the increased size of the
> zonelist did not appear to be justified.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/mmzone.h | 7 +++++++
> mm/page_alloc.c | 15 ++++++++++++++-
> 2 files changed, 21 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..1d6ccbe 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -328,6 +328,13 @@ struct zone {
> unsigned long *pageblock_flags;
> #endif /* CONFIG_SPARSEMEM */
>
> +#ifdef CONFIG_MIGRATION
> + /*
> + * If a compaction fails, do not try compaction again until
> + * jiffies is after the value of compact_resume
> + */
> + unsigned long compact_resume;
> +#endif
CONFIG_COMPACTION?
>
> ZONE_PADDING(_pad1_)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7275afb..9c86606 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1729,7 +1729,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> cond_resched();
>
> /* Try memory compaction for high-order allocations before reclaim */
> - if (order) {
> + if (order && time_after(jiffies, preferred_zone->compact_resume)) {
> *did_some_progress = try_to_compact_pages(zonelist,
> order, gfp_mask, nodemask);
> if (*did_some_progress != COMPACT_INCOMPLETE) {
> @@ -1748,6 +1748,19 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> * but not enough to satisfy watermarks.
> */
> count_vm_event(COMPACTFAIL);
> +
> + /*
> + * On failure, avoid compaction for a short time.
> + * XXX: This is very unsatisfactory. The failure
> + * to compact has nothing to do with time
> + * and everything to do with the requested
> + * order, the number of free pages and
> + * watermarks. How to wait on that is more
> + * unclear, but the answer would apply to
> + * other areas where the VM waits based on
> + * time.
> + */
> + preferred_zone->compact_resume = jiffies + HZ/50;
> }
> }
>
This will need to be moved to (another) inline function dependent on
CONFIG_COMPACTION since we don't have zone->compact_resume without it;
it's probably better to seperate the function out rather than add #ifdef's
within __alloc_pages_direct_reclaim().
On Thu, Jan 07, 2010 at 01:46:03PM -0800, David Rientjes wrote:
> On Wed, 6 Jan 2010, Mel Gorman wrote:
>
> > CONFIG_MIGRATION currently depends on CONFIG_NUMA. The current users of
> > page migration such as sys_move_pages(), sys_migrate_pages() and cpuset
> > process migration are ordinarily only beneficial on NUMA.
> >
> > As memory compaction will operate within a zone and is useful on both NUMA
> > and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> > user selects CONFIG_COMPACTION as an option.
> >
> > TODO
> > o After this patch is applied, the migration core is available but it
> > also makes NUMA-specific features available. This is too much
> > exposure so revisit this.
> >
>
> CONFIG_MIGRATION is no longer strictly dependent on CONFIG_NUMA since
> ARCH_ENABLE_MEMORY_HOTREMOVE has allowed it to be configured for UMA
> machines. All strictly NUMA features in the migration core should be
> isolated under its #ifdef CONFIG_NUMA (sys_move_pages()) in mm/migrate.c
> or by simply not compiling mm/mempolicy.c (sys_migrate_pages()), so this
> patch looks fine as is (although the "help" text for CONFIG_MIGRATION
> could be updated to reflect that it's useful for both memory hot-remove
> and now compaction).
>
That does appear to be the case, thanks. I had not double-checked
closely and it was somewhat of a problem when the series was first
developed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Jan 13, 2010 at 03:23:44PM -0800, David Rientjes wrote:
> On Thu, 7 Jan 2010, David Rientjes wrote:
>
> > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > > index 6201371..5965ef2 100644
> > > --- a/include/linux/compaction.h
> > > +++ b/include/linux/compaction.h
> > > @@ -5,4 +5,9 @@
> > > #define COMPACT_INCOMPLETE 0
> > > #define COMPACT_COMPLETE 1
> > >
> > > +#ifdef CONFIG_MIGRATION
> > > +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > > + void __user *buffer, size_t *length, loff_t *ppos);
> > > +#endif /* CONFIG_MIGRATION */
> > > +
> > > #endif /* _LINUX_COMPACTION_H */
>
> This should be CONFIG_COMPACTION since mm/compaction.c won't be compiled
> without it; the later additions to this ifdef, fragmentation_index() and
> try_to_compact_pages(), can also be under CONFIG_COMPACTION since neither
> are used outside of the compaction core directly (__fragmentation_index()
> from vmstat uses its wrapped function at file scope).
>
True. It's corrected now.
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 8a68b24..6202e95 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -50,6 +50,7 @@
> > > #include <linux/ftrace.h>
> > > #include <linux/slow-work.h>
> > > #include <linux/perf_event.h>
> > > +#include <linux/compaction.h>
> > >
> > > #include <asm/uaccess.h>
> > > #include <asm/processor.h>
> > > @@ -80,6 +81,7 @@ extern int pid_max;
> > > extern int min_free_kbytes;
> > > extern int pid_max_min, pid_max_max;
> > > extern int sysctl_drop_caches;
> > > +extern int sysctl_compact_node;
> > > extern int percpu_pagelist_fraction;
> > > extern int compat_log;
> > > extern int latencytop_enabled;
> > > @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
> > > .mode = 0644,
> > > .proc_handler = drop_caches_sysctl_handler,
> > > },
> > > +#ifdef CONFIG_MIGRATION
> > > + {
> > > + .procname = "compact_node",
> > > + .data = &sysctl_compact_node,
> > > + .maxlen = sizeof(int),
> > > + .mode = 0644,
> >
> > This should only need 0200?
> >
>
> This needs to be CONFIG_COMPACTION as well, we won't have the handler
> without mm/compaction.c.
>
Both corrected.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Jan 07, 2010 at 02:00:16PM -0800, David Rientjes wrote:
> On Wed, 6 Jan 2010, Mel Gorman wrote:
>
> > This patch adds a proc file /proc/sys/vm/compact_node. When a NID is written
> > to the file, each zone in that node is compacted. This should be done with
> > debugfs but this was what was available to rebase quickly and I suspect
> > debugfs either did not exist or was in development during the first
> > implementation.
> >
> > If this interface is to exist in the long term, it needs to be thought
> > about carefully. For the moment, it's handy to have to test compaction
> > under a controlled setting.
> >
>
> With Lee's work on mempolicy-constrained hugepage allocations, there is a
> use-case for this explicit trigger to be exported via sysfs in the
> longterm:
True, although the per-node structures are only available on NUMA making
it necessary to have two interfaces. The per-node one is handy enough
because it would be just
/sys/devices/system/node/nodeX/compact_node
When written to, this node is compacted by the writing process
But there does not appear to be a "good" way of having a non-NUMA
interface. /sys/devices/system/node does not exist .... Does anyone
remember why !NUMA does not have a /sys/devices/system/node/node0? Is
there a good reason or was there just no point?
> we should be able to determine how successful the allocation of
> hugepages will be on a particular node without actually allocating them
> depending on the degree of external fragmentation to form our mempolicy.
You can guess to some extent but can't know without scanning the pages.
A simple guess would be based on the number of MIGRATE_MOVABLE pageblocks
and the number of LRU pages that are in the node. Something like;
1. Count how many MIGRATE_MOVABLE pageblocks there currently are in the
zone (examination of a bitmap)
2. Assuming there is little or no mixing of pageblocks (because min_free_kbytes
was tuned properly), calculate how many MIGRATE_MOVABLE pageblocks
are required for all the LRU pages in the zone
3. The number of possible hugepages is the difference in the two
pageblock counts.
> Since node-targeted hugepage allocation and freeing now exists in the
> kernel and compaction is used primary for the former, I don't see why it
> shouldn't be exposed.
>
> I like the (slightly racy) interface that avoids having a trigger for each
> node under /sys/devices/system/node as well.
>
So, would you be happy with a proc-interface like this and also having a
sysfs-based trigger for NUMA? Or do we need to come up with a
sysfs-based trigger that works for both NUMA and !NUMA?
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/compaction.h | 5 ++++
> > kernel/sysctl.c | 11 +++++++++
> > mm/compaction.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 68 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 6201371..5965ef2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -5,4 +5,9 @@
> > #define COMPACT_INCOMPLETE 0
> > #define COMPACT_COMPLETE 1
> >
> > +#ifdef CONFIG_MIGRATION
> > +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > + void __user *buffer, size_t *length, loff_t *ppos);
> > +#endif /* CONFIG_MIGRATION */
> > +
> > #endif /* _LINUX_COMPACTION_H */
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a68b24..6202e95 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -50,6 +50,7 @@
> > #include <linux/ftrace.h>
> > #include <linux/slow-work.h>
> > #include <linux/perf_event.h>
> > +#include <linux/compaction.h>
> >
> > #include <asm/uaccess.h>
> > #include <asm/processor.h>
> > @@ -80,6 +81,7 @@ extern int pid_max;
> > extern int min_free_kbytes;
> > extern int pid_max_min, pid_max_max;
> > extern int sysctl_drop_caches;
> > +extern int sysctl_compact_node;
> > extern int percpu_pagelist_fraction;
> > extern int compat_log;
> > extern int latencytop_enabled;
> > @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
> > .mode = 0644,
> > .proc_handler = drop_caches_sysctl_handler,
> > },
> > +#ifdef CONFIG_MIGRATION
> > + {
> > + .procname = "compact_node",
> > + .data = &sysctl_compact_node,
> > + .maxlen = sizeof(int),
> > + .mode = 0644,
>
> This should only need 0200?
>
> > + .proc_handler = sysctl_compaction_handler,
> > + },
> > +#endif /* CONFIG_MIGRATION */
> > {
> > .procname = "min_free_kbytes",
> > .data = &min_free_kbytes,
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index d36760a..a8bcae2 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -11,6 +11,7 @@
> > #include <linux/migrate.h>
> > #include <linux/compaction.h>
> > #include <linux/mm_inline.h>
> > +#include <linux/sysctl.h>
> > #include "internal.h"
> >
> > /*
> > @@ -338,3 +339,54 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > return ret;
> > }
> >
> > +/* Compact all zones within a node */
> > +int compact_node(int nid)
> > +{
> > + int zoneid;
> > + pg_data_t *pgdat;
> > + struct zone *zone;
> > +
> > + if (nid < 0 || nid > nr_node_ids || !node_online(nid))
> > + return -EINVAL;
> > + pgdat = NODE_DATA(nid);
> > +
> > + /* Flush pending updates to the LRU lists */
> > + lru_add_drain_all();
> > +
> > + printk(KERN_INFO "Compacting memory in node %d\n", nid);
> > + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> > + struct compact_control cc;
> > +
> > + zone = &pgdat->node_zones[zoneid];
> > + if (!populated_zone(zone))
> > + continue;
> > +
> > + cc.nr_freepages = 0;
> > + cc.nr_migratepages = 0;
> > + cc.zone = zone;
> > + INIT_LIST_HEAD(&cc.freepages);
> > + INIT_LIST_HEAD(&cc.migratepages);
> > +
> > + compact_zone(zone, &cc);
> > +
> > + VM_BUG_ON(!list_empty(&cc.freepages));
> > + VM_BUG_ON(!list_empty(&cc.migratepages));
> > + }
> > + printk(KERN_INFO "Compaction of node %d complete\n", nid);
>
> If it's exposed through sysfs, the printk's should probably be demoted to
> pr_debug().
>
I deleted them altogether. The point originally was to measure how long
compaction took. It can just as easily be done by running
time echo 0 > compact_node
Thanks
> > +
> > + return 0;
> > +}
> > +
> > +/* This is global and fierce ugly but it's straight-forward */
> > +int sysctl_compact_node;
> > +
> > +/* This is the entry point for compacting nodes via /proc/sys/vm */
> > +int sysctl_compaction_handler(struct ctl_table *table, int write,
> > + void __user *buffer, size_t *length, loff_t *ppos)
> > +{
> > + proc_dointvec(table, write, buffer, length, ppos);
> > + if (write)
> > + return compact_node(sysctl_compact_node);
> > +
> > + return 0;
> > +}
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Jan 13, 2010 at 03:28:16PM -0800, David Rientjes wrote:
> On Wed, 6 Jan 2010, Mel Gorman wrote:
>
> > The fragmentation index may indicate that a failure it due to external
> > fragmentation, a compaction run complete and an allocation failure still
> > fail. There are two obvious reasons as to why
> >
> > o Page migration cannot move all pages so fragmentation remains
> > o A suitable page may exist but watermarks are not met
> >
> > In the event of compaction and allocation failure, this patch prevents
> > compaction happening for a short interval. It's only recorded on the
> > preferred zone but that should be enough coverage. This could have been
> > implemented similar to the zonelist_cache but the increased size of the
> > zonelist did not appear to be justified.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/mmzone.h | 7 +++++++
> > mm/page_alloc.c | 15 ++++++++++++++-
> > 2 files changed, 21 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 30fe668..1d6ccbe 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -328,6 +328,13 @@ struct zone {
> > unsigned long *pageblock_flags;
> > #endif /* CONFIG_SPARSEMEM */
> >
> > +#ifdef CONFIG_MIGRATION
> > + /*
> > + * If a compaction fails, do not try compaction again until
> > + * jiffies is after the value of compact_resume
> > + */
> > + unsigned long compact_resume;
> > +#endif
>
> CONFIG_COMPACTION?
>
Yep
> >
> > ZONE_PADDING(_pad1_)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7275afb..9c86606 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1729,7 +1729,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > cond_resched();
> >
> > /* Try memory compaction for high-order allocations before reclaim */
> > - if (order) {
> > + if (order && time_after(jiffies, preferred_zone->compact_resume)) {
> > *did_some_progress = try_to_compact_pages(zonelist,
> > order, gfp_mask, nodemask);
> > if (*did_some_progress != COMPACT_INCOMPLETE) {
> > @@ -1748,6 +1748,19 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > * but not enough to satisfy watermarks.
> > */
> > count_vm_event(COMPACTFAIL);
> > +
> > + /*
> > + * On failure, avoid compaction for a short time.
> > + * XXX: This is very unsatisfactory. The failure
> > + * to compact has nothing to do with time
> > + * and everything to do with the requested
> > + * order, the number of free pages and
> > + * watermarks. How to wait on that is more
> > + * unclear, but the answer would apply to
> > + * other areas where the VM waits based on
> > + * time.
> > + */
> > + preferred_zone->compact_resume = jiffies + HZ/50;
> > }
> > }
> >
>
> This will need to be moved to (another) inline function dependent on
> CONFIG_COMPACTION since we don't have zone->compact_resume without it;
> it's probably better to seperate the function out rather than add #ifdef's
> within __alloc_pages_direct_reclaim().
>
Moved to an inline called defer_compaction()
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, 20 Jan 2010, Mel Gorman wrote:
> True, although the per-node structures are only available on NUMA making
> it necessary to have two interfaces. The per-node one is handy enough
> because it would be just
>
> /sys/devices/system/node/nodeX/compact_node
> When written to, this node is compacted by the writing process
>
> But there does not appear to be a "good" way of having a non-NUMA
> interface. /sys/devices/system/node does not exist .... Does anyone
> remember why !NUMA does not have a /sys/devices/system/node/node0? Is
> there a good reason or was there just no point?
We could create a fake node0 for the !NUMA case I guess? Dont see a major
reason why not to do it aside from scripts that may check for the presence
of the file to switch to a "NUMA" mode.
On Wed, 20 Jan 2010, Mel Gorman wrote:
> > With Lee's work on mempolicy-constrained hugepage allocations, there is a
> > use-case for this explicit trigger to be exported via sysfs in the
> > longterm:
>
> True, although the per-node structures are only available on NUMA making
> it necessary to have two interfaces. The per-node one is handy enough
> because it would be just
>
> /sys/devices/system/node/nodeX/compact_node
> When written to, this node is compacted by the writing process
>
> But there does not appear to be a "good" way of having a non-NUMA
> interface. /sys/devices/system/node does not exist .... Does anyone
> remember why !NUMA does not have a /sys/devices/system/node/node0? Is
> there a good reason or was there just no point?
>
There doesn't seem to be a usecase for a fake node0 sysfs entry since it
would be a duplication of procfs.
I think it would be best to create a global /proc/sys/vm/compact trigger
that would walk all "compactable" zones system-wide and then a per-node
/sys/devices/system/node/nodeX/compact trigger for that particular node,
both with permissions 0200.
It would be helpful to be able to determine what is "compactable" at the
same time by adding both global and per-node "compact_order" tunables that
would default to HUGETLB_PAGE_ORDER. Then, the corresponding "compact"
trigger would only do work if fill_contig_page_info() shows
!free_blocks_suitable for either all zones (global trigger) or each zone
in the node's zonelist (per-node trigger).
On Wed, Jan 20, 2010 at 12:12:55PM -0600, Christoph Lameter wrote:
> On Wed, 20 Jan 2010, Mel Gorman wrote:
>
> > True, although the per-node structures are only available on NUMA making
> > it necessary to have two interfaces. The per-node one is handy enough
> > because it would be just
> >
> > /sys/devices/system/node/nodeX/compact_node
> > When written to, this node is compacted by the writing process
> >
> > But there does not appear to be a "good" way of having a non-NUMA
> > interface. /sys/devices/system/node does not exist .... Does anyone
> > remember why !NUMA does not have a /sys/devices/system/node/node0? Is
> > there a good reason or was there just no point?
>
> We could create a fake node0 for the !NUMA case I guess?
I would like to but I have the same concerns as you about programs or scripts
assuming the existence of /sys/devices/system/node/ imples NUMA.
> Dont see a major
> reason why not to do it aside from scripts that may check for the presence
> of the file to switch to a "NUMA" mode.
>
That would suck royally and unfortunately it's partly the case with libnuma
at least. Well, not the library itself but one of the utilities.
numa_available() is implemented by checking the return value of get_mempolicy()
so it's ok.
It checks the max configured node by parsing the contents of the
/sys/devices/system/node/ directory so that should also be ok as long as
the UMA node is 0.
However, the numastat script is a perl script that makes assumptions on
NUMA versus UMA depending on the existence of the sysfs directory. If it
exists, it parses numastat. While this would be faked as well, we're
talking about adding a fair amount of fakery in there and still end up
with a behaviour change. Previously, the script would have identified
the system was not NUMA aware and afterwards, it prints out meaningless
values.
Not sure how great an option that is :(
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Sorry, I haven't read this patch at all.
> The time differences are marginal but bear in mind that this is an ideal
> case of mostly unmapped buffer pages. On nice set of results is between
> allocations 13-18 where no pages were reclaimed, some compaction occured
> and 300 huge pages were allocated in 0.16 seconds. Furthermore, compaction
> allocated a high higher percentage of memory (91% of RAM as huge pages).
>
> The downside appears to be that the compaction kernel reclaimed even more
> pages than the vanilla kernel. However, take the cut-off point of 880 pages
> that both kernels succeeded. The vanilla kernel had reclaimed 105132 pages
> at that point. The kernel with compaction had reclaimed 59071, less than
> half of what the vanilla kernel reclaimed. i.e. the bulk of pages reclaimed
> with the compaction kernel were to get from 87% of memory allocated to 91%
> as huge pages.
>
> These results would appear to be an encouraging enough start.
>
> Comments?
I think "Total pages reclaimed" increasing is not good thing ;)
Honestly, I haven't understand why your patch increase reclaimed and
the exactly meaning of the your tool's rclm field.
Can you share your mesurement script? May I run the same test?
I like this patch, but I don't like increasing reclaim. I'd like to know
this patch require any vmscan change and/or its change mitigate the issue.
Thanks.
On Thu, Jan 21, 2010 at 12:12:11PM +0900, KOSAKI Motohiro wrote:
> Hi Mel,
>
> Sorry, I haven't read this patch at all.
>
> > The time differences are marginal but bear in mind that this is an ideal
> > case of mostly unmapped buffer pages. On nice set of results is between
> > allocations 13-18 where no pages were reclaimed, some compaction occured
> > and 300 huge pages were allocated in 0.16 seconds. Furthermore, compaction
> > allocated a high higher percentage of memory (91% of RAM as huge pages).
> >
> > The downside appears to be that the compaction kernel reclaimed even more
> > pages than the vanilla kernel. However, take the cut-off point of 880 pages
> > that both kernels succeeded. The vanilla kernel had reclaimed 105132 pages
> > at that point. The kernel with compaction had reclaimed 59071, less than
> > half of what the vanilla kernel reclaimed. i.e. the bulk of pages reclaimed
> > with the compaction kernel were to get from 87% of memory allocated to 91%
> > as huge pages.
> >
> > These results would appear to be an encouraging enough start.
> >
> > Comments?
>
> I think "Total pages reclaimed" increasing is not good thing ;)
First, I made a mistake in the patch. With the bug fixed, they're
reduced. See the post later in the thread
http://lkml.org/lkml/2010/1/6/215
> Honestly, I haven't understand why your patch increase reclaimed and
> the exactly meaning of the your tool's rclm field.
>
> Can you share your mesurement script? May I run the same test?
>
Unfortunately at the moment it's part of a mini-testgrid setup I run out
of the house. It doesn't lend itself to being stand-alone. I'll break it
out as part of the next release.
> I like this patch, but I don't like increasing reclaim. I'd like to know
> this patch require any vmscan change and/or its change mitigate the issue.
>
With the bug repaired, reclaims go from 105132 to 45935 with more huge
pages allocated so right now, no special action is required.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Jan 20, 2010 at 12:48:05PM -0800, David Rientjes wrote:
> On Wed, 20 Jan 2010, Mel Gorman wrote:
>
> > > With Lee's work on mempolicy-constrained hugepage allocations, there is a
> > > use-case for this explicit trigger to be exported via sysfs in the
> > > longterm:
> >
> > True, although the per-node structures are only available on NUMA making
> > it necessary to have two interfaces. The per-node one is handy enough
> > because it would be just
> >
> > /sys/devices/system/node/nodeX/compact_node
> > When written to, this node is compacted by the writing process
> >
> > But there does not appear to be a "good" way of having a non-NUMA
> > interface. /sys/devices/system/node does not exist .... Does anyone
> > remember why !NUMA does not have a /sys/devices/system/node/node0? Is
> > there a good reason or was there just no point?
> >
>
> There doesn't seem to be a usecase for a fake node0 sysfs entry since it
> would be a duplication of procfs.
>
Indeed.
> I think it would be best to create a global /proc/sys/vm/compact trigger
> that would walk all "compactable" zones system-wide
Easily done.
> and then a per-node
> /sys/devices/system/node/nodeX/compact trigger for that particular node,
> both with permissions 0200.
>
Will work on this as an additional patch. It should be straight-forward
with the only care needing to be taken around memory hotplug as usual.
> It would be helpful to be able to determine what is "compactable" at the
> same time by adding both global and per-node "compact_order" tunables that
> would default to HUGETLB_PAGE_ORDER.
Well, rather than having a separate tunable, writing a number to
/proc/sys/vm/compact could indicate the order if that trigger is now
working system-wide. Would that be suitable?
> Then, the corresponding "compact"
> trigger would only do work if fill_contig_page_info() shows
> !free_blocks_suitable for either all zones (global trigger) or each zone
> in the node's zonelist (per-node trigger).
>
Do you see a need for proc to act like this? I'm wondering if
try_to_compact_pages() already does what you're looking for except no
process is required to muck around in /proc or /sys.
I somewhat saw the /proc and /sys tunables being used for either debugging or
by a job scheduler that compacted one or more nodes before a new job started.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, 21 Jan 2010, Mel Gorman wrote:
> > It would be helpful to be able to determine what is "compactable" at the
> > same time by adding both global and per-node "compact_order" tunables that
> > would default to HUGETLB_PAGE_ORDER.
>
> Well, rather than having a separate tunable, writing a number to
> /proc/sys/vm/compact could indicate the order if that trigger is now
> working system-wide. Would that be suitable?
>
Do you think you'll eventually find a need to call try_to_compact_pages()
with a higher order than the one passed to the page allocator to limit
"compaction thrashing" from fragmented frees to a zone where we're
constantly compacting order-1 pages, for instance? I agree that memory
compaction should always be used before direct reclaim for higher order
allocations, but it may be more efficient to define a compact_min_order,
tunable from userspace, that would avoid the need for constant order-1
compactions from subsequent page allocations.
If that's a possibility, we may find a need for "compact_order", now
renamed "compact_min_order", outside of the explicit trigger.
> > > Comments?
> >
> > I think "Total pages reclaimed" increasing is not good thing ;)
>
> First, I made a mistake in the patch. With the bug fixed, they're
> reduced. See the post later in the thread
> http://lkml.org/lkml/2010/1/6/215
Oh, I see. I'm glad the issue was alredy fixed.
> > Honestly, I haven't understand why your patch increase reclaimed and
> > the exactly meaning of the your tool's rclm field.
> >
> > Can you share your mesurement script? May I run the same test?
>
> Unfortunately at the moment it's part of a mini-testgrid setup I run out
> of the house. It doesn't lend itself to being stand-alone. I'll break it
> out as part of the next release.
surely good news :)
> > I like this patch, but I don't like increasing reclaim. I'd like to know
> > this patch require any vmscan change and/or its change mitigate the issue.
> >
>
> With the bug repaired, reclaims go from 105132 to 45935 with more huge
> pages allocated so right now, no special action is required.
ok. thanks.
On Wed, 6 Jan 2010, Mel Gorman wrote:
> Unusuable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmstat.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 99 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6051fba..e1ea2d5 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> return 0;
> }
>
> +
> +struct config_page_info {
> + unsigned long free_pages;
> + unsigned long free_blocks_total;
> + unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes to attempt to estimate
> + * how many suitable free blocks there *might* be if MOVABLE pages were
> + * migrated. Calculating that is possible, but expensive and can be
> + * figured out from userspace
> + */
> +static void fill_contig_page_info(struct zone *zone,
> + unsigned int suitable_order,
> + struct config_page_info *info)
There's a descrepency between the name of the function and the name of the
struct, I think they were probably both meant to be contig_page_info.
> +{
> + unsigned int order;
> +
> + info->free_pages = 0;
> + info->free_blocks_total = 0;
> + info->free_blocks_suitable = 0;
> +
> + for (order = 0; order < MAX_ORDER; order++) {
> + unsigned long blocks;
> +
> + /* Count number of free blocks */
> + blocks = zone->free_area[order].nr_free;
> + info->free_blocks_total += blocks;
> +
> + /* Count free base pages */
> + info->free_pages += blocks << order;
> +
> + /* Count the suitable free blocks */
> + if (order >= suitable_order)
> + info->free_blocks_suitable += blocks <<
> + (order - suitable_order);
> + }
> +}
> +
> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +int unusable_free_index(struct zone *zone,
> + unsigned int order,
> + struct config_page_info *info)
Should be static?
> +{
> + /* No free memory is interpreted as all free memory is unusable */
> + if (info->free_pages == 0)
> + return 100;
> +
> + /*
> + * Index should be a value between 0 and 1. Return a value to 3
> + * decimal places.
> + *
> + * 0 => no fragmentation
> + * 1 => high fragmentation
> + */
> + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> +
This value is only for userspace consumption via /proc/pagetypeinfo, so
I'm wondering why it needs to be exported as an index. Other than a loss
of precision, wouldn't this be easier to understand (especially when
coupled with the free page counts already exported) if it were multipled
by 100 rather than 1000 and shown as a percent of _usable_ free memory at
each order? Otherwise, we're left doing this "free - unusuable"
calculation while the number of unusuable pages at an order isn't
necessarily of great interest as a vanilla value.
> +}
> +
> +static void pagetypeinfo_showunusable_print(struct seq_file *m,
> + pg_data_t *pgdat, struct zone *zone)
> +{
> + unsigned int order;
> +
> + /* Alloc on stack as interrupts are disabled for zone walk */
> + struct config_page_info info;
> +
> + seq_printf(m, "Node %4d, zone %8s %19s",
> + pgdat->node_id,
> + zone->name, " ");
> + for (order = 0; order < MAX_ORDER; ++order) {
> + fill_contig_page_info(zone, order, &info);
It's a shame we can't keep this data for the fragmentation index exported
subsequently in patch 3.
> + seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
> + }
> +
> + seq_putc(m, '\n');
> +}
> +
> +/*
> + * Display unusable free space index
> + * XXX: Could be a lot more efficient, but it's not a critical path
> + */
> +static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
> +{
> + pg_data_t *pgdat = (pg_data_t *)arg;
> +
> + seq_printf(m, "\nUnusable free space index at order\n");
> + walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
> +
> + return 0;
> +}
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> @@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
> seq_putc(m, '\n');
> pagetypeinfo_showfree(m, pgdat);
> + pagetypeinfo_showunusable(m, pgdat);
> pagetypeinfo_showblockcount(m, pgdat);
>
> return 0;
/proc/pagetypeinfo isn't documented, but that's been fine until now
because all of the fields dealing with "free pages" and "number of blocks"
are easily understood. That changes now because there is no clear
understanding of "fragmentation index" in userspace, so we'll probably
need some kind of memory compaction documentation eventually.
On Thu, Jan 28, 2010 at 02:27:39PM -0800, David Rientjes wrote:
> On Wed, 6 Jan 2010, Mel Gorman wrote:
>
> > Unusuable free space index is a measure of external fragmentation that
> > takes the allocation size into account. For the most part, the huge page
> > size will be the size of interest but not necessarily so it is exported
> > on a per-order and per-zone basis via /proc/pagetypeinfo.
> >
> > The index is normally calculated as a value between 0 and 1 which is
> > obviously unsuitable within the kernel. Instead, the first three decimal
> > places are used as a value between 0 and 1000 for an integer approximation.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmstat.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 99 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 6051fba..e1ea2d5 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -451,6 +451,104 @@ static int frag_show(struct seq_file *m, void *arg)
> > return 0;
> > }
> >
> > +
> > +struct config_page_info {
> > + unsigned long free_pages;
> > + unsigned long free_blocks_total;
> > + unsigned long free_blocks_suitable;
> > +};
> > +
> > +/*
> > + * Calculate the number of free pages in a zone, how many contiguous
> > + * pages are free and how many are large enough to satisfy an allocation of
> > + * the target size. Note that this function makes to attempt to estimate
> > + * how many suitable free blocks there *might* be if MOVABLE pages were
> > + * migrated. Calculating that is possible, but expensive and can be
> > + * figured out from userspace
> > + */
> > +static void fill_contig_page_info(struct zone *zone,
> > + unsigned int suitable_order,
> > + struct config_page_info *info)
>
> There's a descrepency between the name of the function and the name of the
> struct, I think they were probably both meant to be contig_page_info.
>
Fixed
> > +{
> > + unsigned int order;
> > +
> > + info->free_pages = 0;
> > + info->free_blocks_total = 0;
> > + info->free_blocks_suitable = 0;
> > +
> > + for (order = 0; order < MAX_ORDER; order++) {
> > + unsigned long blocks;
> > +
> > + /* Count number of free blocks */
> > + blocks = zone->free_area[order].nr_free;
> > + info->free_blocks_total += blocks;
> > +
> > + /* Count free base pages */
> > + info->free_pages += blocks << order;
> > +
> > + /* Count the suitable free blocks */
> > + if (order >= suitable_order)
> > + info->free_blocks_suitable += blocks <<
> > + (order - suitable_order);
> > + }
> > +}
> > +
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +int unusable_free_index(struct zone *zone,
> > + unsigned int order,
> > + struct config_page_info *info)
>
> Should be static?
>
Fixed
> > +{
> > + /* No free memory is interpreted as all free memory is unusable */
> > + if (info->free_pages == 0)
> > + return 100;
> > +
> > + /*
> > + * Index should be a value between 0 and 1. Return a value to 3
> > + * decimal places.
> > + *
> > + * 0 => no fragmentation
> > + * 1 => high fragmentation
> > + */
> > + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> > +
>
> This value is only for userspace consumption via /proc/pagetypeinfo, so
> I'm wondering why it needs to be exported as an index. Other than a loss
> of precision, wouldn't this be easier to understand (especially when
> coupled with the free page counts already exported) if it were multipled
> by 100 rather than 1000 and shown as a percent of _usable_ free memory at
> each order?
I find it easier to understand either way, but that's hardly a surprise.
The 1000 is because of the loss of precision. I can make it a 100 but I
don't think it makes much of a difference.
> Otherwise, we're left doing this "free - unusuable"
> calculation while the number of unusuable pages at an order isn't
> necessarily of great interest as a vanilla value.
>
> > +}
> > +
> > +static void pagetypeinfo_showunusable_print(struct seq_file *m,
> > + pg_data_t *pgdat, struct zone *zone)
> > +{
> > + unsigned int order;
> > +
> > + /* Alloc on stack as interrupts are disabled for zone walk */
> > + struct config_page_info info;
> > +
> > + seq_printf(m, "Node %4d, zone %8s %19s",
> > + pgdat->node_id,
> > + zone->name, " ");
> > + for (order = 0; order < MAX_ORDER; ++order) {
> > + fill_contig_page_info(zone, order, &info);
>
> It's a shame we can't keep this data for the fragmentation index exported
> subsequently in patch 3.
>
It could. When I did first, it made things messier and the patches less
clear-cut so I kept it simple as it wasn't performance-critical.
> > + seq_printf(m, "%6d ", unusable_free_index(zone, order, &info));
> > + }
> > +
> > + seq_putc(m, '\n');
> > +}
> > +
> > +/*
> > + * Display unusable free space index
> > + * XXX: Could be a lot more efficient, but it's not a critical path
> > + */
> > +static int pagetypeinfo_showunusable(struct seq_file *m, void *arg)
> > +{
> > + pg_data_t *pgdat = (pg_data_t *)arg;
> > +
> > + seq_printf(m, "\nUnusable free space index at order\n");
> > + walk_zones_in_node(m, pgdat, pagetypeinfo_showunusable_print);
> > +
> > + return 0;
> > +}
> > +
> > static void pagetypeinfo_showfree_print(struct seq_file *m,
> > pg_data_t *pgdat, struct zone *zone)
> > {
> > @@ -558,6 +656,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> > seq_printf(m, "Pages per block: %lu\n", pageblock_nr_pages);
> > seq_putc(m, '\n');
> > pagetypeinfo_showfree(m, pgdat);
> > + pagetypeinfo_showunusable(m, pgdat);
> > pagetypeinfo_showblockcount(m, pgdat);
> >
> > return 0;
>
> /proc/pagetypeinfo isn't documented, but that's been fine until now
> because all of the fields dealing with "free pages" and "number of blocks"
> are easily understood. That changes now because there is no clear
> understanding of "fragmentation index" in userspace, so we'll probably
> need some kind of memory compaction documentation eventually.
>
Agreed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, 5 Feb 2010, Mel Gorman wrote:
> > > + /*
> > > + * Index should be a value between 0 and 1. Return a value to 3
> > > + * decimal places.
> > > + *
> > > + * 0 => no fragmentation
> > > + * 1 => high fragmentation
> > > + */
> > > + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> > > +
> >
> > This value is only for userspace consumption via /proc/pagetypeinfo, so
> > I'm wondering why it needs to be exported as an index. Other than a loss
> > of precision, wouldn't this be easier to understand (especially when
> > coupled with the free page counts already exported) if it were multipled
> > by 100 rather than 1000 and shown as a percent of _usable_ free memory at
> > each order?
>
> I find it easier to understand either way, but that's hardly a surprise.
> The 1000 is because of the loss of precision. I can make it a 100 but I
> don't think it makes much of a difference.
>
This suggestion was coupled with the subsequent note that there is no
documentation of what "unusuable free space index" is, except by the
implementation itself. Since the value isn't used by the kernel, I think
exporting the value as a percent would be easier understood by the user
without looking up the semantics. I don't have strong feelings either
way, however.
On Fri, Feb 05, 2010 at 01:40:21PM -0800, David Rientjes wrote:
> On Fri, 5 Feb 2010, Mel Gorman wrote:
>
> > > > + /*
> > > > + * Index should be a value between 0 and 1. Return a value to 3
> > > > + * decimal places.
> > > > + *
> > > > + * 0 => no fragmentation
> > > > + * 1 => high fragmentation
> > > > + */
> > > > + return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
> > > > +
> > >
> > > This value is only for userspace consumption via /proc/pagetypeinfo, so
> > > I'm wondering why it needs to be exported as an index. Other than a loss
> > > of precision, wouldn't this be easier to understand (especially when
> > > coupled with the free page counts already exported) if it were multipled
> > > by 100 rather than 1000 and shown as a percent of _usable_ free memory at
> > > each order?
> >
> > I find it easier to understand either way, but that's hardly a surprise.
> > The 1000 is because of the loss of precision. I can make it a 100 but I
> > don't think it makes much of a difference.
> >
>
> This suggestion was coupled with the subsequent note that there is no
> documentation of what "unusuable free space index" is, except by the
> implementation itself. Since the value isn't used by the kernel, I think
> exporting the value as a percent would be easier understood by the user
> without looking up the semantics. I don't have strong feelings either
> way, however.
>
I'm writing documentation. I'm keeping with the 1000 value because a) I
like the precision and b) the fragmentation index is not related to
percentages and I think having one as a percentage and the other as an
index would cause confusion. Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab