2010-04-20 21:01:18

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/14] Memory Compaction v8

This series is based on the patches in mmotm-2010-04-15-14-42.patch but
rebased against 2.6.34-rc4 and collapsed. The cumulative fixes were getting
tricky to follow and this series should give a clearer view of what is
going on. It should be possible to drop all the compaction-related patches
in mmotm and drop this series directly in its place.

Changelog since mmotm
o Have defer_compaction use an exponential backoff instead of time
o Rename COMPACT_INCOMPLETE and document the return values

Changelog since V7 (these are the changes made while in mmotm)
o Have compaction depend on CONFIG_HUGETLB_PAGE instead of CONFIG_HUGETLBFS
o Move extfrag_index and unusable_index to debugfs
o Many minor cleanups in the core patch
o Drop unsafe optimisation on PageBuddy as zone->lock is not held
o Call kernel_map_pages after split_free_page
o Update documentation on compact_memory proc entry
o Initialise cc->zone properly when compacting all of memory
o Do not display vmstats related to compaction when disabled

Changelog since V6
o Avoid accessing anon_vma when migrating unmapped PageSwapCache pages

Changelog since V5
o Rebase to mmotm-2010-03-24-14-48
o Add more reviewed-by's
o Correct one spelling in vmstat.c and some leader clarifications
o Split the LRU isolation modes into a separate path
o Correct a NID change
o Call migrate_prep less frequently
o Remove unnecessary inlining
o Do not interfere with memory hot-remove
o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER
o page_mapped instead of page_mapcount and allow swapcache to migrate
o Avoid too many pages being isolated for migration
o Handle PageSwapCache pages during migration

Changelog since V4
o Remove unnecessary check for PageLRU and PageUnevictable
o Fix isolated accounting
o Close race window between page_mapcount and rcu_read_lock
o Added a lot more Reviewed-by tags

Changelog since V3
o Document sysfs entries (subseqently, merged independently)
o COMPACTION should depend on MMU
o Comment updates
o Ensure proc/sysfs triggering of compaction fully completes
o Rename anon_vma refcount to external_refcount
o Rebase to mmotm on top of 2.6.34-rc1

Changelog since V2
o Move unusable and fragmentation indices to separate proc files
o Express indices as being between 0 and 1
o Update copyright notice for compaction.c
o Avoid infinite loop when split free page fails
o Init compact_resume at least once (impacted x86 testing)
o Fewer pages are isolated during compaction.
o LRU lists are no longer rotated when page is busy
o NR_ISOLATED_* is updated to avoid isolating too many pages
o Update zone LRU stats correctly when isolating pages
o Reference count anon_vma instead of insufficient locking with
use-after-free races in memory compaction
o Watch for unmapped anon pages during migration
o Remove unnecessary parameters on a few functions
o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
for the proc patches as they have been split out into separate
files and I don't know if the Acks are still valid.

Changelog since V1
o Update help blurb on CONFIG_MIGRATION
o Max unusable free space index is 100, not 1000
o Move blockpfn forward properly during compaction
o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
o Permissions on /proc and /sys files should be 0200
o Reduce verbosity
o Compact all nodes when triggered via /proc
o Add per-node compaction via sysfs
o Move defer_compaction out-of-line
o Fix lock oddities in rmap_walk_anon
o Add documentation

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area. A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.

Testing of compaction was in three stages. For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was tested on X86, X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax

The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running

At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size

X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791

X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234

PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a huge
page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of unmovable and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.

vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70

Documentation/ABI/testing/sysfs-devices-node | 7 +
Documentation/sysctl/vm.txt | 26 +-
drivers/base/node.c | 3 +
include/linux/compaction.h | 89 ++++
include/linux/mm.h | 1 +
include/linux/mmzone.h | 9 +
include/linux/rmap.h | 27 +-
include/linux/swap.h | 6 +
include/linux/vmstat.h | 4 +
kernel/sysctl.c | 25 +
mm/Kconfig | 18 +-
mm/Makefile | 1 +
mm/compaction.c | 605 ++++++++++++++++++++++++++
mm/ksm.c | 4 +-
mm/migrate.c | 54 ++-
mm/page_alloc.c | 111 +++++
mm/rmap.c | 10 +-
mm/vmscan.c | 5 -
mm/vmstat.c | 299 ++++++++++++--
19 files changed, 1246 insertions(+), 58 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-devices-node
create mode 100644 include/linux/compaction.h
create mode 100644 mm/compaction.c


2010-04-20 21:01:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/14] mm: Move definition for LRU isolation modes to a header

Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
Memory compaction needs access to these modes for isolating pages for
migration. This patch exports them.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
include/linux/swap.h | 5 +++++
mm/vmscan.c | 5 -----
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 94ec325..32af03c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
__lru_cache_add(page, LRU_ACTIVE_FILE);
}

+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
+#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
+
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1070f83..8bdd85c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
return nr_reclaimed;
}

-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
-#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
-
/*
* Attempt to remove the specified page from its LRU. Only take this page
* if it is of the appropriate PageActive status. Pages which are being
--
1.6.5

2010-04-20 21:01:25

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 14/14] mm,compaction: Defer compaction using an exponential backoff when compaction fails

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone. To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache. This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/compaction.h | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/mmzone.h | 9 +++++++++
mm/page_alloc.c | 5 ++++-
3 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 3719325..5ac5155 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,6 +22,36 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* Do not skip compaction more than 64 times */
+#define COMPACT_MAX_DEFER_SHIFT 6
+
+/*
+ * Compaction is deferred when compaction fails to result in a page
+ * allocation success. 1 << compact_defer_limit compactions are skipped up
+ * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
+ */
+static inline void defer_compaction(struct zone *zone)
+{
+ zone->compact_considered = 0;
+ zone->compact_defer_shift++;
+
+ if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
+ zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
+}
+
+/* Returns true if compaction should be skipped this time */
+static inline bool compaction_deferred(struct zone *zone)
+{
+ unsigned long defer_limit = 1UL << zone->compact_defer_shift;
+
+ /* Avoid possible overflow */
+ if (++zone->compact_considered > defer_limit)
+ zone->compact_considered = defer_limit;
+
+ return zone->compact_considered < (1UL << zone->compact_defer_shift);
+}
+
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -29,6 +59,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
return COMPACT_CONTINUE;
}

+static inline void defer_compaction(struct zone *zone)
+{
+}
+
+static inline bool compaction_deferred(struct zone *zone)
+{
+ return 1;
+}
+
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..fd55f72 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -321,6 +321,15 @@ struct zone {
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

+#ifdef CONFIG_COMPACTION
+ /*
+ * On compaction failure, 1<<compact_defer_shift compactions
+ * are skipped before trying again. The number attempted since
+ * last failure is tracked with compact_considered.
+ */
+ unsigned int compact_considered;
+ unsigned int compact_defer_shift;
+#endif

ZONE_PADDING(_pad1_)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1de363e..51497ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1772,7 +1772,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
{
struct page *page;

- if (!order)
+ if (!order || compaction_deferred(preferred_zone))
return NULL;

*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
@@ -1788,6 +1788,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
alloc_flags, preferred_zone,
migratetype);
if (page) {
+ preferred_zone->compact_considered = 0;
+ preferred_zone->compact_defer_shift = 0;
__count_vm_event(COMPACTSUCCESS);
return page;
}
@@ -1798,6 +1800,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
+ defer_compaction(preferred_zone);

cond_resched();
}
--
1.6.5

2010-04-20 21:01:30

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/14] mm,compaction: Add a tunable that decides when memory should be compacted and when it should be reclaimed

The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will not
be compacted. This choice is arbitrary and not based on data. To help
optimise the system and set a sensible default for this value, this patch
adds a sysctl extfrag_threshold. The kernel will only compact memory if
the fragmentation index is above the extfrag_threshold.

[[email protected]: Fix build errors when proc fs is not configured]
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/vm.txt | 16 +++++++++++++++-
include/linux/compaction.h | 3 +++
kernel/sysctl.c | 15 +++++++++++++++
mm/compaction.c | 12 +++++++++++-
4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 3b3fa1b..6274970 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
- dirty_ratio
- dirty_writeback_centisecs
- drop_caches
+- extfrag_threshold
- hugepages_treat_as_movable
- hugetlb_shm_group
- laptop_mode
@@ -149,6 +149,20 @@ user should run `sync' first.

==============================================================

+extfrag_threshold
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what
+the fragmentation index for each order is in each zone in the system. Values
+tending towards 0 imply allocations would fail due to lack of memory,
+values towards 1000 imply failures are due to fragmentation and -1 implies
+that the allocation will succeed as long as watermarks are met.
+
+The kernel will not compact memory in a zone if the
+fragmentation index is <= extfrag_threshold. The default value is 500.
+
+==============================================================
+
hugepages_treat_as_movable

This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index eed40ec..3719325 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -15,6 +15,9 @@
extern int sysctl_compact_memory;
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
+extern int sysctl_extfrag_threshold;
+extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos);

extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 987d6cf..43dc29d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -262,6 +262,11 @@ static int min_sched_shares_ratelimit = 100000; /* 100 usec */
static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
#endif

+#ifdef CONFIG_COMPACTION
+static int min_extfrag_threshold;
+static int max_extfrag_threshold = 1000;
+#endif
+
static struct ctl_table kern_table[] = {
{
.procname = "sched_child_runs_first",
@@ -1130,6 +1135,16 @@ static struct ctl_table vm_table[] = {
.mode = 0200,
.proc_handler = sysctl_compaction_handler,
},
+ {
+ .procname = "extfrag_threshold",
+ .data = &sysctl_extfrag_threshold,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = sysctl_extfrag_handler,
+ .extra1 = &min_extfrag_threshold,
+ .extra2 = &max_extfrag_threshold,
+ },
+
#endif /* CONFIG_COMPACTION */
{
.procname = "min_free_kbytes",
diff --git a/mm/compaction.c b/mm/compaction.c
index 06aed42..bd13560 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -433,6 +433,8 @@ static unsigned long compact_zone_order(struct zone *zone,
return compact_zone(zone, &cc);
}

+int sysctl_extfrag_threshold = 500;
+
/**
* try_to_compact_pages - Direct compact to satisfy a high-order allocation
* @zonelist: The zonelist used for the current allocation
@@ -491,7 +493,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
* Only compact if a failure would be due to fragmentation.
*/
fragindex = fragmentation_index(zone, order);
- if (fragindex >= 0 && fragindex <= 500)
+ if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
continue;

if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
@@ -572,6 +574,14 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
return 0;
}

+int sysctl_extfrag_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+ return 0;
+}
+
#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
ssize_t sysfs_compact_node(struct sys_device *dev,
struct sysdev_attribute *attr,
--
1.6.5

2010-04-20 21:02:14

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/14] mm,compaction: Add /sys trigger for per-node memory compaction

Add a per-node sysfs file called compact. When the file is written to,
each zone in that node is compacted. The intention that this would be
used by something like a job scheduler in a batch system before a job
starts so that the job can allocate the maximum number of hugepages
without significant start-up cost.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-node | 7 +++++++
drivers/base/node.c | 3 +++
include/linux/compaction.h | 16 ++++++++++++++++
mm/compaction.c | 23 +++++++++++++++++++++++
4 files changed, 49 insertions(+), 0 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..453a210
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What: /sys/devices/system/node/nodeX/compact
+Date: February 2010
+Contact: Mel Gorman <[email protected]>
+Description:
+ When this file is written to, all memory within that node
+ will be compacted. When it completes, memory will be freed
+ into blocks which have as many contiguous pages as possible
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 057979a..2bdd8a9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -9,6 +9,7 @@
#include <linux/memory.h>
#include <linux/node.h>
#include <linux/hugetlb.h>
+#include <linux/compaction.h>
#include <linux/cpumask.h>
#include <linux/topology.h>
#include <linux/nodemask.h>
@@ -246,6 +247,8 @@ int register_node(struct node *node, int num, struct node *parent)
scan_unevictable_register_node(node);

hugetlb_register_node(node);
+
+ compaction_register_node(node);
}
return error;
}
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5723888..ba98cfe 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -12,4 +12,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
#endif /* CONFIG_COMPACTION */

+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+ return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
#endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 5e4444c..83b5de6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
#include <linux/mm_inline.h>
#include <linux/backing-dev.h>
#include <linux/sysctl.h>
+#include <linux/sysfs.h>
#include "internal.h"

/*
@@ -453,3 +454,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,

return 0;
}
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t count)
+{
+ compact_node(dev->id);
+
+ return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+ return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+ return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
--
1.6.5

2010-04-20 21:01:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/14] mm,compaction: Direct compact when a high-order allocation fails

Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.

Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.

[[email protected]: Fix build errors]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
include/linux/compaction.h | 24 ++++++++--
include/linux/vmstat.h | 1 +
mm/compaction.c | 117 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 63 +++++++++++++++++++++++
mm/vmstat.c | 16 +++++-
5 files changed, 215 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ba98cfe..eed40ec 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,15 +1,31 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H

-/* Return values for compact_zone() */
-#define COMPACT_CONTINUE 0
-#define COMPACT_PARTIAL 1
-#define COMPACT_COMPLETE 2
+/* Return values for compact_zone() and try_to_compact_pages() */
+/* compaction didn't start as it was not possible or direct reclaim was more suitable */
+#define COMPACT_SKIPPED 0
+/* compaction should continue to another pageblock */
+#define COMPACT_CONTINUE 1
+/* direct compaction partially compacted a zone and there are suitable pages */
+#define COMPACT_PARTIAL 2
+/* The full zone was compacted */
+#define COMPACT_COMPLETE 3

#ifdef CONFIG_COMPACTION
extern int sysctl_compact_memory;
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ return COMPACT_CONTINUE;
+}
+
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b421d1b..7f43ccd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -45,6 +45,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+ COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#endif
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/compaction.c b/mm/compaction.c
index 83b5de6..06aed42 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,6 +35,8 @@ struct compact_control {
unsigned long nr_anon;
unsigned long nr_file;

+ unsigned int order; /* order a direct compactor needs */
+ int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
};

@@ -341,6 +343,9 @@ static void update_nr_listpages(struct compact_control *cc)
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
+ unsigned int order;
+ unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;

@@ -348,6 +353,24 @@ static int compact_finished(struct zone *zone,
if (cc->free_pfn <= cc->migrate_pfn)
return COMPACT_COMPLETE;

+ /* Compaction run is not finished if the watermark is not met */
+ if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+ return COMPACT_CONTINUE;
+
+ if (cc->order == -1)
+ return COMPACT_CONTINUE;
+
+ /* Direct compactor: Is a suitable page free? */
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (order >= pageblock_order && zone->free_area[order].nr_free)
+ return COMPACT_PARTIAL;
+ }
+
return COMPACT_CONTINUE;
}

@@ -394,6 +417,99 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}

+static unsigned long compact_zone_order(struct zone *zone,
+ int order, gfp_t gfp_mask)
+{
+ struct compact_control cc = {
+ .nr_freepages = 0,
+ .nr_migratepages = 0,
+ .order = order,
+ .migratetype = allocflags_to_migratetype(gfp_mask),
+ .zone = zone,
+ };
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ int may_enter_fs = gfp_mask & __GFP_FS;
+ int may_perform_io = gfp_mask & __GFP_IO;
+ unsigned long watermark;
+ struct zoneref *z;
+ struct zone *zone;
+ int rc = COMPACT_SKIPPED;
+
+ /*
+ * Check whether it is worth even starting compaction. The order check is
+ * made because an assumption is made that the page allocator can satisfy
+ * the "cheaper" orders without taking special steps
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+ return rc;
+
+ count_vm_event(COMPACTSTALL);
+
+ /* Compact each zone in the list */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+ nodemask) {
+ int fragindex;
+ int status;
+
+ /*
+ * Watermarks for order-0 must be met for compaction. Note
+ * the 2UL. This is because during migration, copies of
+ * pages need to be allocated and for a short time, the
+ * footprint is higher
+ */
+ watermark = low_wmark_pages(zone) + (2UL << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ continue;
+
+ /*
+ * fragmentation index determines if allocation failures are
+ * due to low memory or external fragmentation
+ *
+ * index of -1 implies allocations might succeed depending
+ * on watermarks
+ * index towards 0 implies failure is due to lack of memory
+ * index towards 1000 implies failure is due to fragmentation
+ *
+ * Only compact if a failure would be due to fragmentation.
+ */
+ fragindex = fragmentation_index(zone, order);
+ if (fragindex >= 0 && fragindex <= 500)
+ continue;
+
+ if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+ rc = COMPACT_PARTIAL;
+ break;
+ }
+
+ status = compact_zone_order(zone, order, gfp_mask);
+ rc = max(status, rc);
+
+ if (zone_watermark_ok(zone, order, watermark, 0, 0))
+ break;
+ }
+
+ return rc;
+}
+
+
/* Compact all zones within a node */
static int compact_node(int nid)
{
@@ -412,6 +528,7 @@ static int compact_node(int nid)
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
+ .order = -1,
};

zone = &pgdat->node_zones[zoneid];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f29d070..1de363e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
#include <linux/memory.h>
+#include <linux/compaction.h>
#include <trace/events/kmem.h>
#include <linux/ftrace_event.h>

@@ -1761,6 +1762,59 @@ out:
return page;
}

+#ifdef CONFIG_COMPACTION
+/* Try memory compaction for high-order allocations before reclaim */
+static struct page *
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ int migratetype, unsigned long *did_some_progress)
+{
+ struct page *page;
+
+ if (!order)
+ return NULL;
+
+ *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
+ nodemask);
+ if (*did_some_progress != COMPACT_SKIPPED) {
+
+ /* Page migration frees to the PCP lists but we want merging */
+ drain_pages(get_cpu());
+ put_cpu();
+
+ page = get_page_from_freelist(gfp_mask, nodemask,
+ order, zonelist, high_zoneidx,
+ alloc_flags, preferred_zone,
+ migratetype);
+ if (page) {
+ __count_vm_event(COMPACTSUCCESS);
+ return page;
+ }
+
+ /*
+ * It's bad if compaction run occurs and fails.
+ * The most likely reason is that pages exist,
+ * but not enough to satisfy watermarks.
+ */
+ count_vm_event(COMPACTFAIL);
+
+ cond_resched();
+ }
+
+ return NULL;
+}
+#else
+static inline struct page *
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ int migratetype, unsigned long *did_some_progress)
+{
+ return NULL;
+}
+#endif /* CONFIG_COMPACTION */
+
/* The really slow allocator path where we enter direct reclaim */
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1947,6 +2001,15 @@ rebalance:
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;

+ /* Try direct compaction */
+ page = __alloc_pages_direct_compact(gfp_mask, order,
+ zonelist, high_zoneidx,
+ nodemask,
+ alloc_flags, preferred_zone,
+ migratetype, &did_some_progress);
+ if (page)
+ goto got_pg;
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c6aacf5..7759941 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -429,7 +429,7 @@ static void fill_contig_page_info(struct zone *zone,
* The value can be used to determine if page reclaim or compaction
* should be used
*/
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+static int __fragmentation_index(unsigned int order, struct contig_page_info *info)
{
unsigned long requested = 1UL << order;

@@ -448,6 +448,15 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
*/
return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
}
+
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+ struct contig_page_info info;
+
+ fill_contig_page_info(zone, order, &info);
+ return __fragmentation_index(order, &info);
+}
#endif

#if defined(CONFIG_PROC_FS) || defined(CONFIG_COMPACTION)
@@ -771,6 +780,9 @@ static const char * const vmstat_text[] = {
"compact_blocks_moved",
"compact_pages_moved",
"compact_pagemigrate_failed",
+ "compact_stall",
+ "compact_fail",
+ "compact_success",
#endif

#ifdef CONFIG_HUGETLB_PAGE
@@ -1136,7 +1148,7 @@ static void extfrag_show_print(struct seq_file *m,
zone->name);
for (order = 0; order < MAX_ORDER; ++order) {
fill_contig_page_info(zone, order, &info);
- index = fragmentation_index(order, &info);
+ index = __fragmentation_index(order, &info);
seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
}

--
1.6.5

2010-04-20 21:01:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/rmap.h | 23 +++++++++++++++++++++++
mm/migrate.c | 12 ++++++++++++
mm/rmap.c | 10 +++++-----
3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
#ifdef CONFIG_KSM
atomic_t ksm_refcount;
#endif
+#ifdef CONFIG_MIGRATION
+ atomic_t migrate_refcount;
+#endif
/*
* NOTE: the LSB of the head.next is set by
* mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
return 0;
}
#endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+ atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+ return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+ return 0;
+}
+#endif /* CONFIG_MIGRATE */

static inline struct anon_vma *page_anon_vma(struct page *page)
{
diff --git a/mm/migrate.c b/mm/migrate.c
index 5938db5..b768a1d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -543,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
int rcu_locked = 0;
int charge = 0;
struct mem_cgroup *mem = NULL;
+ struct anon_vma *anon_vma = NULL;

if (!newpage)
return -ENOMEM;
@@ -599,6 +600,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (PageAnon(page)) {
rcu_read_lock();
rcu_locked = 1;
+ anon_vma = page_anon_vma(page);
+ atomic_inc(&anon_vma->migrate_refcount);
}

/*
@@ -638,6 +641,15 @@ skip_unmap:
if (rc)
remove_migration_ptes(page, page);
rcu_unlock:
+
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+ int empty = list_empty(&anon_vma->head);
+ spin_unlock(&anon_vma->lock);
+ if (empty)
+ anon_vma_free(anon_vma);
+ }
+
if (rcu_locked)
rcu_read_unlock();
uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index c59eaea..4bad2c5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -249,7 +249,8 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
list_del(&anon_vma_chain->same_anon_vma);

/* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+ empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+ !migrate_refcount(anon_vma);
spin_unlock(&anon_vma->lock);

if (empty)
@@ -274,6 +275,7 @@ static void anon_vma_ctor(void *data)

spin_lock_init(&anon_vma->lock);
ksm_refcount_init(anon_vma);
+ migrate_refcount_init(anon_vma);
INIT_LIST_HEAD(&anon_vma->head);
}

@@ -1365,10 +1367,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
/*
* Note: remove_migration_ptes() cannot use page_lock_anon_vma()
* because that depends on page_mapped(); but not all its usages
- * are holding mmap_sem, which also gave the necessary guarantee
- * (that this anon_vma's slab has not already been destroyed).
- * This needs to be reviewed later: avoiding page_lock_anon_vma()
- * is risky, and currently limits the usefulness of rmap_walk().
+ * are holding mmap_sem. Users without mmap_sem are required to
+ * take a reference count to prevent the anon_vma disappearing
*/
anon_vma = page_anon_vma(page);
if (!anon_vma)
--
1.6.5

2010-04-20 21:02:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/14] mm,compaction: Add /proc trigger for memory compaction

This patch adds a proc file /proc/sys/vm/compact_memory. When an
arbitrary value is written to the file, all zones are compacted. The
expected user of such a trigger is a job scheduler that prepares the
system before the target application runs.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
Documentation/sysctl/vm.txt | 10 +++++++
include/linux/compaction.h | 6 ++++
kernel/sysctl.c | 10 +++++++
mm/compaction.c | 61 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 56366a5..3b3fa1b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
Currently, these files are in /proc/sys/vm:

- block_dump
+- compact_memory
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -64,6 +65,15 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.

==============================================================

+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
+all zones are compacted such that free memory is available in contiguous
+blocks where possible. This can be important for example in the allocation of
+huge pages although processes will also directly compact memory as required.
+
+==============================================================
+
dirty_background_bytes

Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 465ca88..5723888 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -6,4 +6,10 @@
#define COMPACT_PARTIAL 1
#define COMPACT_COMPLETE 2

+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
#endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f25eb..987d6cf 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -37,6 +37,7 @@
#include <linux/highuid.h>
#include <linux/writeback.h>
#include <linux/ratelimit.h>
+#include <linux/compaction.h>
#include <linux/hugetlb.h>
#include <linux/initrd.h>
#include <linux/key.h>
@@ -1121,6 +1122,15 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = drop_caches_sysctl_handler,
},
+#ifdef CONFIG_COMPACTION
+ {
+ .procname = "compact_memory",
+ .data = &sysctl_compact_memory,
+ .maxlen = sizeof(int),
+ .mode = 0200,
+ .proc_handler = sysctl_compaction_handler,
+ },
+#endif /* CONFIG_COMPACTION */
{
.procname = "min_free_kbytes",
.data = &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 1b80dd8..5e4444c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
#include <linux/compaction.h>
#include <linux/mm_inline.h>
#include <linux/backing-dev.h>
+#include <linux/sysctl.h>
#include "internal.h"

/*
@@ -392,3 +393,63 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}

+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+ int zoneid;
+ pg_data_t *pgdat;
+ struct zone *zone;
+
+ if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+ return -EINVAL;
+ pgdat = NODE_DATA(nid);
+
+ /* Flush pending updates to the LRU lists */
+ lru_add_drain_all();
+
+ for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+ struct compact_control cc = {
+ .nr_freepages = 0,
+ .nr_migratepages = 0,
+ };
+
+ zone = &pgdat->node_zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ cc.zone = zone;
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ compact_zone(zone, &cc);
+
+ VM_BUG_ON(!list_empty(&cc.freepages));
+ VM_BUG_ON(!list_empty(&cc.migratepages));
+ }
+
+ return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+ int nid;
+
+ for_each_online_node(nid)
+ compact_node(nid);
+
+ return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ if (write)
+ return compact_nodes();
+
+ return 0;
+}
--
1.6.5

2010-04-20 21:02:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/14] mm,compaction: Memory compaction core

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.

[[email protected]: Fix unsafe optimisation]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
include/linux/compaction.h | 9 +
include/linux/mm.h | 1 +
include/linux/swap.h | 1 +
include/linux/vmstat.h | 3 +
mm/Makefile | 1 +
mm/compaction.c | 394 ++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 45 +++++
mm/vmstat.c | 7 +
8 files changed, 461 insertions(+), 0 deletions(-)
create mode 100644 include/linux/compaction.h
create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..465ca88
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_CONTINUE 0
+#define COMPACT_PARTIAL 1
+#define COMPACT_COMPLETE 2
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3058bbc..eb21256 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -334,6 +334,7 @@ void put_page(struct page *page);
void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);

/*
* Compound pages have a destructor function. Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 32af03c..97e623b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
};

#define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX

#define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */
#define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 117f0dd..b421d1b 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_COMPACTION
+ COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+#endif
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/Makefile b/mm/Makefile
index 6c2a73a..8982504 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
+obj-$(CONFIG_COMPACTION) += compaction.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..1b80dd8
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,394 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <[email protected]>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+ struct list_head freepages; /* List of free pages to migrate to */
+ struct list_head migratepages; /* List of pages being migrated */
+ unsigned long nr_freepages; /* Number of isolated free pages */
+ unsigned long nr_migratepages; /* Number of pages to migrate */
+ unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long migrate_pfn; /* isolate_migratepages search base */
+
+ /* Account for isolated anon and file pages */
+ unsigned long nr_anon;
+ unsigned long nr_file;
+
+ struct zone *zone;
+};
+
+static unsigned long release_freepages(struct list_head *freelist)
+{
+ struct page *page, *next;
+ unsigned long count = 0;
+
+ list_for_each_entry_safe(page, next, freelist, lru) {
+ list_del(&page->lru);
+ __free_page(page);
+ count++;
+ }
+
+ return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static unsigned long isolate_freepages_block(struct zone *zone,
+ unsigned long blockpfn,
+ struct list_head *freelist)
+{
+ unsigned long zone_end_pfn, end_pfn;
+ int total_isolated = 0;
+ struct page *cursor;
+
+ /* Get the last PFN we should scan for free pages at */
+ zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
+
+ /* Find the first usable PFN in the block to initialse page cursor */
+ for (; blockpfn < end_pfn; blockpfn++) {
+ if (pfn_valid_within(blockpfn))
+ break;
+ }
+ cursor = pfn_to_page(blockpfn);
+
+ /* Isolate free pages. This assumes the block is valid */
+ for (; blockpfn < end_pfn; blockpfn++, cursor++) {
+ int isolated, i;
+ struct page *page = cursor;
+
+ if (!pfn_valid_within(blockpfn))
+ continue;
+
+ if (!PageBuddy(page))
+ continue;
+
+ /* Found a free page, break it into order-0 pages */
+ isolated = split_free_page(page);
+ total_isolated += isolated;
+ for (i = 0; i < isolated; i++) {
+ list_add(&page->lru, freelist);
+ page++;
+ }
+
+ /* If a page was split, advance to the end of it */
+ if (isolated) {
+ blockpfn += isolated - 1;
+ cursor += isolated - 1;
+ }
+ }
+
+ return total_isolated;
+}
+
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
+{
+
+ int migratetype = get_pageblock_migratetype(page);
+
+ /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+ if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+ return false;
+
+ /* If the page is a large free page, then allow migration */
+ if (PageBuddy(page) && page_order(page) >= pageblock_order)
+ return true;
+
+ /* If the block is MIGRATE_MOVABLE, allow migration */
+ if (migratetype == MIGRATE_MOVABLE)
+ return true;
+
+ /* Otherwise skip the block */
+ return false;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from and then isolate them.
+ */
+static void isolate_freepages(struct zone *zone,
+ struct compact_control *cc)
+{
+ struct page *page;
+ unsigned long high_pfn, low_pfn, pfn;
+ unsigned long flags;
+ int nr_freepages = cc->nr_freepages;
+ struct list_head *freelist = &cc->freepages;
+
+ pfn = cc->free_pfn;
+ low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+ high_pfn = low_pfn;
+
+ /*
+ * Isolate free pages until enough are available to migrate the
+ * pages on cc->migratepages. We stop searching if the migrate
+ * and free page scanners meet or enough free pages are isolated.
+ */
+ spin_lock_irqsave(&zone->lock, flags);
+ for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+ pfn -= pageblock_nr_pages) {
+ unsigned long isolated;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ /*
+ * Check for overlapping nodes/zones. It's possible on some
+ * configurations to have a setup like
+ * node0 node1 node0
+ * i.e. it's possible that all pages within a zones range of
+ * pages do not belong to a single zone.
+ */
+ page = pfn_to_page(pfn);
+ if (page_zone(page) != zone)
+ continue;
+
+ /* Check the block is suitable for migration */
+ if (!suitable_migration_target(page))
+ continue;
+
+ /* Found a block suitable for isolating free pages from */
+ isolated = isolate_freepages_block(zone, pfn, freelist);
+ nr_freepages += isolated;
+
+ /*
+ * Record the highest PFN we isolated pages from. When next
+ * looking for free pages, the search will restart here as
+ * page migration may have returned some pages to the allocator
+ */
+ if (isolated)
+ high_pfn = max(high_pfn, pfn);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ /* split_free_page does not map the pages */
+ list_for_each_entry(page, freelist, lru) {
+ arch_alloc_page(page, 0);
+ kernel_map_pages(page, 1, 1);
+ }
+
+ cc->free_pfn = high_pfn;
+ cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone */
+static void acct_isolated(struct zone *zone, struct compact_control *cc)
+{
+ struct page *page;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+
+ list_for_each_entry(page, &cc->migratepages, lru) {
+ int lru = page_lru_base_type(page);
+ count[lru]++;
+ }
+
+ cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/* Similar to reclaim, but different enough that they don't share logic */
+static bool too_many_isolated(struct zone *zone)
+{
+
+ unsigned long inactive, isolated;
+
+ inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_INACTIVE_ANON);
+ isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+ zone_page_state(zone, NR_ISOLATED_ANON);
+
+ return isolated > inactive;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+ struct compact_control *cc)
+{
+ unsigned long low_pfn, end_pfn;
+ struct list_head *migratelist = &cc->migratepages;
+
+ /* Do not scan outside zone boundaries */
+ low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+
+ /* Only scan within a pageblock boundary */
+ end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+ /* Do not cross the free scanner or scan within a memory hole */
+ if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+ cc->migrate_pfn = end_pfn;
+ return 0;
+ }
+
+ /*
+ * Ensure that there are not too many pages isolated from the LRU
+ * list by either parallel reclaimers or compaction. If there are,
+ * delay for some time until fewer pages are isolated
+ */
+ while (unlikely(too_many_isolated(zone))) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+ if (fatal_signal_pending(current))
+ return 0;
+ }
+
+ /* Time to isolate some pages for migration */
+ spin_lock_irq(&zone->lru_lock);
+ for (; low_pfn < end_pfn; low_pfn++) {
+ struct page *page;
+ if (!pfn_valid_within(low_pfn))
+ continue;
+
+ /* Get the page and skip if free */
+ page = pfn_to_page(low_pfn);
+ if (PageBuddy(page))
+ continue;
+
+ /* Try isolate the page */
+ if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
+ continue;
+
+ /* Successfully isolated */
+ del_page_from_lru_list(zone, page, page_lru(page));
+ list_add(&page->lru, migratelist);
+ mem_cgroup_del_lru(page);
+ cc->nr_migratepages++;
+
+ /* Avoid isolating too much */
+ if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+ break;
+ }
+
+ acct_isolated(zone, cc);
+
+ spin_unlock_irq(&zone->lru_lock);
+ cc->migrate_pfn = low_pfn;
+
+ return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+ unsigned long data,
+ int **result)
+{
+ struct compact_control *cc = (struct compact_control *)data;
+ struct page *freepage;
+
+ /* Isolate free pages if necessary */
+ if (list_empty(&cc->freepages)) {
+ isolate_freepages(cc->zone, cc);
+
+ if (list_empty(&cc->freepages))
+ return NULL;
+ }
+
+ freepage = list_entry(cc->freepages.next, struct page, lru);
+ list_del(&freepage->lru);
+ cc->nr_freepages--;
+
+ return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+ int nr_migratepages = 0;
+ int nr_freepages = 0;
+ struct page *page;
+
+ list_for_each_entry(page, &cc->migratepages, lru)
+ nr_migratepages++;
+ list_for_each_entry(page, &cc->freepages, lru)
+ nr_freepages++;
+
+ cc->nr_migratepages = nr_migratepages;
+ cc->nr_freepages = nr_freepages;
+}
+
+static int compact_finished(struct zone *zone,
+ struct compact_control *cc)
+{
+ if (fatal_signal_pending(current))
+ return COMPACT_PARTIAL;
+
+ /* Compaction run completes if the migrate and free scanner meet */
+ if (cc->free_pfn <= cc->migrate_pfn)
+ return COMPACT_COMPLETE;
+
+ return COMPACT_CONTINUE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+ int ret;
+
+ /* Setup to move all movable pages to the end of the zone */
+ cc->migrate_pfn = zone->zone_start_pfn;
+ cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+ cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ migrate_prep();
+
+ while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
+ unsigned long nr_migrate, nr_remaining;
+
+ if (!isolate_migratepages(zone, cc))
+ continue;
+
+ nr_migrate = cc->nr_migratepages;
+ migrate_pages(&cc->migratepages, compaction_alloc,
+ (unsigned long)cc, 0);
+ update_nr_listpages(cc);
+ nr_remaining = cc->nr_migratepages;
+
+ count_vm_event(COMPACTBLOCKS);
+ count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+ if (nr_remaining)
+ count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+ /* Release LRU pages not migrated */
+ if (!list_empty(&cc->migratepages)) {
+ putback_lru_pages(&cc->migratepages);
+ cc->nr_migratepages = 0;
+ }
+
+ }
+
+ /* Release free pages and check accounting */
+ cc->nr_freepages -= release_freepages(&cc->freepages);
+ VM_BUG_ON(cc->nr_freepages != 0);
+
+ return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a4fb0a..f29d070 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,6 +1208,51 @@ void split_page(struct page *page, unsigned int order)
}

/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ unsigned long watermark;
+ struct zone *zone;
+
+ BUG_ON(!PageBuddy(page));
+
+ zone = page_zone(page);
+ order = page_order(page);
+
+ /* Obey watermarks as if the page was being allocated */
+ watermark = low_wmark_pages(zone) + (1 << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ return 0;
+
+ /* Remove page from free list */
+ list_del(&page->lru);
+ zone->free_area[order].nr_free--;
+ rmv_page_order(page);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+
+ if (order >= pageblock_order - 1) {
+ struct page *endpage = page + (1 << order) - 1;
+ for (; page < endpage; page += pageblock_nr_pages)
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ }
+
+ return 1 << order;
+}
+
+/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
* or two.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 23a5899..c6aacf5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -766,6 +766,13 @@ static const char * const vmstat_text[] = {
"allocstall",

"pgrotated",
+
+#ifdef CONFIG_COMPACTION
+ "compact_blocks_moved",
+ "compact_pages_moved",
+ "compact_pagemigrate_failed",
+#endif
+
#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
"htlb_buddy_alloc_fail",
--
1.6.5

2010-04-20 21:03:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/14] mm: Export fragmentation index via debugfs

The fragmentation fragmentation index, is only meaningful if an allocation
would fail and indicates what the failure is due to. A value of -1 such as
in many of the examples above states that the allocation would succeed.
If it would fail, the value is between 0 and 1. A value tending towards
0 implies the allocation failed due to a lack of memory. A value tending
towards 1 implies it failed due to external fragmentation.

For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zo
basis via /sys/kernel/debug/extfrag/extfrag_index

> cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/vmstat.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index d3e0fa1..23a5899 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -16,6 +16,7 @@
#include <linux/cpu.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
+#include <linux/math64.h>

#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -420,6 +421,33 @@ static void fill_contig_page_info(struct zone *zone,
(order - suitable_order);
}
}
+
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+ unsigned long requested = 1UL << order;
+
+ if (!info->free_blocks_total)
+ return 0;
+
+ /* Fragmentation index only makes sense when a request would fail */
+ if (info->free_blocks_suitable)
+ return -1000;
+
+ /*
+ * Index is between 0 and 1 so return within 3 decimal places
+ *
+ * 0 => allocation would fail due to lack of memory
+ * 1 => allocation would fail due to fragmentation
+ */
+ return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
+}
#endif

#if defined(CONFIG_PROC_FS) || defined(CONFIG_COMPACTION)
@@ -1087,6 +1115,58 @@ static const struct file_operations unusable_file_ops = {
.release = seq_release,
};

+static void extfrag_show_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+ int index;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct contig_page_info info;
+
+ seq_printf(m, "Node %d, zone %8s ",
+ pgdat->node_id,
+ zone->name);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ index = fragmentation_index(order, &info);
+ seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+ return 0;
+}
+
+static const struct seq_operations extfrag_op = {
+ .start = frag_start,
+ .next = frag_next,
+ .stop = frag_stop,
+ .show = extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+ .open = extfrag_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
static int __init extfrag_debug_init(void)
{
extfrag_debug_root = debugfs_create_dir("extfrag", NULL);
@@ -1097,6 +1177,10 @@ static int __init extfrag_debug_init(void)
extfrag_debug_root, NULL, &unusable_file_ops))
return -ENOMEM;

+ if (!debugfs_create_file("extfrag_index", 0444,
+ extfrag_debug_root, NULL, &extfrag_file_ops))
+ return -ENOMEM;
+
return 0;
}

--
1.6.5

2010-04-20 21:03:19

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/14] mm: Export unusable free space index via debugfs

The unusable free space index measures how much of the available free
memory cannot be used to satisfy an allocation of a given size and is
a value between 0 and 1. The higher the value, the more of free memory
is unusable and by implication, the worse the external fragmentation
is. For the most part, the huge page size will be the size of interest
but not necessarily so it is exported on a per-order and per-zone basis
via /sys/kernel/debug/extfrag/unusable_index.

> cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

[[email protected]: Fix allnoconfig]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/vmstat.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 149 insertions(+), 1 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa12ea3..d3e0fa1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -379,7 +379,50 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z)
}
#endif

-#ifdef CONFIG_PROC_FS
+#ifdef CONFIG_COMPACTION
+struct contig_page_info {
+ unsigned long free_pages;
+ unsigned long free_blocks_total;
+ unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+ unsigned int suitable_order,
+ struct contig_page_info *info)
+{
+ unsigned int order;
+
+ info->free_pages = 0;
+ info->free_blocks_total = 0;
+ info->free_blocks_suitable = 0;
+
+ for (order = 0; order < MAX_ORDER; order++) {
+ unsigned long blocks;
+
+ /* Count number of free blocks */
+ blocks = zone->free_area[order].nr_free;
+ info->free_blocks_total += blocks;
+
+ /* Count free base pages */
+ info->free_pages += blocks << order;
+
+ /* Count the suitable free blocks */
+ if (order >= suitable_order)
+ info->free_blocks_suitable += blocks <<
+ (order - suitable_order);
+ }
+}
+#endif
+
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_COMPACTION)
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

@@ -432,7 +475,9 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
spin_unlock_irqrestore(&zone->lock, flags);
}
}
+#endif

+#ifdef CONFIG_PROC_FS
static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
struct zone *zone)
{
@@ -954,3 +999,106 @@ static int __init setup_vmstat(void)
return 0;
}
module_init(setup_vmstat)
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)
+#include <linux/debugfs.h>
+
+static struct dentry *extfrag_debug_root;
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+ struct contig_page_info *info)
+{
+ /* No free memory is interpreted as all free memory is unusable */
+ if (info->free_pages == 0)
+ return 1000;
+
+ /*
+ * Index should be a value between 0 and 1. Return a value to 3
+ * decimal places.
+ *
+ * 0 => no fragmentation
+ * 1 => high fragmentation
+ */
+ return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+ int index;
+ struct contig_page_info info;
+
+ seq_printf(m, "Node %d, zone %8s ",
+ pgdat->node_id,
+ zone->name);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ index = unusable_free_index(order, &info);
+ seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ *
+ * The unusable free space index measures how much of the available free
+ * memory cannot be used to satisfy an allocation of a given size and is a
+ * value between 0 and 1. The higher the value, the more of free memory is
+ * unusable and by implication, the worse the external fragmentation is. This
+ * can be expressed as a percentage by multiplying by 100.
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ /* check memoryless node */
+ if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ return 0;
+
+ walk_zones_in_node(m, pgdat, unusable_show_print);
+
+ return 0;
+}
+
+static const struct seq_operations unusable_op = {
+ .start = frag_start,
+ .next = frag_next,
+ .stop = frag_stop,
+ .show = unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+ .open = unusable_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init extfrag_debug_init(void)
+{
+ extfrag_debug_root = debugfs_create_dir("extfrag", NULL);
+ if (!extfrag_debug_root)
+ return -ENOMEM;
+
+ if (!debugfs_create_file("unusable_index", 0444,
+ extfrag_debug_root, NULL, &unusable_file_ops))
+ return -ENOMEM;
+
+ return 0;
+}
+
+module_init(extfrag_debug_init);
+#endif
--
1.6.5

2010-04-20 21:01:13

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/14] mm,migration: Share the anon_vma ref counts between KSM and page migration

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/rmap.h | 50 ++++++++++++++++++--------------------------------
mm/ksm.c | 4 ++--
mm/migrate.c | 4 ++--
mm/rmap.c | 6 ++----
4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
*/
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
-#ifdef CONFIG_KSM
- atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
- atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+ /*
+ * The external_refcount is taken by either KSM or page migration
+ * to take a reference to an anon_vma when there is no
+ * guarantee that the vma of page tables will exist for
+ * the duration of the operation. A caller that takes
+ * the reference is responsible for clearing up the
+ * anon_vma if they are the last user on release
+ */
+ atomic_t external_refcount;
#endif
/*
* NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
};

#ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
{
- atomic_set(&anon_vma->ksm_refcount, 0);
+ atomic_set(&anon_vma->external_refcount, 0);
}

-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
{
- return atomic_read(&anon_vma->ksm_refcount);
+ return atomic_read(&anon_vma->external_refcount);
}
#else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
{
}

-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
{
return 0;
}
#endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
- atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
- return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
- return 0;
-}
-#endif /* CONFIG_MIGRATE */

static inline struct anon_vma *page_anon_vma(struct page *page)
{
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..3666d43 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
struct anon_vma *anon_vma)
{
rmap_item->anon_vma = anon_vma;
- atomic_inc(&anon_vma->ksm_refcount);
+ atomic_inc(&anon_vma->external_refcount);
}

static void drop_anon_vma(struct rmap_item *rmap_item)
{
struct anon_vma *anon_vma = rmap_item->anon_vma;

- if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+ if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
int empty = list_empty(&anon_vma->head);
spin_unlock(&anon_vma->lock);
if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index b768a1d..42a3d24 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -601,7 +601,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
rcu_read_lock();
rcu_locked = 1;
anon_vma = page_anon_vma(page);
- atomic_inc(&anon_vma->migrate_refcount);
+ atomic_inc(&anon_vma->external_refcount);
}

/*
@@ -643,7 +643,7 @@ skip_unmap:
rcu_unlock:

/* Drop an anon_vma reference if we took one */
- if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+ if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
int empty = list_empty(&anon_vma->head);
spin_unlock(&anon_vma->lock);
if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 4bad2c5..85f203e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -249,8 +249,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
list_del(&anon_vma_chain->same_anon_vma);

/* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
- !migrate_refcount(anon_vma);
+ empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
spin_unlock(&anon_vma->lock);

if (empty)
@@ -274,8 +273,7 @@ static void anon_vma_ctor(void *data)
struct anon_vma *anon_vma = data;

spin_lock_init(&anon_vma->lock);
- ksm_refcount_init(anon_vma);
- migrate_refcount_init(anon_vma);
+ anonvma_external_refcount_init(anon_vma);
INIT_LIST_HEAD(&anon_vma->head);
}

--
1.6.5

2010-04-20 21:03:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/14] mm: Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

[[email protected]: Depend on CONFIG_HUGETLB_PAGE]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/Kconfig | 18 +++++++++++++++---
1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..a275a7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
default "4"

#
+# support for memory compaction
+config COMPACTION
+ bool "Allow for memory compaction"
+ def_bool y
+ select MIGRATION
+ depends on EXPERIMENTAL && HUGETLB_PAGE && MMU
+ help
+ Allows the compaction of memory for the allocation of huge pages.
+
+#
# support for page migration
#
config MIGRATION
@@ -180,9 +190,11 @@ config MIGRATION
depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
help
Allows the migration of the physical location of pages of processes
- while the virtual addresses are not changed. This is useful for
- example on NUMA systems to put pages nearer to the processors accessing
- the page.
+ while the virtual addresses are not changed. This is useful in
+ two situations. The first is on NUMA systems to put pages nearer
+ to the processors accessing. The second is when allocating huge
+ pages as migration can relocate pages to satisfy a huge page
+ allocation instead of reclaiming.

config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
--
1.6.5

2010-04-20 21:03:51

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
mm/migrate.c | 53 ++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b114635..4afd6fe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -485,7 +485,8 @@ static int fallback_migrate_page(struct address_space *mapping,
* < 0 - error code
* == 0 - success
*/
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+ int remap_swapcache)
{
struct address_space *mapping;
int rc;
@@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
else
rc = fallback_migrate_page(mapping, newpage, page);

- if (!rc)
- remove_migration_ptes(page, newpage);
- else
+ if (rc) {
newpage->mapping = NULL;
+ } else {
+ if (remap_swapcache)
+ remove_migration_ptes(page, newpage);
+ }

unlock_page(newpage);

@@ -540,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
+ int remap_swapcache = 1;
int rcu_locked = 0;
int charge = 0;
struct mem_cgroup *mem = NULL;
@@ -601,18 +605,33 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
rcu_read_lock();
rcu_locked = 1;

- /*
- * If the page has no mappings any more, just bail. An
- * unmapped anon page is likely to be freed soon but worse,
- * it's possible its anon_vma disappeared between when
- * the page was isolated and when we reached here while
- * the RCU lock was not held
- */
- if (!page_mapped(page))
- goto rcu_unlock;
+ /* Determine how to safely use anon_vma */
+ if (!page_mapped(page)) {
+ if (!PageSwapCache(page))
+ goto rcu_unlock;

- anon_vma = page_anon_vma(page);
- atomic_inc(&anon_vma->external_refcount);
+ /*
+ * We cannot be sure that the anon_vma of an unmapped
+ * swapcache page is safe to use because we don't
+ * know in advance if the VMA that this page belonged
+ * to still exists. If the VMA and others sharing the
+ * data have been freed, then the anon_vma could
+ * already be invalid.
+ *
+ * To avoid this possibility, swapcache pages get
+ * migrated but are not remapped when migration
+ * completes
+ */
+ remap_swapcache = 0;
+ } else {
+ /*
+ * Take a reference count on the anon_vma if the
+ * page is mapped so that it is guaranteed to
+ * exist when the page is remapped later
+ */
+ anon_vma = page_anon_vma(page);
+ atomic_inc(&anon_vma->external_refcount);
+ }
}

/*
@@ -647,9 +666,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,

skip_unmap:
if (!page_mapped(page))
- rc = move_to_new_page(newpage, page);
+ rc = move_to_new_page(newpage, page, remap_swapcache);

- if (rc)
+ if (rc && remap_swapcache)
remove_migration_ptes(page, page);
rcu_unlock:

--
1.6.5

2010-04-20 21:03:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/14] mm,migration: Do not try to migrate unmapped anonymous pages

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being
isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
page dropped to 0 and the anon_vma gets freed. This can happen during
memory compaction if pages being migrated belong to a process that exits
before migration completes. Hence, the use-after-free race looks like

1. Page isolated for migration
2. Process exits
3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/migrate.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 42a3d24..b114635 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -600,6 +600,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (PageAnon(page)) {
rcu_read_lock();
rcu_locked = 1;
+
+ /*
+ * If the page has no mappings any more, just bail. An
+ * unmapped anon page is likely to be freed soon but worse,
+ * it's possible its anon_vma disappeared between when
+ * the page was isolated and when we reached here while
+ * the RCU lock was not held
+ */
+ if (!page_mapped(page))
+ goto rcu_unlock;
+
anon_vma = page_anon_vma(page);
atomic_inc(&anon_vma->external_refcount);
}
--
1.6.5

2010-04-21 02:53:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating

On Tue, 20 Apr 2010 22:01:03 +0100
Mel Gorman <[email protected]> wrote:

> rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> locking an anon_vma and it does not appear to have sufficient locking to
> ensure the anon_vma does not disappear from under it.
>
> This patch copies an approach used by KSM to take a reference on the
> anon_vma while pages are being migrated. This should prevent rmap_walk()
> running into nasty surprises later because anon_vma has been freed.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Rik van Riel <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Hmm, a nitpick..


> static inline struct anon_vma *page_anon_vma(struct page *page)
> {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5938db5..b768a1d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -543,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> int rcu_locked = 0;
> int charge = 0;
> struct mem_cgroup *mem = NULL;
> + struct anon_vma *anon_vma = NULL;
>
> if (!newpage)
> return -ENOMEM;
> @@ -599,6 +600,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> if (PageAnon(page)) {
> rcu_read_lock();
> rcu_locked = 1;
> + anon_vma = page_anon_vma(page);
> + atomic_inc(&anon_vma->migrate_refcount);
> }
>
> /*
> @@ -638,6 +641,15 @@ skip_unmap:
> if (rc)
> remove_migration_ptes(page, page);
> rcu_unlock:
> +
> + /* Drop an anon_vma reference if we took one */
> + if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> + int empty = list_empty(&anon_vma->head);
> + spin_unlock(&anon_vma->lock);
> + if (empty)
> + anon_vma_free(anon_vma);
> + }
> +

Hmm. Can't we move this part to mm/rmap.c ?

as
anon_vma_drop_external_reference()
or some.

Thanks,
-Kame

2010-04-21 14:31:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Tue, 20 Apr 2010, Mel Gorman wrote:

> @@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> else
> rc = fallback_migrate_page(mapping, newpage, page);
>
> - if (!rc)
> - remove_migration_ptes(page, newpage);
> - else
> + if (rc) {
> newpage->mapping = NULL;
> + } else {
> + if (remap_swapcache)
> + remove_migration_ptes(page, newpage);
> + }

You are going to keep the migration ptes after the page has been unlocked?
Or is remap_swapcache true if its not a swapcache page?

Maybe you meant

if (!remap_swapcache)

?

> unlock_page(newpage);
>

>
> skip_unmap:
> if (!page_mapped(page))
> - rc = move_to_new_page(newpage, page);
> + rc = move_to_new_page(newpage, page, remap_swapcache);
>
> - if (rc)
> + if (rc && remap_swapcache)
> remove_migration_ptes(page, page);
> rcu_unlock:
>

Looks like you meant !remap_swapcache

2010-04-21 15:00:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, Apr 21, 2010 at 09:30:20AM -0500, Christoph Lameter wrote:
> On Tue, 20 Apr 2010, Mel Gorman wrote:
>
> > @@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> > else
> > rc = fallback_migrate_page(mapping, newpage, page);
> >
> > - if (!rc)
> > - remove_migration_ptes(page, newpage);
> > - else
> > + if (rc) {
> > newpage->mapping = NULL;
> > + } else {
> > + if (remap_swapcache)
> > + remove_migration_ptes(page, newpage);
> > + }
>
> You are going to keep the migration ptes after the page has been unlocked?

Yes, because it's not known if the anon_vma for the unmapped swapcache page
still exists or not. Now, a bug has been reported where a migration PTE is
found where the page is not locked. I'm trying to determine if it's the same
page or not but the problem takes ages to reproduce.

> Or is remap_swapcache true if its not a swapcache page?
>
> Maybe you meant
>
> if (!remap_swapcache)
>
> ?
>

No, remap_swapcache could just be called "remap". If it's 0, it's
considered unsafe to remap the page.

> > unlock_page(newpage);
> >
>
> >
> > skip_unmap:
> > if (!page_mapped(page))
> > - rc = move_to_new_page(newpage, page);
> > + rc = move_to_new_page(newpage, page, remap_swapcache);
> >
> > - if (rc)
> > + if (rc && remap_swapcache)
> > remove_migration_ptes(page, page);
> > rcu_unlock:
> >
>
> Looks like you meant !remap_swapcache
>

If remap_swapcache is 1, the anon_vma is valid (or irrelevant because
it's a file) and it's safe to remap the page by removing the migration
PTEs.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-21 15:06:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, 21 Apr 2010, Mel Gorman wrote:

> No, remap_swapcache could just be called "remap". If it's 0, it's
> considered unsafe to remap the page.

Call this "can_remap"?

2010-04-21 15:14:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, Apr 21, 2010 at 10:05:21AM -0500, Christoph Lameter wrote:
> On Wed, 21 Apr 2010, Mel Gorman wrote:
>
> > No, remap_swapcache could just be called "remap". If it's 0, it's
> > considered unsafe to remap the page.
>
> Call this "can_remap"?
>

can_do - ba dum tisch.

While you are looking though, maybe you can confirm something for me.

1. Is leaving a migration PTE like this behind reasonable? (I think yes
particularly as the page was already unmapped so it's not a new fault
incurred)
2. Is the BUG_ON check in
include/linux/swapops.h#migration_entry_to_page() now wrong? (I
think yes, but I'm not sure and I'm having trouble verifying it)

Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-21 15:32:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, 21 Apr 2010, Mel Gorman wrote:

> On Wed, Apr 21, 2010 at 10:05:21AM -0500, Christoph Lameter wrote:
> > On Wed, 21 Apr 2010, Mel Gorman wrote:
> >
> > > No, remap_swapcache could just be called "remap". If it's 0, it's
> > > considered unsafe to remap the page.
> >
> > Call this "can_remap"?
> >
>
> can_do - ba dum tisch.
>
> While you are looking though, maybe you can confirm something for me.
>
> 1. Is leaving a migration PTE like this behind reasonable? (I think yes
> particularly as the page was already unmapped so it's not a new fault
> incurred)

The design of page migration only allows for the existence of these as
long as the page is locked. Not sure what would happen if you leave this
hanging around. Paths that are not prepared for a migration_pte may
encounter one.

> 2. Is the BUG_ON check in
> include/linux/swapops.h#migration_entry_to_page() now wrong? (I
> think yes, but I'm not sure and I'm having trouble verifying it)

The bug check ensures that migration entries only occur when the page
is locked. This patch changes that behavior. This is going too oops
therefore in unmap_and_move() when you try to remove the migration_ptes
from an unlocked page.

2010-04-21 15:34:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, Apr 21, 2010 at 10:31:29AM -0500, Christoph Lameter wrote:
> On Wed, 21 Apr 2010, Mel Gorman wrote:
>
> > On Wed, Apr 21, 2010 at 10:05:21AM -0500, Christoph Lameter wrote:
> > > On Wed, 21 Apr 2010, Mel Gorman wrote:
> > >
> > > > No, remap_swapcache could just be called "remap". If it's 0, it's
> > > > considered unsafe to remap the page.
> > >
> > > Call this "can_remap"?
> > >
> >
> > can_do - ba dum tisch.
> >
> > While you are looking though, maybe you can confirm something for me.
> >
> > 1. Is leaving a migration PTE like this behind reasonable? (I think yes
> > particularly as the page was already unmapped so it's not a new fault
> > incurred)
>
> The design of page migration only allows for the existence of these as
> long as the page is locked. Not sure what would happen if you leave this
> hanging around. Paths that are not prepared for a migration_pte may
> encounter one.
>

If there are other paths, then migration of unmapped PageSwapCache is
plain unsafe and this patch would have to go. It'd limit compaction
somewhat but without the series, it would appear that memory hot-remove
is unsafe (albeit almost impossible to trigger).

> > 2. Is the BUG_ON check in
> > include/linux/swapops.h#migration_entry_to_page() now wrong? (I
> > think yes, but I'm not sure and I'm having trouble verifying it)
>
> The bug check ensures that migration entries only occur when the page
> is locked. This patch changes that behavior. This is going too oops
> therefore in unmap_and_move() when you try to remove the migration_ptes
> from an unlocked page.
>

It's not unmap_and_move() that the problem is occurring on but during a
page fault - presumably in do_swap_page but I'm not 100% certain.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-21 15:47:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, 21 Apr 2010, Mel Gorman wrote:

> > > 2. Is the BUG_ON check in
> > > include/linux/swapops.h#migration_entry_to_page() now wrong? (I
> > > think yes, but I'm not sure and I'm having trouble verifying it)
> >
> > The bug check ensures that migration entries only occur when the page
> > is locked. This patch changes that behavior. This is going too oops
> > therefore in unmap_and_move() when you try to remove the migration_ptes
> > from an unlocked page.
> >
>
> It's not unmap_and_move() that the problem is occurring on but during a
> page fault - presumably in do_swap_page but I'm not 100% certain.

remove_migration_pte() calls migration_entry_to_page(). So it must do that
only if the page is still locked.

You need to ensure that the page is not unlocked in move_to_new_page() if
the migration ptes are kept.

move_to_new_page() only unlocks the new page not the original page. So that is safe.

And it seems that the old page is also unlocked in unmap_and_move() only
after the migration_ptes have been removed? So we are fine after all...?


2010-04-22 00:04:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, 21 Apr 2010 09:30:20 -0500 (CDT)
Christoph Lameter <[email protected]> wrote:

> On Tue, 20 Apr 2010, Mel Gorman wrote:
>
> > @@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> > else
> > rc = fallback_migrate_page(mapping, newpage, page);
> >
> > - if (!rc)
> > - remove_migration_ptes(page, newpage);
> > - else
> > + if (rc) {
> > newpage->mapping = NULL;
> > + } else {
> > + if (remap_swapcache)
> > + remove_migration_ptes(page, newpage);
> > + }
>
> You are going to keep the migration ptes after the page has been unlocked?
> Or is remap_swapcache true if its not a swapcache page?
>
> Maybe you meant
>
> if (!remap_swapcache)
>

Ah....Can I confirm my understanding ?

remap_swapcache == true only when
The old page was ANON && it is not mapped. && it is SwapCache.

We do above check under lock_page(). So, this SwapCache is never mapped until
we release lock_page() on the old page. So, we don't use migration_pte in
this case because try_to_unmap() do nothing and don't need to call
remove_migration_pte().

If migration_pte is used somewhere...I think it's bug.

-Kame



> ?
>
> > unlock_page(newpage);
> >
>
> >
> > skip_unmap:
> > if (!page_mapped(page))
> > - rc = move_to_new_page(newpage, page);
> > + rc = move_to_new_page(newpage, page, remap_swapcache);
> >
> > - if (rc)
> > + if (rc && remap_swapcache)
> > remove_migration_ptes(page, page);
> > rcu_unlock:
> >
>
> Looks like you meant !remap_swapcache
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2010-04-22 00:11:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 8:59 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 21 Apr 2010 09:30:20 -0500 (CDT)
> Christoph Lameter <[email protected]> wrote:
>
>> On Tue, 20 Apr 2010, Mel Gorman wrote:
>>
>> > @@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>> >     else
>> >             rc = fallback_migrate_page(mapping, newpage, page);
>> >
>> > -   if (!rc)
>> > -           remove_migration_ptes(page, newpage);
>> > -   else
>> > +   if (rc) {
>> >             newpage->mapping = NULL;
>> > +   } else {
>> > +           if (remap_swapcache)
>> > +                   remove_migration_ptes(page, newpage);
>> > +   }
>>
>> You are going to keep the migration ptes after the page has been unlocked?
>> Or is remap_swapcache true if its not a swapcache page?
>>
>> Maybe you meant
>>
>> if (!remap_swapcache)
>>
>
> Ah....Can I confirm my understanding ?
>
> remap_swapcache == true only when
>  The old page was ANON && it is not mapped. && it is SwapCache.
>
> We do above check under lock_page(). So, this SwapCache is never mapped until
> we release lock_page() on the old page. So, we don't use migration_pte in
> this case because try_to_unmap() do nothing and don't need to call
> remove_migration_pte().

Yes. so I thought what kinds of race happened.
Firstly I doubt fork and migration. but It isn't.
I can't understand how this bug happens.
Apparently, We have been missed something.
I will look into this further.


--
Kind regards,
Minchan Kim

2010-04-22 09:28:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Wed, Apr 21, 2010 at 10:46:45AM -0500, Christoph Lameter wrote:
> On Wed, 21 Apr 2010, Mel Gorman wrote:
>
> > > > 2. Is the BUG_ON check in
> > > > include/linux/swapops.h#migration_entry_to_page() now wrong? (I
> > > > think yes, but I'm not sure and I'm having trouble verifying it)
> > >
> > > The bug check ensures that migration entries only occur when the page
> > > is locked. This patch changes that behavior. This is going too oops
> > > therefore in unmap_and_move() when you try to remove the migration_ptes
> > > from an unlocked page.
> > >
> >
> > It's not unmap_and_move() that the problem is occurring on but during a
> > page fault - presumably in do_swap_page but I'm not 100% certain.
>
> remove_migration_pte() calls migration_entry_to_page(). So it must do that
> only if the page is still locked.
>

Correct, but the other call path is

do_swap_page
-> migration_entry_wait
-> migration_entry_to_page

with migration_entry_wait expecting the page to be locked. There is a dangling
migration PTEs coming from somewhere. I thought it was from unmapped swapcache
first, but that cannot be the case. There is a race somewhere.

> You need to ensure that the page is not unlocked in move_to_new_page() if
> the migration ptes are kept.
>
> move_to_new_page() only unlocks the new page not the original page. So that is safe.
>
> And it seems that the old page is also unlocked in unmap_and_move() only
> after the migration_ptes have been removed? So we are fine after all...?
>

You'd think but migration PTEs are being left behind in some circumstance. I
thought it was due to this series, but it's unlikely. It's more a case that
compaction heavily exercises migration.

We can clean up the old migration PTEs though when they are encountered
like in the following patch for example? I'll continue investigating why
this dangling migration pte exists as closing that race would be a
better fix.

==== CUT HERE ====
mm,migration: Remove dangling migration ptes pointing to unlocked pages

Due to some yet-to-be-identified race, it is possible for migration PTEs
to be left behind, When later paged-in, a BUG is triggered that assumes
that all migration PTEs are point to a page currently being migrated and
so must be locked.

Rather than calling BUG, this patch notes the existance of dangling migration
PTEs in migration_entry_wait() and cleans them up.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/swapops.h | 11 ++++++-----
mm/memory.c | 2 +-
mm/migrate.c | 24 ++++++++++++++++++++----
3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cd42e30..01fba71 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -95,14 +95,15 @@ static inline int is_write_migration_entry(swp_entry_t entry)
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
}

-static inline struct page *migration_entry_to_page(swp_entry_t entry)
+static inline struct page *migration_entry_to_page(swp_entry_t entry,
+ bool lock_required)
{
struct page *p = pfn_to_page(swp_offset(entry));
/*
* Any use of migration entries may only occur while the
* corresponding page is locked
*/
- BUG_ON(!PageLocked(p));
+ BUG_ON(!PageLocked(p) && lock_required);
return p;
}

@@ -112,7 +113,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
}

extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address);
+ unsigned long address, bool lock_required);
#else

#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -121,9 +122,9 @@ static inline int is_migration_entry(swp_entry_t swp)
return 0;
}
#define migration_entry_to_page(swp) NULL
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
+static inline void make_migration_entry_read(swp_entry_t *entryp, bool lock_required) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address) { }
+ unsigned long address, bool lock_required) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 833952d..9719138 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2619,7 +2619,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ migration_entry_wait(mm, pmd, address, false);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 4afd6fe..308639b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -114,7 +114,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
entry = pte_to_swp_entry(pte);

if (!is_migration_entry(entry) ||
- migration_entry_to_page(entry) != old)
+ migration_entry_to_page(entry, true) != old)
goto unlock;

get_page(new);
@@ -148,19 +148,23 @@ static void remove_migration_ptes(struct page *old, struct page *new)

/*
* Something used the pte of a page under migration. We need to
- * get to the page and wait until migration is finished.
+ * get to the page and wait until migration is finished. Alternatively,
+ * the migration of an unmapped swapcache page could have left a
+ * dangling migration pte due to the lack of certainity about an
+ * anon_vma. In that case, the migration pte is removed and rechecked.
* When we return from this function the fault will be retried.
*
* This function is called from do_swap_page().
*/
void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
+ unsigned long address, bool lock_required)
{
pte_t *ptep, pte;
spinlock_t *ptl;
swp_entry_t entry;
struct page *page;

+recheck:
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
if (!is_swap_pte(pte))
@@ -170,7 +174,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
if (!is_migration_entry(entry))
goto out;

- page = migration_entry_to_page(entry);
+ page = migration_entry_to_page(entry, lock_required);

/*
* Once radix-tree replacement of page migration started, page_count
@@ -181,6 +185,18 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
*/
if (!get_page_unless_zero(page))
goto out;
+
+ /* If unlocked, this is a dangling migration pte that needs removal */
+ if (!PageLocked(page)) {
+ BUG_ON(lock_required);
+ pte_unmap_unlock(ptep, ptl);
+ lock_page(page);
+ remove_migration_pte(page, find_vma(mm, address), address, page);
+ unlock_page(page);
+ put_page(page);
+ goto recheck;
+ }
+
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);

2010-04-22 09:50:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 22 Apr 2010 10:28:20 +0100
Mel Gorman <[email protected]> wrote:

> On Wed, Apr 21, 2010 at 10:46:45AM -0500, Christoph Lameter wrote:
> > On Wed, 21 Apr 2010, Mel Gorman wrote:
> >
> > > > > 2. Is the BUG_ON check in
> > > > > include/linux/swapops.h#migration_entry_to_page() now wrong? (I
> > > > > think yes, but I'm not sure and I'm having trouble verifying it)
> > > >
> > > > The bug check ensures that migration entries only occur when the page
> > > > is locked. This patch changes that behavior. This is going too oops
> > > > therefore in unmap_and_move() when you try to remove the migration_ptes
> > > > from an unlocked page.
> > > >
> > >
> > > It's not unmap_and_move() that the problem is occurring on but during a
> > > page fault - presumably in do_swap_page but I'm not 100% certain.
> >
> > remove_migration_pte() calls migration_entry_to_page(). So it must do that
> > only if the page is still locked.
> >
>
> Correct, but the other call path is
>
> do_swap_page
> -> migration_entry_wait
> -> migration_entry_to_page
>
> with migration_entry_wait expecting the page to be locked. There is a dangling
> migration PTEs coming from somewhere. I thought it was from unmapped swapcache
> first, but that cannot be the case. There is a race somewhere.
>
> > You need to ensure that the page is not unlocked in move_to_new_page() if
> > the migration ptes are kept.
> >
> > move_to_new_page() only unlocks the new page not the original page. So that is safe.
> >
> > And it seems that the old page is also unlocked in unmap_and_move() only
> > after the migration_ptes have been removed? So we are fine after all...?
> >
>
> You'd think but migration PTEs are being left behind in some circumstance. I
> thought it was due to this series, but it's unlikely. It's more a case that
> compaction heavily exercises migration.
>
> We can clean up the old migration PTEs though when they are encountered
> like in the following patch for example? I'll continue investigating why
> this dangling migration pte exists as closing that race would be a
> better fix.
>
> ==== CUT HERE ====
> mm,migration: Remove dangling migration ptes pointing to unlocked pages
>
> Due to some yet-to-be-identified race, it is possible for migration PTEs
> to be left behind, When later paged-in, a BUG is triggered that assumes
> that all migration PTEs are point to a page currently being migrated and
> so must be locked.
>
> Rather than calling BUG, this patch notes the existance of dangling migration
> PTEs in migration_entry_wait() and cleans them up.
>

I use similar patch for debugging. In my patch, this when this function founds
dangling migration entry, return error code and do_swap_page() returns
VM_FAULT_SIGBUS.


Hmm..in my test, the case was.

Before try_to_unmap:
mapcount=1, SwapCache, remap_swapcache=1
After remap
mapcount=0, SwapCache, rc=0.

So, I think there may be some race in rmap_walk() and vma handling or
anon_vma handling. migration_entry isn't found by rmap_walk.

Hmm..it seems this kind patch will be required for debug.

-Kame


2010-04-22 10:13:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 22 Apr 2010 10:28:20 +0100
> Mel Gorman <[email protected]> wrote:
>
>> On Wed, Apr 21, 2010 at 10:46:45AM -0500, Christoph Lameter wrote:
>> > On Wed, 21 Apr 2010, Mel Gorman wrote:
>> >
>> > > > > 2. Is the BUG_ON check in
>> > > > >    include/linux/swapops.h#migration_entry_to_page() now wrong? (I
>> > > > >    think yes, but I'm not sure and I'm having trouble verifying it)
>> > > >
>> > > > The bug check ensures that migration entries only occur when the page
>> > > > is locked. This patch changes that behavior. This is going too oops
>> > > > therefore in unmap_and_move() when you try to remove the migration_ptes
>> > > > from an unlocked page.
>> > > >
>> > >
>> > > It's not unmap_and_move() that the problem is occurring on but during a
>> > > page fault - presumably in do_swap_page but I'm not 100% certain.
>> >
>> > remove_migration_pte() calls migration_entry_to_page(). So it must do that
>> > only if the page is still locked.
>> >
>>
>> Correct, but the other call path is
>>
>> do_swap_page
>>   -> migration_entry_wait
>>     -> migration_entry_to_page
>>
>> with migration_entry_wait expecting the page to be locked. There is a dangling
>> migration PTEs coming from somewhere. I thought it was from unmapped swapcache
>> first, but that cannot be the case. There is a race somewhere.
>>
>> > You need to ensure that the page is not unlocked in move_to_new_page() if
>> > the migration ptes are kept.
>> >
>> > move_to_new_page() only unlocks the new page not the original page. So that is safe.
>> >
>> > And it seems that the old page is also unlocked in unmap_and_move() only
>> > after the migration_ptes have been removed? So we are fine after all...?
>> >
>>
>> You'd think but migration PTEs are being left behind in some circumstance. I
>> thought it was due to this series, but it's unlikely. It's more a case that
>> compaction heavily exercises migration.
>>
>> We can clean up the old migration PTEs though when they are encountered
>> like in the following patch for example? I'll continue investigating why
>> this dangling migration pte exists as closing that race would be a
>> better fix.
>>
>> ==== CUT HERE ====
>> mm,migration: Remove dangling migration ptes pointing to unlocked pages
>>
>> Due to some yet-to-be-identified race, it is possible for migration PTEs
>> to be left behind, When later paged-in, a BUG is triggered that assumes
>> that all migration PTEs are point to a page currently being migrated and
>> so must be locked.
>>
>> Rather than calling BUG, this patch notes the existance of dangling migration
>> PTEs in migration_entry_wait() and cleans them up.
>>
>
> I use similar patch for debugging. In my patch, this when this function founds
> dangling migration entry, return error code and do_swap_page() returns
> VM_FAULT_SIGBUS.
>
>
> Hmm..in my test, the case was.
>
> Before try_to_unmap:
>        mapcount=1, SwapCache, remap_swapcache=1
> After remap
>        mapcount=0, SwapCache, rc=0.
>
> So, I think there may be some race in rmap_walk() and vma handling or
> anon_vma handling. migration_entry isn't found by rmap_walk.
>
> Hmm..it seems this kind patch will be required for debug.

I looked do_swap_page, again.
lock_page is called long after migration_entry_wait.
It means lock_page can't close the race.

So I think this BUG is possible.
What do you think?

> -Kame
>
>
>
>



--
Kind regards,
Minchan Kim

2010-04-22 10:35:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 22 Apr 2010 19:13:12 +0900
Minchan Kim <[email protected]> wrote:

> On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:

> > Hmm..in my test, the case was.
> >
> > Before try_to_unmap:
> >        mapcount=1, SwapCache, remap_swapcache=1
> > After remap
> >        mapcount=0, SwapCache, rc=0.
> >
> > So, I think there may be some race in rmap_walk() and vma handling or
> > anon_vma handling. migration_entry isn't found by rmap_walk.
> >
> > Hmm..it seems this kind patch will be required for debug.
>
> I looked do_swap_page, again.
> lock_page is called long after migration_entry_wait.
> It means lock_page can't close the race.
>
> So I think this BUG is possible.
> What do you think?
>

I think it's not a problem.
When migration-entry-wait is called, enry_wait() does

pte_lock();
check migration_pte
check it's locked.

And after wait_on_page_locked(), it just returns to user and cause
page fault again.

Thanks,
-Kame

2010-04-22 10:55:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 22 Apr 2010 19:31:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 22 Apr 2010 19:13:12 +0900
> Minchan Kim <[email protected]> wrote:
>
> > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
>
> > > Hmm..in my test, the case was.
> > >
> > > Before try_to_unmap:
> > >        mapcount=1, SwapCache, remap_swapcache=1
> > > After remap
> > >        mapcount=0, SwapCache, rc=0.
> > >
> > > So, I think there may be some race in rmap_walk() and vma handling or
> > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > >
> > > Hmm..it seems this kind patch will be required for debug.
> >

Ok, here is my patch for _fix_. But still testing...
Running well at least for 30 minutes, where I can see bug in 10minutes.
But this patch is too naive. please think about something better fix.

==
From: KAMEZAWA Hiroyuki <[email protected]>

At adjust_vma(), vma's start address and pgoff is updated under
write lock of mmap_sem. This means the vma's rmap information
update is atoimic only under read lock of mmap_sem.


Even if it's not atomic, in usual case, try_to_ummap() etc...
just fails to decrease mapcount to be 0. no problem.

But at page migration's rmap_walk(), it requires to know all
migration_entry in page tables and recover mapcount.

So, this race in vma's address is critical. When rmap_walk meet
the race, rmap_walk will mistakenly get -EFAULT and don't call
rmap_one(). This patch adds a lock for vma's rmap information.
But, this is _very slow_.
We need something sophisitcated, light-weight update for this..


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mm_types.h | 1 +
kernel/fork.c | 1 +
mm/mmap.c | 11 ++++++++++-
mm/rmap.c | 3 +++
4 files changed, 15 insertions(+), 1 deletion(-)

Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
@@ -183,6 +183,7 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
+ spinlock_t adjust_lock;
};

struct core_thread {
Index: linux-2.6.34-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
+++ linux-2.6.34-rc4-mm1/mm/mmap.c
@@ -584,13 +584,20 @@ again: remove_next = 1 + (end > next->
if (adjust_next)
vma_prio_tree_remove(next, root);
}
-
+ /*
+ * changing all params in atomic. If not, vma_address in rmap.c
+ * can see wrong result.
+ */
+ spin_lock(&vma->adjust_lock);
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
+ spin_unlock(&vma->adjust_lock);
if (adjust_next) {
+ spin_lock(&next->adjust_lock);
next->vm_start += adjust_next << PAGE_SHIFT;
next->vm_pgoff += adjust_next;
+ spin_unlock(&next->adjust_lock);
}

if (root) {
@@ -1939,6 +1946,7 @@ static int __split_vma(struct mm_struct
*new = *vma;

INIT_LIST_HEAD(&new->anon_vma_chain);
+ spin_lock_init(&new->adjust_lock);

if (new_below)
new->vm_end = addr;
@@ -2338,6 +2346,7 @@ struct vm_area_struct *copy_vma(struct v
if (IS_ERR(pol))
goto out_free_vma;
INIT_LIST_HEAD(&new_vma->anon_vma_chain);
+ spin_lock_init(&new_vma->adjust_lock);
if (anon_vma_clone(new_vma, vma))
goto out_free_mempol;
vma_set_policy(new_vma, pol);
Index: linux-2.6.34-rc4-mm1/kernel/fork.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/kernel/fork.c
+++ linux-2.6.34-rc4-mm1/kernel/fork.c
@@ -350,6 +350,7 @@ static int dup_mmap(struct mm_struct *mm
goto fail_nomem;
*tmp = *mpnt;
INIT_LIST_HEAD(&tmp->anon_vma_chain);
+ spin_lock_init(&tmp->adjust_lock);
pol = mpol_dup(vma_policy(mpnt));
retval = PTR_ERR(pol);
if (IS_ERR(pol))
Index: linux-2.6.34-rc4-mm1/mm/rmap.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/rmap.c
+++ linux-2.6.34-rc4-mm1/mm/rmap.c
@@ -332,11 +332,14 @@ vma_address(struct page *page, struct vm
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
unsigned long address;

+ spin_lock(&vma->adjust_lock);
address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+ spin_unlock(&vma->adjust_lock);
/* page should be within @vma mapping range */
return -EFAULT;
}
+ spin_unlock(&vma->adjust_lock);
return address;
}




2010-04-22 14:14:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 07:51:53PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Apr 2010 19:31:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Thu, 22 Apr 2010 19:13:12 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > > <[email protected]> wrote:
> >
> > > > Hmm..in my test, the case was.
> > > >
> > > > Before try_to_unmap:
> > > > ? ? ? ?mapcount=1, SwapCache, remap_swapcache=1
> > > > After remap
> > > > ? ? ? ?mapcount=0, SwapCache, rc=0.
> > > >
> > > > So, I think there may be some race in rmap_walk() and vma handling or
> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > > >
> > > > Hmm..it seems this kind patch will be required for debug.
> > >
>
> Ok, here is my patch for _fix_. But still testing...
> Running well at least for 30 minutes, where I can see bug in 10minutes.
> But this patch is too naive. please think about something better fix.
>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> At adjust_vma(), vma's start address and pgoff is updated under
> write lock of mmap_sem. This means the vma's rmap information
> update is atoimic only under read lock of mmap_sem.
>
>
> Even if it's not atomic, in usual case, try_to_ummap() etc...
> just fails to decrease mapcount to be 0. no problem.
>
> But at page migration's rmap_walk(), it requires to know all
> migration_entry in page tables and recover mapcount.
>
> So, this race in vma's address is critical. When rmap_walk meet
> the race, rmap_walk will mistakenly get -EFAULT and don't call
> rmap_one(). This patch adds a lock for vma's rmap information.
> But, this is _very slow_.

Ok wow. That is exceptionally well-spotted. This looks like a proper bug
that compaction exposes as opposed to a bug that compaction introduces.

> We need something sophisitcated, light-weight update for this..
>

In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
the duration of the update and is taken elsewhere where the VMA information
is read such as rmap_walk_file()

In the event the VMA is anon, vma_adjust currently talks no locks and your
patch introduces a new one but why not use the anon_vma lock here? Am I
missing something that requires the new lock?

For example;

==== CUT HERE ====
mm: Take the anon_vma lock in vma_adjust()

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch takes the anon_vma->lock during vma_adjust to avoid such
races.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/mmap.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);

2010-04-22 14:24:12

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 2010-04-22 at 19:51 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Apr 2010 19:31:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Thu, 22 Apr 2010 19:13:12 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > > <[email protected]> wrote:
> >
> > > > Hmm..in my test, the case was.
> > > >
> > > > Before try_to_unmap:
> > > > mapcount=1, SwapCache, remap_swapcache=1
> > > > After remap
> > > > mapcount=0, SwapCache, rc=0.
> > > >
> > > > So, I think there may be some race in rmap_walk() and vma handling or
> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > > >
> > > > Hmm..it seems this kind patch will be required for debug.
> > >
>
> Ok, here is my patch for _fix_. But still testing...
> Running well at least for 30 minutes, where I can see bug in 10minutes.
> But this patch is too naive. please think about something better fix.
>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> At adjust_vma(), vma's start address and pgoff is updated under
> write lock of mmap_sem. This means the vma's rmap information
> update is atoimic only under read lock of mmap_sem.
>
>
> Even if it's not atomic, in usual case, try_to_ummap() etc...
> just fails to decrease mapcount to be 0. no problem.
>
> But at page migration's rmap_walk(), it requires to know all
> migration_entry in page tables and recover mapcount.
>
> So, this race in vma's address is critical. When rmap_walk meet
> the race, rmap_walk will mistakenly get -EFAULT and don't call
> rmap_one(). This patch adds a lock for vma's rmap information.
> But, this is _very slow_.
> We need something sophisitcated, light-weight update for this..
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/mm_types.h | 1 +
> kernel/fork.c | 1 +
> mm/mmap.c | 11 ++++++++++-
> mm/rmap.c | 3 +++
> 4 files changed, 15 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
> +++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> @@ -183,6 +183,7 @@ struct vm_area_struct {
> #ifdef CONFIG_NUMA
> struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> #endif
> + spinlock_t adjust_lock;
> };
>
> struct core_thread {
> Index: linux-2.6.34-rc4-mm1/mm/mmap.c
> ===================================================================
> --- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
> +++ linux-2.6.34-rc4-mm1/mm/mmap.c
> @@ -584,13 +584,20 @@ again: remove_next = 1 + (end > next->
> if (adjust_next)
> vma_prio_tree_remove(next, root);
> }
> -
> + /*
> + * changing all params in atomic. If not, vma_address in rmap.c
> + * can see wrong result.
> + */
> + spin_lock(&vma->adjust_lock);
> vma->vm_start = start;
> vma->vm_end = end;
> vma->vm_pgoff = pgoff;
> + spin_unlock(&vma->adjust_lock);
> if (adjust_next) {
> + spin_lock(&next->adjust_lock);
> next->vm_start += adjust_next << PAGE_SHIFT;
> next->vm_pgoff += adjust_next;
> + spin_unlock(&next->adjust_lock);
> }
>
> if (root) {
> @@ -1939,6 +1946,7 @@ static int __split_vma(struct mm_struct
> *new = *vma;
>
> INIT_LIST_HEAD(&new->anon_vma_chain);
> + spin_lock_init(&new->adjust_lock);
>
> if (new_below)
> new->vm_end = addr;
> @@ -2338,6 +2346,7 @@ struct vm_area_struct *copy_vma(struct v
> if (IS_ERR(pol))
> goto out_free_vma;
> INIT_LIST_HEAD(&new_vma->anon_vma_chain);
> + spin_lock_init(&new_vma->adjust_lock);
> if (anon_vma_clone(new_vma, vma))
> goto out_free_mempol;
> vma_set_policy(new_vma, pol);
> Index: linux-2.6.34-rc4-mm1/kernel/fork.c
> ===================================================================
> --- linux-2.6.34-rc4-mm1.orig/kernel/fork.c
> +++ linux-2.6.34-rc4-mm1/kernel/fork.c
> @@ -350,6 +350,7 @@ static int dup_mmap(struct mm_struct *mm
> goto fail_nomem;
> *tmp = *mpnt;
> INIT_LIST_HEAD(&tmp->anon_vma_chain);
> + spin_lock_init(&tmp->adjust_lock);
> pol = mpol_dup(vma_policy(mpnt));
> retval = PTR_ERR(pol);
> if (IS_ERR(pol))
> Index: linux-2.6.34-rc4-mm1/mm/rmap.c
> ===================================================================
> --- linux-2.6.34-rc4-mm1.orig/mm/rmap.c
> +++ linux-2.6.34-rc4-mm1/mm/rmap.c
> @@ -332,11 +332,14 @@ vma_address(struct page *page, struct vm
> pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> unsigned long address;
>
> + spin_lock(&vma->adjust_lock);
> address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> + spin_unlock(&vma->adjust_lock);
> /* page should be within @vma mapping range */
> return -EFAULT;
> }
> + spin_unlock(&vma->adjust_lock);
> return address;
> }
>

Nice Catch, Kame. :)

For further optimization, we can hold vma->adjust_lock if vma_address
returns -EFAULT. But I hope we redesigns it without new locking.
But I don't have good idea, now. :(

--
Kind regards,
Minchan Kim

2010-04-22 14:25:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 11:14 PM, Mel Gorman <[email protected]> wrote:
> On Thu, Apr 22, 2010 at 07:51:53PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Thu, 22 Apr 2010 19:31:06 +0900
>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>> > On Thu, 22 Apr 2010 19:13:12 +0900
>> > Minchan Kim <[email protected]> wrote:
>> >
>> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
>> > > <[email protected]> wrote:
>> >
>> > > > Hmm..in my test, the case was.
>> > > >
>> > > > Before try_to_unmap:
>> > > >        mapcount=1, SwapCache, remap_swapcache=1
>> > > > After remap
>> > > >        mapcount=0, SwapCache, rc=0.
>> > > >
>> > > > So, I think there may be some race in rmap_walk() and vma handling or
>> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
>> > > >
>> > > > Hmm..it seems this kind patch will be required for debug.
>> > >
>>
>> Ok, here is my patch for _fix_. But still testing...
>> Running well at least for 30 minutes, where I can see bug in 10minutes.
>> But this patch is too naive. please think about something better fix.
>>
>> ==
>> From: KAMEZAWA Hiroyuki <[email protected]>
>>
>> At adjust_vma(), vma's start address and pgoff is updated under
>> write lock of mmap_sem. This means the vma's rmap information
>> update is atoimic only under read lock of mmap_sem.
>>
>>
>> Even if it's not atomic, in usual case, try_to_ummap() etc...
>> just fails to decrease mapcount to be 0. no problem.
>>
>> But at page migration's rmap_walk(), it requires to know all
>> migration_entry in page tables and recover mapcount.
>>
>> So, this race in vma's address is critical. When rmap_walk meet
>> the race, rmap_walk will mistakenly get -EFAULT and don't call
>> rmap_one(). This patch adds a lock for vma's rmap information.
>> But, this is _very slow_.
>
> Ok wow. That is exceptionally well-spotted. This looks like a proper bug
> that compaction exposes as opposed to a bug that compaction introduces.
>
>> We need something sophisitcated, light-weight update for this..
>>
>
> In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
> the duration of the update and is  taken elsewhere where the VMA information
> is read such as rmap_walk_file()
>
> In the event the VMA is anon, vma_adjust currently talks no locks and your
> patch introduces a new one but why not use the anon_vma lock here? Am I
> missing something that requires the new lock?

rmap_walk_anon doesn't hold vma's anon_vma->lock.
It holds page->anon_vma->lock.

--
Kind regards,
Minchan Kim

2010-04-22 14:40:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 2010-04-22 at 23:23 +0900, Minchan Kim wrote:
> On Thu, 2010-04-22 at 19:51 +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 22 Apr 2010 19:31:06 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Thu, 22 Apr 2010 19:13:12 +0900
> > > Minchan Kim <[email protected]> wrote:
> > >
> > > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > > > <[email protected]> wrote:
> > >
> > > > > Hmm..in my test, the case was.
> > > > >
> > > > > Before try_to_unmap:
> > > > > mapcount=1, SwapCache, remap_swapcache=1
> > > > > After remap
> > > > > mapcount=0, SwapCache, rc=0.
> > > > >
> > > > > So, I think there may be some race in rmap_walk() and vma handling or
> > > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > > > >
> > > > > Hmm..it seems this kind patch will be required for debug.
> > > >
> >
> > Ok, here is my patch for _fix_. But still testing...
> > Running well at least for 30 minutes, where I can see bug in 10minutes.
> > But this patch is too naive. please think about something better fix.
> >
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > At adjust_vma(), vma's start address and pgoff is updated under
> > write lock of mmap_sem. This means the vma's rmap information
> > update is atoimic only under read lock of mmap_sem.
> >
> >
> > Even if it's not atomic, in usual case, try_to_ummap() etc...
> > just fails to decrease mapcount to be 0. no problem.
> >
> > But at page migration's rmap_walk(), it requires to know all
> > migration_entry in page tables and recover mapcount.
> >
> > So, this race in vma's address is critical. When rmap_walk meet
> > the race, rmap_walk will mistakenly get -EFAULT and don't call
> > rmap_one(). This patch adds a lock for vma's rmap information.
> > But, this is _very slow_.
> > We need something sophisitcated, light-weight update for this..
> >
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/mm_types.h | 1 +
> > kernel/fork.c | 1 +
> > mm/mmap.c | 11 ++++++++++-
> > mm/rmap.c | 3 +++
> > 4 files changed, 15 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> > ===================================================================
> > --- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
> > +++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> > @@ -183,6 +183,7 @@ struct vm_area_struct {
> > #ifdef CONFIG_NUMA
> > struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> > #endif
> > + spinlock_t adjust_lock;
> > };
> >
> > struct core_thread {
> > Index: linux-2.6.34-rc4-mm1/mm/mmap.c
> > ===================================================================
> > --- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
> > +++ linux-2.6.34-rc4-mm1/mm/mmap.c
> > @@ -584,13 +584,20 @@ again: remove_next = 1 + (end > next->
> > if (adjust_next)
> > vma_prio_tree_remove(next, root);
> > }
> > -
> > + /*
> > + * changing all params in atomic. If not, vma_address in rmap.c
> > + * can see wrong result.
> > + */
> > + spin_lock(&vma->adjust_lock);
> > vma->vm_start = start;
> > vma->vm_end = end;
> > vma->vm_pgoff = pgoff;
> > + spin_unlock(&vma->adjust_lock);
> > if (adjust_next) {
> > + spin_lock(&next->adjust_lock);
> > next->vm_start += adjust_next << PAGE_SHIFT;
> > next->vm_pgoff += adjust_next;
> > + spin_unlock(&next->adjust_lock);
> > }
> >
> > if (root) {
> > @@ -1939,6 +1946,7 @@ static int __split_vma(struct mm_struct
> > *new = *vma;
> >
> > INIT_LIST_HEAD(&new->anon_vma_chain);
> > + spin_lock_init(&new->adjust_lock);
> >
> > if (new_below)
> > new->vm_end = addr;
> > @@ -2338,6 +2346,7 @@ struct vm_area_struct *copy_vma(struct v
> > if (IS_ERR(pol))
> > goto out_free_vma;
> > INIT_LIST_HEAD(&new_vma->anon_vma_chain);
> > + spin_lock_init(&new_vma->adjust_lock);
> > if (anon_vma_clone(new_vma, vma))
> > goto out_free_mempol;
> > vma_set_policy(new_vma, pol);
> > Index: linux-2.6.34-rc4-mm1/kernel/fork.c
> > ===================================================================
> > --- linux-2.6.34-rc4-mm1.orig/kernel/fork.c
> > +++ linux-2.6.34-rc4-mm1/kernel/fork.c
> > @@ -350,6 +350,7 @@ static int dup_mmap(struct mm_struct *mm
> > goto fail_nomem;
> > *tmp = *mpnt;
> > INIT_LIST_HEAD(&tmp->anon_vma_chain);
> > + spin_lock_init(&tmp->adjust_lock);
> > pol = mpol_dup(vma_policy(mpnt));
> > retval = PTR_ERR(pol);
> > if (IS_ERR(pol))
> > Index: linux-2.6.34-rc4-mm1/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.34-rc4-mm1.orig/mm/rmap.c
> > +++ linux-2.6.34-rc4-mm1/mm/rmap.c
> > @@ -332,11 +332,14 @@ vma_address(struct page *page, struct vm
> > pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > unsigned long address;
> >
> > + spin_lock(&vma->adjust_lock);
> > address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > + spin_unlock(&vma->adjust_lock);
> > /* page should be within @vma mapping range */
> > return -EFAULT;
> > }
> > + spin_unlock(&vma->adjust_lock);
> > return address;
> > }
> >
>
> Nice Catch, Kame. :)
>
> For further optimization, we can hold vma->adjust_lock if vma_address
> returns -EFAULT. But I hope we redesigns it without new locking.
> But I don't have good idea, now. :(

How about this?
I just merged ideas of Mel and Kame.:)

It just shows the concept, not formal patch.


diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61ea742 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,8 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -619,7 +621,8 @@ again: remove_next = 1 + (end > next->vm_end);

if (mapping)
spin_unlock(&mapping->i_mmap_lock);
-
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 3a53d9f..8075057 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1359,9 +1359,22 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
+ struct anon_vma *tmp_anon_vma = vma->anon_vma;
+ unsigned long address;
+ int tmp_vma_lock = 0;
+
+ if (tmp_anon_vma != anon_vma) {
+ spin_lock(&tmp_anon_vma->lock);
+ tmp_vma_lock = 1;
+ }
+ address = vma_address(page, vma);
+ if (address == -EFAULT) {
+ if (tmp_vma_lock)
+ spin_unlock(&tmp_anon_vma->lock);
continue;
+ }
+ if (tmp_vma_lock)
+ spin_unlock(&tmp_anon_vma->lock);
ret = rmap_one(page, vma, address, arg);
if (ret != SWAP_AGAIN)
break;
--
1.7.0.5



--
Kind regards,
Minchan Kim

Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 22 Apr 2010, Minchan Kim wrote:

> For further optimization, we can hold vma->adjust_lock if vma_address
> returns -EFAULT. But I hope we redesigns it without new locking.
> But I don't have good idea, now. :(

You could make it atomic through the use of RCU.

Create a new vma entry with the changed parameters and then atomically
switch to the new vma.

Problem is that you have some list_heads in there.

2010-04-22 15:40:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 11:18:14PM +0900, Minchan Kim wrote:
> On Thu, Apr 22, 2010 at 11:14 PM, Mel Gorman <[email protected]> wrote:
> > On Thu, Apr 22, 2010 at 07:51:53PM +0900, KAMEZAWA Hiroyuki wrote:
> >> On Thu, 22 Apr 2010 19:31:06 +0900
> >> KAMEZAWA Hiroyuki <[email protected]> wrote:
> >>
> >> > On Thu, 22 Apr 2010 19:13:12 +0900
> >> > Minchan Kim <[email protected]> wrote:
> >> >
> >> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> >> > > <[email protected]> wrote:
> >> >
> >> > > > Hmm..in my test, the case was.
> >> > > >
> >> > > > Before try_to_unmap:
> >> > > > ? ? ? ?mapcount=1, SwapCache, remap_swapcache=1
> >> > > > After remap
> >> > > > ? ? ? ?mapcount=0, SwapCache, rc=0.
> >> > > >
> >> > > > So, I think there may be some race in rmap_walk() and vma handling or
> >> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> >> > > >
> >> > > > Hmm..it seems this kind patch will be required for debug.
> >> > >
> >>
> >> Ok, here is my patch for _fix_. But still testing...
> >> Running well at least for 30 minutes, where I can see bug in 10minutes.
> >> But this patch is too naive. please think about something better fix.
> >>
> >> ==
> >> From: KAMEZAWA Hiroyuki <[email protected]>
> >>
> >> At adjust_vma(), vma's start address and pgoff is updated under
> >> write lock of mmap_sem. This means the vma's rmap information
> >> update is atoimic only under read lock of mmap_sem.
> >>
> >>
> >> Even if it's not atomic, in usual case, try_to_ummap() etc...
> >> just fails to decrease mapcount to be 0. no problem.
> >>
> >> But at page migration's rmap_walk(), it requires to know all
> >> migration_entry in page tables and recover mapcount.
> >>
> >> So, this race in vma's address is critical. When rmap_walk meet
> >> the race, rmap_walk will mistakenly get -EFAULT and don't call
> >> rmap_one(). This patch adds a lock for vma's rmap information.
> >> But, this is _very slow_.
> >
> > Ok wow. That is exceptionally well-spotted. This looks like a proper bug
> > that compaction exposes as opposed to a bug that compaction introduces.
> >
> >> We need something sophisitcated, light-weight update for this..
> >>
> >
> > In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
> > the duration of the update and is ?taken elsewhere where the VMA information
> > is read such as rmap_walk_file()
> >
> > In the event the VMA is anon, vma_adjust currently talks no locks and your
> > patch introduces a new one but why not use the anon_vma lock here? Am I
> > missing something that requires the new lock?
>
> rmap_walk_anon doesn't hold vma's anon_vma->lock.
> It holds page->anon_vma->lock.
>

Of course, thank you for pointing out my error. With multiple
anon_vma's, the locking is a bit of a mess. We cannot hold spinlocks on
two vma's in the same list at the same time without potentially causing
a livelock. The problem becomes how we can safely drop one anon_vma and
acquire the other without them disappearing from under us.

See the XXX mark in the following incomplete patch for example. It's
incomplete because the list traversal is also not safe once the lock has
been dropped and -EFAULT is returned by vma_address.

==== CUT HERE ====
mm: Take the vma anon_vma lock in vma_adjust and during rmap_walk

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch takes the anon_vma->lock during vma_adjust to avoid such
races. During rmap_walk, the page anon_vma is locked but as it walks the
VMA list, it'll lock the VMA->anon_vma if they differ as well.

---
mm/mmap.c | 6 ++++++
mm/rmap.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..1ea0cae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1358,7 +1358,7 @@ int try_to_munlock(struct page *page)
static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
struct vm_area_struct *, unsigned long, void *), void *arg)
{
- struct anon_vma *anon_vma;
+ struct anon_vma *page_avma;
struct anon_vma_chain *avc;
int ret = SWAP_AGAIN;

@@ -1368,20 +1368,52 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* are holding mmap_sem. Users without mmap_sem are required to
* take a reference count to prevent the anon_vma disappearing
*/
- anon_vma = page_anon_vma(page);
- if (!anon_vma)
+ page_avma = page_anon_vma(page);
+ if (!page_avma)
return ret;
- spin_lock(&anon_vma->lock);
- list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+ spin_lock(&page_avma->lock);
+restart:
+ list_for_each_entry(avc, &page_avma->head, same_anon_vma) {
+ struct anon_vma *vma_avma;
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
- continue;
+ if (address == -EFAULT) {
+ /*
+ * If the pages anon_vma and the VMAs anon_vma differ,
+ * vma_address was called without the lock being held
+ * but we cannot hold more than one lock on the anon_vma
+ * list at a time without potentially causing a livelock.
+ * Drop the page anon_vma lock, acquire the vma one and
+ * then restart the whole operation
+ */
+ if (vma->anon_vma != page_avma) {
+ vma_avma = vma->anon_vma;
+ spin_unlock(&page_avma->lock);
+
+ /*
+ * XXX: rcu_read_lock will ensure that the
+ * anon_vma still exists but how can we be
+ * sure it has not been freed and reused?
+ */
+ spin_lock(&vma_avma->lock);
+ address = vma_address(page, vma);
+ spin_unlock(&vma_avma->lock);
+
+ /* page_avma with elevated external_refcount exists */
+ spin_lock(&page_avma->lock);
+ if (address == -EFAULT)
+ continue;
+ }
+ }
ret = rmap_one(page, vma, address, arg);
if (ret != SWAP_AGAIN)
break;
+
+ /* Restart the whole list walk if the lock was dropped */
+ if (vma_avma)
+ goto restart;
}
- spin_unlock(&anon_vma->lock);
+ spin_unlock(&page_avma->lock);
return ret;
}

2010-04-22 15:45:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 11:40:06PM +0900, Minchan Kim wrote:
> On Thu, 2010-04-22 at 23:23 +0900, Minchan Kim wrote:
> > On Thu, 2010-04-22 at 19:51 +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 22 Apr 2010 19:31:06 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > > On Thu, 22 Apr 2010 19:13:12 +0900
> > > > Minchan Kim <[email protected]> wrote:
> > > >
> > > > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > > > > <[email protected]> wrote:
> > > >
> > > > > > Hmm..in my test, the case was.
> > > > > >
> > > > > > Before try_to_unmap:
> > > > > > mapcount=1, SwapCache, remap_swapcache=1
> > > > > > After remap
> > > > > > mapcount=0, SwapCache, rc=0.
> > > > > >
> > > > > > So, I think there may be some race in rmap_walk() and vma handling or
> > > > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > > > > >
> > > > > > Hmm..it seems this kind patch will be required for debug.
> > > > >
> > >
> > > Ok, here is my patch for _fix_. But still testing...
> > > Running well at least for 30 minutes, where I can see bug in 10minutes.
> > > But this patch is too naive. please think about something better fix.
> > >
> > > ==
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > At adjust_vma(), vma's start address and pgoff is updated under
> > > write lock of mmap_sem. This means the vma's rmap information
> > > update is atoimic only under read lock of mmap_sem.
> > >
> > >
> > > Even if it's not atomic, in usual case, try_to_ummap() etc...
> > > just fails to decrease mapcount to be 0. no problem.
> > >
> > > But at page migration's rmap_walk(), it requires to know all
> > > migration_entry in page tables and recover mapcount.
> > >
> > > So, this race in vma's address is critical. When rmap_walk meet
> > > the race, rmap_walk will mistakenly get -EFAULT and don't call
> > > rmap_one(). This patch adds a lock for vma's rmap information.
> > > But, this is _very slow_.
> > > We need something sophisitcated, light-weight update for this..
> > >
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > > ---
> > > include/linux/mm_types.h | 1 +
> > > kernel/fork.c | 1 +
> > > mm/mmap.c | 11 ++++++++++-
> > > mm/rmap.c | 3 +++
> > > 4 files changed, 15 insertions(+), 1 deletion(-)
> > >
> > > Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
> > > +++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
> > > @@ -183,6 +183,7 @@ struct vm_area_struct {
> > > #ifdef CONFIG_NUMA
> > > struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> > > #endif
> > > + spinlock_t adjust_lock;
> > > };
> > >
> > > struct core_thread {
> > > Index: linux-2.6.34-rc4-mm1/mm/mmap.c
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
> > > +++ linux-2.6.34-rc4-mm1/mm/mmap.c
> > > @@ -584,13 +584,20 @@ again: remove_next = 1 + (end > next->
> > > if (adjust_next)
> > > vma_prio_tree_remove(next, root);
> > > }
> > > -
> > > + /*
> > > + * changing all params in atomic. If not, vma_address in rmap.c
> > > + * can see wrong result.
> > > + */
> > > + spin_lock(&vma->adjust_lock);
> > > vma->vm_start = start;
> > > vma->vm_end = end;
> > > vma->vm_pgoff = pgoff;
> > > + spin_unlock(&vma->adjust_lock);
> > > if (adjust_next) {
> > > + spin_lock(&next->adjust_lock);
> > > next->vm_start += adjust_next << PAGE_SHIFT;
> > > next->vm_pgoff += adjust_next;
> > > + spin_unlock(&next->adjust_lock);
> > > }
> > >
> > > if (root) {
> > > @@ -1939,6 +1946,7 @@ static int __split_vma(struct mm_struct
> > > *new = *vma;
> > >
> > > INIT_LIST_HEAD(&new->anon_vma_chain);
> > > + spin_lock_init(&new->adjust_lock);
> > >
> > > if (new_below)
> > > new->vm_end = addr;
> > > @@ -2338,6 +2346,7 @@ struct vm_area_struct *copy_vma(struct v
> > > if (IS_ERR(pol))
> > > goto out_free_vma;
> > > INIT_LIST_HEAD(&new_vma->anon_vma_chain);
> > > + spin_lock_init(&new_vma->adjust_lock);
> > > if (anon_vma_clone(new_vma, vma))
> > > goto out_free_mempol;
> > > vma_set_policy(new_vma, pol);
> > > Index: linux-2.6.34-rc4-mm1/kernel/fork.c
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/kernel/fork.c
> > > +++ linux-2.6.34-rc4-mm1/kernel/fork.c
> > > @@ -350,6 +350,7 @@ static int dup_mmap(struct mm_struct *mm
> > > goto fail_nomem;
> > > *tmp = *mpnt;
> > > INIT_LIST_HEAD(&tmp->anon_vma_chain);
> > > + spin_lock_init(&tmp->adjust_lock);
> > > pol = mpol_dup(vma_policy(mpnt));
> > > retval = PTR_ERR(pol);
> > > if (IS_ERR(pol))
> > > Index: linux-2.6.34-rc4-mm1/mm/rmap.c
> > > ===================================================================
> > > --- linux-2.6.34-rc4-mm1.orig/mm/rmap.c
> > > +++ linux-2.6.34-rc4-mm1/mm/rmap.c
> > > @@ -332,11 +332,14 @@ vma_address(struct page *page, struct vm
> > > pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > > unsigned long address;
> > >
> > > + spin_lock(&vma->adjust_lock);
> > > address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > > + spin_unlock(&vma->adjust_lock);
> > > /* page should be within @vma mapping range */
> > > return -EFAULT;
> > > }
> > > + spin_unlock(&vma->adjust_lock);
> > > return address;
> > > }
> > >
> >
> > Nice Catch, Kame. :)
> >
> > For further optimization, we can hold vma->adjust_lock if vma_address
> > returns -EFAULT. But I hope we redesigns it without new locking.
> > But I don't have good idea, now. :(
>
> How about this?
> I just merged ideas of Mel and Kame.:)
>
> It just shows the concept, not formal patch.
>
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..61ea742 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -578,6 +578,8 @@ again: remove_next = 1 + (end > next->vm_end);
> }
> }
>
> + if (vma->anon_vma)
> + spin_lock(&vma->anon_vma->lock);
> if (root) {
> flush_dcache_mmap_lock(mapping);
> vma_prio_tree_remove(vma, root);
> @@ -619,7 +621,8 @@ again: remove_next = 1 + (end > next->vm_end);
>
> if (mapping)
> spin_unlock(&mapping->i_mmap_lock);
> -
> + if (vma->anon_vma)
> + spin_unlock(&vma->anon_vma->lock);
> if (remove_next) {
> if (file) {
> fput(file);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3a53d9f..8075057 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1359,9 +1359,22 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> spin_lock(&anon_vma->lock);
> list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> struct vm_area_struct *vma = avc->vma;
> - unsigned long address = vma_address(page, vma);
> - if (address == -EFAULT)
> + struct anon_vma *tmp_anon_vma = vma->anon_vma;
> + unsigned long address;
> + int tmp_vma_lock = 0;
> +
> + if (tmp_anon_vma != anon_vma) {
> + spin_lock(&tmp_anon_vma->lock);
> + tmp_vma_lock = 1;
> + }

heh, I thought of a similar approach at the same time as you but missed
this mail until later. However, with this approach I suspect there is a
possibility that two walkers of the same anon_vma list could livelock if
two locks on the list are held at the same time. Am still thinking of
how it could be resolved without introducing new locking.

> + address = vma_address(page, vma);
> + if (address == -EFAULT) {
> + if (tmp_vma_lock)
> + spin_unlock(&tmp_anon_vma->lock);
> continue;
> + }
> + if (tmp_vma_lock)
> + spin_unlock(&tmp_anon_vma->lock);
> ret = rmap_one(page, vma, address, arg);
> if (ret != SWAP_AGAIN)
> break;
> --
> 1.7.0.5
>
>
>
> --
> Kind regards,
> Minchan Kim
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-22 16:14:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 04:40:03PM +0100, Mel Gorman wrote:
> On Thu, Apr 22, 2010 at 11:18:14PM +0900, Minchan Kim wrote:
> > On Thu, Apr 22, 2010 at 11:14 PM, Mel Gorman <[email protected]> wrote:
> > > On Thu, Apr 22, 2010 at 07:51:53PM +0900, KAMEZAWA Hiroyuki wrote:
> > >> On Thu, 22 Apr 2010 19:31:06 +0900
> > >> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >>
> > >> > On Thu, 22 Apr 2010 19:13:12 +0900
> > >> > Minchan Kim <[email protected]> wrote:
> > >> >
> > >> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > >> > > <[email protected]> wrote:
> > >> >
> > >> > > > Hmm..in my test, the case was.
> > >> > > >
> > >> > > > Before try_to_unmap:
> > >> > > > ? ? ? ?mapcount=1, SwapCache, remap_swapcache=1
> > >> > > > After remap
> > >> > > > ? ? ? ?mapcount=0, SwapCache, rc=0.
> > >> > > >
> > >> > > > So, I think there may be some race in rmap_walk() and vma handling or
> > >> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > >> > > >
> > >> > > > Hmm..it seems this kind patch will be required for debug.
> > >> > >
> > >>
> > >> Ok, here is my patch for _fix_. But still testing...
> > >> Running well at least for 30 minutes, where I can see bug in 10minutes.
> > >> But this patch is too naive. please think about something better fix.
> > >>
> > >> ==
> > >> From: KAMEZAWA Hiroyuki <[email protected]>
> > >>
> > >> At adjust_vma(), vma's start address and pgoff is updated under
> > >> write lock of mmap_sem. This means the vma's rmap information
> > >> update is atoimic only under read lock of mmap_sem.
> > >>
> > >>
> > >> Even if it's not atomic, in usual case, try_to_ummap() etc...
> > >> just fails to decrease mapcount to be 0. no problem.
> > >>
> > >> But at page migration's rmap_walk(), it requires to know all
> > >> migration_entry in page tables and recover mapcount.
> > >>
> > >> So, this race in vma's address is critical. When rmap_walk meet
> > >> the race, rmap_walk will mistakenly get -EFAULT and don't call
> > >> rmap_one(). This patch adds a lock for vma's rmap information.
> > >> But, this is _very slow_.
> > >
> > > Ok wow. That is exceptionally well-spotted. This looks like a proper bug
> > > that compaction exposes as opposed to a bug that compaction introduces.
> > >
> > >> We need something sophisitcated, light-weight update for this..
> > >>
> > >
> > > In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
> > > the duration of the update and is ?taken elsewhere where the VMA information
> > > is read such as rmap_walk_file()
> > >
> > > In the event the VMA is anon, vma_adjust currently talks no locks and your
> > > patch introduces a new one but why not use the anon_vma lock here? Am I
> > > missing something that requires the new lock?
> >
> > rmap_walk_anon doesn't hold vma's anon_vma->lock.
> > It holds page->anon_vma->lock.
> >
>
> Of course, thank you for pointing out my error. With multiple
> anon_vma's, the locking is a bit of a mess. We cannot hold spinlocks on
> two vma's in the same list at the same time without potentially causing
> a livelock.

Incidentally, I now belatedly see why Kamezawa introduced a new lock. I
assume it was to get around this mess.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 22 Apr 2010, Mel Gorman wrote:

> vma_adjust() is updating anon VMA information without any locks taken.
> In constract, file-backed mappings use the i_mmap_lock. This lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a
> BUG_ON to trigger when the page is faulted in.

Isnt this also a race with reclaim / swap?

2010-04-22 19:37:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 04:40:04PM +0100, Mel Gorman wrote:
> On Thu, Apr 22, 2010 at 11:18:14PM +0900, Minchan Kim wrote:
> > On Thu, Apr 22, 2010 at 11:14 PM, Mel Gorman <[email protected]> wrote:
> > > On Thu, Apr 22, 2010 at 07:51:53PM +0900, KAMEZAWA Hiroyuki wrote:
> > >> On Thu, 22 Apr 2010 19:31:06 +0900
> > >> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >>
> > >> > On Thu, 22 Apr 2010 19:13:12 +0900
> > >> > Minchan Kim <[email protected]> wrote:
> > >> >
> > >> > > On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki
> > >> > > <[email protected]> wrote:
> > >> >
> > >> > > > Hmm..in my test, the case was.
> > >> > > >
> > >> > > > Before try_to_unmap:
> > >> > > > ? ? ? ?mapcount=1, SwapCache, remap_swapcache=1
> > >> > > > After remap
> > >> > > > ? ? ? ?mapcount=0, SwapCache, rc=0.
> > >> > > >
> > >> > > > So, I think there may be some race in rmap_walk() and vma handling or
> > >> > > > anon_vma handling. migration_entry isn't found by rmap_walk.
> > >> > > >
> > >> > > > Hmm..it seems this kind patch will be required for debug.
> > >> > >
> > >>
> > >> Ok, here is my patch for _fix_. But still testing...
> > >> Running well at least for 30 minutes, where I can see bug in 10minutes.
> > >> But this patch is too naive. please think about something better fix.
> > >>
> > >> ==
> > >> From: KAMEZAWA Hiroyuki <[email protected]>
> > >>
> > >> At adjust_vma(), vma's start address and pgoff is updated under
> > >> write lock of mmap_sem. This means the vma's rmap information
> > >> update is atoimic only under read lock of mmap_sem.
> > >>
> > >>
> > >> Even if it's not atomic, in usual case, try_to_ummap() etc...
> > >> just fails to decrease mapcount to be 0. no problem.
> > >>
> > >> But at page migration's rmap_walk(), it requires to know all
> > >> migration_entry in page tables and recover mapcount.
> > >>
> > >> So, this race in vma's address is critical. When rmap_walk meet
> > >> the race, rmap_walk will mistakenly get -EFAULT and don't call
> > >> rmap_one(). This patch adds a lock for vma's rmap information.
> > >> But, this is _very slow_.
> > >
> > > Ok wow. That is exceptionally well-spotted. This looks like a proper bug
> > > that compaction exposes as opposed to a bug that compaction introduces.
> > >
> > >> We need something sophisitcated, light-weight update for this..
> > >>
> > >
> > > In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
> > > the duration of the update and is ?taken elsewhere where the VMA information
> > > is read such as rmap_walk_file()
> > >
> > > In the event the VMA is anon, vma_adjust currently talks no locks and your
> > > patch introduces a new one but why not use the anon_vma lock here? Am I
> > > missing something that requires the new lock?
> >
> > rmap_walk_anon doesn't hold vma's anon_vma->lock.
> > It holds page->anon_vma->lock.
> >
>
> Of course, thank you for pointing out my error. With multiple
> anon_vma's, the locking is a bit of a mess. We cannot hold spinlocks on
> two vma's in the same list at the same time without potentially causing
> a livelock. The problem becomes how we can safely drop one anon_vma and
> acquire the other without them disappearing from under us.
>
> See the XXX mark in the following incomplete patch for example. It's
> incomplete because the list traversal is also not safe once the lock has
> been dropped and -EFAULT is returned by vma_address.
>

There is a simplier alternative I guess. When the vma->anon_vma is difference,
try and lock it. If the lock is uncontended, continue. If not, release the
pages anon_vma lock and start from the beginning - repeat until the list is
walked uncontended. This should avoid livelocking against other walkers.
I have the test running now for 30 minutes with no problems but will
leave it overnight and see what happens.

I tried the approach of having vma_adjust and rmap_walk always seeing
the same anon_vma but I couldn't devise a method. It doesn't seem
possible but I'm still getting to grips with the anon_vma_chain stuff.
Maybe Rik can spot a better way of doing this.

==== CUT HERE ====
mm: Take the vma anon_vma lock in vma_adjust and during rmap_walk

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch has vma_adjust() take the vma->anon_vma lock. When
rmap_walk_anon() is walking the anon_vma_chain list, it can encounter a
VMA a different anon_vma. It cannot just take this lock because
depending on the order of traversal of anon_vma_chains in other
processes, it could cause a livelock. Instead, it releases the pages
anon_vma lock it has and starts again from scratch until it can traverse
the full list uncontended.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/mmap.c | 6 ++++++
mm/rmap.c | 21 ++++++++++++++++++++-
2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);

+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..59d5553 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1368,13 +1368,32 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
* are holding mmap_sem. Users without mmap_sem are required to
* take a reference count to prevent the anon_vma disappearing
*/
+retry:
anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
+ unsigned long address;
+
+ /*
+ * Guard against livelocks by not spinning against
+ * vma->anon_vma->lock. If contention is found, release our lock and
+ * try again until VMA list can be traversed without worrying about
+ * the details of the VMA changing underneath us.
+ */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto retry;
+ }
+ }
+
+ address = vma_address(page, vma);
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (address == -EFAULT)
continue;
ret = rmap_one(page, vma, address, arg);

2010-04-22 23:56:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, 22 Apr 2010 14:40:46 -0500 (CDT)
Christoph Lameter <[email protected]> wrote:

> On Thu, 22 Apr 2010, Mel Gorman wrote:
>
> > vma_adjust() is updating anon VMA information without any locks taken.
> > In constract, file-backed mappings use the i_mmap_lock. This lack of
> > locking can result in races with page migration. During rmap_walk(),
> > vma_address() can return -EFAULT for an address that will soon be valid.
> > This leaves a dangling migration PTE behind which can later cause a
> > BUG_ON to trigger when the page is faulted in.
>
> Isnt this also a race with reclaim / swap?
>
Yes, it's also race in reclaim/swap ...
page_referenced()
try_to_unmap().
rmap_walk() <==== we hit this case.

But above 2 are not considered to be critical.

I'm not sure how this race affect KSM.

Thanks,
-Kame

2010-04-23 03:39:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Thu, Apr 22, 2010 at 10:14:04AM -0500, Christoph Lameter wrote:
> On Thu, 22 Apr 2010, Minchan Kim wrote:
>
> > For further optimization, we can hold vma->adjust_lock if vma_address
> > returns -EFAULT. But I hope we redesigns it without new locking.
> > But I don't have good idea, now. :(
>
> You could make it atomic through the use of RCU.
>
> Create a new vma entry with the changed parameters and then atomically
> switch to the new vma.
>
> Problem is that you have some list_heads in there.

Indeed, it would be necessary to update -all- pointers to the old
vma entry to point to the new vma entry. The question at that point
will be "is it OK to switch the pointers over one at a time?"

In many situations, the answer is "yes", but it is necessary to check
carefully.

Thanx, Paul

2010-04-23 04:55:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

Hi, Christoph.

On Fri, Apr 23, 2010 at 12:14 AM, Christoph Lameter <[email protected]> wrote:
> On Thu, 22 Apr 2010, Minchan Kim wrote:
>
>> For further optimization, we can hold vma->adjust_lock if vma_address
>> returns -EFAULT. But I hope we redesigns it without new locking.
>> But I don't have good idea, now. :(
>
> You could make it atomic through the use of RCU.
>
> Create a new vma entry with the changed parameters and then atomically
> switch to the new vma.
> Problem is that you have some list_heads in there.

That's a good idea if we can do _simply_.
That's because there are many confusion anon_vma and vma handling nowadays.
(http://thread.gmane.org/gmane.linux.kernel/969907)
So I hope we solve the problem without rather complicated rcu locking
if it isn't critical path.

--
Kind regards,
Minchan Kim

2010-04-23 09:03:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Fri, Apr 23, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Apr 2010 14:40:46 -0500 (CDT)
> Christoph Lameter <[email protected]> wrote:
>
> > On Thu, 22 Apr 2010, Mel Gorman wrote:
> >
> > > vma_adjust() is updating anon VMA information without any locks taken.
> > > In constract, file-backed mappings use the i_mmap_lock. This lack of
> > > locking can result in races with page migration. During rmap_walk(),
> > > vma_address() can return -EFAULT for an address that will soon be valid.
> > > This leaves a dangling migration PTE behind which can later cause a
> > > BUG_ON to trigger when the page is faulted in.
> >
> > Isnt this also a race with reclaim / swap?
> >
> Yes, it's also race in reclaim/swap ...
> page_referenced()
> try_to_unmap().
> rmap_walk() <==== we hit this case.
>
> But above 2 are not considered to be critical.
>
> I'm not sure how this race affect KSM.
>

I'm not that familiar with KSM but took a look through. Mostly,
accessing the VMA is protected by the mmap_sem with the exception of
rmap_walk_ksm. It needs similar protection for accessing the VMA than
rmap_walk_anon does.

Specifically, this part

list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
continue;

needs to acquire the vma->anon_vma lock if it differs or in your case
call something similar to vma_address_safe.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-23 18:32:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

Hi Mel,

On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> heh, I thought of a similar approach at the same time as you but missed
> this mail until later. However, with this approach I suspect there is a
> possibility that two walkers of the same anon_vma list could livelock if
> two locks on the list are held at the same time. Am still thinking of
> how it could be resolved without introducing new locking.

Trying to understand this issue and I've some questions. This
vma_adjust and lock inversion troubles with the anon-vma lock in
rmap_walk are a new issue introduced by the recent anon-vma changes,
right?

About swapcache, try_to_unmap just nuke the mappings, establish the
swap entry in the pte (not migration entry), and then there's no need
to call remove_migration_ptes. So it just need to skip it for
swapcache. page_mapped must return zero after try_to_unmap returns
before we're allowed to migrate (plus the page count must be just 1
and not 2 or more for gup users!).

I don't get what's the problem about swapcache and the races connected
to it, the moment I hear migration PTE in context of swapcache
migration I'm confused because there's no migration PTE for swapcache.

The new page will have no mappings either, it just needs to be part of
the swapcache with the same page->index = swapentry, indexed in the
radix tree with that page->index, and paga->mapping pointing to
swapcache. Then new page faults will bring it in the pageatables. The
lookup_swap_cache has to be serialized against some lock, it should be
the radix tree lock? So the migration has to happen with that lock
hold no? We can't just migrate swapcache without stopping swapcache
radix tree lookups no? I didn't digest the full migrate.c yet and I
don't see where it happens. Freeing the swapcache while simpler and
safer, would be quite bad as it'd create I/O for potentially hot-ram.

About the refcounting of anon-vma in migrate.c I think it'd be much
simpler if zap_page_range and folks would just wait (like they do if
they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
no need of refcounting the anon-vma that way.

I assume whatever is added to rmap_walk I also have to add to
split_huge_page later when switching to mainline anon-vma code (for
now I stick to 2.6.32 anon-vma code to avoid debugging anon-vma-chain,
memory compaction, swapcache migration and transparent hugepage at the
same time, which becomes a little beyond feasibility).

2010-04-23 19:23:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
> On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > heh, I thought of a similar approach at the same time as you but missed
> > this mail until later. However, with this approach I suspect there is a
> > possibility that two walkers of the same anon_vma list could livelock if
> > two locks on the list are held at the same time. Am still thinking of
> > how it could be resolved without introducing new locking.
>
> Trying to understand this issue and I've some questions. This
> vma_adjust and lock inversion troubles with the anon-vma lock in
> rmap_walk are a new issue introduced by the recent anon-vma changes,
> right?
>

In a manner of speaking. There was no locking going on but prior to the
anon_vma changes, there would have been only one anon_vma lock and the
fix would be easier - just take the lock on anon_vma->lock while the
VMAs are being updated.

> About swapcache, try_to_unmap just nuke the mappings, establish the
> swap entry in the pte (not migration entry), and then there's no need
> to call remove_migration_ptes.

That would be an alternative for swapcache but it's not necessarily
where the problem is.

> So it just need to skip it for
> swapcache. page_mapped must return zero after try_to_unmap returns
> before we're allowed to migrate (plus the page count must be just 1
> and not 2 or more for gup users!).
>
> I don't get what's the problem about swapcache and the races connected
> to it, the moment I hear migration PTE in context of swapcache
> migration I'm confused because there's no migration PTE for swapcache.
>

That was a mistake on my part. The race appears to be between vma_adjust
changing the details of the VMA while rmap_walk is going on. It mistakenly
believes the vma no longer spans the address, gets -EFAULT from vma_address
and doesn't clean up the migration PTE. This is later encountered but the
page lock is no longer held and it bugs. An alternative would be to clean
up the migration PTE of unlocked pages on the assumption it was due to this
race but it's a bit sloppier.

> The new page will have no mappings either, it just needs to be part of
> the swapcache with the same page->index = swapentry, indexed in the
> radix tree with that page->index, and paga->mapping pointing to
> swapcache. Then new page faults will bring it in the pageatables. The
> lookup_swap_cache has to be serialized against some lock, it should be
> the radix tree lock? So the migration has to happen with that lock
> hold no?

Think migrate_page_move_mapping() is what you're looking for? It takes
the mapping tree lock.

>We can't just migrate swapcache without stopping swapcache
> radix tree lookups no? I didn't digest the full migrate.c yet and I
> don't see where it happens. Freeing the swapcache while simpler and
> safer, would be quite bad as it'd create I/O for potentially hot-ram.
>
> About the refcounting of anon-vma in migrate.c I think it'd be much
> simpler if zap_page_range and folks would just wait (like they do if
> they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
> no need of refcounting the anon-vma that way.
>

I'm not getting what you're suggesting here. The refcount is to make
sure the anon_vma doesn't go away after the page mapcount reaches zero.
What are we waiting for?

> I assume whatever is added to rmap_walk I also have to add to
> split_huge_page later when switching to mainline anon-vma code (for
> now I stick to 2.6.32 anon-vma code to avoid debugging anon-vma-chain,
> memory compaction, swapcache migration and transparent hugepage at the
> same time, which becomes a little beyond feasibility).
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-23 19:40:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Fri, Apr 23, 2010 at 08:23:12PM +0100, Mel Gorman wrote:
> On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> > Hi Mel,
> >
> > On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > > heh, I thought of a similar approach at the same time as you but missed
> > > this mail until later. However, with this approach I suspect there is a
> > > possibility that two walkers of the same anon_vma list could livelock if
> > > two locks on the list are held at the same time. Am still thinking of
> > > how it could be resolved without introducing new locking.
> >
> > Trying to understand this issue and I've some questions. This
> > vma_adjust and lock inversion troubles with the anon-vma lock in
> > rmap_walk are a new issue introduced by the recent anon-vma changes,
> > right?
> >
>
> In a manner of speaking. There was no locking going on but prior to the
> anon_vma changes, there would have been only one anon_vma lock and the
> fix would be easier - just take the lock on anon_vma->lock while the
> VMAs are being updated.

So it was very much a bug before too and we could miss to find some
pte mapping the page if vm_start was adjusted?

Also note, expand_downwards also moves vm_start with only the
anon_vma->lock as it has to serialize against other expand_downwards
and the rmap_walk. But expand_downwards takes the lock and it was safe
before.

Also for swapping even if things screwup it's no big deal, because it
will just skip, but migration has to find all ptes in
remove_migration_ptes and try_to_unmap also has to unmap everything.

In the split_huge_page case even the ordering at which newly allocated
vmas for the child are queued is critical, they've to be put at the
end of the list to be safe (otherwise do_wp_huge_page may not wait and
we may fail to mark the huge_pmd in the child as pmd_splitting).

> > About swapcache, try_to_unmap just nuke the mappings, establish the
> > swap entry in the pte (not migration entry), and then there's no need
> > to call remove_migration_ptes.
>
> That would be an alternative for swapcache but it's not necessarily
> where the problem is.

Hmm try_to_unmap already nukes all swap entries without creating any
migration pte for swapcache as far as I can tell.

> > So it just need to skip it for
> > swapcache. page_mapped must return zero after try_to_unmap returns
> > before we're allowed to migrate (plus the page count must be just 1
> > and not 2 or more for gup users!).
> >
> > I don't get what's the problem about swapcache and the races connected
> > to it, the moment I hear migration PTE in context of swapcache
> > migration I'm confused because there's no migration PTE for swapcache.
> >
>
> That was a mistake on my part. The race appears to be between vma_adjust
> changing the details of the VMA while rmap_walk is going on. It mistakenly
> believes the vma no longer spans the address, gets -EFAULT from vma_address
> and doesn't clean up the migration PTE. This is later encountered but the
> page lock is no longer held and it bugs. An alternative would be to clean
> up the migration PTE of unlocked pages on the assumption it was due to this
> race but it's a bit sloppier.

Agreed, it's sure better to close the race... the other may have even
more implications. It's good to retain the invariant that when a
migration PTE exists the page also still exists and it's locked
(locked really mostly matters for pagecache I guess, but it's ok).

> > The new page will have no mappings either, it just needs to be part of
> > the swapcache with the same page->index = swapentry, indexed in the
> > radix tree with that page->index, and paga->mapping pointing to
> > swapcache. Then new page faults will bring it in the pageatables. The
> > lookup_swap_cache has to be serialized against some lock, it should be
> > the radix tree lock? So the migration has to happen with that lock
> > hold no?
>
> Think migrate_page_move_mapping() is what you're looking for? It takes
> the mapping tree lock.

Yep exactly!

> >We can't just migrate swapcache without stopping swapcache
> > radix tree lookups no? I didn't digest the full migrate.c yet and I
> > don't see where it happens. Freeing the swapcache while simpler and
> > safer, would be quite bad as it'd create I/O for potentially hot-ram.
> >
> > About the refcounting of anon-vma in migrate.c I think it'd be much
> > simpler if zap_page_range and folks would just wait (like they do if
> > they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
> > no need of refcounting the anon-vma that way.
> >
>
> I'm not getting what you're suggesting here. The refcount is to make
> sure the anon_vma doesn't go away after the page mapcount reaches zero.
> What are we waiting for?

Causing zap_page-range to Wait the end of migration when it encounters
migration ptes instead of skipping them all together by only releasing
the rss and doing nothing about them. If the pte can't go away, so the
mm so the vma and so the anon-vma. I'm not suggesting to change that,
but I guess that's the way I would have done if I would have
implemented it, it'd avoid refcounting. Just like I did in
split_huge_page I count the number of pmd marked splitting, and I
compare it to the number of pmds that are converted from huge to
not-huge and I compared that again with the page_mapcount. If the
three numbers aren't equal I bug. It simply can't go wrong unnoticed
that way. I only can do that because I stop the zapping.

2010-04-23 21:36:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Fri, Apr 23, 2010 at 09:39:48PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 23, 2010 at 08:23:12PM +0100, Mel Gorman wrote:
> > On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> > > Hi Mel,
> > >
> > > On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > > > heh, I thought of a similar approach at the same time as you but missed
> > > > this mail until later. However, with this approach I suspect there is a
> > > > possibility that two walkers of the same anon_vma list could livelock if
> > > > two locks on the list are held at the same time. Am still thinking of
> > > > how it could be resolved without introducing new locking.
> > >
> > > Trying to understand this issue and I've some questions. This
> > > vma_adjust and lock inversion troubles with the anon-vma lock in
> > > rmap_walk are a new issue introduced by the recent anon-vma changes,
> > > right?
> > >
> >
> > In a manner of speaking. There was no locking going on but prior to the
> > anon_vma changes, there would have been only one anon_vma lock and the
> > fix would be easier - just take the lock on anon_vma->lock while the
> > VMAs are being updated.
>
> So it was very much a bug before too and we could miss to find some
> pte mapping the page if vm_start was adjusted?

Well I looked deeper into it myself as I wanted to have this bit (and
other bits) sorted out in aa.git, and definitely this is a bug
introduced by the newest anon-vma changes in 2.6.34-rc so aa.git
cannot be affected as it's using the 2.6.33 anon-vma (and prev) code.

vma_adjust already takes the anon_vma->lock and of course I also
further verified that trying to apply your snippet to vma_adjust
results in immediately deadlock as the very same lock is already taken
in my tree as it's the same anon-vma (simpler). So aa.git will be
immune from these bugs for now.

2010-04-24 10:50:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Fri, Apr 23, 2010 at 09:39:48PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 23, 2010 at 08:23:12PM +0100, Mel Gorman wrote:
> > On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> > > Hi Mel,
> > >
> > > On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > > > heh, I thought of a similar approach at the same time as you but missed
> > > > this mail until later. However, with this approach I suspect there is a
> > > > possibility that two walkers of the same anon_vma list could livelock if
> > > > two locks on the list are held at the same time. Am still thinking of
> > > > how it could be resolved without introducing new locking.
> > >
> > > Trying to understand this issue and I've some questions. This
> > > vma_adjust and lock inversion troubles with the anon-vma lock in
> > > rmap_walk are a new issue introduced by the recent anon-vma changes,
> > > right?
> > >
> >
> > In a manner of speaking. There was no locking going on but prior to the
> > anon_vma changes, there would have been only one anon_vma lock and the
> > fix would be easier - just take the lock on anon_vma->lock while the
> > VMAs are being updated.
>
> So it was very much a bug before too and we could miss to find some
> pte mapping the page if vm_start was adjusted?
>

I thought it was but I was looking at an rc kernel instead of 2.6.33.
This looked as if it was safe but it's not any more with the new
anon_vma scheme.

> Also note, expand_downwards also moves vm_start with only the
> anon_vma->lock as it has to serialize against other expand_downwards
> and the rmap_walk. But expand_downwards takes the lock and it was safe
> before.
>
> Also for swapping even if things screwup it's no big deal, because it
> will just skip, but migration has to find all ptes in
> remove_migration_ptes and try_to_unmap also has to unmap everything.
>

It either has to find them all or it has to be capable of a
lazy-cleanup. I had lazy cleanup patch but it was dropped because we
felt it should have been possible to properly lock this. I'm beginning
to think it can't because there appears to be a few cases where the VM
doesn't care if it doesn't find all the mappings.

> In the split_huge_page case even the ordering at which newly allocated
> vmas for the child are queued is critical, they've to be put at the
> end of the list to be safe (otherwise do_wp_huge_page may not wait and
> we may fail to mark the huge_pmd in the child as pmd_splitting).
>

There might also be a locking snarl there as well then. I confess that
the details of transparent hugepage support have fallen back out of my
head within the last two weeks.

> > > About swapcache, try_to_unmap just nuke the mappings, establish the
> > > swap entry in the pte (not migration entry), and then there's no need
> > > to call remove_migration_ptes.
> >
> > That would be an alternative for swapcache but it's not necessarily
> > where the problem is.s
>
> Hmm try_to_unmap already nukes all swap entries without creating any
> migration pte for swapcache as far as I can tell.
>

When a mapped swapcache is unmapped, a migration PTE is put in place. If
that was not the case, we wouldn't be hitting the bug in the first
place.

> > > So it just need to skip it for
> > > swapcache. page_mapped must return zero after try_to_unmap returns
> > > before we're allowed to migrate (plus the page count must be just 1
> > > and not 2 or more for gup users!).
> > >
> > > I don't get what's the problem about swapcache and the races connected
> > > to it, the moment I hear migration PTE in context of swapcache
> > > migration I'm confused because there's no migration PTE for swapcache.
> > >
> >
> > That was a mistake on my part. The race appears to be between vma_adjust
> > changing the details of the VMA while rmap_walk is going on. It mistakenly
> > believes the vma no longer spans the address, gets -EFAULT from vma_address
> > and doesn't clean up the migration PTE. This is later encountered but the
> > page lock is no longer held and it bugs. An alternative would be to clean
> > up the migration PTE of unlocked pages on the assumption it was due to this
> > race but it's a bit sloppier.
>
> Agreed, it's sure better to close the race... the other may have even
> more implications.

True, which is why I'm not keen on lazy cleanup.

> It's good to retain the invariant that when a
> migration PTE exists the page also still exists and it's locked
> (locked really mostly matters for pagecache I guess, but it's ok).
>
> > > The new page will have no mappings either, it just needs to be part of
> > > the swapcache with the same page->index = swapentry, indexed in the
> > > radix tree with that page->index, and paga->mapping pointing to
> > > swapcache. Then new page faults will bring it in the pageatables. The
> > > lookup_swap_cache has to be serialized against some lock, it should be
> > > the radix tree lock? So the migration has to happen with that lock
> > > hold no?
> >
> > Think migrate_page_move_mapping() is what you're looking for? It takes
> > the mapping tree lock.
>
> Yep exactly!
>
> > >We can't just migrate swapcache without stopping swapcache
> > > radix tree lookups no? I didn't digest the full migrate.c yet and I
> > > don't see where it happens. Freeing the swapcache while simpler and
> > > safer, would be quite bad as it'd create I/O for potentially hot-ram.
> > >
> > > About the refcounting of anon-vma in migrate.c I think it'd be much
> > > simpler if zap_page_range and folks would just wait (like they do if
> > > they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
> > > no need of refcounting the anon-vma that way.
> > >
> >
> > I'm not getting what you're suggesting here. The refcount is to make
> > sure the anon_vma doesn't go away after the page mapcount reaches zero.
> > What are we waiting for?
>
> Causing zap_page-range to Wait the end of migration when it encounters
> migration ptes instead of skipping them all together by only releasing
> the rss and doing nothing about them. If the pte can't go away, so the
> mm so the vma and so the anon-vma. I'm not suggesting to change that,
> but I guess that's the way I would have done if I would have
> implemented it, it'd avoid refcounting.

Maybe.

> Just like I did in
> split_huge_page I count the number of pmd marked splitting, and I
> compare it to the number of pmds that are converted from huge to
> not-huge and I compared that again with the page_mapcount. If the
> three numbers aren't equal I bug. It simply can't go wrong unnoticed
> that way. I only can do that because I stop the zapping.
>

That would be another way of doing it all right. Count how many
migration ptes we created, pass that to rmap_walk. If they don't match,
assume a race and do it again before the page is unlocked.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-24 10:52:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Fri, Apr 23, 2010 at 11:35:49PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 23, 2010 at 09:39:48PM +0200, Andrea Arcangeli wrote:
> > On Fri, Apr 23, 2010 at 08:23:12PM +0100, Mel Gorman wrote:
> > > On Fri, Apr 23, 2010 at 08:31:35PM +0200, Andrea Arcangeli wrote:
> > > > Hi Mel,
> > > >
> > > > On Thu, Apr 22, 2010 at 04:44:43PM +0100, Mel Gorman wrote:
> > > > > heh, I thought of a similar approach at the same time as you but missed
> > > > > this mail until later. However, with this approach I suspect there is a
> > > > > possibility that two walkers of the same anon_vma list could livelock if
> > > > > two locks on the list are held at the same time. Am still thinking of
> > > > > how it could be resolved without introducing new locking.
> > > >
> > > > Trying to understand this issue and I've some questions. This
> > > > vma_adjust and lock inversion troubles with the anon-vma lock in
> > > > rmap_walk are a new issue introduced by the recent anon-vma changes,
> > > > right?
> > > >
> > >
> > > In a manner of speaking. There was no locking going on but prior to the
> > > anon_vma changes, there would have been only one anon_vma lock and the
> > > fix would be easier - just take the lock on anon_vma->lock while the
> > > VMAs are being updated.
> >
> > So it was very much a bug before too and we could miss to find some
> > pte mapping the page if vm_start was adjusted?
>
> Well I looked deeper into it myself as I wanted to have this bit (and
> other bits) sorted out in aa.git, and definitely this is a bug
> introduced by the newest anon-vma changes in 2.6.34-rc so aa.git
> cannot be affected as it's using the 2.6.33 anon-vma (and prev) code.
>

I think you're right. This is a new bug introduced by the anon_vma changes. On
the plus side, it means we don't have to worry about -stable.

> vma_adjust already takes the anon_vma->lock and of course I also
> further verified that trying to apply your snippet to vma_adjust
> results in immediately deadlock as the very same lock is already taken
> in my tree as it's the same anon-vma (simpler).

Yes, I expected that. Previously, there was only one anon_vma so if you
double-take the lock, bad things happen.

> So aa.git will be
> immune from these bugs for now.
>

It should be. I expect that's why you have never seen the bugon in
swapops.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-24 11:14:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> I think you're right. This is a new bug introduced by the anon_vma changes. On
> the plus side, it means we don't have to worry about -stable.

Correct, no worry about -stable.

> > vma_adjust already takes the anon_vma->lock and of course I also
> > further verified that trying to apply your snippet to vma_adjust
> > results in immediately deadlock as the very same lock is already taken
> > in my tree as it's the same anon-vma (simpler).
>
> Yes, I expected that. Previously, there was only one anon_vma so if you
> double-take the lock, bad things happen.
>
> > So aa.git will be
> > immune from these bugs for now.
> >
>
> It should be. I expect that's why you have never seen the bugon in
> swapops.

Correct, I never seen it, and I keep it under very great stress with
swap storms of hugepages, lots of I/O and khugepaged at 100% cpu.

Also keep in mind expand_downwards which also adjusts
vm_start/vm_pgoff the same way (and without mmap_sem write mode).

2010-04-24 11:59:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Sat, Apr 24, 2010 at 01:13:40PM +0200, Andrea Arcangeli wrote:
> On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> > I think you're right. This is a new bug introduced by the anon_vma changes. On
> > the plus side, it means we don't have to worry about -stable.
>
> Correct, no worry about -stable.
>
> > > vma_adjust already takes the anon_vma->lock and of course I also
> > > further verified that trying to apply your snippet to vma_adjust
> > > results in immediately deadlock as the very same lock is already taken
> > > in my tree as it's the same anon-vma (simpler).
> >
> > Yes, I expected that. Previously, there was only one anon_vma so if you
> > double-take the lock, bad things happen.
> >
> > > So aa.git will be
> > > immune from these bugs for now.
> > >
> >
> > It should be. I expect that's why you have never seen the bugon in
> > swapops.
>
> Correct, I never seen it, and I keep it under very great stress with
> swap storms of hugepages, lots of I/O and khugepaged at 100% cpu.
>

Well, to me this is also good because it shows it's not an existing bug in
migration or a new bug introduced by compaction either. Previously I hadn't
seen this bug either but until relatively recently, the bulk of the testing
was against 2.6.33.

> Also keep in mind expand_downwards which also adjusts
> vm_start/vm_pgoff the same way (and without mmap_sem write mode).
>

Will keep it in mind. It's taking the anon_vma lock but once again,
there might be more than one anon_vma to worry about and the proper
locking still isn't massively clear to me.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-24 14:31:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Sat, Apr 24, 2010 at 12:59:37PM +0100, Mel Gorman wrote:
> Well, to me this is also good because it shows it's not an existing bug in
> migration or a new bug introduced by compaction either. Previously I hadn't
> seen this bug either but until relatively recently, the bulk of the testing
> was against 2.6.33.

I suggest to test again with aa.git as soon as I make a new release
with your v8 code (probably today). I didn't merge the swapcache
locked debug patch that tries to recover the thing after the fact. No
problem here with swapcache apparently and your v8 and latest numa-aware
khugepaged code is under stress for the last 12 hours.

Other than full numa awareness in all hugepage allocations and your v8
code, I added a generic document that needs review and I plan to add a
config tweak to select the default to be madvise or always for
transparent hugepage (never makes no sense, other than for debugging
purposes, madvise already provides the guarantee of zero risk of
unintentional and not guaranteed beneficial memory waste).

> Will keep it in mind. It's taking the anon_vma lock but once again,
> there might be more than one anon_vma to worry about and the proper
> locking still isn't massively clear to me.

Yes, that's my point, same issue there.

2010-04-25 14:42:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> It should be. I expect that's why you have never seen the bugon in
> swapops.

Oh I just got the very crash you're talking about with aa.git with
your v8 code. Weird that I never reproduced it before! I think it's
because I fixed gcc to be fully backed by hugepages always (without
khugepaged) and I was rebuilding a couple of packages, and that now
triggers memory compaction much more, but mixed with heavy
fork/execve. This is the only instability I managed to reproduce over
24 hours of stress testing and it's clearly not related to transparent
hugepage support but it's either a bug in migrate.c (more likely) or
memory compaction.

Note that I'm running with the 2.6.33 anon-vma code, so it will
relieve you to know it's not the anon-vma recent changes causing this
(well I can't rule out anon-vma bugs, but if it's anon-vma, it's a
longstanding one).

kernel BUG at include/linux/swapops.h:105!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size
CPU 0
Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr

Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name
RIP: 0010:[<ffffffff810e66b0>] [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
RSP: 0000:ffff88009ab6fa58 EFLAGS: 00010246
RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298
RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8
RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298
R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0
FS: 00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffff8a53000 CR3: 0000000004f80000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process basename (pid: 13351, threadinfo ffff88009ab6e000, task ffff88009ab96c70)
Stack:
ffff8800aaa95280 ffffffff810ce472 ffff8801134a7ce8 0000000000000000
<0> 00000000142d1a3e ffffffff810c2e35 79f085e9c08a4db7 62d38944fd014000
<0> 76b07a274b0c057a ffffea00025649f8 f8000000000a168d d19934e84d2a74f3
Call Trace:
[<ffffffff810ce472>] ? page_add_new_anon_rmap+0x72/0xc0
[<ffffffff810c2e35>] ? handle_pte_fault+0x7a5/0x7d0
[<ffffffff8150506d>] ? do_page_fault+0x13d/0x420
[<ffffffff8150215f>] ? page_fault+0x1f/0x30
[<ffffffff81273bfb>] ? strnlen_user+0x4b/0x80
[<ffffffff81131f4e>] ? load_elf_binary+0x12be/0x1c80
[<ffffffff810f426d>] ? search_binary_handler+0xad/0x2c0
[<ffffffff810f5ce7>] ? do_execve+0x247/0x320
[<ffffffff8100ab16>] ? sys_execve+0x36/0x60
[<ffffffff8100314a>] ? stub_execve+0x6a/0xc0
Code: 5e ff ff ff 8d 41 01 89 4c 24 08 89 44 24 04 8b 74 24 04 8b 44 24 08 f0 0f b1 32 89 44 24 0c 8b 44 24 0c 39 c8 74 a4 89 c1 eb d1 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 54 49 89 d4
RIP [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
RSP <ffff88009ab6fa58>
---[ end trace 840ce8bc6f6dc402 ]---


It doesn't look like a coincidence the page that had the migration PTE
set was the argv in the user stack during execve. The bug has to be
there. Or maybe it's a coincidence and it will mislead us. If you've
other stack traces please post them so I can have more info (I'll post
more stack traces if I get them again, it doesn't look easy to
reproduce, supposedly the bug has always been there since the first
time I used memory compaction, and this is the first time I reproduce
it).

2010-04-26 21:55:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On 04/24/2010 07:59 AM, Mel Gorman wrote:
> On Sat, Apr 24, 2010 at 01:13:40PM +0200, Andrea Arcangeli wrote:

>> Also keep in mind expand_downwards which also adjusts
>> vm_start/vm_pgoff the same way (and without mmap_sem write mode).
>
> Will keep it in mind. It's taking the anon_vma lock but once again,
> there might be more than one anon_vma to worry about and the proper
> locking still isn't massively clear to me.

The locking for the anon_vma_chain->same_vma list is
essentially the same as what was used before in mmap
and anon_vma_prepare.

Either the mmap_sem is held for write, or the mmap_sem
is held for reading and the page_table_lock is held.

What exactly is the problem that migration is seeing?

--
All rights reversed

2010-04-26 22:11:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Mon, Apr 26, 2010 at 05:54:08PM -0400, Rik van Riel wrote:
> On 04/24/2010 07:59 AM, Mel Gorman wrote:
>> On Sat, Apr 24, 2010 at 01:13:40PM +0200, Andrea Arcangeli wrote:
>
>>> Also keep in mind expand_downwards which also adjusts
>>> vm_start/vm_pgoff the same way (and without mmap_sem write mode).
>>
>> Will keep it in mind. It's taking the anon_vma lock but once again,
>> there might be more than one anon_vma to worry about and the proper
>> locking still isn't massively clear to me.
>
> The locking for the anon_vma_chain->same_vma list is
> essentially the same as what was used before in mmap
> and anon_vma_prepare.
>
> Either the mmap_sem is held for write, or the mmap_sem
> is held for reading and the page_table_lock is held.
>
> What exactly is the problem that migration is seeing?
>

There are two problems.

Migration isn't holding the mmap_sem for write, for read or the pagetable
lock. It locks the page, unmaps it, puts a migration PTE in place that looks
like a swap entry, copies it and remaps it under the pagetable lock. At no
point does it hold the mmap_sem, but it needs to be sure it finds all the
migration pte it created. Because there are multiple anon_vma's, the locking
is tricky and unclear. I have one patch that locks the anon_vmas as it finds
them but is prepared to start over in the event of contention.

The second appears to be migration ptes that get copied during fork().
This is easier to handle.

I'm testing two patches at the moment and after 8 hours have seen no problem
even though the races are being detected (and handled). If it survives the
night, I'll post them.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-26 22:28:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Mon, Apr 26, 2010 at 11:11:03PM +0100, Mel Gorman wrote:
> Migration isn't holding the mmap_sem for write, for read or the pagetable
> lock. It locks the page, unmaps it, puts a migration PTE in place that looks
> like a swap entry, copies it and remaps it under the pagetable lock. At no
> point does it hold the mmap_sem, but it needs to be sure it finds all the
> migration pte it created. Because there are multiple anon_vma's, the locking
> is tricky and unclear. I have one patch that locks the anon_vmas as it finds
> them but is prepared to start over in the event of contention.

split_huge_page has the exact same requirements, except it is more
strict and it will stop zap_page_range and count that the same number
of pmds it marked as splitting are found again later.


Also note migration has the same "ordering" requirements for
anon_vma_link during fork, new vmas have to be appended at the end or
migration will choke (not going into the details of why, but I can if
you want). This should be safe in new anon-vma code as I already
pointed out this requirement to Rik for split_huge_page to be safe too.

I never tested split_huge_page on the fixed new anon-vma code (before
the latest fixes so with rc4 or so, I only know before the latest
fixes it was triggering BUG_ON in split_huge_page as I've enough
bug-on in there to be sure if split_huge_page doesn't BUG_ON, it's
safe). I need to retry with the new anon-vma code... split_huge_page
never showed anything wrong with the 2.6.33 code that I'm running on
to reduce the variables in the equation.

> The second appears to be migration ptes that get copied during fork().
> This is easier to handle.

And this is also where the requirement that new vmas are added to the
end of the anon-vma lists comes from.

> I'm testing two patches at the moment and after 8 hours have seen no problem
> even though the races are being detected (and handled). If it survives the
> night, I'll post them.

I run again the same kernel as before and I reproduced the crash in
migration_entry_wait swapops.h (page not locked) just once when I
posted the stack trace and never again. I wanted to compare stack
traces and see if it happens again. But that bug in
migration_entry_wait can't be related to the new anon-vma code because
I've backed it out from aa.git. Still you've to figure out if your
patch is fixing a real bug.

I'm just pointing out if there's a bug in anon-vma
vma_adjust/expand_downards is unrelated to the crash in swapops.h
migration_entry_wait. And obviously it's not either a bug in
transparent hugepage code, as you also reproduced the same crash
without using aa.git only with v8.

We need to fix the swapops.h bug with maximum priority... (and of
course the anon-vma bug too if it exists).

Other than that swapops.h in migrate that you can also reproduce with
only mainline + memory compaction v8, I had zero other problems with
current aa.git.

2010-04-27 09:40:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Sun, Apr 25, 2010 at 04:41:13PM +0200, Andrea Arcangeli wrote:
> On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> > It should be. I expect that's why you have never seen the bugon in
> > swapops.
>
> Oh I just got the very crash you're talking about with aa.git with
> your v8 code. Weird that I never reproduced it before! I think it's
> because I fixed gcc to be fully backed by hugepages always (without
> khugepaged) and I was rebuilding a couple of packages, and that now
> triggers memory compaction much more, but mixed with heavy
> fork/execve. This is the only instability I managed to reproduce over
> 24 hours of stress testing and it's clearly not related to transparent
> hugepage support but it's either a bug in migrate.c (more likely) or
> memory compaction.
>
> Note that I'm running with the 2.6.33 anon-vma code, so it will
> relieve you to know it's not the anon-vma recent changes causing this
> (well I can't rule out anon-vma bugs, but if it's anon-vma, it's a
> longstanding one).
>
> kernel BUG at include/linux/swapops.h:105!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size
> CPU 0
> Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr
>
> Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name
> RIP: 0010:[<ffffffff810e66b0>] [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> RSP: 0000:ffff88009ab6fa58 EFLAGS: 00010246
> RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298
> RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8
> RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000
> R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298
> R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0
> FS: 00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007ffff8a53000 CR3: 0000000004f80000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process basename (pid: 13351, threadinfo ffff88009ab6e000, task ffff88009ab96c70)
> Stack:
> ffff8800aaa95280 ffffffff810ce472 ffff8801134a7ce8 0000000000000000
> <0> 00000000142d1a3e ffffffff810c2e35 79f085e9c08a4db7 62d38944fd014000
> <0> 76b07a274b0c057a ffffea00025649f8 f8000000000a168d d19934e84d2a74f3
> Call Trace:
> [<ffffffff810ce472>] ? page_add_new_anon_rmap+0x72/0xc0
> [<ffffffff810c2e35>] ? handle_pte_fault+0x7a5/0x7d0
> [<ffffffff8150506d>] ? do_page_fault+0x13d/0x420
> [<ffffffff8150215f>] ? page_fault+0x1f/0x30
> [<ffffffff81273bfb>] ? strnlen_user+0x4b/0x80
> [<ffffffff81131f4e>] ? load_elf_binary+0x12be/0x1c80
> [<ffffffff810f426d>] ? search_binary_handler+0xad/0x2c0
> [<ffffffff810f5ce7>] ? do_execve+0x247/0x320
> [<ffffffff8100ab16>] ? sys_execve+0x36/0x60
> [<ffffffff8100314a>] ? stub_execve+0x6a/0xc0
> Code: 5e ff ff ff 8d 41 01 89 4c 24 08 89 44 24 04 8b 74 24 04 8b 44 24 08 f0 0f b1 32 89 44 24 0c 8b 44 24 0c 39 c8 74 a4 89 c1 eb d1 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 54 49 89 d4
> RIP [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> RSP <ffff88009ab6fa58>
> ---[ end trace 840ce8bc6f6dc402 ]---
>
> It doesn't look like a coincidence the page that had the migration PTE
> set was the argv in the user stack during execve. The bug has to be
> there. Or maybe it's a coincidence and it will mislead us. If you've
> other stack traces please post them so I can have more info (I'll post
> more stack traces if I get them again, it doesn't look easy to
> reproduce, supposedly the bug has always been there since the first
> time I used memory compaction, and this is the first time I reproduce
> it).
>

The oopses I am getting look very similar. The page is encountered in
the stack while copying the arguements in. I don't think it's a
coincidence.

[17831.496941] ------------[ cut here ]------------
[17831.532517] kernel BUG at include/linux/swapops.h:105!
[17831.532517] invalid opcode: 0000 [#1] PREEMPT SMP
[17831.532517] last sysfs file: /sys/block/sde/size
[17831.532517] CPU 0
[17831.532517] Modules linked in: kvm_amd kvm dm_crypt loop evdev tpm_tis i2c_piix4 shpchp tpm wmi tpm_bios serio_raw i2c_core pci_hotplug processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci ide_pci_generic libata r8169 mii ide_core ehci_hcd scsi_mod ohci_hcd floppy thermal fan thermal_sys
[17831.532517]
[17831.532517] Pid: 31028, comm: make Not tainted 2.6.34-rc4-mm1-fix-swapops #2 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[17831.532517] RIP: 0010:[<ffffffff811094fb>] [<ffffffff811094fb>] migration_entry_wait+0xc1/0x129
[17831.532517] RSP: 0018:ffff88007ebfd9d8 EFLAGS: 00010246
[17831.532517] RAX: ffffea0000000000 RBX: ffffea0000199368 RCX: 00000000000389f0
[17831.532517] RDX: 0000000000199368 RSI: ffffffff81826558 RDI: 0000000000e9d63e
[17831.532517] RBP: ffff88007ebfda08 R08: ffff88007e334ec0 R09: ffff88007ebfd9e8
[17831.532517] R10: 00000000b411a2ca R11: 0000000000000246 R12: 0000000036f44000
[17831.532517] R13: ffff88007d164088 R14: f8000000000074eb R15: 0000000000e9d63e
[17831.532517] FS: 00002abc5c083f90(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[17831.532517] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17831.532517] CR2: 00007fff3fb8cd1c CR3: 000000007d12d000 CR4: 00000000000006f0
[17831.532517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17831.532517] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17831.532517] Process make (pid: 31028, threadinfo ffff88007ebfc000, task ffff88007e334ec0)
[17831.532517] Stack:
[17831.532517] ffff88007ebfdb98 ffff88007ebfda18 ffff88007ebfdbe8 0000000000000600
[17831.532517] <0> ffff880036f44c60 00007fff3fb8cd1c ffff88007ebfdaa8 ffffffff810e951a
[17831.532517] <0> ffff88007ebfda88 0000000000000246 0000000000000000 ffffffff8130c542
[17831.532517] Call Trace:
[17831.532517] [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[17831.532517] [<ffffffff8130c542>] ? do_page_fault+0x26a/0x46e
[17831.532517] [<ffffffff8130c722>] do_page_fault+0x44a/0x46e
[17831.532517] [<ffffffff813086fd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[17831.532517] [<ffffffff81309935>] page_fault+0x25/0x30
[17831.532517] [<ffffffff811c1dc7>] ? strnlen_user+0x3f/0x57
[17831.532517] [<ffffffff8114de0d>] load_elf_binary+0x1508/0x190b
[17831.532517] [<ffffffff81113297>] search_binary_handler+0x173/0x313
[17831.532517] [<ffffffff8114c905>] ? load_elf_binary+0x0/0x190b
[17831.532517] [<ffffffff81114892>] do_execve+0x219/0x30a
[17831.532517] [<ffffffff8111887b>] ? getname+0x14d/0x1b3
[17831.532517] [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[17831.532517] [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[17831.532517] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[17831.532517] RIP [<ffffffff811094fb>] migration_entry_wait+0xc1/0x129
[17831.532517] RSP <ffff88007ebfd9d8>
[17835.075942] ---[ end trace 6c659d3989ca12d3 ]---

2010-04-27 10:45:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Tue, 27 Apr 2010 10:40:02 +0100
Mel Gorman <[email protected]> wrote:

> On Sun, Apr 25, 2010 at 04:41:13PM +0200, Andrea Arcangeli wrote:
> > On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> > > It should be. I expect that's why you have never seen the bugon in
> > > swapops.
> >
> > Oh I just got the very crash you're talking about with aa.git with
> > your v8 code. Weird that I never reproduced it before! I think it's
> > because I fixed gcc to be fully backed by hugepages always (without
> > khugepaged) and I was rebuilding a couple of packages, and that now
> > triggers memory compaction much more, but mixed with heavy
> > fork/execve. This is the only instability I managed to reproduce over
> > 24 hours of stress testing and it's clearly not related to transparent
> > hugepage support but it's either a bug in migrate.c (more likely) or
> > memory compaction.
> >
> > Note that I'm running with the 2.6.33 anon-vma code, so it will
> > relieve you to know it's not the anon-vma recent changes causing this
> > (well I can't rule out anon-vma bugs, but if it's anon-vma, it's a
> > longstanding one).
> >
> > kernel BUG at include/linux/swapops.h:105!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size
> > CPU 0
> > Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr
> >
> > Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name
> > RIP: 0010:[<ffffffff810e66b0>] [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> > RSP: 0000:ffff88009ab6fa58 EFLAGS: 00010246
> > RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298
> > RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8
> > RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000
> > R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298
> > R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0
> > FS: 00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00007ffff8a53000 CR3: 0000000004f80000 CR4: 00000000000006f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process basename (pid: 13351, threadinfo ffff88009ab6e000, task ffff88009ab96c70)
> > Stack:
> > ffff8800aaa95280 ffffffff810ce472 ffff8801134a7ce8 0000000000000000
> > <0> 00000000142d1a3e ffffffff810c2e35 79f085e9c08a4db7 62d38944fd014000
> > <0> 76b07a274b0c057a ffffea00025649f8 f8000000000a168d d19934e84d2a74f3
> > Call Trace:
> > [<ffffffff810ce472>] ? page_add_new_anon_rmap+0x72/0xc0
> > [<ffffffff810c2e35>] ? handle_pte_fault+0x7a5/0x7d0
> > [<ffffffff8150506d>] ? do_page_fault+0x13d/0x420
> > [<ffffffff8150215f>] ? page_fault+0x1f/0x30
> > [<ffffffff81273bfb>] ? strnlen_user+0x4b/0x80
> > [<ffffffff81131f4e>] ? load_elf_binary+0x12be/0x1c80
> > [<ffffffff810f426d>] ? search_binary_handler+0xad/0x2c0
> > [<ffffffff810f5ce7>] ? do_execve+0x247/0x320
> > [<ffffffff8100ab16>] ? sys_execve+0x36/0x60
> > [<ffffffff8100314a>] ? stub_execve+0x6a/0xc0
> > Code: 5e ff ff ff 8d 41 01 89 4c 24 08 89 44 24 04 8b 74 24 04 8b 44 24 08 f0 0f b1 32 89 44 24 0c 8b 44 24 0c 39 c8 74 a4 89 c1 eb d1 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 54 49 89 d4
> > RIP [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> > RSP <ffff88009ab6fa58>
> > ---[ end trace 840ce8bc6f6dc402 ]---
> >
> > It doesn't look like a coincidence the page that had the migration PTE
> > set was the argv in the user stack during execve. The bug has to be
> > there. Or maybe it's a coincidence and it will mislead us. If you've
> > other stack traces please post them so I can have more info (I'll post
> > more stack traces if I get them again, it doesn't look easy to
> > reproduce, supposedly the bug has always been there since the first
> > time I used memory compaction, and this is the first time I reproduce
> > it).
> >
>
> The oopses I am getting look very similar. The page is encountered in
> the stack while copying the arguements in. I don't think it's a
> coincidence.
>

Hmm. booby trap aronude here ?
==
static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
{
....
/*
* cover the whole range: [new_start, old_end)
*/
if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
return -ENOMEM;

/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
if (length != move_page_tables(vma, old_start,
vma, new_start, length))
return -ENOMEM;
...
/*
* Shrink the vma to just the new range. Always succeeds.
*/
vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);


==

I think we have wrong vma_address() -> "pte"
==
=== (A) ===
vma_adjust(). ---- (*)
=== (B) ===
move_pte().
==

vma_address(page, vma)
=> address = vma->vm_start + ((page->index << shift) - vma->vm_pgoff) << PAGE_SHIFT);

So, vma_address() in zone (A) and vma_address in (B) will return different address.

When pte inludes migration_pte, this seems critical. Because an address pointed
by vma_address() in zone (B) will not contain migration_pte until
move_ptes() ends.

Thanks,
-Kame





2010-04-27 11:12:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Tue, Apr 27, 2010 at 07:41:39PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Apr 2010 10:40:02 +0100
> Mel Gorman <[email protected]> wrote:
>
> > On Sun, Apr 25, 2010 at 04:41:13PM +0200, Andrea Arcangeli wrote:
> > > On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> > > > It should be. I expect that's why you have never seen the bugon in
> > > > swapops.
> > >
> > > Oh I just got the very crash you're talking about with aa.git with
> > > your v8 code. Weird that I never reproduced it before! I think it's
> > > because I fixed gcc to be fully backed by hugepages always (without
> > > khugepaged) and I was rebuilding a couple of packages, and that now
> > > triggers memory compaction much more, but mixed with heavy
> > > fork/execve. This is the only instability I managed to reproduce over
> > > 24 hours of stress testing and it's clearly not related to transparent
> > > hugepage support but it's either a bug in migrate.c (more likely) or
> > > memory compaction.
> > >
> > > Note that I'm running with the 2.6.33 anon-vma code, so it will
> > > relieve you to know it's not the anon-vma recent changes causing this
> > > (well I can't rule out anon-vma bugs, but if it's anon-vma, it's a
> > > longstanding one).
> > >
> > > kernel BUG at include/linux/swapops.h:105!
> > > invalid opcode: 0000 [#1] SMP
> > > last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size
> > > CPU 0
> > > Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr
> > >
> > > Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name
> > > RIP: 0010:[<ffffffff810e66b0>] [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> > > RSP: 0000:ffff88009ab6fa58 EFLAGS: 00010246
> > > RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298
> > > RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8
> > > RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000
> > > R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298
> > > R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0
> > > FS: 00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 00007ffff8a53000 CR3: 0000000004f80000 CR4: 00000000000006f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Process basename (pid: 13351, threadinfo ffff88009ab6e000, task ffff88009ab96c70)
> > > Stack:
> > > ffff8800aaa95280 ffffffff810ce472 ffff8801134a7ce8 0000000000000000
> > > <0> 00000000142d1a3e ffffffff810c2e35 79f085e9c08a4db7 62d38944fd014000
> > > <0> 76b07a274b0c057a ffffea00025649f8 f8000000000a168d d19934e84d2a74f3
> > > Call Trace:
> > > [<ffffffff810ce472>] ? page_add_new_anon_rmap+0x72/0xc0
> > > [<ffffffff810c2e35>] ? handle_pte_fault+0x7a5/0x7d0
> > > [<ffffffff8150506d>] ? do_page_fault+0x13d/0x420
> > > [<ffffffff8150215f>] ? page_fault+0x1f/0x30
> > > [<ffffffff81273bfb>] ? strnlen_user+0x4b/0x80
> > > [<ffffffff81131f4e>] ? load_elf_binary+0x12be/0x1c80
> > > [<ffffffff810f426d>] ? search_binary_handler+0xad/0x2c0
> > > [<ffffffff810f5ce7>] ? do_execve+0x247/0x320
> > > [<ffffffff8100ab16>] ? sys_execve+0x36/0x60
> > > [<ffffffff8100314a>] ? stub_execve+0x6a/0xc0
> > > Code: 5e ff ff ff 8d 41 01 89 4c 24 08 89 44 24 04 8b 74 24 04 8b 44 24 08 f0 0f b1 32 89 44 24 0c 8b 44 24 0c 39 c8 74 a4 89 c1 eb d1 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 54 49 89 d4
> > > RIP [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> > > RSP <ffff88009ab6fa58>
> > > ---[ end trace 840ce8bc6f6dc402 ]---
> > >
> > > It doesn't look like a coincidence the page that had the migration PTE
> > > set was the argv in the user stack during execve. The bug has to be
> > > there. Or maybe it's a coincidence and it will mislead us. If you've
> > > other stack traces please post them so I can have more info (I'll post
> > > more stack traces if I get them again, it doesn't look easy to
> > > reproduce, supposedly the bug has always been there since the first
> > > time I used memory compaction, and this is the first time I reproduce
> > > it).
> > >
> >
> > The oopses I am getting look very similar. The page is encountered in
> > the stack while copying the arguements in. I don't think it's a
> > coincidence.
> >
>
> Hmm. booby trap aronude here ?

I think so. I have a debugging patch running at the moment that is
checking for migration ptes while the page tables are being moved.

> ==
> static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
> {
> ....
> /*
> * cover the whole range: [new_start, old_end)
> */
> if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> return -ENOMEM;
>
> /*
> * move the page tables downwards, on failure we rely on
> * process cleanup to remove whatever mess we made.
> */
> if (length != move_page_tables(vma, old_start,
> vma, new_start, length))
> return -ENOMEM;
> ...

Specfically, I have it in move_ptes. If migration entries are being found
there, it would be reasonable for exec() to wait on migration to complete
but what you suggest below is more plausible.

> /*
> * Shrink the vma to just the new range. Always succeeds.
> */
> vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
>
>
> ==
>
> I think we have wrong vma_address() -> "pte"
> ==
> === (A) ===
> vma_adjust(). ---- (*)
> === (B) ===
> move_pte().
> ==
>
> vma_address(page, vma)
> => address = vma->vm_start + ((page->index << shift) - vma->vm_pgoff) << PAGE_SHIFT);
>
> So, vma_address() in zone (A) and vma_address in (B) will return different address.
>

Yes. I was expecting that the anon_vma lock in vma_adjust would delay exec
until migration completed.

> When pte inludes migration_pte, this seems critical. Because an address pointed
> by vma_address() in zone (B) will not contain migration_pte until
> move_ptes() ends.
>

This is plausible considering that, like vma_adjust(), move_ptes does
not appear to take the anon_vma->lock in the same fashion as i_mmap_lock
is taken for files.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-04-27 15:43:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/14] mm,migration: Allow the migration of PageSwapCache pages

On Tue, Apr 27, 2010 at 10:40:02AM +0100, Mel Gorman wrote:
> On Sun, Apr 25, 2010 at 04:41:13PM +0200, Andrea Arcangeli wrote:
> > On Sat, Apr 24, 2010 at 11:52:27AM +0100, Mel Gorman wrote:
> > > It should be. I expect that's why you have never seen the bugon in
> > > swapops.
> >
> > Oh I just got the very crash you're talking about with aa.git with
> > your v8 code. Weird that I never reproduced it before! I think it's
> > because I fixed gcc to be fully backed by hugepages always (without
> > khugepaged) and I was rebuilding a couple of packages, and that now
> > triggers memory compaction much more, but mixed with heavy
> > fork/execve. This is the only instability I managed to reproduce over
> > 24 hours of stress testing and it's clearly not related to transparent
> > hugepage support but it's either a bug in migrate.c (more likely) or
> > memory compaction.
> >
> > Note that I'm running with the 2.6.33 anon-vma code, so it will
> > relieve you to know it's not the anon-vma recent changes causing this
> > (well I can't rule out anon-vma bugs, but if it's anon-vma, it's a
> > longstanding one).
> >
> > kernel BUG at include/linux/swapops.h:105!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size
> > CPU 0
> > Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr
> >
> > Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name
> > RIP: 0010:[<ffffffff810e66b0>] [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> > RSP: 0000:ffff88009ab6fa58 EFLAGS: 00010246
> > RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298
> > RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8
> > RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000
> > R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298
> > R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0
> > FS: 00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00007ffff8a53000 CR3: 0000000004f80000 CR4: 00000000000006f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process basename (pid: 13351, threadinfo ffff88009ab6e000, task ffff88009ab96c70)
> > Stack:
> > ffff8800aaa95280 ffffffff810ce472 ffff8801134a7ce8 0000000000000000
> > <0> 00000000142d1a3e ffffffff810c2e35 79f085e9c08a4db7 62d38944fd014000
> > <0> 76b07a274b0c057a ffffea00025649f8 f8000000000a168d d19934e84d2a74f3
> > Call Trace:
> > [<ffffffff810ce472>] ? page_add_new_anon_rmap+0x72/0xc0
> > [<ffffffff810c2e35>] ? handle_pte_fault+0x7a5/0x7d0
> > [<ffffffff8150506d>] ? do_page_fault+0x13d/0x420
> > [<ffffffff8150215f>] ? page_fault+0x1f/0x30
> > [<ffffffff81273bfb>] ? strnlen_user+0x4b/0x80
> > [<ffffffff81131f4e>] ? load_elf_binary+0x12be/0x1c80
> > [<ffffffff810f426d>] ? search_binary_handler+0xad/0x2c0
> > [<ffffffff810f5ce7>] ? do_execve+0x247/0x320
> > [<ffffffff8100ab16>] ? sys_execve+0x36/0x60
> > [<ffffffff8100314a>] ? stub_execve+0x6a/0xc0
> > Code: 5e ff ff ff 8d 41 01 89 4c 24 08 89 44 24 04 8b 74 24 04 8b 44 24 08 f0 0f b1 32 89 44 24 0c 8b 44 24 0c 39 c8 74 a4 89 c1 eb d1 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 54 49 89 d4
> > RIP [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
> > RSP <ffff88009ab6fa58>
> > ---[ end trace 840ce8bc6f6dc402 ]---
> >
> > It doesn't look like a coincidence the page that had the migration PTE
> > set was the argv in the user stack during execve. The bug has to be
> > there. Or maybe it's a coincidence and it will mislead us. If you've
> > other stack traces please post them so I can have more info (I'll post
> > more stack traces if I get them again, it doesn't look easy to
> > reproduce, supposedly the bug has always been there since the first
> > time I used memory compaction, and this is the first time I reproduce
> > it).
> >
>
> The oopses I am getting look very similar. The page is encountered in
> the stack while copying the arguements in. I don't think it's a
> coincidence.

Awesome this is happening in the same page and in the same place!
Thanks a lot for sharing your oops (I'm running the same kernel again
and I never reproduced it again but I didn't apply the "reproducer"
stress to it but only plenty of gcc hugepage loads, I think rebuilding
gcc with hugepage-gcc is what triggered it the first time, as gcc
takes more memory to build and has a lot more of pathologic cases like
translate.o taking lots of memory than the kernel or glibc which I
rebuilt a lot recently).

Reducing the race to this will help tremendously. If I understand
correctly patch 1 didn't fix it (and patch 2 can't be the issue for me
as I'm running 2.6.33 anon-vma code). I'll check ASAP if patch 1 is
needed even if it's not fixing this bug, or if it's unnecessary, and
what else could be wrong...

Thanks a lot for the help!
Andrea