Changes V4->V5:
- Patchset against 2.6.15-rc2-mm1
- Update policy layer patch to use the generic check_range in 2.6.15-rc2-mm1.
- Remove try_to_unmap patch since VM_RESERVED vanished under us and therefore
there is no point anymore to distinguish between permament and transitional
failures.
Changes V3->V4:
- Patchset against 2.6.15-rc1-mm2 + two swap migration fixes posted today.
- Remove what is already in 2.6.14-rc1-mm2 which results in a significant
cleanup of the code.
Changes V2->V3:
- Patchset against 2.6.14-mm2
- Fix single processor build and builds without CONFIG_MIGRATION
- export symbols for filesystems that are modules and for
modules using migrate_pages().
- Paul Jackson's cpuset migration support is in 2.6.14-mm2 so
this patchset can be easily applied to -mm2 to get from swap
based to direct page migration.
Changes V1->V2:
- Call node_remap with the right parameters in do_migrate_pages().
- Take radix tree lock while examining page count to avoid races with
find_get_page() and various *_get_pages based on it.
- Convert direct ptes to swap ptes before radix tree update to avoid
more races.
- Fix problem if CONFIG_MIGRATION is off for buffer_migrate_page
- Add documentation about page migration
- Change migrate_pages() api so that the caller can decide what
to do about the migrated pages (badmem handling and hotplug
have to remove those pages for good).
- Drop config patch (already in mm)
- Add try_to_unmap patch
- Patchset now against 2.6.14-mm1 without requiring additional patches.
Note that the page migration here is different from the one of the memory
hotplug project. Pages are migrated in order to improve performance.
A best effort is made to migrate all pages that are in use by user space
and that are swappable. If a couple of pages are not moved then the
performance of a process will not increase as much as wanted but the
application will continue to function properly.
Much of the ideas for this code were originally developed in the memory
hotplug project and we hope that this code also will allow the hotplug
project to build on this patch in order to get to their goals. We also
would like to be able to move bad memory at SGI which is likely something
that will also be based on this patchset.
I am very thankful for the support of the hotplug developers for bringing
this patchset about. The migration of kernel pages, slab pages and
other unswappable pages that is also needed by the hotplug project
and for the remapping of bad memory is likely to require a significant
amount of additional changes to the Linux kernel beyond the scope of
this page migration endeavor.
Page migration can be triggered via:
A. Specifying MPOL_MF_MOVE(_ALL) when setting a new policy
for a range of addresses of a process.
B. Calling sys_migrate_pages() to control the location of the pages of
another process. Pages may migrate back through swapping if memory
policies, cpuset nodes and the node on which the process is executing
are not changed by other means.
sys_migrate_pages() may be particularly useful to move the pages of
a process if the scheduler has shifted the execution of a process
to a different node.
C. Changing the cpuset of a task (moving tasks to another cpuset or modifying
its set of allowed nodes) if a special option is set in the cpuset. The
cpuset code will call into the page migration layer in order to move the
process to its new environment. This is the preferred and easiest method
to use page migration. Thanks to Paul Jackson for realizing this
functionality.
The patchset consists of seven patches (only the first four are necessary to
have basic direct migration support):
1. Fix double unlock page
The cleanup patches introduced a bug. Fix that
2. Consolidate successful migration handling.
Code to handle successful migration occurred two times.
3. SwapCache patch
SwapCache pages may have changed their type after lock_page().
Check for this and retry lookup if the page is no longer a SwapCache
page.
4. migrate_pages()
Basic direct migration with fallback to swap if all other attempts
fail.
5. remove_from_swap()
Page migration installs swap ptes for anonymous pages in order to
preserve the information contained in the page tables. This patch
removes the swap ptes and replaces them with real ones after migration.
6. upgrade of MPOL_MF_MOVE and sys_migrate_pages()
Add logic to mm/mempolicy.c to allow the policy layer to control
direct page migration. Thanks to Paul Jackson for the interative
logic to move between sets of nodes.
7. buffer_migrate_pages() patch
Allow migration without writing back dirty pages. Add filesystem dependent
migration support for ext2/ext3 and xfs. Use swapper space to setup a
method to migrate anonymous pages without writeback.
Credits (also in mm/vmscan.c):
The idea for this scheme of page migration was first developed in the context
of the memory hotplug project. The main authors of the migration code from
the memory hotplug project are:
IWAMOTO Toshihiro <[email protected]>
Hirokazu Takahashi <[email protected]>
Dave Hansen <[email protected]>
Check for PageSwapCache after looking up and locking a swap page.
The page migration code may change a swap pte to point to a different page
under lock_page().
If that happens then the vm must retry the lookup operation in the swap
space to find the correct page number. There are a couple of locations
in the VM where a lock_page() is done on a swap page. In these locations
we need to check afterwards if the page was migrated. If the page was migrated
then the old page that was looked up before was freed and no longer has the
PageSwapCache bit set.
Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Christoph Lameter <clameter@@sgi.com>
Index: linux-2.6.15-rc1-mm2/mm/memory.c
===================================================================
--- linux-2.6.15-rc1-mm2.orig/mm/memory.c 2005-11-18 09:47:15.000000000 -0800
+++ linux-2.6.15-rc1-mm2/mm/memory.c 2005-11-18 09:47:19.000000000 -0800
@@ -1781,6 +1781,7 @@ static int do_swap_page(struct mm_struct
goto out;
entry = pte_to_swp_entry(orig_pte);
+again:
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1804,6 +1805,12 @@ static int do_swap_page(struct mm_struct
mark_page_accessed(page);
lock_page(page);
+ if (!PageSwapCache(page)) {
+ /* Page migration has occured */
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
/*
* Back out if somebody else already faulted in this pte.
Index: linux-2.6.15-rc1-mm2/mm/shmem.c
===================================================================
--- linux-2.6.15-rc1-mm2.orig/mm/shmem.c 2005-11-18 09:47:15.000000000 -0800
+++ linux-2.6.15-rc1-mm2/mm/shmem.c 2005-11-18 09:47:19.000000000 -0800
@@ -1028,6 +1028,14 @@ repeat:
page_cache_release(swappage);
goto repeat;
}
+ if (!PageSwapCache(swappage)) {
+ /* Page migration has occured */
+ shmem_swp_unmap(entry);
+ spin_unlock(&info->lock);
+ unlock_page(swappage);
+ page_cache_release(swappage);
+ goto repeat;
+ }
if (PageWriteback(swappage)) {
shmem_swp_unmap(entry);
spin_unlock(&info->lock);
Index: linux-2.6.15-rc1-mm2/mm/swapfile.c
===================================================================
--- linux-2.6.15-rc1-mm2.orig/mm/swapfile.c 2005-11-11 17:43:36.000000000 -0800
+++ linux-2.6.15-rc1-mm2/mm/swapfile.c 2005-11-18 09:47:19.000000000 -0800
@@ -624,6 +624,7 @@ static int try_to_unuse(unsigned int typ
*/
swap_map = &si->swap_map[i];
entry = swp_entry(type, i);
+again:
page = read_swap_cache_async(entry, NULL, 0);
if (!page) {
/*
@@ -658,6 +659,12 @@ static int try_to_unuse(unsigned int typ
wait_on_page_locked(page);
wait_on_page_writeback(page);
lock_page(page);
+ if (!PageSwapCache(page)) {
+ /* Page migration has occured */
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
wait_on_page_writeback(page);
/*
Migrate a page with buffers without requiring writeback
This introduces a new address space operation migratepage() that
may be used by a filesystem to implement its own version of page migration.
A version is provided that migrates buffers attached to pages. Some
filesystems (ext2, ext3, xfs) are modified to utilize this feature.
The swapper address space operation are modified so that a regular
migrate_page() will occur for anonymous pages without writeback
(migrate_pages forces every anonymous page to have a swap entry).
V2->V3:
- export functions for filesystems that are modules and for modules that
perform migration by calling migrate_pages().
- Fix macro name clash. Fix build on UP and systems without CONFIG_MIGRATION
V1->V2:
- Fix CONFIG_MIGRATION handling
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc2-mm1/include/linux/fs.h
===================================================================
--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h 2005-11-23 09:10:04.000000000 -0800
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h 2005-11-28 11:41:38.000000000 -0800
@@ -367,6 +367,8 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
+ /* migrate the contents of a page to the specified target */
+ int (*migratepage) (struct page *, struct page *);
};
struct backing_dev_info;
@@ -1722,6 +1724,12 @@ extern void simple_release_fs(struct vfs
extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
+#ifdef CONFIG_MIGRATION
+extern int buffer_migrate_page(struct page *, struct page *);
+#else
+#define buffer_migrate_page NULL
+#endif
+
extern int inode_change_ok(struct inode *, struct iattr *);
extern int __must_check inode_setattr(struct inode *, struct iattr *);
Index: linux-2.6.15-rc2-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/swap_state.c 2005-11-23 09:10:04.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/swap_state.c 2005-11-28 11:41:38.000000000 -0800
@@ -27,6 +27,7 @@ static struct address_space_operations s
.writepage = swap_writepage,
.sync_page = block_sync_page,
.set_page_dirty = __set_page_dirty_nobuffers,
+ .migratepage = migrate_page,
};
static struct backing_dev_info swap_backing_dev_info = {
Index: linux-2.6.15-rc2-mm1/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/fs/xfs/linux-2.6/xfs_aops.c 2005-11-19 19:25:03.000000000 -0800
+++ linux-2.6.15-rc2-mm1/fs/xfs/linux-2.6/xfs_aops.c 2005-11-28 11:41:38.000000000 -0800
@@ -1348,4 +1348,5 @@ struct address_space_operations linvfs_a
.commit_write = generic_commit_write,
.bmap = linvfs_bmap,
.direct_IO = linvfs_direct_IO,
+ .migratepage = buffer_migrate_page,
};
Index: linux-2.6.15-rc2-mm1/fs/buffer.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/fs/buffer.c 2005-11-23 09:10:03.000000000 -0800
+++ linux-2.6.15-rc2-mm1/fs/buffer.c 2005-11-28 11:41:38.000000000 -0800
@@ -3051,6 +3051,71 @@ asmlinkage long sys_bdflush(int func, lo
}
/*
+ * Migration function for pages with buffers. This function can only be used
+ * if the underlying filesystem guarantees that no other references to "page"
+ * exist.
+ */
+#ifdef CONFIG_MIGRATION
+int buffer_migrate_page(struct page *newpage, struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct buffer_head *bh, *head;
+
+ if (!mapping)
+ return -EAGAIN;
+
+ if (!page_has_buffers(page))
+ return migrate_page(newpage, page);
+
+ head = page_buffers(page);
+
+ if (migrate_page_remove_references(newpage, page, 3))
+ return -EAGAIN;
+
+ spin_lock(&mapping->private_lock);
+
+ bh = head;
+ do {
+ get_bh(bh);
+ lock_buffer(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ ClearPagePrivate(page);
+ set_page_private(newpage, page_private(page));
+ set_page_private(page, 0);
+ put_page(page);
+ get_page(newpage);
+
+ bh = head;
+ do {
+ set_bh_page(bh, newpage, bh_offset(bh));
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ SetPagePrivate(newpage);
+ spin_unlock(&mapping->private_lock);
+
+ migrate_page_copy(newpage, page);
+
+ spin_lock(&mapping->private_lock);
+ bh = head;
+ do {
+ unlock_buffer(bh);
+ put_bh(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+ spin_unlock(&mapping->private_lock);
+
+ return 0;
+}
+EXPORT_SYMBOL(buffer_migrate_page);
+#endif
+
+/*
* Buffer-head allocation
*/
static kmem_cache_t *bh_cachep;
Index: linux-2.6.15-rc2-mm1/fs/ext3/inode.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/fs/ext3/inode.c 2005-11-23 09:10:03.000000000 -0800
+++ linux-2.6.15-rc2-mm1/fs/ext3/inode.c 2005-11-28 11:41:38.000000000 -0800
@@ -1564,6 +1564,7 @@ static struct address_space_operations e
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
};
static struct address_space_operations ext3_writeback_aops = {
@@ -1577,6 +1578,7 @@ static struct address_space_operations e
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
};
static struct address_space_operations ext3_journalled_aops = {
Index: linux-2.6.15-rc2-mm1/fs/ext2/inode.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/fs/ext2/inode.c 2005-11-19 19:25:03.000000000 -0800
+++ linux-2.6.15-rc2-mm1/fs/ext2/inode.c 2005-11-28 11:41:38.000000000 -0800
@@ -706,6 +706,7 @@ struct address_space_operations ext2_aop
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
+ .migratepage = buffer_migrate_page,
};
struct address_space_operations ext2_aops_xip = {
@@ -723,6 +724,7 @@ struct address_space_operations ext2_nob
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
+ .migratepage = buffer_migrate_page,
};
/*
Index: linux-2.6.15-rc2-mm1/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/fs/xfs/linux-2.6/xfs_buf.c 2005-11-19 19:25:03.000000000 -0800
+++ linux-2.6.15-rc2-mm1/fs/xfs/linux-2.6/xfs_buf.c 2005-11-28 11:41:38.000000000 -0800
@@ -1568,6 +1568,7 @@ xfs_mapping_buftarg(
struct address_space *mapping;
static struct address_space_operations mapping_aops = {
.sync_page = block_sync_page,
+ .migratepage = fail_migrate_page,
};
inode = new_inode(bdev->bd_inode->i_sb);
Index: linux-2.6.15-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c 2005-11-28 11:41:05.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c 2005-11-28 11:41:38.000000000 -0800
@@ -618,6 +618,15 @@ int putback_lru_pages(struct list_head *
}
/*
+ * Non migratable page
+ */
+int fail_migrate_page(struct page *newpage, struct page *page)
+{
+ return -EIO;
+}
+EXPORT_SYMBOL(fail_migrate_page);
+
+/*
* swapout a single page
* page is locked upon entry, unlocked on exit
*/
@@ -773,6 +782,8 @@ int migrate_page_remove_references(struc
return 0;
}
+EXPORT_SYMBOL(swap_page);
+EXPORT_SYMBOL(migrate_page_remove_references);
/*
* Copy the page to its new location
@@ -822,6 +833,7 @@ void migrate_page_copy(struct page *newp
if (PageWriteback(newpage))
end_page_writeback(newpage);
}
+EXPORT_SYMBOL(migrate_page_copy);
/*
* Common logic to directly migrate a single page suitable for
@@ -840,6 +852,7 @@ int migrate_page(struct page *newpage, s
return 0;
}
+EXPORT_SYMBOL(migrate_page);
/*
* migrate_pages
@@ -939,6 +952,11 @@ redo:
if (!mapping)
goto unlock_both;
+ if (mapping->a_ops->migratepage) {
+ rc = mapping->a_ops->migratepage(newpage, page);
+ goto unlock_both;
+ }
+
/*
* Trigger writeout if page is dirty
*/
@@ -1052,6 +1070,7 @@ redo:
}
return rc;
}
+EXPORT_SYMBOL(migrate_pages);
#endif
/*
Index: linux-2.6.15-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc2-mm1.orig/include/linux/swap.h 2005-11-28 11:39:43.000000000 -0800
+++ linux-2.6.15-rc2-mm1/include/linux/swap.h 2005-11-28 11:41:38.000000000 -0800
@@ -184,6 +184,11 @@ extern int migrate_page_remove_reference
extern void migrate_page_copy(struct page *, struct page *);
extern int migrate_pages(struct list_head *l, struct list_head *t,
struct list_head *moved, struct list_head *failed);
+extern int fail_migrate_page(struct page *, struct page *);
+#else
+/* Possible settings for the migrate_page() method in address_operations */
+#define migrate_page NULL
+#define fail_migrate_page NULL
#endif
#ifdef CONFIG_MMU
Modify policy layer to support direct page migration
- Add migrate_pages_to() allowing the migration of a list of pages to a
a specified node or to vma with a specific allocation policy in sets
of MIGRATE_CHUNK_SIZE pages
- Modify do_migrate_pages() to do a staged move of pages from the
source nodes to the target nodes.
V3->V4: Fixed up to be based on the swap migration code in 2.6.15-rc1-mm2.
V1->V2:
- Migrate processes in chunks of MIGRATE_CHUNK_SIZE
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/mempolicy.c 2005-11-28 12:30:08.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/mempolicy.c 2005-11-28 12:36:57.000000000 -0800
@@ -95,6 +95,9 @@
#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */
+/* The number of pages to migrate per call to migrate_pages() */
+#define MIGRATE_CHUNK_SIZE 256
+
static kmem_cache_t *policy_cache;
static kmem_cache_t *sn_cache;
@@ -573,24 +576,96 @@ static void migrate_page_add(struct vm_a
}
}
-static int swap_pages(struct list_head *pagelist)
+/*
+ * Migrate the list 'pagelist' of pages to a certain destination.
+ *
+ * Specify destination with either non-NULL vma or dest_node >= 0
+ * Return the number of pages not migrated or error code
+ */
+static int migrate_pages_to(struct list_head *pagelist,
+ struct vm_area_struct *vma, int dest)
{
+ LIST_HEAD(newlist);
LIST_HEAD(moved);
LIST_HEAD(failed);
- int n;
+ int err = 0;
+ int nr_pages;
+ struct page *page;
+ struct list_head *p;
- n = migrate_pages(pagelist, NULL, &moved, &failed);
- putback_lru_pages(&failed);
- putback_lru_pages(&moved);
+redo:
+ nr_pages = 0;
+ list_for_each(p, pagelist) {
+ if (vma)
+ page = alloc_page_vma(GFP_HIGHUSER, vma,
+ vma->vm_start);
+ else
+ page = alloc_pages_node(dest, GFP_HIGHUSER, 0);
- return n;
+ if (!page) {
+ err = -ENOMEM;
+ goto out;
+ }
+ list_add(&page->lru, &newlist);
+ nr_pages++;
+ if (nr_pages > MIGRATE_CHUNK_SIZE);
+ break;
+ }
+ err = migrate_pages(pagelist, &newlist, &moved, &failed);
+
+ putback_lru_pages(&moved); /* Call release pages instead ?? */
+
+ if (err >= 0 && list_empty(&newlist) && !list_empty(pagelist))
+ goto redo;
+out:
+ /* Return leftover allocated pages */
+ while (!list_empty(&newlist)) {
+ page = list_entry(newlist.next, struct page, lru);
+ list_del(&page->lru);
+ __free_page(page);
+ }
+ list_splice(&failed, pagelist);
+ if (err < 0)
+ return err;
+
+ /* Calculate number of leftover pages */
+ nr_pages = 0;
+ list_for_each(p, pagelist)
+ nr_pages++;
+ return nr_pages;
+}
+
+/*
+ * Migrate pages from one node to a target node.
+ * Returns error or the number of pages not migrated.
+ */
+int migrate_to_node(struct mm_struct *mm, int source, int dest, int flags)
+{
+ nodemask_t nmask;
+ LIST_HEAD(pagelist);
+ int err = 0;
+
+ nodes_clear(nmask);
+ node_set(source, nmask);
+
+ check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nmask,
+ flags | MPOL_MF_DISCONTIG_OK,
+ &pagelist);
+
+ if (!list_empty(&pagelist)) {
+
+ err = migrate_pages_to(&pagelist, NULL, dest);
+
+ if (!list_empty(&pagelist))
+ putback_lru_pages(&pagelist);
+
+ }
+ return err;
}
/*
- * For now migrate_pages simply swaps out the pages from nodes that are in
- * the source set but not in the target set. In the future, we would
- * want a function that moves pages between the two nodesets in such
- * a way as to preserve the physical layout as much as possible.
+ * Move pages between the two nodesets so as to preserve the physical
+ * layout as much as possible.
*
* Returns the number of page that could not be moved.
*/
@@ -598,22 +673,76 @@ int do_migrate_pages(struct mm_struct *m
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags)
{
LIST_HEAD(pagelist);
- int count = 0;
- nodemask_t nodes;
+ int busy = 0;
+ int err = 0;
+ nodemask_t tmp;
- nodes_andnot(nodes, *from_nodes, *to_nodes);
+ down_read(&mm->mmap_sem);
- down_read(&mm->mmap_sem);
- check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nodes,
- flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+/* Find a 'source' bit set in 'tmp' whose corresponding 'dest'
+ * bit in 'to' is not also set in 'tmp'. Clear the found 'source'
+ * bit in 'tmp', and return that <source, dest> pair for migration.
+ * The pair of nodemasks 'to' and 'from' define the map.
+ *
+ * If no pair of bits is found that way, fallback to picking some
+ * pair of 'source' and 'dest' bits that are not the same. If the
+ * 'source' and 'dest' bits are the same, this represents a node
+ * that will be migrating to itself, so no pages need move.
+ *
+ * If no bits are left in 'tmp', or if all remaining bits left
+ * in 'tmp' correspond to the same bit in 'to', return false
+ * (nothing left to migrate).
+ *
+ * This lets us pick a pair of nodes to migrate between, such that
+ * if possible the dest node is not already occupied by some other
+ * source node, minimizing the risk of overloading the memory on a
+ * node that would happen if we migrated incoming memory to a node
+ * before migrating outgoing memory source that same node.
+ *
+ * A single scan of tmp is sufficient. As we go, we remember the
+ * most recent <s, d> pair that moved (s != d). If we find a pair
+ * that not only moved, but what's better, moved to an empty slot
+ * (d is not set in tmp), then we break out then, with that pair.
+ * Otherwise when we finish scannng from_tmp, we at least have the
+ * most recent <s, d> pair that moved. If we get all the way through
+ * the scan of tmp without finding any node that moved, much less
+ * moved to an empty node, then there is nothing left worth migrating.
+ */
- if (!list_empty(&pagelist)) {
- count = swap_pages(&pagelist);
- putback_lru_pages(&pagelist);
+ tmp = *from_nodes;
+ while (!nodes_empty(tmp)) {
+ int s,d;
+ int source = -1;
+ int dest = 0;
+
+ for_each_node_mask(s, tmp) {
+
+ d = node_remap(s, *from_nodes, *to_nodes);
+ if (s == d)
+ continue;
+
+ source = s; /* Node moved. Memorize */
+ dest = d;
+
+ /* dest not in remaining from nodes? */
+ if (!node_isset(dest, tmp))
+ break;
+ }
+ if (source == -1)
+ break;
+
+ node_clear(source, tmp);
+ err = migrate_to_node(mm, source, dest, flags);
+ if (err > 0)
+ busy += err;
+ if (err < 0)
+ break;
}
up_read(&mm->mmap_sem);
- return count;
+ if (err < 0)
+ return err;
+ return busy;
}
long do_mbind(unsigned long start, unsigned long len,
@@ -673,8 +802,9 @@ long do_mbind(unsigned long start, unsig
int nr_failed = 0;
err = mbind_range(vma, start, end, new);
+
if (!list_empty(&pagelist))
- nr_failed = swap_pages(&pagelist);
+ nr_failed = migrate_pages_to(&pagelist, vma, -1);
if (!err && nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
SwapMig: fix double page unlock
The cleanup patches screwed things up slightly. In case of an unsuccessful
migration a page may be unlocked twice.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c 2005-11-28 11:06:15.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c 2005-11-28 11:23:45.000000000 -0800
@@ -751,6 +751,7 @@ redo:
list_move(&page->lru, moved);
continue;
}
+ goto next;
unlock_page:
unlock_page(page);
Add remove_from_swap
remove_from_swap() allows the restoration of the pte entries that existed
before page migration occurred for anonymous pages by walking the reverse
maps. This reduces swap use and establishes regular pte's without the need
for page faults.
V3->V4:
- Add new function remove_vma_swap in swapfile.c to encapsulate
the functionality needed instead of exporting unuse_vma.
- Add #ifdef CONFIG_MIGRATION
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc2-mm1.orig/include/linux/swap.h 2005-11-28 11:29:59.000000000 -0800
+++ linux-2.6.15-rc2-mm1/include/linux/swap.h 2005-11-28 11:39:43.000000000 -0800
@@ -264,6 +264,9 @@ extern int remove_exclusive_swap_page(st
struct backing_dev_info;
extern spinlock_t swap_lock;
+#ifdef CONFIG_MIGRATION
+extern int remove_vma_swap(struct vm_area_struct *vma, struct page *page);
+#endif
/* linux/mm/thrash.c */
extern struct mm_struct * swap_token_mm;
Index: linux-2.6.15-rc2-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/swapfile.c 2005-11-28 11:29:07.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/swapfile.c 2005-11-28 11:39:43.000000000 -0800
@@ -532,6 +532,16 @@ static int unuse_mm(struct mm_struct *mm
return 0;
}
+#ifdef CONFIG_MIGRATION
+int remove_vma_swap(struct vm_area_struct *vma, struct page *page)
+{
+ swp_entry_t entry = { .val = page_private(page) };
+
+ return unuse_vma(vma, entry, page);
+}
+#endif
+
+
/*
* Scan swap_map from current position to next entry still in use.
* Recycle to start on reaching the end, returning 0 when empty.
Index: linux-2.6.15-rc2-mm1/mm/rmap.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/rmap.c 2005-11-23 09:10:04.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/rmap.c 2005-11-28 11:39:43.000000000 -0800
@@ -205,6 +205,28 @@ out:
return anon_vma;
}
+#ifdef CONFIG_MIGRATION
+/*
+ * Remove an anonymous page from swap replacing the swap pte's
+ * through real pte's pointing to valid pages.
+ */
+void remove_from_swap(struct page *page)
+{
+ struct anon_vma *anon_vma;
+ struct vm_area_struct *vma;
+
+ if (!PageAnon(page))
+ return;
+
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return;
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+ remove_vma_swap(vma, page);
+}
+#endif
+
/*
* At what user virtual address is page expected in vma?
*/
Index: linux-2.6.15-rc2-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.15-rc2-mm1.orig/include/linux/rmap.h 2005-11-19 19:25:03.000000000 -0800
+++ linux-2.6.15-rc2-mm1/include/linux/rmap.h 2005-11-28 11:39:43.000000000 -0800
@@ -91,6 +91,9 @@ static inline void page_dup_rmap(struct
*/
int page_referenced(struct page *, int is_locked, int ignore_token);
int try_to_unmap(struct page *);
+#ifdef CONFIG_MIGRATION
+void remove_from_swap(struct page *page);
+#endif
/*
* Called from mm/filemap_xip.c to unmap empty zero page
Index: linux-2.6.15-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c 2005-11-28 11:39:34.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c 2005-11-28 11:48:48.000000000 -0800
@@ -994,10 +994,11 @@ next:
list_move(&page->lru, failed);
nr_failed++;
} else {
- if (newpage)
+ if (newpage) {
/* Successful migration. Return new page to LRU */
+ remove_from_swap(page);
move_to_lru(newpage);
-
+ }
list_move(&page->lru, moved);
}
}
SwapMig: consolidate successful migration handling
Handle the moving of migrated pages to the "moved" list in a common branch
for rc == 0.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c 2005-11-28 11:23:45.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c 2005-11-28 11:25:14.000000000 -0800
@@ -702,11 +702,11 @@ redo:
list_for_each_entry_safe(page, page2, from, lru) {
cond_resched();
- if (page_count(page) == 1) {
+ rc = 0;
+ if (page_count(page) == 1)
/* page was freed from under us. So we are done. */
- list_move(&page->lru, moved);
- continue;
- }
+ goto next;
+
/*
* Skip locked pages during the first two passes to give the
* functions holding the lock time to release the page. Later we
@@ -747,10 +747,7 @@ redo:
* Page is properly locked and writeback is complete.
* Try to migrate the page.
*/
- if (!swap_page(page)) {
- list_move(&page->lru, moved);
- continue;
- }
+ rc = swap_page(page);
goto next;
unlock_page:
@@ -761,9 +758,12 @@ next:
retry++;
else if (rc) {
- /* Permanent failure to migrate the page */
+ /* Permanent failure */
list_move(&page->lru, failed);
nr_failed++;
+ } else {
+ /* Success */
+ list_move(&page->lru, moved);
}
}
if (retry && pass++ < 10)
Add direct migration support with fall back to swap.
Direct migration support on top of the swap based page migration facility.
This allows the direct migration of anonymous pages and the migration of
file backed pages by dropping the associated buffers (requires writeout).
Fall back to swap out if necessary.
The patch is based on lots of patches from the hotplug project but the code
was restructured, documented and simplified as much as possible.
Note that an additional patch that defines the migrate_page() method
for filesystems is necessary in order to avoid writeback for anonymous
and file backed pages.
V4-V5:
- Patch against 2.6.15-rc2-mm1 + double unlock fix + consolidation patch
V3-V4:
- Remove components already in the swap migration patch
V1->V2:
- Change migrate_pages() so that it can return pagelist for failed and
moved pages. No longer free the old pages but allow caller to dispose
of them.
- Unmap pages before changing reverse map under tree lock. Take
a write_lock instead of a read_lock.
- Add documentation
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc2-mm1.orig/include/linux/swap.h 2005-11-23 09:10:04.000000000 -0800
+++ linux-2.6.15-rc2-mm1/include/linux/swap.h 2005-11-28 11:29:59.000000000 -0800
@@ -179,6 +179,9 @@ extern int vm_swappiness;
#ifdef CONFIG_MIGRATION
extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);
+extern int migrate_page(struct page *, struct page *);
+extern int migrate_page_remove_references(struct page *, struct page *, int);
+extern void migrate_page_copy(struct page *, struct page *);
extern int migrate_pages(struct list_head *l, struct list_head *t,
struct list_head *moved, struct list_head *failed);
#endif
Index: linux-2.6.15-rc2-mm1/Documentation/vm/page_migration
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-mm1/Documentation/vm/page_migration 2005-11-28 11:29:59.000000000 -0800
@@ -0,0 +1,106 @@
+Page migration
+--------------
+
+Page migration occurs in several steps. First a high level
+description for those trying to use migrate_pages() and then
+a low level description of how the low level details work.
+
+
+A. Use of migrate_pages()
+-------------------------
+
+1. Remove pages from the LRU.
+
+ Lists of pages to be migrated are generated by scanning over
+ pages and moving them into lists. This is done by
+ calling isolate_lru_page() or __isolate_lru_page().
+ Calling isolate_lru_page increases the references to the page
+ so that it cannot vanish under us.
+
+2. Generate a list of newly allocates page to move the contents
+ of the first list to.
+
+3. The migrate_pages() function is called which attempts
+ to do the migration. It returns the moved pages in the
+ list specified as the third parameter and the failed
+ migrations in the fourth parameter. The first parameter
+ will contain the pages that could still be retried.
+
+4. The leftover pages of various types are returned
+ to the LRU using putback_to_lru_pages() or otherwise
+ disposed of. The pages will still have the refcount as
+ increased by isolate_lru_pages()!
+
+B. Operation of migrate_pages()
+--------------------------------
+
+migrate_pages does several passes over its list of pages. A page is moved
+if all references to a page are removable at the time.
+
+Steps:
+
+1. Lock the page to be migrated
+
+2. Insure that writeback is complete.
+
+3. Make sure that the page has assigned swap cache entry if
+ it is an anonyous page. The swap cache reference is necessary
+ to preserve the information contain in the page table maps.
+
+4. Prep the new page that we want to move to. It is locked
+ and set to not being uptodate so that all accesses to the new
+ page immediately lock while we are moving references.
+
+5. All the page table references to the page are either dropped (file backed)
+ or converted to swap references (anonymous pages). This should decrease the
+ reference count.
+
+6. The radix tree lock is taken
+
+7. The refcount of the page is examined and we back out if references remain
+
+8. The radix tree is checked and if it does not contain the pointer to this
+ page then we back out.
+
+9. The mapping is checked. If the mapping is gone then a truncate action may
+ be in progress and we back out.
+
+10. The new page is prepped with some settings from the old page so that accesses
+ to the new page will be discovererd to have the correct settings.
+
+11. The radix tree is changed to point to the new page.
+
+12. The reference count of the old page is dropped because the reference has now
+ been removed.
+
+13. The radix tree lock is dropped.
+
+14. The page contents are copied to the new page.
+
+15. The remaining page flags are copied to the new page.
+
+16. The old page flags are cleared to indicate that the page does
+ not use any information anymore.
+
+17. Queued up writeback on the new page is triggered.
+
+18. The locks are dropped from the old and new page.
+
+19. The swapcache reference is removed from the new page.
+
+20. The new page is moved to the LRU.
+
+This system is not without its problems. The check for the number of
+references while holding the radix tree lock may race with another function
+on another processor incrementing the reference counter for a page. In that
+case we will be in a situation where the page will linger until the reference
+count is dropped by that processor. There are no other references to the page
+though. The kernel functions would have taken a lock on the page if the page
+would have to be written to.
+
+The page is therefore likely just lingering for read purposes for a short while.
+The copy page code contains a couple of printks to detect the situation and help
+if there are any issues with the lingering pages.
+
+Christoph Lameter, November 7, 2005.
+
Index: linux-2.6.15-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c 2005-11-28 11:25:14.000000000 -0800
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c 2005-11-28 11:38:08.000000000 -0800
@@ -663,6 +663,185 @@ retry:
return -EAGAIN;
}
/*
+ * Page migration was first developed in the context of the memory hotplug
+ * project. The main authors of the migration code are:
+ *
+ * IWAMOTO Toshihiro <[email protected]>
+ * Hirokazu Takahashi <[email protected]>
+ * Dave Hansen <[email protected]>
+ * Christoph Lameter <[email protected]>
+ */
+
+/*
+ * Remove references for a page and establish the new page with the correct
+ * basic settings to be able to stop accesses to the page.
+ */
+int migrate_page_remove_references(struct page *newpage, struct page *page, int nr_refs)
+{
+ struct address_space *mapping = page_mapping(page);
+ struct page **radix_pointer;
+ int i;
+
+ /*
+ * Avoid doing any of the following work if the page count
+ * indicates that the page is in use or truncate has removed
+ * the page.
+ */
+ if (!mapping || page_mapcount(page) + nr_refs != page_count(page))
+ return 1;
+
+ /*
+ * Establish swap ptes for anonymous pages or destroy pte
+ * maps for files.
+ *
+ * In order to reestablish file backed mappings the fault handlers
+ * will take the radix tree_lock which may then be used to stop
+ * processses from accessing this page until the new page is ready.
+ *
+ * A process accessing via a swap pte (an anonymous page) will take a
+ * page_lock on the old page which will block the process until the
+ * migration attempt is complete. At that time the PageSwapCache bit
+ * will be examined. If the page was migrated then the PageSwapCache
+ * bit will be clear and the operation to retrieve the page will be
+ * retried which will find the new page in the radix tree. Then a new
+ * direct mapping may be generated based on the radix tree contents.
+ *
+ * If the page was not migrated then the PageSwapCache bit
+ * is still set and the operation may continue.
+ */
+ for(i = 0; i < 10 && page_mapped(page); i++) {
+ int rc = try_to_unmap(page);
+
+ if (rc == SWAP_SUCCESS)
+ break;
+ /*
+ * If there are other runnable processes then running
+ * them may make it possible to unmap the page
+ */
+ schedule();
+ }
+
+ /*
+ * Give up if we were unable to remove all mappings.
+ */
+ if (page_mapcount(page))
+ return 1;
+
+ write_lock_irq(&mapping->tree_lock);
+
+ radix_pointer = (struct page **)radix_tree_lookup_slot(
+ &mapping->page_tree,
+ page_index(page));
+
+ if (!page->mapping ||
+ page_count(page) != nr_refs ||
+ *radix_pointer != page) {
+ write_unlock_irq(&mapping->tree_lock);
+ return 1;
+ }
+
+ /*
+ * The page count for the old page may be raised by other kernel
+ * components at this point since no lock exists to prevent another
+ * processor from increasing the page_count. If that happens then
+ * the page will continue to exist as long as the kernel component
+ * keeps the page count high. The page has no other references left
+ * and it is not being written to, otherwise the page lock would have
+ * been taken and we would not have attempted to migrate the page.
+ *
+ * Filesystems increase the page count while holding the tree_lock
+ * which provides synchronization with this code.
+ */
+
+ /*
+ * Certain minimal information about a page must be available
+ * in order for other subsystems to properly handle the page if they
+ * find it through the radix tree update before we are finished
+ * copying the page.
+ */
+ get_page(newpage);
+ newpage->index = page_index(page);
+ if (PageSwapCache(page)) {
+ SetPageSwapCache(newpage);
+ set_page_private(newpage, page_private(page));
+ } else
+ newpage->mapping = page->mapping;
+
+ *radix_pointer = newpage;
+ __put_page(page);
+ write_unlock_irq(&mapping->tree_lock);
+
+ return 0;
+}
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+
+ /* Debug potential trouble with concurrent increases of page_count */
+ if (page_count(page) != 1)
+ printk(KERN_WARNING "precheck: copying %p->%p page count=%d\n",
+ page, newpage, page_count(page));
+
+ copy_highpage(newpage, page);
+
+ if (PageError(page))
+ SetPageError(newpage);
+ if (PageReferenced(page))
+ SetPageReferenced(newpage);
+ if (PageUptodate(page))
+ SetPageUptodate(newpage);
+ if (PageActive(page))
+ SetPageActive(newpage);
+ if (PageChecked(page))
+ SetPageChecked(newpage);
+ if (PageMappedToDisk(page))
+ SetPageMappedToDisk(newpage);
+
+ if (PageDirty(page)) {
+ clear_page_dirty_for_io(page);
+ set_page_dirty(newpage);
+ }
+
+ ClearPageSwapCache(page);
+ ClearPageActive(page);
+ ClearPagePrivate(page);
+ set_page_private(page, 0);
+ page->mapping = NULL;
+
+ if (page_count(page) != 1)
+ printk(KERN_WARNING "postcheck: copying %p->%p page count=%d\n",
+ page, newpage, page_count(page));
+
+ /*
+ * If any waiters have accumulated on the new page then
+ * wake them up.
+ */
+ if (PageWriteback(newpage))
+ end_page_writeback(newpage);
+}
+
+/*
+ * Common logic to directly migrate a single page suitable for
+ * pages that do not use PagePrivate.
+ *
+ * Pages are locked upon entry and exit.
+ */
+int migrate_page(struct page *newpage, struct page *page)
+{
+ BUG_ON(PageWriteback(page)); /* Writeback must be complete */
+
+ if (migrate_page_remove_references(newpage, page, 2))
+ return -EAGAIN;
+
+ migrate_page_copy(newpage, page);
+
+ return 0;
+}
+
+/*
* migrate_pages
*
* Two lists are passed to this function. The first list
@@ -675,11 +854,6 @@ retry:
* are movable anymore because t has become empty
* or no retryable pages exist anymore.
*
- * SIMPLIFIED VERSION: This implementation of migrate_pages
- * is only swapping out pages and never touches the second
- * list. The direct migration patchset
- * extends this function to avoid the use of swap.
- *
* Return: Number of pages not migrated when "to" ran empty.
*/
int migrate_pages(struct list_head *from, struct list_head *to,
@@ -700,6 +874,9 @@ redo:
retry = 0;
list_for_each_entry_safe(page, page2, from, lru) {
+ struct page *newpage = NULL;
+ struct address_space *mapping;
+
cond_resched();
rc = 0;
@@ -707,6 +884,9 @@ redo:
/* page was freed from under us. So we are done. */
goto next;
+ if (to && list_empty(to))
+ break;
+
/*
* Skip locked pages during the first two passes to give the
* functions holding the lock time to release the page. Later we
@@ -743,12 +923,64 @@ redo:
}
}
+ if (!to) {
+ rc = swap_page(page);
+ goto next;
+ }
+
+ newpage = lru_to_page(to);
+ lock_page(newpage);
+
/*
- * Page is properly locked and writeback is complete.
+ * Pages are properly locked and writeback is complete.
* Try to migrate the page.
*/
- rc = swap_page(page);
- goto next;
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock_both;
+
+ /*
+ * Trigger writeout if page is dirty
+ */
+ if (PageDirty(page)) {
+ switch (pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto unlock_both;
+
+ case PAGE_SUCCESS:
+ unlock_page(newpage);
+ goto next;
+
+ case PAGE_CLEAN:
+ ; /* try to migrate the page below */
+ }
+ }
+ /*
+ * If we have no buffer or can release the buffer
+ * then do a simple migration.
+ */
+ if (!page_has_buffers(page) ||
+ try_to_release_page(page, GFP_KERNEL)) {
+ rc = migrate_page(newpage, page);
+ goto unlock_both;
+ }
+
+ /*
+ * On early passes with mapped pages simply
+ * retry. There may be a lock held for some
+ * buffers that may go away. Later
+ * swap them out.
+ */
+ if (pass > 4) {
+ unlock_page(newpage);
+ newpage = NULL;
+ rc = swap_page(page);
+ goto next;
+ }
+
+unlock_both:
+ unlock_page(newpage);
unlock_page:
unlock_page(page);
@@ -762,7 +994,10 @@ next:
list_move(&page->lru, failed);
nr_failed++;
} else {
- /* Success */
+ if (newpage)
+ /* Successful migration. Return new page to LRU */
+ move_to_lru(newpage);
+
list_move(&page->lru, moved);
}
}
Hi,
Christoph Lameter wrote:
> +int migrate_page_remove_references(struct page *newpage, struct page *page, int nr_refs)
> +{
> + write_lock_irq(&mapping->tree_lock);
> +
> + radix_pointer = (struct page **)radix_tree_lookup_slot(
> + &mapping->page_tree,
> + page_index(page));
> +
> + if (!page->mapping ||
> + page_count(page) != nr_refs ||
> + *radix_pointer != page) {
> + write_unlock_irq(&mapping->tree_lock);
> + return 1;
> + }
I'm testing memory hot removing patch based on your patch.
I found a problem around the shmem,
but I'm not sure whether it can be problem on migration or not.
Problem is:
1. a page of shmem(tmpfs)'s generic file is in page-cache. assume page is diry.
2. When it passed to migrate_page(), it reaches pageout() in the middle of migrate_page().
3. pageout calls shmem_writepage(), and the page turns to be swap-cache page.
At this point, page->mapping becomes NULL (see move_to_swapcache())
4. pageout retunrs PAGE_SUCCESS.
5. Finaly, migrate_page() goes to redo.
6. retry
7. Because spwapper_space's a_ops->migratepage is not NULL,
"Avoid write back hook" in patch 7/7 is used.
+ if (mapping->a_ops->migratepage) {
+ rc = mapping->a_ops->migratepage(newpage, page);
+ goto unlock_both;
+ }
a_ops->migrate_page points to migrate_page() in mm/vmscan.c
8. migrate_page() try to replace radix tree entry in swapper_space.
9. Becasue page->mapping is NULL(becasue of 3), migrate_page_remove_references() fails.
I avoid above situation by following code in migrate_page_remove_references() now.
But I'm not sure whether this is sane fix or not.
> + if ((!PageSwapCache(page) && !page->mapping) ||
> + page_count(page) != nr_refs ||
> + *radix_pointer != page) {
> + write_unlock_irq(&mapping->tree_lock);
> + return 1;
> + }
-- Kame
Christoph Lameter wrote:
> Add remove_from_swap
>
> remove_from_swap() allows the restoration of the pte entries that existed
> before page migration occurred for anonymous pages by walking the reverse
> maps. This reduces swap use and establishes regular pte's without the need
> for page faults.
>
in migrate_page_copy()
==
ClearPageSwapCache(page);
ClearPageActive(page);
ClearPagePrivate(page);
set_page_private(page, 0);
page->mapping = NULL;
==
page->mapping turns to be NULL, when migration success.
> + if (newpage) {
> /* Successful migration. Return new page to LRU */
> + remove_from_swap(page);
> move_to_lru(newpage);
> -
When success, remove_from_swap(page) is called.
> +#ifdef CONFIG_MIGRATION
> +/*
> + * Remove an anonymous page from swap replacing the swap pte's
> + * through real pte's pointing to valid pages.
> + */
> +void remove_from_swap(struct page *page)
> +{
> + struct anon_vma *anon_vma;
> + struct vm_area_struct *vma;
> +
> + if (!PageAnon(page))
> + return;
> +
PageAnon(page) always 0.
remove_from_swap(newpage) is sane ?
-- Kame
On Wed, 30 Nov 2005, KAMEZAWA Hiroyuki wrote:
> remove_from_swap(newpage) is sane ?
Yes, I will change that. Thank you for discovering the problem.
On Wed, 30 Nov 2005, KAMEZAWA Hiroyuki wrote:
> I found a problem around the shmem,
The current page migration functions in mempolicy.c do not migrate shmem
vmas to be safe. In the future we surely would like to support migration
of shmem. I'd be glad if you could make sure that this works.
> Problem is:
> 1. a page of shmem(tmpfs)'s generic file is in page-cache. assume page is
> diry.
> 2. When it passed to migrate_page(), it reaches pageout() in the middle of
> migrate_page().
> 3. pageout calls shmem_writepage(), and the page turns to be swap-cache page.
> At this point, page->mapping becomes NULL (see move_to_swapcache())
A swapcache page would have page->mapping pointing to swapper space.
move_to_swap_cache does not set page->mapping == NULL.
> 7. Because spwapper_space's a_ops->migratepage is not NULL,
> "Avoid write back hook" in patch 7/7 is used.
> + if (mapping->a_ops->migratepage) {
> + rc = mapping->a_ops->migratepage(newpage, page);
> + goto unlock_both;
> + }
> a_ops->migrate_page points to migrate_page() in mm/vmscan.c
> 8. migrate_page() try to replace radix tree entry in swapper_space.
> 9. Becasue page->mapping is NULL(becasue of 3),
> migrate_page_remove_references() fails.
If page->mapping would be NULL then migrate_page() could not
have been called. The mapping is used to obtain the address of the
function to call,
Christoph Lameter wrote:
> The current page migration functions in mempolicy.c do not migrate shmem
> vmas to be safe. In the future we surely would like to support migration
> of shmem. I'd be glad if you could make sure that this works.
>
Okay, shmem is not problem now.
>
>>Problem is:
>>1. a page of shmem(tmpfs)'s generic file is in page-cache. assume page is
>>diry.
>>2. When it passed to migrate_page(), it reaches pageout() in the middle of
>>migrate_page().
>>3. pageout calls shmem_writepage(), and the page turns to be swap-cache page.
>> At this point, page->mapping becomes NULL (see move_to_swapcache())
>
>
> A swapcache page would have page->mapping pointing to swapper space.
> move_to_swap_cache does not set page->mapping == NULL.
>
int move_to_swap_cache(struct page *page, swp_entry_t entry)
{
int err = __add_to_swap_cache(page, entry, GFP_ATOMIC);
if (!err) {
remove_from_page_cache(page);<------------------------this
page_cache_release(page); /* pagecache ref */
if (!swap_duplicate(entry))
BUG();
SetPageDirty(page);
INC_CACHE_INFO(add_total);
} else if (err == -EEXIST)
INC_CACHE_INFO(exist_race);
return err;
}
remove_from_page_cache(page) sets page->mapping == NULL.
>
> If page->mapping would be NULL then migrate_page() could not
> have been called. The mapping is used to obtain the address of the
> function to call,
What you say is here.
==
/*
* Pages are properly locked and writeback is complete.
* Try to migrate the page.
*/
mapping = page_mapping(page);
if (!mapping) <-------------------------------------------this check.
goto unlock_both;
if (mapping->a_ops->migratepage) {
rc = mapping->a_ops->migratepage(newpage, page);
goto unlock_both;
}
==
But, see page_mapping() .....
==
static inline struct address_space *page_mapping(struct page *page)
{
struct address_space *mapping = page->mapping;
if (unlikely(PageSwapCache(page)))
mapping = &swapper_space;
else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON))
mapping = NULL;
return mapping;
}
==
Even if page->mapping == NULL, page_mapping() can return &swapper_space if PageSwapCache()
is true. (Note: a shmem page here is not page-cache, not anon, but swap-cache)
I'm now considering to add a_ops->migrate_page() to shmem is sane way...
But migration doesn't manage shmem, so this is just memory hot-remove's problem.
-- Kame
On Thu, 1 Dec 2005, Kamezawa Hiroyuki wrote:
> Christoph Lameter wrote:
> > The current page migration functions in mempolicy.c do not migrate shmem
> > vmas to be safe. In the future we surely would like to support migration of
> > shmem. I'd be glad if you could make sure that this works.
> >
> Okay, shmem is not problem now.
So I could allow VM_SHM migration? Patch attached.
> remove_from_page_cache(page) sets page->mapping == NULL.
Correct. So shmem_writepage actually removes a page. Hmm...
> Even if page->mapping == NULL, page_mapping() can return &swapper_space if
> PageSwapCache()
> is true. (Note: a shmem page here is not page-cache, not anon, but swap-cache)
>
> I'm now considering to add a_ops->migrate_page() to shmem is sane way...
> But migration doesn't manage shmem, so this is just memory hot-remove's
> problem.
Do you think this patch would work? It allows migration of VM_SHM vmas and
switches from checking page->mapping to page_mapping() in
migrate_page_remove_references.
Index: linux-2.6.15-rc3-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/mempolicy.c 2005-11-30 09:53:31.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/mempolicy.c 2005-11-30 09:53:32.000000000 -0800
@@ -294,7 +294,7 @@ static inline int check_pgd_range(struct
static inline int vma_migratable(struct vm_area_struct *vma)
{
if (vma->vm_flags & (
- VM_LOCKED|VM_IO|VM_RESERVED|VM_PFNMAP|VM_DENYWRITE|VM_SHM))
+ VM_LOCKED|VM_IO|VM_RESERVED|VM_PFNMAP|VM_DENYWRITE))
return 0;
return 1;
}
Index: linux-2.6.15-rc3-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/vmscan.c 2005-11-30 09:53:31.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/vmscan.c 2005-11-30 09:53:41.000000000 -0800
@@ -742,7 +742,7 @@ int migrate_page_remove_references(struc
&mapping->page_tree,
page_index(page));
- if (!page->mapping ||
+ if (!page_mapping(page) ||
page_count(page) != nr_refs ||
*radix_pointer != page) {
write_unlock_irq(&mapping->tree_lock);
Christoph Lameter wrote:
>
> Do you think this patch would work? It allows migration of VM_SHM vmas and
> switches from checking page->mapping to page_mapping() in
> migrate_page_remove_references.
>
I think this would work.
I'll try your new set and this patch.
Thanks,
-- Kame