Direct page migration allows to avoid using swap space for migration. This
patch does not add any additional APIs, it just improves the way page
migration works.
Benefits over swap migration:
1. Its faster because the page does not need to be written to swap space.
2. It does not use swap space and therefore there is no danger of running
out of swap space.
3. The need to write back a dirty page before migration is avoided through
a file system specific method making page migration even faster. We
fall back to writeout if the filesystem does not have such a method
defined.
4. Direct migration allows the preservation of the relative location of a page
within a set of nodes. This means that special placement of pages
for a performance critical application can be preserved when migrating.
Swap migration will rearrange the pages as they are swapped in which
may destroy the prior arrangement.
Many of the ideas for this code were originally developed in the memory
hotplug project and we hope that this code also will allow the hotplug
project to build on this patch in order to get to their goals. We also
would like to be able to move bad memory at SGI. IA64 arch specific code
to handle bad memory exists in 2.6.15 but that code is currently not able
to migrate pages.
The patchset consists of five patches (only the first two are necessary to
have basic direct migration support):
1. SwapCache patch
SwapCache pages may have changed their type after lock_page() if the
page was migrated. Check for this and retry lookup if the page is no
longer a SwapCache page.
2. migrate_pages()
Basic direct migration with fallback to swap if all other attempts
fail.
3. remove_from_swap()
Page migration installs swap ptes for anonymous pages in order to
preserve the information contained in the page tables. This patch
removes the swap ptes after migration and replaces them with regular
ptes.
4. upgrade of MPOL_MF_MOVE and sys_migrate_pages()
Add logic to mm/mempolicy.c to allow the policy layer to control
direct page migration. Thanks to Paul Jackson for the interative
logic to move between sets of nodes.
5. buffer_migrate_pages() patch
Allow migration without writing back dirty pages. Add filesystem dependent
migration support for ext2/ext3 and xfs. Use swapper space to setup a
method to migrate anonymous pages without writeback.
Credits (also in mm/vmscan.c):
The idea for this scheme of page migration was first developed in the context
of the memory hotplug project. The main authors of the migration code from
the memory hotplug project are:
IWAMOTO Toshihiro <[email protected]>
Hirokazu Takahashi <[email protected]>
Dave Hansen <[email protected]>
Changes V5->V6:
- Patchset against 2.6.15-rc3-mm1
- Remove checks for page count increases while migrating after Andrew assured
me that this cannot happen. Revise documentation to reflect that. If this is
the case then we will have no need to include the unwind code from the
hotplug project in the future.
- Wrong reference while calling remove_from_swap to page instead of newpage
fixed.
Changes V4->V5:
- Patchset against 2.6.15-rc2-mm1
- Update policy layer patch to use the generic check_range in 2.6.15-rc2-mm1.
- Remove try_to_unmap patch since VM_RESERVED vanished under us and therefore
there is no point anymore to distinguish between permament and transitional
failures.
Changes V3->V4:
- Patchset against 2.6.15-rc1-mm2 + two swap migration fixes posted today.
- Remove what is already in 2.6.14-rc1-mm2 which results in a significant
cleanup of the code.
Changes V2->V3:
- Patchset against 2.6.14-mm2
- Fix single processor build and builds without CONFIG_MIGRATION
- export symbols for filesystems that are modules and for
modules using migrate_pages().
- Paul Jackson's cpuset migration support is in 2.6.14-mm2 so
this patchset can be easily applied to -mm2 to get from swap
based to direct page migration.
Changes V1->V2:
- Call node_remap with the right parameters in do_migrate_pages().
- Take radix tree lock while examining page count to avoid races with
find_get_page() and various *_get_pages based on it.
- Convert direct ptes to swap ptes before radix tree update to avoid
more races.
- Fix problem if CONFIG_MIGRATION is off for buffer_migrate_page
- Add documentation about page migration
- Change migrate_pages() api so that the caller can decide what
to do about the migrated pages (badmem handling and hotplug
have to remove those pages for good).
- Drop config patch (already in mm)
- Add try_to_unmap patch
- Patchset now against 2.6.14-mm1 without requiring additional patches.
Add remove_from_swap
remove_from_swap() allows the restoration of the pte entries that existed
before page migration occurred for anonymous pages by walking the reverse
maps. This reduces swap use and establishes regular pte's without the need
for page faults.
V5->V6:
- Somehow V5 did a remove_from_swap for the old page. Changed to new
page
V3->V4:
- Add new function remove_vma_swap in swapfile.c to encapsulate
the functionality needed instead of exporting unuse_vma.
- Add #ifdef CONFIG_MIGRATION
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc3-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc3-mm1.orig/include/linux/swap.h 2005-11-30 08:46:48.000000000 -0800
+++ linux-2.6.15-rc3-mm1/include/linux/swap.h 2005-11-30 08:46:52.000000000 -0800
@@ -263,6 +263,9 @@ extern int remove_exclusive_swap_page(st
struct backing_dev_info;
extern spinlock_t swap_lock;
+#ifdef CONFIG_MIGRATION
+extern int remove_vma_swap(struct vm_area_struct *vma, struct page *page);
+#endif
/* linux/mm/thrash.c */
extern struct mm_struct * swap_token_mm;
Index: linux-2.6.15-rc3-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/swapfile.c 2005-11-30 08:46:44.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/swapfile.c 2005-11-30 08:46:52.000000000 -0800
@@ -532,6 +532,16 @@ static int unuse_mm(struct mm_struct *mm
return 0;
}
+#ifdef CONFIG_MIGRATION
+int remove_vma_swap(struct vm_area_struct *vma, struct page *page)
+{
+ swp_entry_t entry = { .val = page_private(page) };
+
+ return unuse_vma(vma, entry, page);
+}
+#endif
+
+
/*
* Scan swap_map from current position to next entry still in use.
* Recycle to start on reaching the end, returning 0 when empty.
Index: linux-2.6.15-rc3-mm1/mm/rmap.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/rmap.c 2005-11-30 08:46:40.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/rmap.c 2005-11-30 08:46:52.000000000 -0800
@@ -205,6 +205,28 @@ out:
return anon_vma;
}
+#ifdef CONFIG_MIGRATION
+/*
+ * Remove an anonymous page from swap replacing the swap pte's
+ * through real pte's pointing to valid pages.
+ */
+void remove_from_swap(struct page *page)
+{
+ struct anon_vma *anon_vma;
+ struct vm_area_struct *vma;
+
+ if (!PageAnon(page))
+ return;
+
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return;
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+ remove_vma_swap(vma, page);
+}
+#endif
+
/*
* At what user virtual address is page expected in vma?
*/
Index: linux-2.6.15-rc3-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.15-rc3-mm1.orig/include/linux/rmap.h 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3-mm1/include/linux/rmap.h 2005-11-30 08:46:52.000000000 -0800
@@ -91,6 +91,9 @@ static inline void page_dup_rmap(struct
*/
int page_referenced(struct page *, int is_locked);
int try_to_unmap(struct page *);
+#ifdef CONFIG_MIGRATION
+void remove_from_swap(struct page *page);
+#endif
/*
* Called from mm/filemap_xip.c to unmap empty zero page
Index: linux-2.6.15-rc3-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/vmscan.c 2005-11-30 08:46:48.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/vmscan.c 2005-11-30 08:46:52.000000000 -0800
@@ -972,10 +972,11 @@ next:
list_move(&page->lru, failed);
nr_failed++;
} else {
- if (newpage)
+ if (newpage) {
/* Successful migration. Return new page to LRU */
+ remove_from_swap(newpage);
move_to_lru(newpage);
-
+ }
list_move(&page->lru, moved);
}
}
Migrate a page with buffers without requiring writeback
This introduces a new address space operation migratepage() that
may be used by a filesystem to implement its own version of page migration.
A version is provided that migrates buffers attached to pages. Some
filesystems (ext2, ext3, xfs) are modified to utilize this feature.
The swapper address space operation are modified so that a regular
migrate_page() will occur for anonymous pages without writeback
(migrate_pages forces every anonymous page to have a swap entry).
V2->V3:
- export functions for filesystems that are modules and for modules that
perform migration by calling migrate_pages().
- Fix macro name clash. Fix build on UP and systems without CONFIG_MIGRATION
V1->V2:
- Fix CONFIG_MIGRATION handling
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc3-mm1/include/linux/fs.h
===================================================================
--- linux-2.6.15-rc3-mm1.orig/include/linux/fs.h 2005-11-30 08:46:39.000000000 -0800
+++ linux-2.6.15-rc3-mm1/include/linux/fs.h 2005-11-30 08:47:00.000000000 -0800
@@ -367,6 +367,8 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
+ /* migrate the contents of a page to the specified target */
+ int (*migratepage) (struct page *, struct page *);
};
struct backing_dev_info;
@@ -1722,6 +1724,12 @@ extern void simple_release_fs(struct vfs
extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
+#ifdef CONFIG_MIGRATION
+extern int buffer_migrate_page(struct page *, struct page *);
+#else
+#define buffer_migrate_page NULL
+#endif
+
extern int inode_change_ok(struct inode *, struct iattr *);
extern int __must_check inode_setattr(struct inode *, struct iattr *);
Index: linux-2.6.15-rc3-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/swap_state.c 2005-11-30 08:46:40.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/swap_state.c 2005-11-30 08:47:00.000000000 -0800
@@ -27,6 +27,7 @@ static struct address_space_operations s
.writepage = swap_writepage,
.sync_page = block_sync_page,
.set_page_dirty = __set_page_dirty_nobuffers,
+ .migratepage = migrate_page,
};
static struct backing_dev_info swap_backing_dev_info = {
Index: linux-2.6.15-rc3-mm1/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/fs/xfs/linux-2.6/xfs_aops.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3-mm1/fs/xfs/linux-2.6/xfs_aops.c 2005-11-30 08:47:00.000000000 -0800
@@ -1347,4 +1347,5 @@ struct address_space_operations linvfs_a
.commit_write = generic_commit_write,
.bmap = linvfs_bmap,
.direct_IO = linvfs_direct_IO,
+ .migratepage = buffer_migrate_page,
};
Index: linux-2.6.15-rc3-mm1/fs/buffer.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/fs/buffer.c 2005-11-30 08:46:38.000000000 -0800
+++ linux-2.6.15-rc3-mm1/fs/buffer.c 2005-11-30 08:47:00.000000000 -0800
@@ -3051,6 +3051,71 @@ asmlinkage long sys_bdflush(int func, lo
}
/*
+ * Migration function for pages with buffers. This function can only be used
+ * if the underlying filesystem guarantees that no other references to "page"
+ * exist.
+ */
+#ifdef CONFIG_MIGRATION
+int buffer_migrate_page(struct page *newpage, struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct buffer_head *bh, *head;
+
+ if (!mapping)
+ return -EAGAIN;
+
+ if (!page_has_buffers(page))
+ return migrate_page(newpage, page);
+
+ head = page_buffers(page);
+
+ if (migrate_page_remove_references(newpage, page, 3))
+ return -EAGAIN;
+
+ spin_lock(&mapping->private_lock);
+
+ bh = head;
+ do {
+ get_bh(bh);
+ lock_buffer(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ ClearPagePrivate(page);
+ set_page_private(newpage, page_private(page));
+ set_page_private(page, 0);
+ put_page(page);
+ get_page(newpage);
+
+ bh = head;
+ do {
+ set_bh_page(bh, newpage, bh_offset(bh));
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ SetPagePrivate(newpage);
+ spin_unlock(&mapping->private_lock);
+
+ migrate_page_copy(newpage, page);
+
+ spin_lock(&mapping->private_lock);
+ bh = head;
+ do {
+ unlock_buffer(bh);
+ put_bh(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+ spin_unlock(&mapping->private_lock);
+
+ return 0;
+}
+EXPORT_SYMBOL(buffer_migrate_page);
+#endif
+
+/*
* Buffer-head allocation
*/
static kmem_cache_t *bh_cachep;
Index: linux-2.6.15-rc3-mm1/fs/ext3/inode.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/fs/ext3/inode.c 2005-11-30 08:46:38.000000000 -0800
+++ linux-2.6.15-rc3-mm1/fs/ext3/inode.c 2005-11-30 08:47:00.000000000 -0800
@@ -1564,6 +1564,7 @@ static struct address_space_operations e
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
};
static struct address_space_operations ext3_writeback_aops = {
@@ -1577,6 +1578,7 @@ static struct address_space_operations e
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
};
static struct address_space_operations ext3_journalled_aops = {
Index: linux-2.6.15-rc3-mm1/fs/ext2/inode.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/fs/ext2/inode.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3-mm1/fs/ext2/inode.c 2005-11-30 08:47:00.000000000 -0800
@@ -706,6 +706,7 @@ struct address_space_operations ext2_aop
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
+ .migratepage = buffer_migrate_page,
};
struct address_space_operations ext2_aops_xip = {
@@ -723,6 +724,7 @@ struct address_space_operations ext2_nob
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
+ .migratepage = buffer_migrate_page,
};
/*
Index: linux-2.6.15-rc3-mm1/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/fs/xfs/linux-2.6/xfs_buf.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3-mm1/fs/xfs/linux-2.6/xfs_buf.c 2005-11-30 08:47:00.000000000 -0800
@@ -1568,6 +1568,7 @@ xfs_mapping_buftarg(
struct address_space *mapping;
static struct address_space_operations mapping_aops = {
.sync_page = block_sync_page,
+ .migratepage = fail_migrate_page,
};
inode = new_inode(bdev->bd_inode->i_sb);
Index: linux-2.6.15-rc3-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/vmscan.c 2005-11-30 08:46:52.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/vmscan.c 2005-11-30 08:47:00.000000000 -0800
@@ -618,6 +618,15 @@ int putback_lru_pages(struct list_head *
}
/*
+ * Non migratable page
+ */
+int fail_migrate_page(struct page *newpage, struct page *page)
+{
+ return -EIO;
+}
+EXPORT_SYMBOL(fail_migrate_page);
+
+/*
* swapout a single page
* page is locked upon entry, unlocked on exit
*/
@@ -762,6 +771,8 @@ int migrate_page_remove_references(struc
return 0;
}
+EXPORT_SYMBOL(swap_page);
+EXPORT_SYMBOL(migrate_page_remove_references);
/*
* Copy the page to its new location
@@ -801,6 +812,7 @@ void migrate_page_copy(struct page *newp
if (PageWriteback(newpage))
end_page_writeback(newpage);
}
+EXPORT_SYMBOL(migrate_page_copy);
/*
* Common logic to directly migrate a single page suitable for
@@ -819,6 +831,7 @@ int migrate_page(struct page *newpage, s
return 0;
}
+EXPORT_SYMBOL(migrate_page);
/*
* migrate_pages
@@ -918,6 +931,11 @@ redo:
if (!mapping)
goto unlock_both;
+ if (mapping->a_ops->migratepage) {
+ rc = mapping->a_ops->migratepage(newpage, page);
+ goto unlock_both;
+ }
+
/*
* Trigger writeout if page is dirty
*/
@@ -1030,6 +1048,7 @@ redo:
}
return rc;
}
+EXPORT_SYMBOL(migrate_pages);
#endif
/*
Index: linux-2.6.15-rc3-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc3-mm1.orig/include/linux/swap.h 2005-11-30 08:46:52.000000000 -0800
+++ linux-2.6.15-rc3-mm1/include/linux/swap.h 2005-11-30 08:47:00.000000000 -0800
@@ -183,6 +183,11 @@ extern int migrate_page_remove_reference
extern void migrate_page_copy(struct page *, struct page *);
extern int migrate_pages(struct list_head *l, struct list_head *t,
struct list_head *moved, struct list_head *failed);
+extern int fail_migrate_page(struct page *, struct page *);
+#else
+/* Possible settings for the migrate_page() method in address_operations */
+#define migrate_page NULL
+#define fail_migrate_page NULL
#endif
#ifdef CONFIG_MMU
Check for PageSwapCache after looking up and locking a swap page.
The page migration code may change a swap pte to point to a different page
under lock_page().
If that happens then the vm must retry the lookup operation in the swap
space to find the correct page number. There are a couple of locations
in the VM where a lock_page() is done on a swap page. In these locations
we need to check afterwards if the page was migrated. If the page was migrated
then the old page that was looked up before was freed and no longer has the
PageSwapCache bit set.
Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Christoph Lameter <clameter@@sgi.com>
Index: linux-2.6.15-rc3-mm1/mm/memory.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/memory.c 2005-11-30 08:46:40.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/memory.c 2005-11-30 08:46:44.000000000 -0800
@@ -1882,6 +1882,7 @@ static int do_swap_page(struct mm_struct
goto out;
entry = pte_to_swp_entry(orig_pte);
+again:
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1905,6 +1906,12 @@ static int do_swap_page(struct mm_struct
mark_page_accessed(page);
lock_page(page);
+ if (!PageSwapCache(page)) {
+ /* Page migration has occured */
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
/*
* Back out if somebody else already faulted in this pte.
Index: linux-2.6.15-rc3-mm1/mm/shmem.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/shmem.c 2005-11-30 08:46:40.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/shmem.c 2005-11-30 08:46:44.000000000 -0800
@@ -1028,6 +1028,14 @@ repeat:
page_cache_release(swappage);
goto repeat;
}
+ if (!PageSwapCache(swappage)) {
+ /* Page migration has occured */
+ shmem_swp_unmap(entry);
+ spin_unlock(&info->lock);
+ unlock_page(swappage);
+ page_cache_release(swappage);
+ goto repeat;
+ }
if (PageWriteback(swappage)) {
shmem_swp_unmap(entry);
spin_unlock(&info->lock);
Index: linux-2.6.15-rc3-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/swapfile.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/swapfile.c 2005-11-30 08:46:44.000000000 -0800
@@ -624,6 +624,7 @@ static int try_to_unuse(unsigned int typ
*/
swap_map = &si->swap_map[i];
entry = swp_entry(type, i);
+again:
page = read_swap_cache_async(entry, NULL, 0);
if (!page) {
/*
@@ -658,6 +659,12 @@ static int try_to_unuse(unsigned int typ
wait_on_page_locked(page);
wait_on_page_writeback(page);
lock_page(page);
+ if (!PageSwapCache(page)) {
+ /* Page migration has occured */
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
wait_on_page_writeback(page);
/*
Modify policy layer to support direct page migration
- Add migrate_pages_to() allowing the migration of a list of pages to a
a specified node or to vma with a specific allocation policy in sets
of MIGRATE_CHUNK_SIZE pages
- Modify do_migrate_pages() to do a staged move of pages from the
source nodes to the target nodes.
V3->V4: Fixed up to be based on the swap migration code in 2.6.15-rc1-mm2.
V1->V2:
- Migrate processes in chunks of MIGRATE_CHUNK_SIZE
Signed-off-by: Paul Jackson <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc3-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/mempolicy.c 2005-11-30 08:46:40.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/mempolicy.c 2005-11-30 08:46:55.000000000 -0800
@@ -95,6 +95,9 @@
#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */
+/* The number of pages to migrate per call to migrate_pages() */
+#define MIGRATE_CHUNK_SIZE 256
+
static kmem_cache_t *policy_cache;
static kmem_cache_t *sn_cache;
@@ -566,24 +569,96 @@ static void migrate_page_add(struct vm_a
}
}
-static int swap_pages(struct list_head *pagelist)
+/*
+ * Migrate the list 'pagelist' of pages to a certain destination.
+ *
+ * Specify destination with either non-NULL vma or dest_node >= 0
+ * Return the number of pages not migrated or error code
+ */
+static int migrate_pages_to(struct list_head *pagelist,
+ struct vm_area_struct *vma, int dest)
{
+ LIST_HEAD(newlist);
LIST_HEAD(moved);
LIST_HEAD(failed);
- int n;
+ int err = 0;
+ int nr_pages;
+ struct page *page;
+ struct list_head *p;
- n = migrate_pages(pagelist, NULL, &moved, &failed);
- putback_lru_pages(&failed);
- putback_lru_pages(&moved);
+redo:
+ nr_pages = 0;
+ list_for_each(p, pagelist) {
+ if (vma)
+ page = alloc_page_vma(GFP_HIGHUSER, vma,
+ vma->vm_start);
+ else
+ page = alloc_pages_node(dest, GFP_HIGHUSER, 0);
- return n;
+ if (!page) {
+ err = -ENOMEM;
+ goto out;
+ }
+ list_add(&page->lru, &newlist);
+ nr_pages++;
+ if (nr_pages > MIGRATE_CHUNK_SIZE);
+ break;
+ }
+ err = migrate_pages(pagelist, &newlist, &moved, &failed);
+
+ putback_lru_pages(&moved); /* Call release pages instead ?? */
+
+ if (err >= 0 && list_empty(&newlist) && !list_empty(pagelist))
+ goto redo;
+out:
+ /* Return leftover allocated pages */
+ while (!list_empty(&newlist)) {
+ page = list_entry(newlist.next, struct page, lru);
+ list_del(&page->lru);
+ __free_page(page);
+ }
+ list_splice(&failed, pagelist);
+ if (err < 0)
+ return err;
+
+ /* Calculate number of leftover pages */
+ nr_pages = 0;
+ list_for_each(p, pagelist)
+ nr_pages++;
+ return nr_pages;
+}
+
+/*
+ * Migrate pages from one node to a target node.
+ * Returns error or the number of pages not migrated.
+ */
+int migrate_to_node(struct mm_struct *mm, int source, int dest, int flags)
+{
+ nodemask_t nmask;
+ LIST_HEAD(pagelist);
+ int err = 0;
+
+ nodes_clear(nmask);
+ node_set(source, nmask);
+
+ check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nmask,
+ flags | MPOL_MF_DISCONTIG_OK,
+ &pagelist);
+
+ if (!list_empty(&pagelist)) {
+
+ err = migrate_pages_to(&pagelist, NULL, dest);
+
+ if (!list_empty(&pagelist))
+ putback_lru_pages(&pagelist);
+
+ }
+ return err;
}
/*
- * For now migrate_pages simply swaps out the pages from nodes that are in
- * the source set but not in the target set. In the future, we would
- * want a function that moves pages between the two nodesets in such
- * a way as to preserve the physical layout as much as possible.
+ * Move pages between the two nodesets so as to preserve the physical
+ * layout as much as possible.
*
* Returns the number of page that could not be moved.
*/
@@ -591,22 +666,76 @@ int do_migrate_pages(struct mm_struct *m
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags)
{
LIST_HEAD(pagelist);
- int count = 0;
- nodemask_t nodes;
+ int busy = 0;
+ int err = 0;
+ nodemask_t tmp;
- nodes_andnot(nodes, *from_nodes, *to_nodes);
+ down_read(&mm->mmap_sem);
- down_read(&mm->mmap_sem);
- check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nodes,
- flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+/* Find a 'source' bit set in 'tmp' whose corresponding 'dest'
+ * bit in 'to' is not also set in 'tmp'. Clear the found 'source'
+ * bit in 'tmp', and return that <source, dest> pair for migration.
+ * The pair of nodemasks 'to' and 'from' define the map.
+ *
+ * If no pair of bits is found that way, fallback to picking some
+ * pair of 'source' and 'dest' bits that are not the same. If the
+ * 'source' and 'dest' bits are the same, this represents a node
+ * that will be migrating to itself, so no pages need move.
+ *
+ * If no bits are left in 'tmp', or if all remaining bits left
+ * in 'tmp' correspond to the same bit in 'to', return false
+ * (nothing left to migrate).
+ *
+ * This lets us pick a pair of nodes to migrate between, such that
+ * if possible the dest node is not already occupied by some other
+ * source node, minimizing the risk of overloading the memory on a
+ * node that would happen if we migrated incoming memory to a node
+ * before migrating outgoing memory source that same node.
+ *
+ * A single scan of tmp is sufficient. As we go, we remember the
+ * most recent <s, d> pair that moved (s != d). If we find a pair
+ * that not only moved, but what's better, moved to an empty slot
+ * (d is not set in tmp), then we break out then, with that pair.
+ * Otherwise when we finish scannng from_tmp, we at least have the
+ * most recent <s, d> pair that moved. If we get all the way through
+ * the scan of tmp without finding any node that moved, much less
+ * moved to an empty node, then there is nothing left worth migrating.
+ */
- if (!list_empty(&pagelist)) {
- count = swap_pages(&pagelist);
- putback_lru_pages(&pagelist);
+ tmp = *from_nodes;
+ while (!nodes_empty(tmp)) {
+ int s,d;
+ int source = -1;
+ int dest = 0;
+
+ for_each_node_mask(s, tmp) {
+
+ d = node_remap(s, *from_nodes, *to_nodes);
+ if (s == d)
+ continue;
+
+ source = s; /* Node moved. Memorize */
+ dest = d;
+
+ /* dest not in remaining from nodes? */
+ if (!node_isset(dest, tmp))
+ break;
+ }
+ if (source == -1)
+ break;
+
+ node_clear(source, tmp);
+ err = migrate_to_node(mm, source, dest, flags);
+ if (err > 0)
+ busy += err;
+ if (err < 0)
+ break;
}
up_read(&mm->mmap_sem);
- return count;
+ if (err < 0)
+ return err;
+ return busy;
}
long do_mbind(unsigned long start, unsigned long len,
@@ -666,8 +795,9 @@ long do_mbind(unsigned long start, unsig
int nr_failed = 0;
err = mbind_range(vma, start, end, new);
+
if (!list_empty(&pagelist))
- nr_failed = swap_pages(&pagelist);
+ nr_failed = migrate_pages_to(&pagelist, vma, -1);
if (!err && nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
Add direct migration support with fall back to swap.
Direct migration support on top of the swap based page migration facility.
This allows the direct migration of anonymous pages and the migration of
file backed pages by dropping the associated buffers (requires writeout).
Fall back to swap out if necessary.
The patch is based on lots of patches from the hotplug project but the code
was restructured, documented and simplified as much as possible.
Note that an additional patch that defines the migrate_page() method
for filesystems is necessary in order to avoid writeback for anonymous
and file backed pages.
V4-V5:
- Patch against 2.6.15-rc2-mm1 + double unlock fix + consolidation patch
V3-V4:
- Remove components already in the swap migration patch
V1->V2:
- Change migrate_pages() so that it can return pagelist for failed and
moved pages. No longer free the old pages but allow caller to dispose
of them.
- Unmap pages before changing reverse map under tree lock. Take
a write_lock instead of a read_lock.
- Add documentation
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.15-rc3-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc3-mm1.orig/include/linux/swap.h 2005-11-30 08:46:39.000000000 -0800
+++ linux-2.6.15-rc3-mm1/include/linux/swap.h 2005-11-30 08:46:48.000000000 -0800
@@ -178,6 +178,9 @@ extern int vm_swappiness;
#ifdef CONFIG_MIGRATION
extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);
+extern int migrate_page(struct page *, struct page *);
+extern int migrate_page_remove_references(struct page *, struct page *, int);
+extern void migrate_page_copy(struct page *, struct page *);
extern int migrate_pages(struct list_head *l, struct list_head *t,
struct list_head *moved, struct list_head *failed);
#endif
Index: linux-2.6.15-rc3-mm1/Documentation/vm/page_migration
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc3-mm1/Documentation/vm/page_migration 2005-11-30 08:46:48.000000000 -0800
@@ -0,0 +1,95 @@
+Page migration
+--------------
+
+Page migration occurs in several steps. First a high level
+description for those trying to use migrate_pages() and then
+a low level description of how the low level details work.
+
+
+A. Use of migrate_pages()
+-------------------------
+
+1. Remove pages from the LRU.
+
+ Lists of pages to be migrated are generated by scanning over
+ pages and moving them into lists. This is done by
+ calling isolate_lru_page() or __isolate_lru_page().
+ Calling isolate_lru_page increases the references to the page
+ so that it cannot vanish under us.
+
+2. Generate a list of newly allocates page to move the contents
+ of the first list to.
+
+3. The migrate_pages() function is called which attempts
+ to do the migration. It returns the moved pages in the
+ list specified as the third parameter and the failed
+ migrations in the fourth parameter. The first parameter
+ will contain the pages that could still be retried.
+
+4. The leftover pages of various types are returned
+ to the LRU using putback_to_lru_pages() or otherwise
+ disposed of. The pages will still have the refcount as
+ increased by isolate_lru_pages()!
+
+B. Operation of migrate_pages()
+--------------------------------
+
+migrate_pages does several passes over its list of pages. A page is moved
+if all references to a page are removable at the time.
+
+Steps:
+
+1. Lock the page to be migrated
+
+2. Insure that writeback is complete.
+
+3. Make sure that the page has assigned swap cache entry if
+ it is an anonyous page. The swap cache reference is necessary
+ to preserve the information contain in the page table maps.
+
+4. Prep the new page that we want to move to. It is locked
+ and set to not being uptodate so that all accesses to the new
+ page immediately lock while we are moving references.
+
+5. All the page table references to the page are either dropped (file backed)
+ or converted to swap references (anonymous pages). This should decrease the
+ reference count.
+
+6. The radix tree lock is taken
+
+7. The refcount of the page is examined and we back out if references remain
+ otherwise we know that we are the only one referencing this page.
+
+8. The radix tree is checked and if it does not contain the pointer to this
+ page then we back out.
+
+9. The mapping is checked. If the mapping is gone then a truncate action may
+ be in progress and we back out.
+
+10. The new page is prepped with some settings from the old page so that accesses
+ to the new page will be discovererd to have the correct settings.
+
+11. The radix tree is changed to point to the new page.
+
+12. The reference count of the old page is dropped because the reference has now
+ been removed.
+
+13. The radix tree lock is dropped.
+
+14. The page contents are copied to the new page.
+
+15. The remaining page flags are copied to the new page.
+
+16. The old page flags are cleared to indicate that the page does
+ not use any information anymore.
+
+17. Queued up writeback on the new page is triggered.
+
+18. The locks are dropped from the old and new page.
+
+19. The swapcache reference is removed from the new page.
+
+20. The new page is moved to the LRU.
+
+Christoph Lameter, November 29, 2005.
+
Index: linux-2.6.15-rc3-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc3-mm1.orig/mm/vmscan.c 2005-11-30 08:46:40.000000000 -0800
+++ linux-2.6.15-rc3-mm1/mm/vmscan.c 2005-11-30 08:46:48.000000000 -0800
@@ -663,6 +663,164 @@ retry:
return -EAGAIN;
}
/*
+ * Page migration was first developed in the context of the memory hotplug
+ * project. The main authors of the migration code are:
+ *
+ * IWAMOTO Toshihiro <[email protected]>
+ * Hirokazu Takahashi <[email protected]>
+ * Dave Hansen <[email protected]>
+ * Christoph Lameter <[email protected]>
+ */
+
+/*
+ * Remove references for a page and establish the new page with the correct
+ * basic settings to be able to stop accesses to the page.
+ */
+int migrate_page_remove_references(struct page *newpage, struct page *page, int nr_refs)
+{
+ struct address_space *mapping = page_mapping(page);
+ struct page **radix_pointer;
+ int i;
+
+ /*
+ * Avoid doing any of the following work if the page count
+ * indicates that the page is in use or truncate has removed
+ * the page.
+ */
+ if (!mapping || page_mapcount(page) + nr_refs != page_count(page))
+ return 1;
+
+ /*
+ * Establish swap ptes for anonymous pages or destroy pte
+ * maps for files.
+ *
+ * In order to reestablish file backed mappings the fault handlers
+ * will take the radix tree_lock which may then be used to stop
+ * processses from accessing this page until the new page is ready.
+ *
+ * A process accessing via a swap pte (an anonymous page) will take a
+ * page_lock on the old page which will block the process until the
+ * migration attempt is complete. At that time the PageSwapCache bit
+ * will be examined. If the page was migrated then the PageSwapCache
+ * bit will be clear and the operation to retrieve the page will be
+ * retried which will find the new page in the radix tree. Then a new
+ * direct mapping may be generated based on the radix tree contents.
+ *
+ * If the page was not migrated then the PageSwapCache bit
+ * is still set and the operation may continue.
+ */
+ for(i = 0; i < 10 && page_mapped(page); i++) {
+ int rc = try_to_unmap(page);
+
+ if (rc == SWAP_SUCCESS)
+ break;
+ /*
+ * If there are other runnable processes then running
+ * them may make it possible to unmap the page
+ */
+ schedule();
+ }
+
+ /*
+ * Give up if we were unable to remove all mappings.
+ */
+ if (page_mapcount(page))
+ return 1;
+
+ write_lock_irq(&mapping->tree_lock);
+
+ radix_pointer = (struct page **)radix_tree_lookup_slot(
+ &mapping->page_tree,
+ page_index(page));
+
+ if (!page->mapping ||
+ page_count(page) != nr_refs ||
+ *radix_pointer != page) {
+ write_unlock_irq(&mapping->tree_lock);
+ return 1;
+ }
+
+ /*
+ * Now we know that no one else is looking at the page.
+ *
+ * Certain minimal information about a page must be available
+ * in order for other subsystems to properly handle the page if they
+ * find it through the radix tree update before we are finished
+ * copying the page.
+ */
+ get_page(newpage);
+ newpage->index = page_index(page);
+ if (PageSwapCache(page)) {
+ SetPageSwapCache(newpage);
+ set_page_private(newpage, page_private(page));
+ } else
+ newpage->mapping = page->mapping;
+
+ *radix_pointer = newpage;
+ __put_page(page);
+ write_unlock_irq(&mapping->tree_lock);
+
+ return 0;
+}
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+ copy_highpage(newpage, page);
+
+ if (PageError(page))
+ SetPageError(newpage);
+ if (PageReferenced(page))
+ SetPageReferenced(newpage);
+ if (PageUptodate(page))
+ SetPageUptodate(newpage);
+ if (PageActive(page))
+ SetPageActive(newpage);
+ if (PageChecked(page))
+ SetPageChecked(newpage);
+ if (PageMappedToDisk(page))
+ SetPageMappedToDisk(newpage);
+
+ if (PageDirty(page)) {
+ clear_page_dirty_for_io(page);
+ set_page_dirty(newpage);
+ }
+
+ ClearPageSwapCache(page);
+ ClearPageActive(page);
+ ClearPagePrivate(page);
+ set_page_private(page, 0);
+ page->mapping = NULL;
+
+ /*
+ * If any waiters have accumulated on the new page then
+ * wake them up.
+ */
+ if (PageWriteback(newpage))
+ end_page_writeback(newpage);
+}
+
+/*
+ * Common logic to directly migrate a single page suitable for
+ * pages that do not use PagePrivate.
+ *
+ * Pages are locked upon entry and exit.
+ */
+int migrate_page(struct page *newpage, struct page *page)
+{
+ BUG_ON(PageWriteback(page)); /* Writeback must be complete */
+
+ if (migrate_page_remove_references(newpage, page, 2))
+ return -EAGAIN;
+
+ migrate_page_copy(newpage, page);
+
+ return 0;
+}
+
+/*
* migrate_pages
*
* Two lists are passed to this function. The first list
@@ -675,11 +833,6 @@ retry:
* are movable anymore because t has become empty
* or no retryable pages exist anymore.
*
- * SIMPLIFIED VERSION: This implementation of migrate_pages
- * is only swapping out pages and never touches the second
- * list. The direct migration patchset
- * extends this function to avoid the use of swap.
- *
* Return: Number of pages not migrated when "to" ran empty.
*/
int migrate_pages(struct list_head *from, struct list_head *to,
@@ -700,6 +853,9 @@ redo:
retry = 0;
list_for_each_entry_safe(page, page2, from, lru) {
+ struct page *newpage = NULL;
+ struct address_space *mapping;
+
cond_resched();
rc = 0;
@@ -707,6 +863,9 @@ redo:
/* page was freed from under us. So we are done. */
goto next;
+ if (to && list_empty(to))
+ break;
+
/*
* Skip locked pages during the first two passes to give the
* functions holding the lock time to release the page. Later we
@@ -743,12 +902,64 @@ redo:
}
}
+ if (!to) {
+ rc = swap_page(page);
+ goto next;
+ }
+
+ newpage = lru_to_page(to);
+ lock_page(newpage);
+
/*
- * Page is properly locked and writeback is complete.
+ * Pages are properly locked and writeback is complete.
* Try to migrate the page.
*/
- rc = swap_page(page);
- goto next;
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock_both;
+
+ /*
+ * Trigger writeout if page is dirty
+ */
+ if (PageDirty(page)) {
+ switch (pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto unlock_both;
+
+ case PAGE_SUCCESS:
+ unlock_page(newpage);
+ goto next;
+
+ case PAGE_CLEAN:
+ ; /* try to migrate the page below */
+ }
+ }
+ /*
+ * If we have no buffer or can release the buffer
+ * then do a simple migration.
+ */
+ if (!page_has_buffers(page) ||
+ try_to_release_page(page, GFP_KERNEL)) {
+ rc = migrate_page(newpage, page);
+ goto unlock_both;
+ }
+
+ /*
+ * On early passes with mapped pages simply
+ * retry. There may be a lock held for some
+ * buffers that may go away. Later
+ * swap them out.
+ */
+ if (pass > 4) {
+ unlock_page(newpage);
+ newpage = NULL;
+ rc = swap_page(page);
+ goto next;
+ }
+
+unlock_both:
+ unlock_page(newpage);
unlock_page:
unlock_page(page);
@@ -761,7 +972,10 @@ next:
list_move(&page->lru, failed);
nr_failed++;
} else {
- /* Success */
+ if (newpage)
+ /* Successful migration. Return new page to LRU */
+ move_to_lru(newpage);
+
list_move(&page->lru, moved);
}
}
Christoph Lameter wrote:
> Changes V5->V6:
> - Patchset against 2.6.15-rc3-mm1
> - Remove checks for page count increases while migrating after Andrew assured
Could you point where is changed in the code ?
> me that this cannot happen. Revise documentation to reflect that. If this is
> the case then we will have no need to include the unwind code from the
> hotplug project in the future.
-- Kame
On Thu, 1 Dec 2005, KAMEZAWA Hiroyuki wrote:
> Christoph Lameter wrote:
>
> > Changes V5->V6:
> > - Patchset against 2.6.15-rc3-mm1
> > - Remove checks for page count increases while migrating after Andrew
> > assured
>
> Could you point where is changed in the code ?
The two checks for page_count > 1 where removed from migrate_page_copy().