2005-10-25 19:31:41

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 0/5] Swap Migration V4: Overview

This is a patchset intended to introduce page migration into the kernel
through a simple implementation of swap based page migration.
The aim is to be minimally intrusive in order to have some hopes for inclusion
into 2.6.15. A separate direct page migration patch is being developed that
applies on top of this patch. The direct migration patch is being discussed on
<[email protected]>.

Much of the code is based on code that the memory hotplug project and Ray Bryant
have been working on for a long time. See http://sourceforge.net/projects/lhms/

Changes from V3 to V4:
- patch against 2.6.14-rc5-mm1.
- Correctly gather pages in migrate_add_page()
- Restructure swapout code for easy later application of the direct migration
patches. Rename swapout() to migrate_pages().
- Add PF_SWAPWRITE support to allow write to swap from a process. Save
and restore earlier state to allow nesting of the use of PF_SWAPWRITE.
- Fix sys_migrate_pages permission check (thanks Ray).

Changes from V2 to V3:
- Break out common code for page eviction (Thanks to a patch by Magnus Damm)
- Add check to avoid MPOL_MF_MOVE moving pages that are also accessed from
another address space. Add support for MPOL_MF_MOVE_ALL to override this
(requires superuser priviledges).
- Update overview regarding direct page migration patchset following soon and
cut longwinded explanations.
- Add sys_migrate patchset
- Check cpuset restrictions on sys_migrate.

Changes from V1 to V2:
- Patch against 2.6.14-rc4-mm1
- Remove move_pages() function
- Code cleanup to make it less invasive.
- Fix missing lru_add_drain() invocation from isolate_lru_page()

In a NUMA system it is often beneficial to be able to move the memory
in use by a process to different nodes in order to enhance performance.
Currently Linux simply does not support this facility. This patchset
implements page migration via a new syscall sys_migrate_pages and via
the memory policy layer with the MPOL_MF_MOVE and MPOL_MF_MOVE_ALL
flags.

Page migration is also useful for other purposes:

1. Memory hotplug. Migrating processes off a memory node that is going
to be disconnected.

2. Remapping of bad pages. These could be detected through soft ECC errors
and other mechanisms.

migrate_pages() can only migrate pages under certain conditions. These other
uses may require additional measures to ensure that pages are migratable. The
hotplug project f.e. restricts allocations to removable memory.


The patchset consists of five patches:

1. LRU operations

Add basic operations to remove pages from the LRU lists and return
them back to it.

2. PF_WRITESWAP

Allow a process to set PF_WRITESWAP in its flags in order to be allowed
to write pages to swap space.

3. migrate_pages() implementation

Adds a function to mm/vmscan.c called migrate_pages(). The functionality
of that function is restricted to swapping out pages. An additional patch
is necessary for direct page migration.

4. MPOL_MF_MOVE flag for memory policies.

This implements MPOL_MF_MOVE in addition to MPOL_MF_STRICT. MPOL_MF_STRICT
allows the checking if all pages in a memory area obey the memory policies.
MPOL_MF_MOVE will migrate all pages that do not conform to the memory policy.
If pages are evicted then the system will allocate pages conforming to the
policy on swap in.

5. sys_migrate_pages system call and cpuset API

Adds a new function call

sys_migrate_pages(pid, maxnode, from_nodes, to_nodes)

to migrate pages of a process to a different node and also a function
for the use of the migration mechanism in cpusets

do_migrate_pages(struct mm_struct *, from_nodes, to_nodes, move_flags).

=====

URLs referring to the discussion regarding the initial version of these
patches.

Page eviction: http://marc.theaimsgroup.com/?l=linux-mm&m=112922756730989&w=2
Numa policy : http://marc.theaimsgroup.com/?l=linux-mm&m=112922756724715&w=2

Discussion of V2 of the patchset:
http://marc.theaimsgroup.com/?t=112959680300007&r=1&w=2

Discussion of V3:
http://marc.theaimsgroup.com/?t=112984939600003&r=1&w=2


2005-10-25 19:31:01

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 4/5] Swap Migration V4: MPOL_MF_MOVE interface

Add page migration support via swap to the NUMA policy layer

This patch adds page migration support to the NUMA policy layer. An additional
flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is specified then
pages that do not conform to the memory policy will be evicted from memory.
When they get pages back in new pages will be allocated following the numa policy.

Changes V3->V4
- migrate_page_add: Do pagelist processing directly instead
of doing it via isolate_lru_page().
- Use the migrate_pages() to evict the pages.

Changes V2->V3
- Add check to not migrate pages shared with other processes (but allow
migration of memory shared between threads having a common mm_struct)
- MPOL_MF_MOVE_ALL to override and move even pages shared with other
processes. This only works if the process issuing this call has
CAP_SYS_RESOURCE because this enables the moving of pages owned
by other processes.
- MPOL_MF_DISCONTIG_OK (internal use only) to not check for continuous VMAs.
Enable MPOL_MF_DISCONTIG_OK if policy to be set is NULL (default policy).

Changes V1->V2
- Add vma_migratable() function for future enhancements.
- No side effects on WARN_ON
- Remove move_pages for now
- Make patch fit 2.6.14-rc4-mm1

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc5-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/mempolicy.c 2005-10-24 10:27:13.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/mempolicy.c 2005-10-25 09:09:54.000000000 -0700
@@ -83,9 +83,13 @@
#include <linux/init.h>
#include <linux/compat.h>
#include <linux/mempolicy.h>
+#include <linux/swap.h>
#include <asm/tlbflush.h>
#include <asm/uaccess.h>

+/* Internal MPOL_MF_xxx flags */
+#define MPOL_MF_DISCONTIG_OK (1<<20) /* Skip checks for continuous vmas */
+
static kmem_cache_t *policy_cache;
static kmem_cache_t *sn_cache;

@@ -179,9 +183,62 @@ static struct mempolicy *mpol_new(int mo
return policy;
}

+/* Check if we are the only process mapping the page in question */
+static inline int single_mm_mapping(struct mm_struct *mm,
+ struct address_space *mapping)
+{
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ int rc = 1;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
+ if (mm != vma->vm_mm) {
+ rc = 0;
+ goto out;
+ }
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
+ if (mm != vma->vm_mm) {
+ rc = 0;
+ goto out;
+ }
+out:
+ spin_unlock(&mapping->i_mmap_lock);
+ return rc;
+}
+
+/*
+ * Add a page to be migrated to the pagelist
+ */
+static void migrate_page_add(struct vm_area_struct *vma,
+ struct page *page, struct list_head *pagelist, unsigned long flags)
+{
+ /*
+ * Avoid migrating a page that is shared by others and not writable.
+ */
+ if ((flags & MPOL_MF_MOVE_ALL) ||
+ PageAnon(page) ||
+ mapping_writably_mapped(page->mapping) ||
+ single_mm_mapping(vma->vm_mm, page->mapping)
+ ) {
+ int rc = isolate_lru_page(page);
+
+ if (rc == 1)
+ list_add(&page->lru, pagelist);
+ /*
+ * If the isolate attempt was not successful
+ * then we just encountered an unswappable
+ * page. Something must be wrong.
+ */
+ WARN_ON(rc == 0);
+ }
+}
+
/* Ensure all existing pages follow the policy. */
static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pte_t *orig_pte;
pte_t *pte;
@@ -200,15 +257,23 @@ static int check_pte_range(struct vm_are
continue;
}
nid = pfn_to_nid(pfn);
- if (!node_isset(nid, *nodes))
- break;
+ if (!node_isset(nid, *nodes)) {
+ if (pagelist) {
+ struct page *page = pfn_to_page(pfn);
+
+ migrate_page_add(vma, page, pagelist, flags);
+ } else
+ break;
+ }
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
return addr != end;
}

static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pmd_t *pmd;
unsigned long next;
@@ -218,14 +283,17 @@ static inline int check_pmd_range(struct
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
- if (check_pte_range(vma, pmd, addr, next, nodes))
+ if (check_pte_range(vma, pmd, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pmd++, addr = next, addr != end);
return 0;
}

static inline int check_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pud_t *pud;
unsigned long next;
@@ -235,14 +303,17 @@ static inline int check_pud_range(struct
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- if (check_pmd_range(vma, pud, addr, next, nodes))
+ if (check_pmd_range(vma, pud, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pud++, addr = next, addr != end);
return 0;
}

static inline int check_pgd_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pgd_t *pgd;
unsigned long next;
@@ -252,16 +323,35 @@ static inline int check_pgd_range(struct
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- if (check_pud_range(vma, pgd, addr, next, nodes))
+ if (check_pud_range(vma, pgd, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pgd++, addr = next, addr != end);
return 0;
}

-/* Step 1: check the range */
+/* Check if a vma is migratable */
+static inline int vma_migratable(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & (
+ VM_LOCKED |
+ VM_IO |
+ VM_RESERVED |
+ VM_DENYWRITE |
+ VM_SHM
+ ))
+ return 0;
+ return 1;
+}
+
+/*
+ * Check if all pages in a range are on a set of nodes.
+ * If pagelist != NULL then isolate pages from the LRU and
+ * put them on the pagelist.
+ */
static struct vm_area_struct *
check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
- nodemask_t *nodes, unsigned long flags)
+ nodemask_t *nodes, unsigned long flags, struct list_head *pagelist)
{
int err;
struct vm_area_struct *first, *vma, *prev;
@@ -273,17 +363,24 @@ check_range(struct mm_struct *mm, unsign
return ERR_PTR(-EACCES);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
- if (!vma->vm_next && vma->vm_end < end)
- return ERR_PTR(-EFAULT);
- if (prev && prev->vm_end < vma->vm_start)
- return ERR_PTR(-EFAULT);
- if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
+ if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+ if (!vma->vm_next && vma->vm_end < end)
+ return ERR_PTR(-EFAULT);
+ if (prev && prev->vm_end < vma->vm_start)
+ return ERR_PTR(-EFAULT);
+ }
+ if (!is_vm_hugetlb_page(vma) &&
+ ((flags & MPOL_MF_STRICT) ||
+ ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+ vma_migratable(vma)
+ ))) {
unsigned long endvma = vma->vm_end;
if (endvma > end)
endvma = end;
if (vma->vm_start > start)
start = vma->vm_start;
- err = check_pgd_range(vma, start, endvma, nodes);
+ err = check_pgd_range(vma, start, endvma, nodes,
+ flags, pagelist);
if (err) {
first = ERR_PTR(err);
break;
@@ -357,33 +454,59 @@ long do_mbind(unsigned long start, unsig
struct mempolicy *new;
unsigned long end;
int err;
+ LIST_HEAD(pagelist);

- if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
+ if ((flags & ~(unsigned long)(MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ || mode > MPOL_MAX)
return -EINVAL;
+ if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
if (start & ~PAGE_MASK)
return -EINVAL;
+
if (mode == MPOL_DEFAULT)
flags &= ~MPOL_MF_STRICT;
+
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
end = start + len;
+
if (end < start)
return -EINVAL;
if (end == start)
return 0;
+
if (mpol_check_policy(mode, nmask))
return -EINVAL;
+
new = mpol_new(mode, nmask);
if (IS_ERR(new))
return PTR_ERR(new);

+ /*
+ * If we are using the default policy then operation
+ * on discontinuous address spaces is okay after all
+ */
+ if (!new)
+ flags |= MPOL_MF_DISCONTIG_OK;
+
PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
mode,nodes_addr(nodes)[0]);

down_write(&mm->mmap_sem);
- vma = check_range(mm, start, end, nmask, flags);
+ vma = check_range(mm, start, end, nmask, flags,
+ (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ? &pagelist : NULL);
err = PTR_ERR(vma);
- if (!IS_ERR(vma))
+ if (!IS_ERR(vma)) {
err = mbind_range(vma, start, end, new);
+ if (!list_empty(&pagelist))
+ migrate_pages(&pagelist, NULL);
+ if (!err && !list_empty(&pagelist) && (flags & MPOL_MF_STRICT))
+ err = -EIO;
+ }
+ if (!list_empty(&pagelist))
+ putback_lru_pages(&pagelist);
+
up_write(&mm->mmap_sem);
mpol_free(new);
return err;
Index: linux-2.6.14-rc5-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/mempolicy.h 2005-10-24 10:27:12.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/mempolicy.h 2005-10-25 09:09:34.000000000 -0700
@@ -22,6 +22,8 @@

/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */

#ifdef __KERNEL__

2005-10-25 19:31:23

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 5/5] Swap Migration V4: sys_migrate_pages interface

sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the
first half of 2005 on [email protected] and [email protected].

The intend of sys_migrate is to migrate memory of a process. A process may have
migrated to another node. Memory was allocated optimally for the prior context.
sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically move
memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation.

The current code here is based on the swap based page migration capability and thus
not able to preserve the physical layout relative to it containing nodeset (which
may be a cpuset). When direct page migration becomes available then the
implementation needs to be changed to do a isomorphic move of pages between different
nodesets. The current implementation simply evicts all pages in source
nodeset that are not in the target nodeset.

Patch supports ia64, i386, x86_64 and ppc64. Patch not tested on ppc64.

Changes V3->V4:
- Add Ray's permissions check based on check_kill_permission().

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc5-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/mempolicy.c 2005-10-25 09:09:54.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/mempolicy.c 2005-10-25 09:29:13.000000000 -0700
@@ -631,6 +631,36 @@ long do_get_mempolicy(int *policy, nodem
}

/*
+ * For now migrate_pages simply swaps out the pages from nodes that are in
+ * the source set but not in the target set. In the future, we would
+ * want a function that moves pages between the two nodesets in such
+ * a way as to preserve the physical layout as much as possible.
+ *
+ * Returns the number of page that could not be moved.
+ */
+int do_migrate_pages(struct mm_struct *mm,
+ nodemask_t *from_nodes, nodemask_t *to_nodes, int flags)
+{
+ LIST_HEAD(pagelist);
+ int count = 0;
+ nodemask_t nodes;
+
+ nodes_andnot(nodes, *from_nodes, *to_nodes);
+ nodes_complement(nodes, nodes);
+
+ down_read(&mm->mmap_sem);
+ check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nodes,
+ flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+ if (!list_empty(&pagelist)) {
+ swapout_pages(&pagelist);
+ if (!list_empty(&pagelist))
+ count = putback_lru_pages(&pagelist);
+ }
+ up_read(&mm->mmap_sem);
+ return count;
+}
+
+/*
* User space interface with variable sized bitmaps for nodelists.
*/

@@ -724,6 +754,65 @@ asmlinkage long sys_set_mempolicy(int mo
return do_set_mempolicy(mode, &nodes);
}

+/* Macro needed until Paul implements this function in kernel/cpusets.c */
+#define cpuset_mems_allowed(task) node_online_map
+
+asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
+ unsigned long __user *old_nodes,
+ unsigned long __user *new_nodes)
+{
+ struct mm_struct *mm;
+ struct task_struct *task;
+ nodemask_t old;
+ nodemask_t new;
+ int err;
+
+ err = get_nodes(&old, old_nodes, maxnode);
+ if (err)
+ return err;
+
+ err = get_nodes(&new, new_nodes, maxnode);
+ if (err)
+ return err;
+
+ /* Find the mm_struct */
+ read_lock(&tasklist_lock);
+ task = pid ? find_task_by_pid(pid) : current;
+ if (!task) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ mm = get_task_mm(task);
+ read_unlock(&tasklist_lock);
+
+ if (!mm)
+ return -EINVAL;
+
+ /*
+ * Permissions check like for signals.
+ * See check_kill_permission()
+ */
+ if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
+ (current->uid ^ task->suid) && (current->uid ^ task->uid) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ /* Is the user allowed to access the target nodes? */
+ if (!nodes_subset(new, cpuset_mems_allowed(task)) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err= -EPERM;
+ goto out;
+ }
+
+ err = do_migrate_pages(mm, &old, &new, MPOL_MF_MOVE);
+out:
+ mmput(mm);
+ return err;
+}
+
+
/* Retrieve NUMA policy */
asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
Index: linux-2.6.14-rc5-mm1/kernel/sys_ni.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/kernel/sys_ni.c 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/kernel/sys_ni.c 2005-10-25 09:29:13.000000000 -0700
@@ -82,6 +82,7 @@ cond_syscall(compat_sys_socketcall);
cond_syscall(sys_inotify_init);
cond_syscall(sys_inotify_add_watch);
cond_syscall(sys_inotify_rm_watch);
+cond_syscall(sys_migrate_pages);

/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
Index: linux-2.6.14-rc5-mm1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.14-rc5-mm1.orig/arch/ia64/kernel/entry.S 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/arch/ia64/kernel/entry.S 2005-10-25 09:29:13.000000000 -0700
@@ -1600,5 +1600,6 @@ sys_call_table:
data8 sys_inotify_init
data8 sys_inotify_add_watch
data8 sys_inotify_rm_watch
+ data8 sys_migrate_pages

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.14-rc5-mm1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/asm-ia64/unistd.h 2005-10-24 10:27:21.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/asm-ia64/unistd.h 2005-10-25 09:29:13.000000000 -0700
@@ -269,12 +269,12 @@
#define __NR_inotify_init 1277
#define __NR_inotify_add_watch 1278
#define __NR_inotify_rm_watch 1279
-
+#define __NR_migrate_pages 1280
#ifdef __KERNEL__

#include <linux/config.h>

-#define NR_syscalls 256 /* length of syscall table */
+#define NR_syscalls 257 /* length of syscall table */

#define __ARCH_WANT_SYS_RT_SIGACTION

Index: linux-2.6.14-rc5-mm1/arch/ppc64/kernel/misc.S
===================================================================
--- linux-2.6.14-rc5-mm1.orig/arch/ppc64/kernel/misc.S 2005-10-24 10:27:15.000000000 -0700
+++ linux-2.6.14-rc5-mm1/arch/ppc64/kernel/misc.S 2005-10-25 09:29:13.000000000 -0700
@@ -1581,3 +1581,4 @@ _GLOBAL(sys_call_table)
.llong .sys_inotify_init /* 275 */
.llong .sys_inotify_add_watch
.llong .sys_inotify_rm_watch
+ .llong .sys_migrate_pages
Index: linux-2.6.14-rc5-mm1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.14-rc5-mm1.orig/arch/i386/kernel/syscall_table.S 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/arch/i386/kernel/syscall_table.S 2005-10-25 09:29:13.000000000 -0700
@@ -294,3 +294,5 @@ ENTRY(sys_call_table)
.long sys_inotify_init
.long sys_inotify_add_watch
.long sys_inotify_rm_watch
+ .long sys_migrate_pages
+
Index: linux-2.6.14-rc5-mm1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/asm-x86_64/unistd.h 2005-10-24 10:27:21.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/asm-x86_64/unistd.h 2005-10-25 09:29:13.000000000 -0700
@@ -571,8 +571,10 @@ __SYSCALL(__NR_inotify_init, sys_inotify
__SYSCALL(__NR_inotify_add_watch, sys_inotify_add_watch)
#define __NR_inotify_rm_watch 255
__SYSCALL(__NR_inotify_rm_watch, sys_inotify_rm_watch)
+#define __NR_migrate_pages 256
+__SYSCALL(__NR_migrate_pages, sys_migrate_pages)

-#define __NR_syscall_max __NR_inotify_rm_watch
+#define __NR_syscall_max __NR_migrate_pages
#ifndef __NO_STUBS

/* user-visible error numbers are in the range -1 - -4095 */
Index: linux-2.6.14-rc5-mm1/include/linux/syscalls.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/syscalls.h 2005-10-24 10:27:21.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/syscalls.h 2005-10-25 09:29:13.000000000 -0700
@@ -511,5 +511,7 @@ asmlinkage long sys_ioprio_set(int which
asmlinkage long sys_ioprio_get(int which, int who);
asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
unsigned long maxnode);
+asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
+ unsigned long __user *from, unsigned long __user *to);

#endif
Index: linux-2.6.14-rc5-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/mempolicy.h 2005-10-25 09:09:34.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/mempolicy.h 2005-10-25 09:29:13.000000000 -0700
@@ -158,6 +158,9 @@ extern void numa_default_policy(void);
extern void numa_policy_init(void);
extern struct mempolicy default_policy;

+int do_migrate_pages(struct mm_struct *mm,
+ nodemask_t *from_nodes, nodemask_t *to_nodes, int flags);
+
#else

struct mempolicy {};

2005-10-25 19:30:58

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 2/5] Swap Migration V4: PF_SWAPWRITE to allow writing to swap

Add PF_SWAPWRITE to control a processes permission to write to swap.

- Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
and pdflush

- Set PF_SWAPWRITE flag for kswapd and pdflush

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc5-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/sched.h 2005-10-24 10:27:29.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/sched.h 2005-10-25 11:07:11.000000000 -0700
@@ -914,6 +914,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
+#define PF_SWAPWRITE 0x01000000 /* the process is allowed to write to swap */

/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-25 08:09:52.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-25 11:07:11.000000000 -0700
@@ -263,9 +263,7 @@ static inline int is_page_cache_freeable

static int may_write_to_queue(struct backing_dev_info *bdi)
{
- if (current_is_kswapd())
- return 1;
- if (current_is_pdflush()) /* This is unlikely, but why not... */
+ if (current->flags & PF_SWAPWRITE)
return 1;
if (!bdi_write_congested(bdi))
return 1;
@@ -1279,7 +1277,7 @@ static int kswapd(void *p)
* us from recursively trying to free more memory as we're
* trying to free the first piece of memory in the first place).
*/
- tsk->flags |= PF_MEMALLOC|PF_KSWAPD;
+ tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;

order = 0;
for ( ; ; ) {
Index: linux-2.6.14-rc5-mm1/mm/pdflush.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/pdflush.c 2005-10-24 10:27:21.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/pdflush.c 2005-10-25 11:07:11.000000000 -0700
@@ -90,7 +90,7 @@ struct pdflush_work {

static int __pdflush(struct pdflush_work *my_work)
{
- current->flags |= PF_FLUSHER;
+ current->flags |= PF_FLUSHER | PF_SWAPWRITE;
my_work->fn = NULL;
my_work->who = current;
INIT_LIST_HEAD(&my_work->list);

2005-10-25 19:31:41

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 1/5] Swap Migration V4: LRU operations

isolation of pages from LRU

Implement functions to isolate pages from the LRU and put them back later.

An earlier implementation was provided by
Hirokazu Takahashi <[email protected]> and
IWAMOTO Toshihiro <[email protected]> for the memory
hotplug project.

>From Magnus:

This patch for 2.6.14-rc4-mm1 breaks out isolate_lru_page() and
putpack_lru_page() and makes them inline. I'd like to build my code on
top of this patch, and I think your page eviction code could be built
on top of this patch too - without introducing too much duplicated
code.

Changes V3-V4:

- Remove obsolete second parameter from isolate_lru_page
- Mention the original authors

Signed-off-by: Magnus Damm <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc5-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/mm_inline.h 2005-10-19 23:23:05.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/mm_inline.h 2005-10-25 08:09:52.000000000 -0700
@@ -38,3 +38,55 @@ del_page_from_lru(struct zone *zone, str
zone->nr_inactive--;
}
}
+
+/*
+ * Isolate one page from the LRU lists.
+ *
+ * - zone->lru_lock must be held
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list
+ * -1 = page is being freed elsewhere.
+ */
+static inline int
+__isolate_lru_page(struct zone *zone, struct page *page)
+{
+ if (TestClearPageLRU(page)) {
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ return -1;
+ } else {
+ if (PageActive(page))
+ del_page_from_active_list(zone, page);
+ else
+ del_page_from_inactive_list(zone, page);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * Add isolated page back on the LRU lists
+ *
+ * - zone->lru_lock must be held
+ * - page must already be removed from other list
+ * - additional call to put_page() is needed
+ */
+static inline void
+__putback_lru_page(struct zone *zone, struct page *page)
+{
+ if (TestSetPageLRU(page))
+ BUG();
+
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+}
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-24 10:27:30.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-25 08:09:52.000000000 -0700
@@ -578,43 +578,75 @@ keep:
*
* Appropriate locks must be held before calling this function.
*
+ * @zone: The zone where lru_lock is held.
* @nr_to_scan: The number of pages to look through on the list.
* @src: The LRU list to pull pages off.
* @dst: The temp list to put pages on to.
- * @scanned: The number of pages that were scanned.
*
- * returns how many pages were moved onto *@dst.
+ * returns the number of pages that were scanned.
*/
-static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
- struct list_head *dst, int *scanned)
+static int isolate_lru_pages(struct zone *zone, int nr_to_scan,
+ struct list_head *src, struct list_head *dst)
{
- int nr_taken = 0;
struct page *page;
- int scan = 0;
+ int scanned = 0;
+ int rc;

- while (scan++ < nr_to_scan && !list_empty(src)) {
+ while (scanned++ < nr_to_scan && !list_empty(src)) {
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);

- if (!TestClearPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It is being freed elsewhere
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, src);
- continue;
- } else {
+ rc = __isolate_lru_page(zone, page);
+
+ BUG_ON(rc == 0); /* PageLRU(page) must be true */
+
+ if (rc == 1) /* Succeeded to isolate page */
list_add(&page->lru, dst);
- nr_taken++;
+
+ if (rc == -1) { /* Not possible to isolate */
+ list_del(&page->lru);
+ list_add(&page->lru, src);
}
}

- *scanned = scan;
- return nr_taken;
+ return scanned;
+}
+
+static void lru_add_drain_per_cpu(void *dummy)
+{
+ lru_add_drain();
+}
+
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list. Do necessary cache draining if the
+ * page is not on the LRU lists yet.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ * -1 = page is being freed elsewhere.
+ */
+int isolate_lru_page(struct page *page)
+{
+ int rc = 0;
+ struct zone *zone = page_zone(page);
+
+redo:
+ spin_lock_irq(&zone->lru_lock);
+ rc = __isolate_lru_page(zone, page);
+ spin_unlock_irq(&zone->lru_lock);
+ if (rc == 0) {
+ /*
+ * Maybe this page is still waiting for a cpu to drain it
+ * from one of the lru lists?
+ */
+ smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
+ lru_add_drain();
+ if (PageLRU(page))
+ goto redo;
+ }
+ return rc;
}

/*
@@ -632,18 +664,15 @@ static void shrink_cache(struct zone *zo
spin_lock_irq(&zone->lru_lock);
while (max_scan > 0) {
struct page *page;
- int nr_taken;
int nr_scan;
int nr_freed;

- nr_taken = isolate_lru_pages(sc->swap_cluster_max,
- &zone->inactive_list,
- &page_list, &nr_scan);
- zone->nr_inactive -= nr_taken;
+ nr_scan = isolate_lru_pages(zone, sc->swap_cluster_max,
+ &zone->inactive_list, &page_list);
zone->pages_scanned += nr_scan;
spin_unlock_irq(&zone->lru_lock);

- if (nr_taken == 0)
+ if (list_empty(&page_list))
goto done;

max_scan -= nr_scan;
@@ -663,13 +692,9 @@ static void shrink_cache(struct zone *zo
*/
while (!list_empty(&page_list)) {
page = lru_to_page(&page_list);
- if (TestSetPageLRU(page))
- BUG();
list_del(&page->lru);
- if (PageActive(page))
- add_page_to_active_list(zone, page);
- else
- add_page_to_inactive_list(zone, page);
+ __putback_lru_page(zone, page);
+
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -683,6 +708,33 @@ done:
}

/*
+ * Add isolated pages on the list back to the LRU
+ * Determines the zone for each pages and takes
+ * the necessary lru lock for each page.
+ *
+ * returns the number of pages put back.
+ */
+int putback_lru_pages(struct list_head *l)
+{
+ struct page * page;
+ struct page * page2;
+ int count = 0;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ struct zone *zone = page_zone(page);
+
+ list_del(&page->lru);
+ spin_lock_irq(&zone->lru_lock);
+ __putback_lru_page(zone, page);
+ spin_unlock_irq(&zone->lru_lock);
+ count++;
+ /* Undo the get from isolate_lru_page */
+ put_page(page);
+ }
+ return count;
+}
+
+/*
* This moves pages from the active list to the inactive list.
*
* We move them the other way if the page is referenced by one or more
@@ -718,10 +770,9 @@ refill_inactive_zone(struct zone *zone,

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
- &l_hold, &pgscanned);
+ pgscanned = isolate_lru_pages(zone, nr_pages,
+ &zone->active_list, &l_hold);
zone->pages_scanned += pgscanned;
- zone->nr_active -= pgmoved;
spin_unlock_irq(&zone->lru_lock);

/*
Index: linux-2.6.14-rc5-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/swap.h 2005-10-24 10:27:13.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/swap.h 2005-10-25 08:09:52.000000000 -0700
@@ -176,6 +176,9 @@ extern int zone_reclaim(struct zone *, u
extern int shrink_all_memory(int);
extern int vm_swappiness;

+extern int isolate_lru_page(struct page *p);
+extern int putback_lru_pages(struct list_head *l);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);

2005-10-25 19:32:26

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 3/5] Swap Migration V4: migrate_pages() function

Page migration support in vmscan.c

This patch adds the basic page migration function with a minimal implementation
that only allows the eviction of pages to swap space.

Page eviction and migration may be useful to migrate pages, to suspend programs
or for remapping single pages (useful for faulty pages or pages with soft ECC
failures)

The process is as follows:

The function wanting to migrate pages must first build a list of pages to be
migrated or evicted and take them off the lru lists via isolate_lru_page().
isolate_lru_page determines that a page is freeable based on the LRU bit set.

Then the actual migration or swapout can happen by calling migrate_pages().

migrate_pages does its best to migrate or swapout the pages and does multiple passes
over the list. Some pages may only be swappable if they are not dirty. migrate_pages
may start writing out dirty pages in the initial passes over the pages.
However, migrate_pages may not be able to migrate or evict all pages for a variety
of reasons.

The remaining pages may be returned to the LRU lists using putback_lru_pages().

Changelog V3->V4:
- Restructure code so that applying patches to support full migration does
require minimal changes. Rename swapout_pages() to migrate_pages().

Changelog V2->V3:
- Extract common code from shrink_list() and swapout_pages()

Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc5-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc5-mm1.orig/include/linux/swap.h 2005-10-25 08:09:52.000000000 -0700
+++ linux-2.6.14-rc5-mm1/include/linux/swap.h 2005-10-25 11:04:33.000000000 -0700
@@ -179,6 +179,8 @@ extern int vm_swappiness;
extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);

+extern int migrate_pages(struct list_head *l, struct list_head *t);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-25 11:04:27.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-25 11:05:59.000000000 -0700
@@ -368,6 +368,47 @@ static pageout_t pageout(struct page *pa
return PAGE_CLEAN;
}

+static inline int remove_mapping(struct address_space *mapping,
+ struct page *page)
+{
+ if (!mapping)
+ return 0; /* truncate got there first */
+
+ write_lock_irq(&mapping->tree_lock);
+
+ /*
+ * The non-racy check for busy page. It is critical to check
+ * PageDirty _after_ making sure that the page is freeable and
+ * not in use by anybody. (pagecache + us == 2)
+ */
+ if (unlikely(page_count(page) != 2))
+ goto cannot_free;
+ smp_rmb();
+ if (unlikely(PageDirty(page)))
+ goto cannot_free;
+
+#ifdef CONFIG_SWAP
+ if (PageSwapCache(page)) {
+ swp_entry_t swap = { .val = page_private(page) };
+ add_to_swapped_list(swap.val);
+ __delete_from_swap_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ swap_free(swap);
+ __put_page(page); /* The pagecache ref */
+ return 1;
+ }
+#endif /* CONFIG_SWAP */
+
+ __remove_from_page_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ __put_page(page);
+ return 1;
+
+cannot_free:
+ write_unlock_irq(&mapping->tree_lock);
+ return 0;
+}
+
/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
@@ -506,37 +547,8 @@ static int shrink_list(struct list_head
goto free_it;
}

- if (!mapping)
- goto keep_locked; /* truncate got there first */
-
- write_lock_irq(&mapping->tree_lock);
-
- /*
- * The non-racy check for busy page. It is critical to check
- * PageDirty _after_ making sure that the page is freeable and
- * not in use by anybody. (pagecache + us == 2)
- */
- if (unlikely(page_count(page) != 2))
- goto cannot_free;
- smp_rmb();
- if (unlikely(PageDirty(page)))
- goto cannot_free;
-
-#ifdef CONFIG_SWAP
- if (PageSwapCache(page)) {
- swp_entry_t swap = { .val = page_private(page) };
- add_to_swapped_list(swap.val);
- __delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- swap_free(swap);
- __put_page(page); /* The pagecache ref */
- goto free_it;
- }
-#endif /* CONFIG_SWAP */
-
- __remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- __put_page(page);
+ if (!remove_mapping(mapping, page))
+ goto keep_locked;

free_it:
unlock_page(page);
@@ -545,10 +557,6 @@ free_it:
__pagevec_release_nonlru(&freed_pvec);
continue;

-cannot_free:
- write_unlock_irq(&mapping->tree_lock);
- goto keep_locked;
-
activate_locked:
SetPageActive(page);
pgactivate++;
@@ -567,6 +575,156 @@ keep:
}

/*
+ * swapout a single page
+ * page is locked upon entry, unlocked on exit
+ *
+ * return codes:
+ * 0 = complete
+ * 1 = retry
+ */
+static int swap_page(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
+ if (page_mapped(page) && mapping)
+ if (try_to_unmap(page) != SWAP_SUCCESS)
+ goto unlock_retry;
+
+ if (PageDirty(page)) {
+ /* Page is dirty, try to write it out here */
+ switch(pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto unlock_retry;
+ case PAGE_SUCCESS:
+ goto retry;
+ case PAGE_CLEAN:
+ ; /* try to free the page below */
+ }
+ }
+
+ if (PagePrivate(page)) {
+ if (!try_to_release_page(page, GFP_KERNEL))
+ goto unlock_retry;
+ if (!mapping && page_count(page) == 1)
+ goto free_it;
+ }
+
+ if (!remove_mapping(mapping, page))
+ goto unlock_retry; /* truncate got there first */
+
+free_it:
+ /*
+ * We may free pages that were taken off the active list
+ * by isolate_lru_page. However, free_hot_cold_page will check
+ * if the active bit is set. So clear it.
+ */
+ ClearPageActive(page);
+
+ list_del(&page->lru);
+ unlock_page(page);
+ put_page(page);
+ return 0;
+
+unlock_retry:
+ unlock_page(page);
+
+retry:
+ return 1;
+}
+/*
+ * migrate_pages
+ *
+ * Two lists are passed to this function. The first list
+ * contains the pages isolated from the LRU to be migrated.
+ * The second list contains new pages that the pages isolated
+ * can be moved to. If the second list is NULL then all
+ * pages are swapped out.
+ *
+ * The function returns after 10 attempts or if no pages
+ * are movable anymore because t has become empty
+ * or no retryable pages exist anymore.
+ *
+ * return value (lists contain remaining pages!)
+ * -1 list of new pages has become exhausted.
+ * 0 All page migrated
+ * n Number of pages not migrated
+ *
+ * SIMPLIFIED VERSION: This implementation of migrate_pages
+ * is only swapping out pages and never touches the second
+ * list. The direct migration patchset
+ * extends this function to avoid the use of swap.
+ */
+int migrate_pages(struct list_head *l, struct list_head *t)
+{
+ int retry;
+ int failed;
+ int pass = 0;
+ struct page *page;
+ struct page *page2;
+ int swapwrite = current->flags & PF_SWAPWRITE;
+
+ if (!swapwrite)
+ current->flags |= PF_SWAPWRITE;
+
+redo:
+ retry = 0;
+ failed = 0;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ cond_resched();
+
+ /*
+ * Skip locked pages during the first two passes to give the
+ * functions holding the lock time to release the page. Later we use
+ * lock_page to have a higher chance of acquiring the lock.
+ */
+ if (pass > 2)
+ lock_page(page);
+ else
+ if (TestSetPageLocked(page))
+ goto retry_later;
+
+ /*
+ * Only wait on writeback if we have already done a pass where
+ * we we may have triggered writeouts for lots of pages.
+ */
+ if (pass > 0)
+ wait_on_page_writeback(page);
+ else
+ if (PageWriteback(page)) {
+ unlock_page(page);
+ goto retry_later;
+ }
+
+#ifdef CONFIG_SWAP
+ if (PageAnon(page) && !PageSwapCache(page)) {
+ if (!add_to_swap(page)) {
+ unlock_page(page);
+ failed++;
+ continue;
+ }
+ }
+#endif /* CONFIG_SWAP */
+
+ /*
+ * Page is properly locked and writeback is complete.
+ * Try to migrate the page.
+ */
+ if (swap_page(page)) {
+retry_later:
+ retry++;
+ }
+ }
+ if (retry && pass++ < 10)
+ goto redo;
+
+ if (!swapwrite)
+ current->flags &= ~PF_SWAPWRITE;
+ return failed + retry;
+}
+
+/*
* zone->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.

2005-10-26 07:16:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 3/5] Swap Migration V4: migrate_pages() function

On Tue, 2005-10-25 at 12:30 -0700, Christoph Lameter wrote:
>
> +#ifdef CONFIG_SWAP
> + if (PageSwapCache(page)) {
> + swp_entry_t swap = { .val = page_private(page) };
> + add_to_swapped_list(swap.val);
> + __delete_from_swap_cache(page);
> + write_unlock_irq(&mapping->tree_lock);
> + swap_free(swap);
> + __put_page(page); /* The pagecache ref */
> + return 1;
> + }
> +#endif /* CONFIG_SWAP */

Why is this #ifdef needed? PageSwapCache() is #defined to 0 when !
CONFIG_SWAP.

-- Dave

2005-10-26 09:31:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] Swap Migration V4: LRU operations

On Tue, 2005-10-25 at 12:30 -0700, Christoph Lameter wrote:

> + if (rc == -1) { /* Not possible to isolate */
> + list_del(&page->lru);
> + list_add(&page->lru, src);
> }

Would the usage of list_move() not be simpler?

Peter Zijlstra

2005-10-26 16:45:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/5] Swap Migration V4: LRU operations

On Wed, 26 Oct 2005, Peter Zijlstra wrote:

> On Tue, 2005-10-25 at 12:30 -0700, Christoph Lameter wrote:
>
> > + if (rc == -1) { /* Not possible to isolate */
> > + list_del(&page->lru);
> > + list_add(&page->lru, src);
> > }
>
> Would the usage of list_move() not be simpler?

Hmmm. yes the whole section is a bit weird there (sorry Magnus). How
about this additional patch:

Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-25 08:09:52.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-26 09:42:34.000000000 -0700
@@ -590,22 +590,22 @@ static int isolate_lru_pages(struct zone
{
struct page *page;
int scanned = 0;
- int rc;

while (scanned++ < nr_to_scan && !list_empty(src)) {
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);

- rc = __isolate_lru_page(zone, page);
-
- BUG_ON(rc == 0); /* PageLRU(page) must be true */
-
- if (rc == 1) /* Succeeded to isolate page */
+ switch (__isolate_lru_page(zone, page)) {
+ case 1:
+ /* Succeeded to isolate page */
list_add(&page->lru, dst);
-
- if (rc == -1) { /* Not possible to isolate */
- list_del(&page->lru);
- list_add(&page->lru, src);
+ break;
+ case -1:
+ /* Not possible to isolate */
+ list_move(&page->lru, src);
+ break;
+ default:
+ BUG();
}
}

2005-10-26 16:49:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/5] Swap Migration V4: migrate_pages() function

On Wed, 26 Oct 2005, Dave Hansen wrote:

> Why is this #ifdef needed? PageSwapCache() is #defined to 0 when !
> CONFIG_SWAP.

Right.

Index: linux-2.6.14-rc5-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c 2005-10-26 09:46:20.000000000 -0700
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c 2005-10-26 09:47:33.000000000 -0700
@@ -387,7 +387,6 @@ static inline int remove_mapping(struct
if (unlikely(PageDirty(page)))
goto cannot_free;

-#ifdef CONFIG_SWAP
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
add_to_swapped_list(swap.val);
@@ -397,7 +396,6 @@ static inline int remove_mapping(struct
__put_page(page); /* The pagecache ref */
return 1;
}
-#endif /* CONFIG_SWAP */

__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);