2005-10-20 22:59:52

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 0/4] Swap migration V3: Overview

Changes from V2 to V3:
- Break out common code for page eviction (Thanks to a patch by Magnus Damm)
- Add check to avoid MPOL_MF_MOVE moving pages that are also accessed from
another address space. Add support for MPOL_MF_MOVE_ALL to override this
(requires superuser priviledges).
- Update overview regarding direct page migration patchset following soon and
cut longwinded explanations.
- Add sys_migrate patchset
- Check cpuset restrictions on sys_migrate.

Changes from V1 to V2:
- Patch against 2.6.14-rc4-mm1
- Remove move_pages() function
- Code cleanup to make it less invasive.
- Fix missing lru_add_drain() invocation from isolate_lru_page()

In a NUMA system it is often beneficial to be able to move the memory
in use by a process to different nodes in order to enhance performance.
Currently Linux simply does not support this facility. This patchset
implements page migration via a new syscall sys_migrate_pages and via
the memory policy layer with the MPOL_MF_MOVE and MPOL_MF_MOVE_ALL
flags.

Page migration is also useful for other purposes:

1. Memory hotplug. Migrating processes off a memory node that is going
to be disconnected.

2. Remapping of bad pages. These could be detected through soft ECC errors
and other mechanisms.

This patchset realizes swap based page migration. Another patchset will
follow soon (done by Mike Kravetz and me based on the hotplug direct page
migration code, draft exists) that implements direct page migration on top
of the framework established by the swap based page migration patchset.

The advantage of page based swapping is that the necessary changes to the kernel
are minimal. With a fully functional but minimal page migration capability we
will be able to enhance low level code and higher level APIs at the same time.
This will hopefully decrease the time needed to get the code for direct page
migration working and into the kernel trees. We hope that the swap based
page migration will be included in 2.6.15.

The patchset consists of two patches:

1. LRU operations

Add basic operations to remove pages from the LRU lists and return
them back to it.

2. Page eviction

Adds a function to mm/vmscan.c called swapout_pages that forces pages
out to swap.

3. MPOL_MF_MOVE flag for memory policies.

This implements MPOL_MF_MOVE in addition to MPOL_MF_STRICT. MPOL_MF_STRICT
allows the checking if all pages in a memory area obey the memory policies.
MPOL_MF_MOVE will evict all pages that do not conform to the memory policy.
The system will allocate pages conforming to the policy on swap in.

4. sys_migrate_pages system call and cpuset API

Adds a new function call

sys_migrate_pages(pid, maxnode, from_nodes, to_nodes)

to migrate pages of a process to a different node.

URLs referring to the discussion regarding the initial version of these
patches.

Page eviction: http://marc.theaimsgroup.com/?l=linux-mm&m=112922756730989&w=2
Numa policy : http://marc.theaimsgroup.com/?l=linux-mm&m=112922756724715&w=2

Discussion of V2 of the patchset:
http://marc.theaimsgroup.com/?t=112959680300007&r=1&w=2


2005-10-20 23:00:21

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 2/4] Swap migration V3: Page Eviction

Page eviction support in vmscan.c

This patch adds functions that allow the eviction of pages to swap space.
Page eviction may be useful to migrate pages, to suspend programs or for
ummapping single pages (useful for faulty pages or pages with soft ECC
failures)

The process is as follows:

The function wanting to evict pages must first build a list of pages to be evicted
and take them off the lru lists. This is done using the isolate_lru_page function.
isolate_lru_page determines that a page is freeable based on the LRU bit set and
adds the page if it is indeed freeable to the list specified.
isolate_lru_page will return 0 for a page that is not freeable.

Then the actual swapout can happen by calling swapout_pages().

swapout_pages does its best to swapout the pages and does multiple passes over the list.
However, swapout_pages may not be able to evict all pages for a variety of reasons.

The remaining pages may be returned to the LRU lists using putback_lru_pages().

V3:
- Extract common code from shrink_list() and swapout_pages()

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/swap.h 2005-10-20 13:13:24.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/linux/swap.h 2005-10-20 13:20:53.000000000 -0700
@@ -179,6 +179,8 @@ extern int vm_swappiness;
extern int isolate_lru_page(struct page *p, struct list_head *l);
extern int putback_lru_pages(struct list_head *l);

+extern int swapout_pages(struct list_head *l);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
Index: linux-2.6.14-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/mm/vmscan.c 2005-10-20 13:18:05.000000000 -0700
+++ linux-2.6.14-rc4-mm1/mm/vmscan.c 2005-10-20 13:22:33.000000000 -0700
@@ -370,6 +370,42 @@ static pageout_t pageout(struct page *pa
return PAGE_CLEAN;
}

+static inline int remove_mapping(struct address_space *mapping,
+ struct page *page)
+{
+ if (!mapping)
+ return 0; /* truncate got there first */
+
+ write_lock_irq(&mapping->tree_lock);
+
+ /*
+ * The non-racy check for busy page. It is critical to check
+ * PageDirty _after_ making sure that the page is freeable and
+ * not in use by anybody. (pagecache + us == 2)
+ */
+ if (page_count(page) != 2 || PageDirty(page)) {
+ write_unlock_irq(&mapping->tree_lock);
+ return 0;
+ }
+
+#ifdef CONFIG_SWAP
+ if (PageSwapCache(page)) {
+ swp_entry_t swap = { .val = page->private };
+ add_to_swapped_list(swap.val);
+ __delete_from_swap_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ swap_free(swap);
+ __put_page(page); /* The pagecache ref */
+ return 1;
+ }
+#endif /* CONFIG_SWAP */
+
+ __remove_from_page_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ __put_page(page);
+ return 1;
+}
+
/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
@@ -508,36 +544,8 @@ static int shrink_list(struct list_head
goto free_it;
}

- if (!mapping)
- goto keep_locked; /* truncate got there first */
-
- write_lock_irq(&mapping->tree_lock);
-
- /*
- * The non-racy check for busy page. It is critical to check
- * PageDirty _after_ making sure that the page is freeable and
- * not in use by anybody. (pagecache + us == 2)
- */
- if (page_count(page) != 2 || PageDirty(page)) {
- write_unlock_irq(&mapping->tree_lock);
+ if (!remove_mapping(mapping, page))
goto keep_locked;
- }
-
-#ifdef CONFIG_SWAP
- if (PageSwapCache(page)) {
- swp_entry_t swap = { .val = page->private };
- add_to_swapped_list(swap.val);
- __delete_from_swap_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- swap_free(swap);
- __put_page(page); /* The pagecache ref */
- goto free_it;
- }
-#endif /* CONFIG_SWAP */
-
- __remove_from_page_cache(page);
- write_unlock_irq(&mapping->tree_lock);
- __put_page(page);

free_it:
unlock_page(page);
@@ -564,6 +572,122 @@ keep:
}

/*
+ * Swapout evicts the pages on the list to swap space.
+ * This is essentially a dumbed down version of shrink_list
+ *
+ * returns the number of pages that were not evictable
+ *
+ * Multiple passes are performed over the list. The first
+ * pass avoids waiting on locks and triggers writeout
+ * actions. Later passes begin to wait on locks in order
+ * to have a better chance of acquiring the lock.
+ */
+int swapout_pages(struct list_head *l)
+{
+ int retry;
+ int failed;
+ int pass = 0;
+ struct page *page;
+ struct page *page2;
+
+ current->flags |= PF_KSWAPD;
+
+redo:
+ retry = 0;
+ failed = 0;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ struct address_space *mapping;
+
+ cond_resched();
+
+ /*
+ * Skip locked pages during the first two passes to give the
+ * functions holding the lock time to release the page. Later we use
+ * lock_page to have a higher chance of acquiring the lock.
+ */
+ if (pass > 2)
+ lock_page(page);
+ else
+ if (TestSetPageLocked(page))
+ goto retry_later;
+
+ /*
+ * Only wait on writeback if we have already done a pass where
+ * we we may have triggered writeouts for lots of pages.
+ */
+ if (pass > 0)
+ wait_on_page_writeback(page);
+ else
+ if (PageWriteback(page))
+ goto retry_later_locked;
+
+#ifdef CONFIG_SWAP
+ if (PageAnon(page) && !PageSwapCache(page)) {
+ if (!add_to_swap(page))
+ goto failed;
+ }
+#endif /* CONFIG_SWAP */
+
+ mapping = page_mapping(page);
+ if (page_mapped(page) && mapping)
+ if (try_to_unmap(page) != SWAP_SUCCESS)
+ goto retry_later_locked;
+
+ if (PageDirty(page)) {
+ /* Page is dirty, try to write it out here */
+ switch(pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto retry_later_locked;
+ case PAGE_SUCCESS:
+ goto retry_later;
+ case PAGE_CLEAN:
+ ; /* try to free the page below */
+ }
+ }
+
+ if (PagePrivate(page)) {
+ if (!try_to_release_page(page, GFP_KERNEL))
+ goto retry_later_locked;
+ if (!mapping && page_count(page) == 1)
+ goto free_it;
+ }
+
+ if (!remove_mapping(mapping, page))
+ goto retry_later_locked; /* truncate got there first */
+
+free_it:
+ /*
+ * We may free pages that were taken off the active list
+ * by isolate_lru_page. However, free_hot_cold_page will check
+ * if the active bit is set. So clear it.
+ */
+ ClearPageActive(page);
+
+ list_del(&page->lru);
+ unlock_page(page);
+ put_page(page);
+ continue;
+
+failed:
+ failed++;
+ unlock_page(page);
+ continue;
+
+retry_later_locked:
+ unlock_page(page);
+retry_later:
+ retry++;
+ }
+ if (retry && pass++ < 10)
+ goto redo;
+
+ current->flags &= ~PF_KSWAPD;
+ return failed + retry;
+}
+
+/*
* zone->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.

2005-10-20 23:00:00

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 3/4] Swap migration V3: MPOL_MF_MOVE interface

Add page migration support via swap to the NUMA policy layer

This patch adds page migration support to the NUMA policy layer. An additional
flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is specified then
pages that do not conform to the memory policy will be evicted from memory.
When they get pages back in new pages will be allocated following the numa policy.

Version 2
- Add vma_migratable() function for future enhancements.
- No side effects on WARN_ON
- Remove move_pages for now
- Make patch fit 2.6.14-rc4-mm1

Version 3
- Add check to not migrate pages shared with other processes (but allow
migration of memory shared between threads having a common mm_struct)
- MPOL_MF_MOVE_ALL to override and move even pages shared with other
processes. This only works if the process issuing this call has
CAP_SYS_RESOURCE because this enables the moving of pages owned
by other processes.
- MPOL_MF_DISCONTIG_OK (internal use only) to not check for continuous VMAs.
Enable MPOL_MF_DISCONTIG_OK if policy to be set is NULL (default policy).

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/mm/mempolicy.c 2005-10-17 10:24:16.000000000 -0700
+++ linux-2.6.14-rc4-mm1/mm/mempolicy.c 2005-10-20 13:33:12.000000000 -0700
@@ -83,9 +83,13 @@
#include <linux/init.h>
#include <linux/compat.h>
#include <linux/mempolicy.h>
+#include <linux/swap.h>
#include <asm/tlbflush.h>
#include <asm/uaccess.h>

+/* Internal MPOL_MF_xxx flags */
+#define MPOL_MF_DISCONTIG_OK (1<<20) /* Skip checks for continuous vmas */
+
static kmem_cache_t *policy_cache;
static kmem_cache_t *sn_cache;

@@ -179,9 +183,62 @@ static struct mempolicy *mpol_new(int mo
return policy;
}

+/* Check if we are the only process mapping the page in question */
+static inline int single_mm_mapping(struct mm_struct *mm,
+ struct address_space *mapping)
+{
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ int rc = 1;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
+ if (mm != vma->vm_mm) {
+ rc = 0;
+ goto out;
+ }
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
+ if (mm != vma->vm_mm) {
+ rc = 0;
+ goto out;
+ }
+out:
+ spin_unlock(&mapping->i_mmap_lock);
+ return rc;
+}
+
+/*
+ * Add a page to be migrated to the pagelist
+ */
+static void migrate_page_add(struct vm_area_struct *vma,
+ struct page *page, struct list_head *pagelist, unsigned long flags)
+{
+ int rc;
+
+ /*
+ * Avoid migrating a page that is shared by others and not writable.
+ */
+ if ((flags & MPOL_MF_MOVE_ALL) ||
+ PageAnon(page) ||
+ mapping_writably_mapped(page->mapping) ||
+ single_mm_mapping(vma->vm_mm, page->mapping)
+ ) {
+
+ rc = isolate_lru_page(page, pagelist);
+ /*
+ * If the isolate attempt was not successful
+ * then we just encountered an unswappable
+ * page. Something must be wrong.
+ */
+ WARN_ON(rc == 0);
+ }
+}
+
/* Ensure all existing pages follow the policy. */
static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pte_t *orig_pte;
pte_t *pte;
@@ -200,15 +257,23 @@ static int check_pte_range(struct vm_are
continue;
}
nid = pfn_to_nid(pfn);
- if (!node_isset(nid, *nodes))
- break;
+ if (!node_isset(nid, *nodes)) {
+ if (pagelist) {
+ struct page *page = pfn_to_page(pfn);
+
+ migrate_page_add(vma, page, pagelist, flags);
+ } else
+ break;
+ }
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
return addr != end;
}

static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pmd_t *pmd;
unsigned long next;
@@ -218,14 +283,17 @@ static inline int check_pmd_range(struct
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
- if (check_pte_range(vma, pmd, addr, next, nodes))
+ if (check_pte_range(vma, pmd, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pmd++, addr = next, addr != end);
return 0;
}

static inline int check_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pud_t *pud;
unsigned long next;
@@ -235,14 +303,17 @@ static inline int check_pud_range(struct
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- if (check_pmd_range(vma, pud, addr, next, nodes))
+ if (check_pmd_range(vma, pud, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pud++, addr = next, addr != end);
return 0;
}

static inline int check_pgd_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end, nodemask_t *nodes)
+ unsigned long addr, unsigned long end,
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
pgd_t *pgd;
unsigned long next;
@@ -252,16 +323,31 @@ static inline int check_pgd_range(struct
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- if (check_pud_range(vma, pgd, addr, next, nodes))
+ if (check_pud_range(vma, pgd, addr, next, nodes,
+ flags, pagelist))
return -EIO;
} while (pgd++, addr = next, addr != end);
return 0;
}

+/* Check if a vma is migratable */
+static inline int vma_migratable(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & (
+ VM_LOCKED |
+ VM_IO |
+ VM_RESERVED |
+ VM_DENYWRITE |
+ VM_SHM
+ ))
+ return 0;
+ return 1;
+}
+
/* Step 1: check the range */
static struct vm_area_struct *
check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
- nodemask_t *nodes, unsigned long flags)
+ nodemask_t *nodes, unsigned long flags, struct list_head *pagelist)
{
int err;
struct vm_area_struct *first, *vma, *prev;
@@ -273,17 +359,24 @@ check_range(struct mm_struct *mm, unsign
return ERR_PTR(-EACCES);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
- if (!vma->vm_next && vma->vm_end < end)
- return ERR_PTR(-EFAULT);
- if (prev && prev->vm_end < vma->vm_start)
- return ERR_PTR(-EFAULT);
- if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
+ if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+ if (!vma->vm_next && vma->vm_end < end)
+ return ERR_PTR(-EFAULT);
+ if (prev && prev->vm_end < vma->vm_start)
+ return ERR_PTR(-EFAULT);
+ }
+ if (!is_vm_hugetlb_page(vma) &&
+ ((flags & MPOL_MF_STRICT) ||
+ ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+ vma_migratable(vma)
+ ))) {
unsigned long endvma = vma->vm_end;
if (endvma > end)
endvma = end;
if (vma->vm_start > start)
start = vma->vm_start;
- err = check_pgd_range(vma, start, endvma, nodes);
+ err = check_pgd_range(vma, start, endvma, nodes,
+ flags, pagelist);
if (err) {
first = ERR_PTR(err);
break;
@@ -357,33 +450,59 @@ long do_mbind(unsigned long start, unsig
struct mempolicy *new;
unsigned long end;
int err;
+ LIST_HEAD(pagelist);

- if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
+ if ((flags & ~(unsigned long)(MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ || mode > MPOL_MAX)
return -EINVAL;
+ if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
if (start & ~PAGE_MASK)
return -EINVAL;
+
if (mode == MPOL_DEFAULT)
flags &= ~MPOL_MF_STRICT;
+
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
end = start + len;
+
if (end < start)
return -EINVAL;
if (end == start)
return 0;
+
if (mpol_check_policy(mode, nmask))
return -EINVAL;
+
new = mpol_new(mode, nmask);
if (IS_ERR(new))
return PTR_ERR(new);

+ /*
+ * If we are using the default policy then operation
+ * on discontinuous address spaces is okay after all
+ */
+ if (!new)
+ flags |= MPOL_MF_DISCONTIG_OK;
+
PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
mode,nodes_addr(nodes)[0]);

down_write(&mm->mmap_sem);
- vma = check_range(mm, start, end, nmask, flags);
+ vma = check_range(mm, start, end, nmask, flags,
+ (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ? &pagelist : NULL);
err = PTR_ERR(vma);
- if (!IS_ERR(vma))
+ if (!IS_ERR(vma)) {
err = mbind_range(vma, start, end, new);
+ if (!list_empty(&pagelist))
+ swapout_pages(&pagelist);
+ if (!err && !list_empty(&pagelist) && (flags & MPOL_MF_STRICT))
+ err = -EIO;
+ }
+ if (!list_empty(&pagelist))
+ putback_lru_pages(&pagelist);
+
up_write(&mm->mmap_sem);
mpol_free(new);
return err;
Index: linux-2.6.14-rc4-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/mempolicy.h 2005-10-17 10:24:13.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/linux/mempolicy.h 2005-10-20 13:26:28.000000000 -0700
@@ -22,6 +22,8 @@

/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */

#ifdef __KERNEL__

2005-10-20 23:00:37

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the
first half of 2005 on [email protected] and [email protected].

The intend of sys_migrate is to migrate memory of a process. A process may have
migrated to another node. Memory was allocated optimally for the prior context.
sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically move
memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation.

The current code here is based on the swap based page migration capability and thus
not able to preserve the physical layout relative to it containing nodeset (which
may be a cpuset). When direct page migration becomes available then the
implementation needs to be changed to do a isomorphic move of pages between different
nodesets. The current implementation simply evicts all pages in source
nodeset that are not in the target nodeset.

Patch supports ia64, i386, x86_64 and ppc64. Patch not tested on ppc64.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/mm/mempolicy.c 2005-10-20 13:33:12.000000000 -0700
+++ linux-2.6.14-rc4-mm1/mm/mempolicy.c 2005-10-20 14:45:45.000000000 -0700
@@ -627,6 +627,36 @@ long do_get_mempolicy(int *policy, nodem
}

/*
+ * For now migrate_pages simply swaps out the pages from nodes that are in
+ * the source set but not in the target set. In the future, we would
+ * want a function that moves pages between the two nodesets in such
+ * a way as to preserve the physical layout as much as possible.
+ *
+ * Returns the number of page that could not be moved.
+ */
+int do_migrate_pages(struct mm_struct *mm,
+ nodemask_t *from_nodes, nodemask_t *to_nodes, int flags)
+{
+ LIST_HEAD(pagelist);
+ int count = 0;
+ nodemask_t nodes;
+
+ nodes_andnot(nodes, *from_nodes, *to_nodes);
+ nodes_complement(nodes, nodes);
+
+ down_read(&mm->mmap_sem);
+ check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nodes,
+ flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+ if (!list_empty(&pagelist)) {
+ swapout_pages(&pagelist);
+ if (!list_empty(&pagelist))
+ count = putback_lru_pages(&pagelist);
+ }
+ up_read(&mm->mmap_sem);
+ return count;
+}
+
+/*
* User space interface with variable sized bitmaps for nodelists.
*/

@@ -720,6 +750,51 @@ asmlinkage long sys_set_mempolicy(int mo
return do_set_mempolicy(mode, &nodes);
}

+/* Macro needed until Paul implements this function in kernel/cpusets.c */
+#define cpuset_mems_allowed(task) node_online_map
+
+asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
+ unsigned long __user *old_nodes,
+ unsigned long __user *new_nodes)
+{
+ struct mm_struct *mm;
+ struct task_struct *task;
+ nodemask_t old;
+ nodemask_t new;
+ int err;
+
+ err = get_nodes(&old, old_nodes, maxnode);
+ if (err)
+ return err;
+
+ err = get_nodes(&new, new_nodes, maxnode);
+ if (err)
+ return err;
+
+ /* Find the mm_struct */
+ read_lock(&tasklist_lock);
+ task = pid ? find_task_by_pid(pid) : current;
+ if (!task) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ mm = get_task_mm(task);
+ read_unlock(&tasklist_lock);
+
+ if (!mm)
+ return -EINVAL;
+
+ /* Is the user allowed to access the target nodes? */
+ if (!nodes_subset(new, cpuset_mems_allowed(task)))
+ return -EPERM;
+
+ err = do_migrate_pages(mm, &old, &new, MPOL_MF_MOVE);
+
+ mmput(mm);
+ return err;
+}
+
+
/* Retrieve NUMA policy */
asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
Index: linux-2.6.14-rc4-mm1/kernel/sys_ni.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/kernel/sys_ni.c 2005-10-10 18:19:19.000000000 -0700
+++ linux-2.6.14-rc4-mm1/kernel/sys_ni.c 2005-10-20 13:34:43.000000000 -0700
@@ -82,6 +82,7 @@ cond_syscall(compat_sys_socketcall);
cond_syscall(sys_inotify_init);
cond_syscall(sys_inotify_add_watch);
cond_syscall(sys_inotify_rm_watch);
+cond_syscall(sys_migrate_pages);

/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
Index: linux-2.6.14-rc4-mm1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.14-rc4-mm1.orig/arch/ia64/kernel/entry.S 2005-10-10 18:19:19.000000000 -0700
+++ linux-2.6.14-rc4-mm1/arch/ia64/kernel/entry.S 2005-10-20 13:34:43.000000000 -0700
@@ -1600,5 +1600,6 @@ sys_call_table:
data8 sys_inotify_init
data8 sys_inotify_add_watch
data8 sys_inotify_rm_watch
+ data8 sys_migrate_pages

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.14-rc4-mm1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/asm-ia64/unistd.h 2005-10-17 10:24:22.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/asm-ia64/unistd.h 2005-10-20 13:34:43.000000000 -0700
@@ -269,12 +269,12 @@
#define __NR_inotify_init 1277
#define __NR_inotify_add_watch 1278
#define __NR_inotify_rm_watch 1279
-
+#define __NR_migrate_pages 1280
#ifdef __KERNEL__

#include <linux/config.h>

-#define NR_syscalls 256 /* length of syscall table */
+#define NR_syscalls 257 /* length of syscall table */

#define __ARCH_WANT_SYS_RT_SIGACTION

Index: linux-2.6.14-rc4-mm1/arch/ppc64/kernel/misc.S
===================================================================
--- linux-2.6.14-rc4-mm1.orig/arch/ppc64/kernel/misc.S 2005-10-17 10:24:18.000000000 -0700
+++ linux-2.6.14-rc4-mm1/arch/ppc64/kernel/misc.S 2005-10-20 13:34:43.000000000 -0700
@@ -1581,3 +1581,4 @@ _GLOBAL(sys_call_table)
.llong .sys_inotify_init /* 275 */
.llong .sys_inotify_add_watch
.llong .sys_inotify_rm_watch
+ .llong .sys_migrate_pages
Index: linux-2.6.14-rc4-mm1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.14-rc4-mm1.orig/arch/i386/kernel/syscall_table.S 2005-10-10 18:19:19.000000000 -0700
+++ linux-2.6.14-rc4-mm1/arch/i386/kernel/syscall_table.S 2005-10-20 13:34:44.000000000 -0700
@@ -294,3 +294,5 @@ ENTRY(sys_call_table)
.long sys_inotify_init
.long sys_inotify_add_watch
.long sys_inotify_rm_watch
+ .long sys_migrate_pages
+
Index: linux-2.6.14-rc4-mm1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/asm-x86_64/unistd.h 2005-10-17 10:24:22.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/asm-x86_64/unistd.h 2005-10-20 13:34:44.000000000 -0700
@@ -571,8 +571,10 @@ __SYSCALL(__NR_inotify_init, sys_inotify
__SYSCALL(__NR_inotify_add_watch, sys_inotify_add_watch)
#define __NR_inotify_rm_watch 255
__SYSCALL(__NR_inotify_rm_watch, sys_inotify_rm_watch)
+#define __NR_migrate_pages 256
+__SYSCALL(__NR_migrate_pages, sys_migrate_pages)

-#define __NR_syscall_max __NR_inotify_rm_watch
+#define __NR_syscall_max __NR_migrate_pages
#ifndef __NO_STUBS

/* user-visible error numbers are in the range -1 - -4095 */
Index: linux-2.6.14-rc4-mm1/include/linux/syscalls.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/syscalls.h 2005-10-17 10:24:22.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/linux/syscalls.h 2005-10-20 13:34:44.000000000 -0700
@@ -511,5 +511,7 @@ asmlinkage long sys_ioprio_set(int which
asmlinkage long sys_ioprio_get(int which, int who);
asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
unsigned long maxnode);
+asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
+ unsigned long __user *from, unsigned long __user *to);

#endif
Index: linux-2.6.14-rc4-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/mempolicy.h 2005-10-20 13:26:28.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/linux/mempolicy.h 2005-10-20 13:50:00.000000000 -0700
@@ -159,6 +159,9 @@ extern void numa_default_policy(void);
extern void numa_policy_init(void);
extern struct mempolicy default_policy;

+int do_migrate_pages(struct mm_struct *mm,
+ nodemask_t *from_nodes, nodemask_t *to_nodes, int flags);
+
#else

struct mempolicy {};

2005-10-20 23:01:07

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 1/4] Swap migration V3: LRU operations

Implement functions to isolate pages from the LRU and put them back later.

>From Magnus:

This patch for 2.6.14-rc4-mm1 breaks out isolate_lru_page() and
putpack_lru_page() and makes them inline. I'd like to build my code on
top of this patch, and I think your page eviction code could be built
on top of this patch too - without introducing too much duplicated
code.

Signed-off-by: Magnus Damm <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-rc4-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/mm_inline.h 2005-10-10 18:19:19.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/linux/mm_inline.h 2005-10-20 10:45:40.000000000 -0700
@@ -38,3 +38,55 @@ del_page_from_lru(struct zone *zone, str
zone->nr_inactive--;
}
}
+
+/*
+ * Isolate one page from the LRU lists.
+ *
+ * - zone->lru_lock must be held
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list
+ * -1 = page is being freed elsewhere.
+ */
+static inline int
+__isolate_lru_page(struct zone *zone, struct page *page)
+{
+ if (TestClearPageLRU(page)) {
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ return -1;
+ } else {
+ if (PageActive(page))
+ del_page_from_active_list(zone, page);
+ else
+ del_page_from_inactive_list(zone, page);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * Add isolated page back on the LRU lists
+ *
+ * - zone->lru_lock must be held
+ * - page must already be removed from other list
+ * - additional call to put_page() is needed
+ */
+static inline void
+__putback_lru_page(struct zone *zone, struct page *page)
+{
+ if (TestSetPageLRU(page))
+ BUG();
+
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+}
Index: linux-2.6.14-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/mm/vmscan.c 2005-10-17 10:24:30.000000000 -0700
+++ linux-2.6.14-rc4-mm1/mm/vmscan.c 2005-10-20 13:18:05.000000000 -0700
@@ -573,43 +573,75 @@ keep:
*
* Appropriate locks must be held before calling this function.
*
+ * @zone: The zone where lru_lock is held.
* @nr_to_scan: The number of pages to look through on the list.
* @src: The LRU list to pull pages off.
* @dst: The temp list to put pages on to.
- * @scanned: The number of pages that were scanned.
*
- * returns how many pages were moved onto *@dst.
+ * returns the number of pages that were scanned.
*/
-static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
- struct list_head *dst, int *scanned)
+static int isolate_lru_pages(struct zone *zone, int nr_to_scan,
+ struct list_head *src, struct list_head *dst)
{
- int nr_taken = 0;
struct page *page;
- int scan = 0;
+ int scanned = 0;
+ int rc;

- while (scan++ < nr_to_scan && !list_empty(src)) {
+ while (scanned++ < nr_to_scan && !list_empty(src)) {
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);

- if (!TestClearPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It is being freed elsewhere
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, src);
- continue;
- } else {
+ rc = __isolate_lru_page(zone, page);
+
+ BUG_ON(rc == 0); /* PageLRU(page) must be true */
+
+ if (rc == 1) /* Succeeded to isolate page */
list_add(&page->lru, dst);
- nr_taken++;
+
+ if (rc == -1) { /* Not possible to isolate */
+ list_del(&page->lru);
+ list_add(&page->lru, src);
}
}

- *scanned = scan;
- return nr_taken;
+ return scanned;
+}
+
+static void lru_add_drain_per_cpu(void *dummy)
+{
+ lru_add_drain();
+}
+
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list. Do necessary cache draining if the
+ * page is not on the LRU lists yet.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ * -1 = page is being freed elsewhere.
+ */
+int isolate_lru_page(struct page *page, struct list_head *l)
+{
+ int rc = 0;
+ struct zone *zone = page_zone(page);
+
+redo:
+ spin_lock_irq(&zone->lru_lock);
+ rc = __isolate_lru_page(zone, page);
+ spin_unlock_irq(&zone->lru_lock);
+ if (rc == 0) {
+ /*
+ * Maybe this page is still waiting for a cpu to drain it
+ * from one of the lru lists?
+ */
+ smp_call_function(&lru_add_drain_per_cpu, NULL, 0 , 1);
+ lru_add_drain();
+ if (PageLRU(page))
+ goto redo;
+ }
+ return rc;
}

/*
@@ -627,18 +659,15 @@ static void shrink_cache(struct zone *zo
spin_lock_irq(&zone->lru_lock);
while (max_scan > 0) {
struct page *page;
- int nr_taken;
int nr_scan;
int nr_freed;

- nr_taken = isolate_lru_pages(sc->swap_cluster_max,
- &zone->inactive_list,
- &page_list, &nr_scan);
- zone->nr_inactive -= nr_taken;
+ nr_scan = isolate_lru_pages(zone, sc->swap_cluster_max,
+ &zone->inactive_list, &page_list);
zone->pages_scanned += nr_scan;
spin_unlock_irq(&zone->lru_lock);

- if (nr_taken == 0)
+ if (list_empty(&page_list))
goto done;

max_scan -= nr_scan;
@@ -658,13 +687,9 @@ static void shrink_cache(struct zone *zo
*/
while (!list_empty(&page_list)) {
page = lru_to_page(&page_list);
- if (TestSetPageLRU(page))
- BUG();
list_del(&page->lru);
- if (PageActive(page))
- add_page_to_active_list(zone, page);
- else
- add_page_to_inactive_list(zone, page);
+ __putback_lru_page(zone, page);
+
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -678,6 +703,33 @@ done:
}

/*
+ * Add isolated pages on the list back to the LRU
+ * Determines the zone for each pages and takes
+ * the necessary lru lock for each page.
+ *
+ * returns the number of pages put back.
+ */
+int putback_lru_pages(struct list_head *l)
+{
+ struct page * page;
+ struct page * page2;
+ int count = 0;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ struct zone *zone = page_zone(page);
+
+ list_del(&page->lru);
+ spin_lock_irq(&zone->lru_lock);
+ __putback_lru_page(zone, page);
+ spin_unlock_irq(&zone->lru_lock);
+ count++;
+ /* Undo the get from isolate_lru_page */
+ put_page(page);
+ }
+ return count;
+}
+
+/*
* This moves pages from the active list to the inactive list.
*
* We move them the other way if the page is referenced by one or more
@@ -713,10 +765,9 @@ refill_inactive_zone(struct zone *zone,

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
- &l_hold, &pgscanned);
+ pgscanned = isolate_lru_pages(zone, nr_pages,
+ &zone->active_list, &l_hold);
zone->pages_scanned += pgscanned;
- zone->nr_active -= pgmoved;
spin_unlock_irq(&zone->lru_lock);

/*
Index: linux-2.6.14-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/swap.h 2005-10-17 10:24:16.000000000 -0700
+++ linux-2.6.14-rc4-mm1/include/linux/swap.h 2005-10-20 13:13:24.000000000 -0700
@@ -176,6 +176,9 @@ extern int zone_reclaim(struct zone *, u
extern int shrink_all_memory(int);
extern int vm_swappiness;

+extern int isolate_lru_page(struct page *p, struct list_head *l);
+extern int putback_lru_pages(struct list_head *l);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);

2005-10-20 23:07:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Christoph Lameter <[email protected]> wrote:
>
> Page migration is also useful for other purposes:
>
> 1. Memory hotplug. Migrating processes off a memory node that is going
> to be disconnected.
>
> 2. Remapping of bad pages. These could be detected through soft ECC errors
> and other mechanisms.

It's only useful for these things if it works with close-to-100% reliability.

And there are are all sorts of things which will prevent that - mlock,
ongoing direct-io, hugepages, whatever.

So before we can commit ourselves to the initial parts of this path we'd
need some reassurance that the overall scheme addresses these things and
that the end result has a high probability of supporting hot unplug and
remapping sufficiently well.

2005-10-20 23:46:30

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Thu, Oct 20, 2005 at 04:06:38PM -0700, Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
> >
> > Page migration is also useful for other purposes:
> >
> > 1. Memory hotplug. Migrating processes off a memory node that is going
> > to be disconnected.
> >
> > 2. Remapping of bad pages. These could be detected through soft ECC errors
> > and other mechanisms.
>
> It's only useful for these things if it works with close-to-100% reliability.
>
> And there are are all sorts of things which will prevent that - mlock,
> ongoing direct-io, hugepages, whatever.

Since soft errors could happen almost anywhere, you are not going to get
close to 100% there. 'General purpose' memory hotplug is going to need
some type of page/memory grouping like Mel Gorman's fragmentation avoidance
patches. Using such groupings, you can almost always find 'some' section
that can be offlined. It is not close to 100%, but without it your chances
of finding a section are closer to 0%. For applications of hotplug where
the only requirement is to remove a quantity of memory (and we are not
concerned about specific physical sections of memory) this appears to be
a viable approach. Once you start talking about removing specific pieces
of memory, I think the only granularity that makes sense at this time is
an entire NUMA node. Here, I 'think' you could limit the type of allocations
made on the node to something like highmem. But, I haven't been looking
into the offlining of specific sections. Only the offlining of any section.

Just to be clear, there are at least two distinct requirements for hotplug.
One only wants to remove a quantity of memory (location unimportant). The
other wants to remove a specific section of memory (location specific). I
think the first is easier to address.

--
Mike

2005-10-21 01:57:04

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On 10/21/05, Christoph Lameter <[email protected]> wrote:
> Page migration is also useful for other purposes:
>
> 1. Memory hotplug. Migrating processes off a memory node that is going
> to be disconnected.
>
> 2. Remapping of bad pages. These could be detected through soft ECC errors
> and other mechanisms.

3. Migrating between zones.

The current per-zone LRU design might have some drawbacks. I would
prefer a per-node LRU to avoid that certain zones needs to shrink more
often than others. But maybe that is not the case, please let me know
if I'm wrong.

If you think about it, say that a certain user space page happens to
be allocated from the DMA zone, and for some reason this DMA zone is
very popular because you have crappy hardware, then it might be more
probable that this page is paged out before some other much older/less
used page in another (larger) zone. And I guess the same applies to
small HIGHMEM zones.

This could very well be related to the "1 GB Memory is bad for you"
problem described briefly here: http://kerneltrap.org/node/2450

Maybe it is possible to have a per-node LRU and always page out the
least recently used page in the entire node, and then migrate pages to
solve specific "within N bits of address space" requirements.

But I'm probably underestimating the cost of page migration...

/ magnus

2005-10-21 02:56:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface


Christoph Lameter wrote:
> + /* Is the user allowed to access the target nodes? */
> + if (!nodes_subset(new, cpuset_mems_allowed(task)))
> + return -EPERM;
> +
How about this ?
+cpuset_update_task_mems_allowed(task, new); (this isn't implemented now)

> + err = do_migrate_pages(mm, &old, &new, MPOL_MF_MOVE);
> +

or it's user's responsibility to updates his mempolicy before
calling sys_migrage_pages() ?

-- Kame


2005-10-21 03:22:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

mike kravetz wrote:
> On Thu, Oct 20, 2005 at 04:06:38PM -0700, Andrew Morton wrote:

> Just to be clear, there are at least two distinct requirements for hotplug.
> One only wants to remove a quantity of memory (location unimportant). The
> other wants to remove a specific section of memory (location specific). I
> think the first is easier to address.
>

The only difficulty to remove a quantity of memory is how to find
where is easy to be removed. If this is fixed, I think it is
easier to address.

My own target is NUMA-node-hotplug.
I want to make the possibility of hotplug *remove a specific section* be close to 100%.
Considering NUMA node hotplug,
if a process is memory location sensitve, it should be migrated before node is removed.
So, process migration by hand before system's memory hotplug looks attractive to me.

If we can implement memory migration before memory hotplug in good way,
I think it's good.

-- Kame

2005-10-21 03:32:30

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Fri, Oct 21, 2005 at 12:22:06PM +0900, KAMEZAWA Hiroyuki wrote:
> mike kravetz wrote:
> >On Thu, Oct 20, 2005 at 04:06:38PM -0700, Andrew Morton wrote:
>
> >Just to be clear, there are at least two distinct requirements for hotplug.
> >One only wants to remove a quantity of memory (location unimportant). The
> >other wants to remove a specific section of memory (location specific). I
> >think the first is easier to address.
> >
>
> The only difficulty to remove a quantity of memory is how to find
> where is easy to be removed. If this is fixed, I think it is
> easier to address.

We have been using Mel's fragmentation patches. One of the data structures
created by these patches is a 'usemap' thats tracks how 'blocks' of memory
are used. I exposed the usemaps via sysfs along with other hotplug memory
section attributes. So, you can then have a user space program scan the
usemaps looking for sections that can be easily offlined.

--
Mike

2005-10-21 03:57:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

mike kravetz wrote:
>>>Just to be clear, there are at least two distinct requirements for hotplug.
>>>One only wants to remove a quantity of memory (location unimportant). The
>>>other wants to remove a specific section of memory (location specific). I
>>>think the first is easier to address.
>>>
>>
>>The only difficulty to remove a quantity of memory is how to find
>>where is easy to be removed. If this is fixed, I think it is
>>easier to address.
>
>
> We have been using Mel's fragmentation patches. One of the data structures
> created by these patches is a 'usemap' thats tracks how 'blocks' of memory
> are used. I exposed the usemaps via sysfs along with other hotplug memory
> section attributes. So, you can then have a user space program scan the
> usemaps looking for sections that can be easily offlined.
>
yea, looks nice :)
But such pages are already shown as hotpluggable, I think.
ACPI/SRAT will define the range, in ia64.

The difficulty is how to find hard-to-migrate pages, as Andrew pointed out.



Thanks,
-- Kame

2005-10-21 04:22:18

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Fri, Oct 21, 2005 at 12:56:16PM +0900, KAMEZAWA Hiroyuki wrote:
> mike kravetz wrote:
> >We have been using Mel's fragmentation patches. One of the data structures
> >created by these patches is a 'usemap' thats tracks how 'blocks' of memory
> >are used. I exposed the usemaps via sysfs along with other hotplug memory
> >section attributes. So, you can then have a user space program scan the
> >usemaps looking for sections that can be easily offlined.
> >
> yea, looks nice :)
> But such pages are already shown as hotpluggable, I think.
> ACPI/SRAT will define the range, in ia64.

I haven't taken a close look at that code, but don't those just give
you physical ranges that can 'possibly' be removed? So, isn't it
possible for hotpluggable ranges to contain pages allocated for kernel
data structures which would be almost impossible to offline?

> The difficulty is how to find hard-to-migrate pages, as Andrew pointed out.

By examining the fragmentation usemaps, we have a pretty good idea about
how the blocks are being used. If a block is flagged as 'User Pages' then
there is a good chance that it can be offlined. Of course, depending on
exactly how those 'user pages' are being used will determine if they can
be offlined. If the offline of a section marked user is unsuccessful, you
can retry in the hope that the situation was transient. Or, you can move
on to the next user block. By concentrating your efforts on blocks only
containing user pages, your chances of success are greatly increased.
For blocks that are marked 'Kernel' we know an offline will not be successful
and don't even try.

--
Mike

2005-10-21 05:14:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

mike kravetz wrote:
>>yea, looks nice :)
>>But such pages are already shown as hotpluggable, I think.
>>ACPI/SRAT will define the range, in ia64.
>
>
> I haven't taken a close look at that code, but don't those just give
> you physical ranges that can 'possibly' be removed?

It just represents pages are physically hotpluggable or not.

> So, isn't it
> possible for hotpluggable ranges to contain pages allocated for kernel
> data structures which would be almost impossible to offline?
>
The range which contains kernel data isn't hot-pluggable.
So such range shouldn't contain the kernel pages.

As you say, it's very helpful to show how section is used.
But I think showing hotpluggable-or-not looks enough.(KERN mem is not hotpluggable)
Once a section turned to be KERN section, it's never be USER section, I think.

>>The difficulty is how to find hard-to-migrate pages, as Andrew pointed out.
>
>
> By examining the fragmentation usemaps, we have a pretty good idea about
> how the blocks are being used. If a block is flagged as 'User Pages' then
> there is a good chance that it can be offlined.
yes.

But 'search and find on demand' approach is not good for the system admin
who makes system resizing plan.
Do you consider some guarantees to keep quantity or location of not-hottpluggable
memory section ?

-- Kame


2005-10-21 06:06:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/4] Swap migration V3: LRU operations

On Thu, 2005-10-20 at 15:59 -0700, Christoph Lameter wrote:

> +/*
> + * Isolate one page from the LRU lists.
> + *
> + * - zone->lru_lock must be held
> + *
> + * Result:
> + * 0 = page not on LRU list
> + * 1 = page removed from LRU list
> + * -1 = page is being freed elsewhere.
> + */

Can these return values please get some real names? I just hate when
things have more than just fail and success as return codes.

It makes much more sense to have something like:

if (ret == ISOLATION_IMPOSSIBLE) {
list_del(&page->lru);
list_add(&page->lru, src);
}

than

+ if (rc == -1) { /* Not possible to isolate */
+ list_del(&page->lru);
+ list_add(&page->lru, src);
+ } if

The comment just makes the code harder to read.

> +static inline int
> +__isolate_lru_page(struct zone *zone, struct page *page)
> +{
> + if (TestClearPageLRU(page)) {
> + if (get_page_testone(page)) {
> + /*
> + * It is being freed elsewhere
> + */
> + __put_page(page);
> + SetPageLRU(page);
> + return -1;
> + } else {
> + if (PageActive(page))
> + del_page_from_active_list(zone, page);
> + else
> + del_page_from_inactive_list(zone, page);
> + return 1;
> + }
> + }
> +
> + return 0;
> +}

How about

+static inline int
> +__isolate_lru_page(struct zone *zone, struct page *page)
> +{
int ret = 0;

if (!TestClearPageLRU(page))
return ret;


Then, the rest of the thing doesn't need to be indented.

> +static inline void
> +__putback_lru_page(struct zone *zone, struct page *page)
> +{

__put_back_lru_page?

BTW, it would probably be nice to say where these patches came from
before Magnus. :)

-- Dave

2005-10-21 06:00:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Andrew Morton wrote:
> Christoph Lameter <[email protected]> wrote:
>
>>Page migration is also useful for other purposes:
>>
>> 1. Memory hotplug. Migrating processes off a memory node that is going
>> to be disconnected.
>>
>> 2. Remapping of bad pages. These could be detected through soft ECC errors
>> and other mechanisms.
>
>
> It's only useful for these things if it works with close-to-100% reliability.
>
> And there are are all sorts of things which will prevent that - mlock,
> ongoing direct-io, hugepages, whatever.
>
In lhms tree, current status is below: (If I'm wrong, plz fix)
==
For mlock, direct page migration will work fine. try_to_unmap_one()
in -mhp tree has an argument *force* and ignore VM_LOCKED, it's for this.

For direct-io, we have to wait for completion.
The end of I/O is not notified and memory_migrate() is just polling pages.

For hugepages, we'll need hugepage demand paging and more work, I think.
==

When a process migrates to other nodes by hand, it can cooperate with migration
subsystem. So we don't have to be afraid of some special using of memory, in many case.
I think Christoph's approach will work fine.

When it comes to memory-hotplug, arbitrary processes are affected.
It's more difficult.

We should focus on 'process migraion on demand', in this thread.

-- Kame

2005-10-21 06:27:44

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 1/4] Swap migration V3: LRU operations

On 10/21/05, Dave Hansen <[email protected]> wrote:
> On Thu, 2005-10-20 at 15:59 -0700, Christoph Lameter wrote:
>
> > +/*
> > + * Isolate one page from the LRU lists.
> > + *
> > + * - zone->lru_lock must be held
> > + *
> > + * Result:
> > + * 0 = page not on LRU list
> > + * 1 = page removed from LRU list
> > + * -1 = page is being freed elsewhere.
> > + */
>
> Can these return values please get some real names? I just hate when
> things have more than just fail and success as return codes.
>
> It makes much more sense to have something like:
>
> if (ret == ISOLATION_IMPOSSIBLE) {

Absolutely. But this involves figuring out nice names that everyone
likes and that does not pollute the name space too much. Any
suggestions?

> How about
>
> +static inline int
> > +__isolate_lru_page(struct zone *zone, struct page *page)
> > +{
> int ret = 0;
>
> if (!TestClearPageLRU(page))
> return ret;
>
> Then, the rest of the thing doesn't need to be indented.

Good idea.

> > +static inline void
> > +__putback_lru_page(struct zone *zone, struct page *page)
> > +{
>
> __put_back_lru_page?
>
> BTW, it would probably be nice to say where these patches came from
> before Magnus. :)

Uh? Yesterday I broke out code from isolate_lru_pages() and
shrink_cache() and emailed Christoph privately. Do you have similar
code in your tree?

/ magnus

2005-10-21 06:58:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/4] Swap migration V3: LRU operations

On Fri, 2005-10-21 at 15:27 +0900, Magnus Damm wrote:
> On 10/21/05, Dave Hansen <[email protected]> wrote:
> > On Thu, 2005-10-20 at 15:59 -0700, Christoph Lameter wrote:
> > > + * 0 = page not on LRU list
> > > + * 1 = page removed from LRU list
> > > + * -1 = page is being freed elsewhere.
> > > + */
> >
> > Can these return values please get some real names? I just hate when
> > things have more than just fail and success as return codes.
> >
> > It makes much more sense to have something like:
> >
> > if (ret == ISOLATION_IMPOSSIBLE) {
>
> Absolutely. But this involves figuring out nice names that everyone
> likes and that does not pollute the name space too much.

So, your excuse for bad code is that you want to avoid a discussion?
Are you new here? ;)

> Any suggestions?

I'd start with the comment, and work from there.

ISOLATE_PAGE_NOT_LRU
ISOLATE_PAGE_REMOVED_FROM_LRU
ISOLATE_PAGE_FREEING_ELSEWHERE

Not my best names in history, but probably a place to start. It keeps
the author from having to add bad comments explaining what the code
does.

> > BTW, it would probably be nice to say where these patches came from
> > before Magnus. :)
>
> Uh? Yesterday I broke out code from isolate_lru_pages() and
> shrink_cache() and emailed Christoph privately. Do you have similar
> code in your tree?

Hirokazu's page migration patches have some functions called the exact
same things: __putback_page_to_lru, etc... although they are simpler.
Not my code, but it would be nice to acknowledge if ideas were coming
from there.

-- Dave

2005-10-21 07:07:34

by Simon Derr

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, KAMEZAWA Hiroyuki wrote:

>
> Christoph Lameter wrote:
> > + /* Is the user allowed to access the target nodes? */
> > + if (!nodes_subset(new, cpuset_mems_allowed(task)))
> > + return -EPERM;
> > +
> How about this ?
> +cpuset_update_task_mems_allowed(task, new); (this isn't implemented now)
>
> > + err = do_migrate_pages(mm, &old, &new, MPOL_MF_MOVE);
> > +
>
> or it's user's responsibility to updates his mempolicy before
> calling sys_migrage_pages() ?
>

The user cannot always add a memory node to a cpuset, for example if this
cpuset is inside another cpuset that is owned by another user. (i.e the
case where the administrator wants to dedicate a part of the machine to a
user).

The kernel checks for these permission issues, conflicts with other
mem_exclusive cpusets, etc... when you write in the 'mems' file.

Automatically updating the ->mems_allowed field as you suggest would
require that the kernel do the same checks in sys_migrage_pages(). Sounds
not as a very good idea to me.

Simon.

2005-10-21 07:22:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface



> Christoph Lameter wrote:
>
>>> > + /* Is the user allowed to access the target nodes? */
>>> > + if (!nodes_subset(new, cpuset_mems_allowed(task)))
>>> > + return -EPERM;
>>> > +
>
>> How about this ?
>> +cpuset_update_task_mems_allowed(task, new); (this isn't implemented now

*new* is already guaranteed to be the subset of current mem_allowed.
Is this violate the permission ?

Simon Derr wrote:
> Automatically updating the ->mems_allowed field as you suggest would
> require that the kernel do the same checks in sys_migrage_pages(). Sounds
> not as a very good idea to me.

Hmm, it means a user or admin should modify mem_allowed
before the first page fault after calling sys_migrate_pages().

-- Kame


2005-10-21 07:26:01

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 1/4] Swap migration V3: LRU operations

On 10/21/05, Dave Hansen <[email protected]> wrote:
> On Fri, 2005-10-21 at 15:27 +0900, Magnus Damm wrote:
> > On 10/21/05, Dave Hansen <[email protected]> wrote:
> > > On Thu, 2005-10-20 at 15:59 -0700, Christoph Lameter wrote:
> > > > + * 0 = page not on LRU list
> > > > + * 1 = page removed from LRU list
> > > > + * -1 = page is being freed elsewhere.
> > > > + */
> > >
> > > Can these return values please get some real names? I just hate when
> > > things have more than just fail and success as return codes.
> > >
> > > It makes much more sense to have something like:
> > >
> > > if (ret == ISOLATION_IMPOSSIBLE) {
> >
> > Absolutely. But this involves figuring out nice names that everyone
> > likes and that does not pollute the name space too much.
>
> So, your excuse for bad code is that you want to avoid a discussion?
> Are you new here? ;)

No and yes. =) To me, broken code is bad code. If code looks good or
not is another issue.

Anyway, I fully agree that using constants are better than hard coded
values. I just prefer to stay out of naming discussions. They tend to
go on forever and I find them pointless.

> > Any suggestions?
>
> I'd start with the comment, and work from there.
>
> ISOLATE_PAGE_NOT_LRU
> ISOLATE_PAGE_REMOVED_FROM_LRU
> ISOLATE_PAGE_FREEING_ELSEWHERE
>
> Not my best names in history, but probably a place to start. It keeps
> the author from having to add bad comments explaining what the code
> does.

Thank you for that suggestion.

> > > BTW, it would probably be nice to say where these patches came from
> > > before Magnus. :)
> >
> > Uh? Yesterday I broke out code from isolate_lru_pages() and
> > shrink_cache() and emailed Christoph privately. Do you have similar
> > code in your tree?
>
> Hirokazu's page migration patches have some functions called the exact
> same things: __putback_page_to_lru, etc... although they are simpler.

I saw that akpm commented regarding duplicated code and I figured it
would be better to break out these functions. And if someone has
written similar code before then it is probably a good sign saying
that something similar is needed.

> Not my code, but it would be nice to acknowledge if ideas were coming
> from there.

Yeah, thanks for stating the obvious.

/ magnus

2005-10-21 07:39:37

by Simon Derr

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, KAMEZAWA Hiroyuki wrote:

>
>
> > Christoph Lameter wrote:
> >
> > > > > + /* Is the user allowed to access the target nodes? */
> > > > > + if (!nodes_subset(new, cpuset_mems_allowed(task)))
> > > > > + return -EPERM;
> > > > > +
> >
> > > How about this ?
> > > +cpuset_update_task_mems_allowed(task, new); (this isn't implemented
> > > now
>
> *new* is already guaranteed to be the subset of current mem_allowed.
> Is this violate the permission ?

Oh, I misunderstood your mail.
I thought you wanted to automatically add extra nodes to the cpuset,
but you actually want to do just the opposite, i.e restrict the nodemask
for this task to the one passed to sys_migrate_pages(). Is that right ?

(If not, ignore the rest of this message)

Maybe sometimes the user would be interested in migrating all the
existing pages of a process, but not change the policy for the future ?

Simon.

2005-10-21 07:47:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Simon Derr wrote:
>>>Christoph Lameter wrote:
>>>>>>+ /* Is the user allowed to access the target nodes? */
>>>>>>+ if (!nodes_subset(new, cpuset_mems_allowed(task)))
>>>>>>+ return -EPERM;
>>>>>>+
>>>
>>>>How about this ?
>>>>+cpuset_update_task_mems_allowed(task, new); (this isn't implemented
>>>>now
>>
>>*new* is already guaranteed to be the subset of current mem_allowed.
>>Is this violate the permission ?
>
>
> Oh, I misunderstood your mail.
> I thought you wanted to automatically add extra nodes to the cpuset,
> but you actually want to do just the opposite, i.e restrict the nodemask
> for this task to the one passed to sys_migrate_pages(). Is that right ?
>
yes.
Anyway, we should modify task's mem_allowed before the first page fault.

-- Kame


2005-10-21 15:16:12

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Kame wrote:
>> How about this ?
>> +cpuset_update_task_mems_allowed(task, new); (this isn't implemented now

One task cannot directly update anothers mems_allowed. The locking on
a task's mems_allowed only allows modifying it in the context of its
current task. This avoids taking a lock to read out the (possibly
multiple word) mems_allowed value when checking it in __alloc_pages().

Grep in kernel/cpuset.c for "mems_generation" to see the mechanism
used to insure that the task mems_allowed is updated, before any
allocation of memory, to match its cpuset.

> *new* is already guaranteed to be the subset of current mem_allowed.
> Is this violate the permission ?

I think that sys_migrate_pages() allows one task to migrate the
pages of another.

* If task A is going to migrate task B's memory, then it should do
so within task B's cpuset constraints (or as close as it can, in
the case of say avoiding ECC soft errors, where the task context
of the affected pages is not easily available). If A doesn't like
B's cpuset constraints, then A should change them first, using the
appropriate cpuset API's, if A has permission to do so. In the
ECC soft correction case, just move the page the nearest place
one can find free memory - from what I can tell ensuring that cpuset
constraints are honored is too expensive (walking the task list for
each page to find which task references that mm struct.)

So this check ensures that A is not moving B's memory outside of
the nodes in B's cpuset.

* Christoph - what is the permissions check on sys_migrate_pages()?
It would seem inappropriate for 'guest' to be able to move the
memory of 'root'.

> Simon Derr wrote:
> > Automatically updating the ->mems_allowed field as you suggest would
> > require that the kernel do the same checks in sys_migrage_pages(). Sounds
> > not as a very good idea to me.

If I am understanding this correctly, this sys_migrate_pages() call
seems most useful in the situation that the pages are being moved
within the nodes already allowed to the target task (perhaps because
the kernel is configured w/o cpusets). Otherwise, you should first
change the mems_allowed of the target task to allow these nodes, and in
that case, you can just use the new cpuset 'memory_migrate' flag (in a
patch in my outgoing queue, that I need to send real soon now) to ask
that existing pages be migrated when that cpuset moves, and when tasks
are moved into that cpuset.

I agree with Simon that sys_migrate_pages() does not want to get in
the business of replicating the checks on updating mems_allowed that
are in the cpuset code.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 15:18:46

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Kame wrote:
> *new* is already guaranteed to be the subset of current mem_allowed.
> Is this violate the permission ?

The question is not so much whether the current tasks mems_allowed
is violated, but whether the mems_allowed of the cpuset of the
task that owns the pages is violated.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 15:23:13

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Simon wrote:
> Maybe sometimes the user would be interested in migrating all the
> existing pages of a process, but not change the policy for the future ?

So long as the user has some reasonable right to change the affected
tasks memory layout, and so long as they are moving memory within the
cpuset constraints (if any) of the affected task, or as close to that
as practical (such as with ECC soft error avoidance), then yes, it would
seem that this sys_migrate_pages() lets existing pages be moved without
changing the cpuset policy for the future.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 15:22:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Paul Jackson wrote:
> I agree with Simon that sys_migrate_pages() does not want to get in
> the business of replicating the checks on updating mems_allowed that
> are in the cpuset code.
>
Hm.. okay.
I'm just afraid of swapped-out pages will goes back to original nodes

-- Kame

2005-10-21 15:29:32

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Mike wrote:
> Just to be clear, there are at least two distinct requirements for hotplug.
> One only wants to remove a quantity of memory (location unimportant).

Could you describe this case a little more? I wasn't aware
of this hotplug requirement, until I saw you comment just now.

The three reasons I knew of for wanting to move memory pages were:
- offload some physical ram or node (avoid or unplug bad hardware)
- task migration to another cpuset or moving an existing cpuset
- various testing and performance motivations to optimize page location

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 15:42:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/4] Swap migration V3: LRU operations

On Fri, 21 Oct 2005, Dave Hansen wrote:

> Hirokazu's page migration patches have some functions called the exact
> same things: __putback_page_to_lru, etc... although they are simpler.
> Not my code, but it would be nice to acknowledge if ideas were coming
> from there.

Ok, I will add note to that effect. The basic idea is
already inherent in the shrink_list logic, so I thought it would be okay.

2005-10-21 15:47:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, Paul Jackson wrote:

> * Christoph - what is the permissions check on sys_migrate_pages()?
> It would seem inappropriate for 'guest' to be able to move the
> memory of 'root'.

The check is missing.

Maybe we could add:

if (!capable(CAP_SYS_RESOURCE))
return -EPERM;

Then we may also decide that root can move any process anywhere and drop
the retrieval of the mems_allowed from the other task.

2005-10-21 15:49:59

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH 1/4] Swap migration V3: LRU operations

Dave Hansen writes:

[...]

>
> It makes much more sense to have something like:
>
> if (ret == ISOLATION_IMPOSSIBLE) {
> list_del(&page->lru);
> list_add(&page->lru, src);
> }
>
> than
>
> + if (rc == -1) { /* Not possible to isolate */
> + list_del(&page->lru);
> + list_add(&page->lru, src);
> + } if

And
if (ret == ISOLATION_IMPOSSIBLE)
list_move(&page->lru, src);

is even better.

Nikita.

2005-10-21 15:55:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Thu, 20 Oct 2005, Andrew Morton wrote:

> Christoph Lameter <[email protected]> wrote:
> >
> > Page migration is also useful for other purposes:
> >
> > 1. Memory hotplug. Migrating processes off a memory node that is going
> > to be disconnected.
> >
> > 2. Remapping of bad pages. These could be detected through soft ECC errors
> > and other mechanisms.
>
> It's only useful for these things if it works with close-to-100% reliability.

I think we need to gradually get there. There are other measures
implemented by the hotplug that can work in conjunction with these patches
to increase the likelyhood of successful migration.

Pages that are not on the LRU are very difficult to move and the hotplug
project addresses that by not allowing allocation in areas that may be
removed etc.

> And there are are all sorts of things which will prevent that - mlock,
> ongoing direct-io, hugepages, whatever.

Right. But these are not a problem for the page migration of processes in
order to optimize performance. The hotplug and the remapping of bad pages
will require additional effort to get done right. Nevertheless, the
material presented here can be used as a basis.

> So before we can commit ourselves to the initial parts of this path we'd
> need some reassurance that the overall scheme addresses these things and
> that the end result has a high probability of supporting hot unplug and
> remapping sufficiently well.

I think we have that assurance. The hotplug project has worked on these
patches for a long time and what we need is a way to gradually put these
things into the kernel. We are trying to facilitate that with these
patches.

2005-10-21 16:01:10

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Fri, Oct 21, 2005 at 08:28:49AM -0700, Paul Jackson wrote:
> Mike wrote:
> > Just to be clear, there are at least two distinct requirements for hotplug.
> > One only wants to remove a quantity of memory (location unimportant).
>
> Could you describe this case a little more? I wasn't aware
> of this hotplug requirement, until I saw you comment just now.

Think of a system running multiple OS's on top of a hypervisor, where
each OS is given some memory for exclusive use. For multiple reasons
(one being workload management) it is desirable to move resources from
one OS to another. For example, take memory away from an underutilized
OS and give it to an over utilized OS.

This describes the environment on IBM's mid to upper level POWER systems.
Currently, there is OS support to dynamically move/reassign CPUs and
adapters between different OSs on these systems.

My knowledge of Xen is limited, but this might also apply to that
environment also. An interesting question comes up if Xen or some
other hypervisor starts vitrtualizing memory. In such cases, would
it make more sense to allow the hypervisor do all resizing or do
we also need hotplug support in the OS for optimal performance?

--
Mike

2005-10-21 16:10:22

by Ray Bryant

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Friday 21 October 2005 10:47, Christoph Lameter wrote:
> On Fri, 21 Oct 2005, Paul Jackson wrote:
> > * Christoph - what is the permissions check on sys_migrate_pages()?
> > It would seem inappropriate for 'guest' to be able to move the
> > memory of 'root'.
>
> The check is missing.
>

That code used to be there. Basically the check was that if you could
legally send a signal to the process, you could migrate its memory.
Go back and look and my patches for this.

Why was this dropped, arbitrarily?

> Maybe we could add:
>
> if (!capable(CAP_SYS_RESOURCE))
> return -EPERM;
>
> Then we may also decide that root can move any process anywhere and drop
> the retrieval of the mems_allowed from the other task.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

2005-10-21 16:27:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, KAMEZAWA Hiroyuki wrote:

> > > How about this ?
> > > +cpuset_update_task_mems_allowed(task, new); (this isn't implemented
> > > now
>
> *new* is already guaranteed to be the subset of current mem_allowed.
> Is this violate the permission ?

Could the cpuset_mems_allowed(task) function update the mems_allowed if
needed?

2005-10-21 16:34:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, Ray Bryant wrote:

> That code used to be there. Basically the check was that if you could
> legally send a signal to the process, you could migrate its memory.
> Go back and look and my patches for this.
>
> Why was this dropped, arbitrarily?

Sorry, it was separated out from the sys_migrate patch.

Here is the fix:

Index: linux-2.6.14-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/mm/mempolicy.c 2005-10-20 14:45:45.000000000 -0700
+++ linux-2.6.14-rc4-mm1/mm/mempolicy.c 2005-10-21 09:32:19.000000000 -0700
@@ -784,12 +784,26 @@ asmlinkage long sys_migrate_pages(pid_t
if (!mm)
return -EINVAL;

+ /*
+ * Permissions check like for signals.
+ * See check_kill_permission()
+ */
+ if ((current->euid ^ task->suid) && (current->euid ^ task->uid) &&
+ (current->uid ^ task->suid) && (current->uid ^ task->uid) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err = -EPERM;
+ goto out;
+ }
+
/* Is the user allowed to access the target nodes? */
- if (!nodes_subset(new, cpuset_mems_allowed(task)))
- return -EPERM;
+ if (!nodes_subset(new, cpuset_mems_allowed(task)) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err= -EPERM;
+ goto out;
+ }

err = do_migrate_pages(mm, &old, &new, MPOL_MF_MOVE);
-
+out:
mmput(mm);
return err;
}

2005-10-21 17:01:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Christoph Lameter wrote:
> On Fri, 21 Oct 2005, KAMEZAWA Hiroyuki wrote:
>
>
>>>>How about this ?
>>>>+cpuset_update_task_mems_allowed(task, new); (this isn't implemented
>>>>now
>>
>>*new* is already guaranteed to be the subset of current mem_allowed.
>>Is this violate the permission ?
>
>
> Could the cpuset_mems_allowed(task) function update the mems_allowed if
> needed?
It looks I was wrong :(
see Paul's e-mail. he describes the problem of my suggestion in detail.

-- Kame

2005-10-21 17:04:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Christoph wrote:
> Could the cpuset_mems_allowed(task) function update the mems_allowed if
> needed?

I'm not sure what you're thinking here. Instead of my asking a dozen
stupid questions, I guess I should just ask you to explain what you
have in mind more.

The function call you show above has no 'mask' argument, so I don't
know what you intend to update mems_allowed to. Currently, a task
mems_allowed is only updated in task context, from its cpusets
mems_allowed. The task mems_allowed is updated automatically coming
into the page allocation code, if the tasks mems_generation doesn't
match its cpusets mems_generation.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 17:07:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, Paul Jackson wrote:

> know what you intend to update mems_allowed to. Currently, a task
> mems_allowed is only updated in task context, from its cpusets
> mems_allowed. The task mems_allowed is updated automatically coming
> into the page allocation code, if the tasks mems_generation doesn't
> match its cpusets mems_generation.

Therefore if mems_allowed is accessed from outside of the
task then it may not be up to date, right?


2005-10-21 18:10:34

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Kame wroteL
> I'm just afraid of swapped-out pages will goes back to original nodes

The pages could end up there, yes, if that's where they are faulted
back into.

In general, the swap-based migration method does not guarantee
where the pages will end up. The more difficult direct node-to-node
migration method will be needed to guarantee that.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 18:17:22

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Christoph wrote:
> Therefore if mems_allowed is accessed from outside of the
> task then it may not be up to date, right?

Yup - exactly.

The up to date allowed memory container for a task is in its cpuset,
which does have the locking mechanisms needed for safe access from
other tasks.

The task mems_allowed is just a private cache of the mems_allowed of
its cpuset, used for quick access from within the task context by the
page allocation code.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-21 18:26:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

On Fri, 21 Oct 2005, Paul Jackson wrote:

> Kame wroteL
> > I'm just afraid of swapped-out pages will goes back to original nodes
>
> The pages could end up there, yes, if that's where they are faulted
> back into.

Right. But the cpuset code will change the mems_allowed. The pages will
then be allocated in that context.

> In general, the swap-based migration method does not guarantee
> where the pages will end up. The more difficult direct node-to-node
> migration method will be needed to guarantee that.

Correct, tt does not guarantee that without cpuset assistance.

2005-10-21 18:57:48

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/4] Swap migration V3: sys_migrate_pages interface

Christoph wrote:
> Right. But the cpuset code will change the mems_allowed. The pages will
> then be allocated in that context.

If the migration is being done as part of moving a cpuset, or moving
a task to a different cpuset, then yes the cpuset code will change
the mems_allowed.

However I thought we were discussing the sys_migrate_pages() call
here. Naked sys_migrate_pages() calls do not involve the cpuset code,
nor change the target tasks mems_allowed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-22 05:32:27

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Fri, Oct 21, 2005 at 10:57:02AM +0900, Magnus Damm wrote:
> On 10/21/05, Christoph Lameter <[email protected]> wrote:
> > Page migration is also useful for other purposes:
> >
> > 1. Memory hotplug. Migrating processes off a memory node that is going
> > to be disconnected.
> >
> > 2. Remapping of bad pages. These could be detected through soft ECC errors
> > and other mechanisms.
>
> 3. Migrating between zones.
>
> The current per-zone LRU design might have some drawbacks. I would
> prefer a per-node LRU to avoid that certain zones needs to shrink more
> often than others. But maybe that is not the case, please let me know
> if I'm wrong.
>
> If you think about it, say that a certain user space page happens to
> be allocated from the DMA zone, and for some reason this DMA zone is
> very popular because you have crappy hardware, then it might be more
> probable that this page is paged out before some other much older/less
> used page in another (larger) zone. And I guess the same applies to
> small HIGHMEM zones.

User pages (accessed through their virtual pte mapping) can be moved
around zones freely - user pages do not suffer from zone requirements.
So you can just migrate a user page in DMA zone to another node's
highmem zone.

Pages with zone requirements (DMA pages for driver buffers or user mmap()
on crappy hardware, lowmem restricted kernel pages (SLAB caches), etc.
can't be migrated easily (and no one attempted to do that yet AFAIK).

> This could very well be related to the "1 GB Memory is bad for you"
> problem described briefly here: http://kerneltrap.org/node/2450
>
> Maybe it is possible to have a per-node LRU and always page out the
> least recently used page in the entire node, and then migrate pages to
> solve specific "within N bits of address space" requirements.

Pages with "N bits of address space" requirement pages can't be migrated
at the moment (on the hardware requirement it would be necessary to have
synchronization with driver operation, shutdown it down, and restartup
it up...)

For SLAB there is no solution as far as I know (except an indirection
level in memory access to these pages, as discussed in this years
memory hotplug presentation by Dave Hansen).

> But I'm probably underestimating the cost of page migration...

The zone balancing issue you describe might be an issue once zone
said pages can be migrated :)

2005-10-22 05:47:51

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 2/4] Swap migration V3: Page Eviction


Hi Christoph,

On Thu, Oct 20, 2005 at 03:59:45PM -0700, Christoph Lameter wrote:
> Page eviction support in vmscan.c
>
> This patch adds functions that allow the eviction of pages to swap space.
> Page eviction may be useful to migrate pages, to suspend programs or for
> ummapping single pages (useful for faulty pages or pages with soft ECC
> failures)

<snip>

You might want to add some throttling in swapout_pages() instead of
relying on the block layer to do it for you.

There have been problems before with very large disk queues (IIRC it was
CFQ) in which all available memory became pinned by dirty data causing
OOM.

See throttle_vm_writeout() in mm/vmscan.c.

> + * Swapout evicts the pages on the list to swap space.
> + * This is essentially a dumbed down version of shrink_list
> + *
> + * returns the number of pages that were not evictable
> + *
> + * Multiple passes are performed over the list. The first
> + * pass avoids waiting on locks and triggers writeout
> + * actions. Later passes begin to wait on locks in order
> + * to have a better chance of acquiring the lock.
> + */
> +int swapout_pages(struct list_head *l)
> +{
> + int retry;
> + int failed;
> + int pass = 0;
> + struct page *page;
> + struct page *page2;
> +
> + current->flags |= PF_KSWAPD;
> +
> +redo:
> + retry = 0;
> + failed = 0;
> +
> + list_for_each_entry_safe(page, page2, l, lru) {
> + struct address_space *mapping;
> +
> + cond_resched();
> +
> + /*
> + * Skip locked pages during the first two passes to give the
> + * functions holding the lock time to release the page. Later we use
> + * lock_page to have a higher chance of acquiring the lock.
> + */
> + if (pass > 2)
> + lock_page(page);
> + else
> + if (TestSetPageLocked(page))
> + goto retry_later;
> +
> + /*
> + * Only wait on writeback if we have already done a pass where
> + * we we may have triggered writeouts for lots of pages.
> + */
> + if (pass > 0)
> + wait_on_page_writeback(page);
> + else
> + if (PageWriteback(page))
> + goto retry_later_locked;
> +
> +#ifdef CONFIG_SWAP
> + if (PageAnon(page) && !PageSwapCache(page)) {
> + if (!add_to_swap(page))
> + goto failed;
> + }
> +#endif /* CONFIG_SWAP */
> +
> + mapping = page_mapping(page);
> + if (page_mapped(page) && mapping)
> + if (try_to_unmap(page) != SWAP_SUCCESS)
> + goto retry_later_locked;
> +
> + if (PageDirty(page)) {
> + /* Page is dirty, try to write it out here */
> + switch(pageout(page, mapping)) {
> + case PAGE_KEEP:
> + case PAGE_ACTIVATE:
> + goto retry_later_locked;
> + case PAGE_SUCCESS:
> + goto retry_later;
> + case PAGE_CLEAN:
> + ; /* try to free the page below */
> + }
> + }
> +
> + if (PagePrivate(page)) {
> + if (!try_to_release_page(page, GFP_KERNEL))
> + goto retry_later_locked;
> + if (!mapping && page_count(page) == 1)
> + goto free_it;
> + }
> +
> + if (!remove_mapping(mapping, page))
> + goto retry_later_locked; /* truncate got there first */
> +
> +free_it:
> + /*
> + * We may free pages that were taken off the active list
> + * by isolate_lru_page. However, free_hot_cold_page will check
> + * if the active bit is set. So clear it.
> + */
> + ClearPageActive(page);
> +
> + list_del(&page->lru);
> + unlock_page(page);
> + put_page(page);
> + continue;
> +
> +failed:
> + failed++;
> + unlock_page(page);
> + continue;
> +
> +retry_later_locked:
> + unlock_page(page);
> +retry_later:
> + retry++;
> + }
> + if (retry && pass++ < 10)
> + goto redo;
> +
> + current->flags &= ~PF_KSWAPD;
> + return failed + retry;
> +}
> +
> +/*
> * zone->lru_lock is heavily contended. Some of the functions that
> * shrink the lists perform better by taking out a batch of pages
> * and working on them outside the LRU lock.

2005-10-22 05:58:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Hi Kame,

On Fri, Oct 21, 2005 at 02:59:14PM +0900, Hiroyuki KAMEZAWA wrote:
> Andrew Morton wrote:
> >Christoph Lameter <[email protected]> wrote:
> >
> >>Page migration is also useful for other purposes:
> >>
> >>1. Memory hotplug. Migrating processes off a memory node that is going
> >> to be disconnected.
> >>
> >>2. Remapping of bad pages. These could be detected through soft ECC errors
> >> and other mechanisms.
> >
> >
> >It's only useful for these things if it works with close-to-100%
> >reliability.
> >
> >And there are are all sorts of things which will prevent that - mlock,
> >ongoing direct-io, hugepages, whatever.
> >
> In lhms tree, current status is below: (If I'm wrong, plz fix)
> ==
> For mlock, direct page migration will work fine. try_to_unmap_one()
> in -mhp tree has an argument *force* and ignore VM_LOCKED, it's for this.
>
> For direct-io, we have to wait for completion.
> The end of I/O is not notified and memory_migrate() is just polling pages.
>
> For hugepages, we'll need hugepage demand paging and more work, I think.

Hugepage pagefaulting is being worked on by Hugh and Adam Litke.

Another major problem that comes to mind is availability of largepages
on the target zone. Those allocations can be made reliable with the
fragmentation avoidance patches plus memory defragmentation using memory
migration.

So all bits should be around for hugepage migration by now?

2005-10-23 12:50:20

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On 10/22/05, Marcelo Tosatti <[email protected]> wrote:
> On Fri, Oct 21, 2005 at 10:57:02AM +0900, Magnus Damm wrote:
> > On 10/21/05, Christoph Lameter <[email protected]> wrote:
> > > Page migration is also useful for other purposes:
> > >
> > > 1. Memory hotplug. Migrating processes off a memory node that is going
> > > to be disconnected.
> > >
> > > 2. Remapping of bad pages. These could be detected through soft ECC errors
> > > and other mechanisms.
> >
> > 3. Migrating between zones.
> >
> > The current per-zone LRU design might have some drawbacks. I would
> > prefer a per-node LRU to avoid that certain zones needs to shrink more
> > often than others. But maybe that is not the case, please let me know
> > if I'm wrong.
> >
> > If you think about it, say that a certain user space page happens to
> > be allocated from the DMA zone, and for some reason this DMA zone is
> > very popular because you have crappy hardware, then it might be more
> > probable that this page is paged out before some other much older/less
> > used page in another (larger) zone. And I guess the same applies to
> > small HIGHMEM zones.
>
> User pages (accessed through their virtual pte mapping) can be moved
> around zones freely - user pages do not suffer from zone requirements.
> So you can just migrate a user page in DMA zone to another node's
> highmem zone.

Exactly. If I'm not mistaken only anonymous pages and page cache are
present on the LRU lists. And like you say, these pages do not really
suffer from zone requirements. So to me, the only reason to have one
LRU per zone is to be able to shrink the amount of LRU pages per zone
if pages are allocated with specific zone requirements and the
watermarks are reached.

> Pages with zone requirements (DMA pages for driver buffers or user mmap()
> on crappy hardware, lowmem restricted kernel pages (SLAB caches), etc.
> can't be migrated easily (and no one attempted to do that yet AFAIK).

I suspected so. But such pages are never included on the LRU lists, right?

> > This could very well be related to the "1 GB Memory is bad for you"
> > problem described briefly here: http://kerneltrap.org/node/2450
> >
> > Maybe it is possible to have a per-node LRU and always page out the
> > least recently used page in the entire node, and then migrate pages to
> > solve specific "within N bits of address space" requirements.
>
> Pages with "N bits of address space" requirement pages can't be migrated
> at the moment (on the hardware requirement it would be necessary to have
> synchronization with driver operation, shutdown it down, and restartup
> it up...)

That's what I thought. But the point I was trying to make was probably
not very clear... Let me clarify a bit.

Today there is a small chance that a user space page might be
allocated from a zone that has very few pages compared to other zones,
and this might lead to that page gets paged out earlier than if the
page would have been allocated from another larger zone.

I propose to have one LRU per node instead of one per zone. When the
kernel then needs to allocate a page with certain requirements
("within N bits of address space") and that zone has too few free
pages, instead of shrinking the per-zone LRU we use page migration.

So, first we check if the requested amount of pages is available in
any zone in the node. If not we shrink the per-node LRU to free up
some pages. Then we somehow locate any unlocked pages in the zone that
is low on pages (no LRU here). These pages are then migrated to free
pages from any other zone. And this migration gives us free pages in
the requested zone.

> For SLAB there is no solution as far as I know (except an indirection
> level in memory access to these pages, as discussed in this years
> memory hotplug presentation by Dave Hansen).

Maybe SLAB defragmentation code is suitable for page migration too?

> > But I'm probably underestimating the cost of page migration...
>
> The zone balancing issue you describe might be an issue once zone
> said pages can be migrated :)

My main concern is that we use one LRU per zone, and I suspect that
this design might be suboptimal if the sizes of the zones differs
much. But I have no numbers.

There are probably not that many drivers using the DMA zone on a
modern PC, so instead of bringing performance penalty on the entire
system I think it would be nicer to punish the evil hardware instead.

Thanks!

/ magnus

2005-10-24 12:44:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On Sun, Oct 23, 2005 at 09:50:18PM +0900, Magnus Damm wrote:
> On 10/22/05, Marcelo Tosatti <[email protected]> wrote:
> > On Fri, Oct 21, 2005 at 10:57:02AM +0900, Magnus Damm wrote:
> > > On 10/21/05, Christoph Lameter <[email protected]> wrote:
> > > > Page migration is also useful for other purposes:
> > > >
> > > > 1. Memory hotplug. Migrating processes off a memory node that is going
> > > > to be disconnected.
> > > >
> > > > 2. Remapping of bad pages. These could be detected through soft ECC errors
> > > > and other mechanisms.
> > >
> > > 3. Migrating between zones.
> > >
> > > The current per-zone LRU design might have some drawbacks. I would
> > > prefer a per-node LRU to avoid that certain zones needs to shrink more
> > > often than others. But maybe that is not the case, please let me know
> > > if I'm wrong.
> > >
> > > If you think about it, say that a certain user space page happens to
> > > be allocated from the DMA zone, and for some reason this DMA zone is
> > > very popular because you have crappy hardware, then it might be more
> > > probable that this page is paged out before some other much older/less
> > > used page in another (larger) zone. And I guess the same applies to
> > > small HIGHMEM zones.
> >
> > User pages (accessed through their virtual pte mapping) can be moved
> > around zones freely - user pages do not suffer from zone requirements.
> > So you can just migrate a user page in DMA zone to another node's
> > highmem zone.
>
> Exactly. If I'm not mistaken only anonymous pages and page cache are
> present on the LRU lists. And like you say, these pages do not really
> suffer from zone requirements. So to me, the only reason to have one
> LRU per zone is to be able to shrink the amount of LRU pages per zone
> if pages are allocated with specific zone requirements and the
> watermarks are reached.
>
> > Pages with zone requirements (DMA pages for driver buffers or user mmap()
> > on crappy hardware, lowmem restricted kernel pages (SLAB caches), etc.
> > can't be migrated easily (and no one attempted to do that yet AFAIK).
>
> I suspected so. But such pages are never included on the LRU lists, right?
>
> > > This could very well be related to the "1 GB Memory is bad for you"
> > > problem described briefly here: http://kerneltrap.org/node/2450
> > >
> > > Maybe it is possible to have a per-node LRU and always page out the
> > > least recently used page in the entire node, and then migrate pages to
> > > solve specific "within N bits of address space" requirements.
> >
> > Pages with "N bits of address space" requirement pages can't be migrated
> > at the moment (on the hardware requirement it would be necessary to have
> > synchronization with driver operation, shutdown it down, and restartup
> > it up...)
>
> That's what I thought. But the point I was trying to make was probably
> not very clear... Let me clarify a bit.
>
> Today there is a small chance that a user space page might be
> allocated from a zone that has very few pages compared to other zones,
> and this might lead to that page gets paged out earlier than if the
> page would have been allocated from another larger zone.
>
> I propose to have one LRU per node instead of one per zone. When the
> kernel then needs to allocate a page with certain requirements
> ("within N bits of address space") and that zone has too few free
> pages, instead of shrinking the per-zone LRU we use page migration.
>
> So, first we check if the requested amount of pages is available in
> any zone in the node. If not we shrink the per-node LRU to free up
> some pages. Then we somehow locate any unlocked pages in the zone that
> is low on pages (no LRU here). These pages are then migrated to free
> pages from any other zone. And this migration gives us free pages in
> the requested zone.

Ah OK, I see what you mean.

Its a possibility indeed.

> > For SLAB there is no solution as far as I know (except an indirection
> > level in memory access to these pages, as discussed in this years
> > memory hotplug presentation by Dave Hansen).
>
> Maybe SLAB defragmentation code is suitable for page migration too?

Free dentries are possible to migrate, but not referenced ones.

How are you going to inform users that the address of a dentry has
changed?

> > > But I'm probably underestimating the cost of page migration...
> >
> > The zone balancing issue you describe might be an issue once zone
> > said pages can be migrated :)
>
> My main concern is that we use one LRU per zone, and I suspect that
> this design might be suboptimal if the sizes of the zones differs
> much. But I have no numbers.

Migrating user pages from lowmem to highmem under situations with
intense low memory pressure (due to certain important allocations
which are restricted to lowmem) might be very useful.

> There are probably not that many drivers using the DMA zone on a
> modern PC, so instead of bringing performance penalty on the entire
> system I think it would be nicer to punish the evil hardware instead.

Agreed - the 16MB DMA zone is silly. Would love to see it go away...


2005-10-25 11:37:54

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

On 10/24/05, Marcelo Tosatti <[email protected]> wrote:
> On Sun, Oct 23, 2005 at 09:50:18PM +0900, Magnus Damm wrote:
> > Maybe SLAB defragmentation code is suitable for page migration too?
>
> Free dentries are possible to migrate, but not referenced ones.
>
> How are you going to inform users that the address of a dentry has
> changed?

Um, not sure, but the idea of defragmenting SLAB entries might be
similar to moving them, ie migration. But how to solve the per-SLAB
referencing is another story... =)

> > > > But I'm probably underestimating the cost of page migration...
> > >
> > > The zone balancing issue you describe might be an issue once zone
> > > said pages can be migrated :)
> >
> > My main concern is that we use one LRU per zone, and I suspect that
> > this design might be suboptimal if the sizes of the zones differs
> > much. But I have no numbers.
>
> Migrating user pages from lowmem to highmem under situations with
> intense low memory pressure (due to certain important allocations
> which are restricted to lowmem) might be very useful.

I patched the kernel on my desktop machine to provide some numbers.
The zoneinfo file and a small patch is attached.

$ uname -r
2.6.14-rc5-git3

$ uptime
20:27:47 up 1 day, 6:27, 18 users, load average: 0.01, 0.13, 0.15

$ cat /proc/zoneinfo | grep present
present 4096
present 225280
present 30342

$ cat /proc/zoneinfo | grep tscanned
tscanned 151352
tscanned 3480599
tscanned 541466

"tscanned" counts how many pages that has been scanned in each zone
since power on. Executive summary assuming that only LRU pages exist
in the zone:

DMA: each page has been scanned ~37 times
Normal: each page has been scanned ~15 times
HighMem: each page has been scanned ~18 times

So if your user space page happens to be allocated from the DMA zone,
it looks like it is more probable that it will be paged out sooner
than if it was allocated from another zone. And this is on a half year
old P4 system.

> > There are probably not that many drivers using the DMA zone on a
> > modern PC, so instead of bringing performance penalty on the entire
> > system I think it would be nicer to punish the evil hardware instead.
>
> Agreed - the 16MB DMA zone is silly. Would love to see it go away...

But is the DMA zone itself evil, or just that we have one LRU per zone...?

/ magnus


Attachments:
(No filename) (2.35 kB)
zoneinfo (1.80 kB)
lru_total_scanned.patch (1.63 kB)
Download all attachments

2005-10-25 19:40:12

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Hi Magnus,

On Tue, Oct 25, 2005 at 08:37:52PM +0900, Magnus Damm wrote:
> On 10/24/05, Marcelo Tosatti <[email protected]> wrote:
> > On Sun, Oct 23, 2005 at 09:50:18PM +0900, Magnus Damm wrote:
> > > Maybe SLAB defragmentation code is suitable for page migration too?
> >
> > Free dentries are possible to migrate, but not referenced ones.
> >
> > How are you going to inform users that the address of a dentry has
> > changed?
>
> Um, not sure, but the idea of defragmenting SLAB entries might be
> similar to moving them, ie migration. But how to solve the per-SLAB
> referencing is another story... =)
>
> > > > > But I'm probably underestimating the cost of page migration...
> > > >
> > > > The zone balancing issue you describe might be an issue once zone
> > > > said pages can be migrated :)
> > >
> > > My main concern is that we use one LRU per zone, and I suspect that
> > > this design might be suboptimal if the sizes of the zones differs
> > > much. But I have no numbers.
> >
> > Migrating user pages from lowmem to highmem under situations with
> > intense low memory pressure (due to certain important allocations
> > which are restricted to lowmem) might be very useful.
>
> I patched the kernel on my desktop machine to provide some numbers.
> The zoneinfo file and a small patch is attached.
>
> $ uname -r
> 2.6.14-rc5-git3
>
> $ uptime
> 20:27:47 up 1 day, 6:27, 18 users, load average: 0.01, 0.13, 0.15
>
> $ cat /proc/zoneinfo | grep present
> present 4096
> present 225280
> present 30342
>
> $ cat /proc/zoneinfo | grep tscanned
> tscanned 151352
> tscanned 3480599
> tscanned 541466
>
> "tscanned" counts how many pages that has been scanned in each zone
> since power on. Executive summary assuming that only LRU pages exist
> in the zone:
>
> DMA: each page has been scanned ~37 times
> Normal: each page has been scanned ~15 times
> HighMem: each page has been scanned ~18 times
>
> So if your user space page happens to be allocated from the DMA zone,
> it looks like it is more probable that it will be paged out sooner
> than if it was allocated from another zone. And this is on a half year
> old P4 system.

Well the higher relative pressure on a specific zone is a fact you have
to live with.

Even with a global LRU you're going to suffer from the same issue once
you've got different relative pressure on different zones.

Thats the reason for the mechanisms which attempt to avoid allocating
from the lower precious zones (lowmem_reserve and the allocation
fallback logic).

> > > There are probably not that many drivers using the DMA zone on a
> > > modern PC, so instead of bringing performance penalty on the entire
> > > system I think it would be nicer to punish the evil hardware instead.
> >
> > Agreed - the 16MB DMA zone is silly. Would love to see it go away...
>
> But is the DMA zone itself evil, or just that we have one LRU per zone...?

I agree that per-zone LRU complicates global page aging (you simply don't have
global aging).

But how to deal with restricted allocation requirements otherwise?
Scanning several GB's worth of pages looking for pages in a specific
small range can't be very promising.

Hope to be useful comments.

> --- from-0002/include/linux/mmzone.h
> +++ to-work/include/linux/mmzone.h 2005-10-24 10:43:13.000000000 +0900
> @@ -151,6 +151,7 @@ struct zone {
> unsigned long nr_active;
> unsigned long nr_inactive;
> unsigned long pages_scanned; /* since last reclaim */
> + unsigned long pages_scanned_total;
> int all_unreclaimable; /* All pages pinned */
>
> /*
> --- from-0002/mm/page_alloc.c
> +++ to-work/mm/page_alloc.c 2005-10-24 10:51:05.000000000 +0900
> @@ -2101,6 +2101,7 @@ static int zoneinfo_show(struct seq_file
> "\n active %lu"
> "\n inactive %lu"
> "\n scanned %lu (a: %lu i: %lu)"
> + "\n tscanned %lu"
> "\n spanned %lu"
> "\n present %lu",
> zone->free_pages,
> @@ -2111,6 +2112,7 @@ static int zoneinfo_show(struct seq_file
> zone->nr_inactive,
> zone->pages_scanned,
> zone->nr_scan_active, zone->nr_scan_inactive,
> + zone->pages_scanned_total,
> zone->spanned_pages,
> zone->present_pages);
> seq_printf(m,
> --- from-0002/mm/vmscan.c
> +++ to-work/mm/vmscan.c 2005-10-24 10:44:09.000000000 +0900
> @@ -633,6 +633,7 @@ static void shrink_cache(struct zone *zo
> &page_list, &nr_scan);
> zone->nr_inactive -= nr_taken;
> zone->pages_scanned += nr_scan;
> + zone->pages_scanned_total += nr_scan;
> spin_unlock_irq(&zone->lru_lock);
>
> if (nr_taken == 0)
> @@ -713,6 +714,7 @@ refill_inactive_zone(struct zone *zone,
> pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
> &l_hold, &pgscanned);
> zone->pages_scanned += pgscanned;
> + zone->pages_scanned_total += pgscanned;
> zone->nr_active -= pgmoved;
> spin_unlock_irq(&zone->lru_lock);
>

2005-10-26 07:04:20

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Hi again Marcelo,

On 10/25/05, Marcelo Tosatti <[email protected]> wrote:
> On Tue, Oct 25, 2005 at 08:37:52PM +0900, Magnus Damm wrote:
> > DMA: each page has been scanned ~37 times
> > Normal: each page has been scanned ~15 times
> > HighMem: each page has been scanned ~18 times
> >
> > So if your user space page happens to be allocated from the DMA zone,
> > it looks like it is more probable that it will be paged out sooner
> > than if it was allocated from another zone. And this is on a half year
> > old P4 system.
>
> Well the higher relative pressure on a specific zone is a fact you have
> to live with.

Yes, and even if the DMA zone was removed we still would have the same
issue with highmem vs lowmem.

> Even with a global LRU you're going to suffer from the same issue once
> you've got different relative pressure on different zones.

Yep, the per-node LRU will not even out the pressure. But my main
concern is rather the side effect of the pressure difference than the
pressure difference itself.

The side effect is that the "wrong" pages may be paged out in a
per-zone LRU compared to a per-node LRU. This may or may not be a big
deal for performance.

> Thats the reason for the mechanisms which attempt to avoid allocating
> from the lower precious zones (lowmem_reserve and the allocation
> fallback logic).

Exactly. But will this logic always work well? With some memory
configurations the normal zone might be smaller than the DMA zone. And
the same applies for highmem vs normal zone. I'm not sure, but doesn't
the size of the zones somehow relate to the memory pressure?

> > > > There are probably not that many drivers using the DMA zone on a
> > > > modern PC, so instead of bringing performance penalty on the entire
> > > > system I think it would be nicer to punish the evil hardware instead.
> > >
> > > Agreed - the 16MB DMA zone is silly. Would love to see it go away...
> >
> > But is the DMA zone itself evil, or just that we have one LRU per zone...?
>
> I agree that per-zone LRU complicates global page aging (you simply don't have
> global aging).
>
> But how to deal with restricted allocation requirements otherwise?
> Scanning several GB's worth of pages looking for pages in a specific
> small range can't be very promising.

I'm not sure exactly how much of the buddy allocator design that
currently is used by the kernel, but I suspect that 99.9% of all
allocations are 0-order. So it probably makes sense to optimize for
such a case.

Maybe it is possible to scrap the zones and instead use:

0-order pages without restrictions (common case):
Free pages in the node are chained together and either kept on one
list (64 bit system or 32 bit system without highmem) or on two lists;
one for lowmem and one for highmem. Maybe per cpu lists should be used
on top of this too.

Other pages (>0-order, special requirements):
Each node has a bitmap where pages belonging to the node are
represented by one bit each. Each bit is used to determine if the
per-page status. A value of 0 means that the page is used/reserved,
and a 1 means that the page is either free or allocated somehow but it
is possible migrate or page out the data.

So a page marked as 1 may be on the 0-order list, in use on some LRU,
or maybe even migratable SLAB.

The functions in linux/bitmap.h or asm/bitops.h are then used to scan
through the bitmap to find contiguous pages within a certain range of
pages. This allows us to fulfill all sorts of funky requirements such
as alignment or "within N address bits".

The allocator should of course prefer free pages over "used but
migratable", but if no free pages exist to fulfill the requirement,
page migration is used to empty the contiguous range.

The drawback of the idea above is of course the overhead (both memory
and cpu) introduced by the bitmap. But the allocator above may be more
successful for N-order allocations than the buddy allocator since the
pages doesn't have to be aligned. The allocator will probably be even
more successful if page migration is used too.

And then you have a per-node LRU on top of the above. =)

> Hope to be useful comments.

Yes, very useful. Many thanks!

/ magnus

2005-10-27 20:08:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Hi Magnus!

On Wed, Oct 26, 2005 at 04:04:18PM +0900, Magnus Damm wrote:
> Hi again Marcelo,
>
> On 10/25/05, Marcelo Tosatti <[email protected]> wrote:
> > On Tue, Oct 25, 2005 at 08:37:52PM +0900, Magnus Damm wrote:
> > > DMA: each page has been scanned ~37 times
> > > Normal: each page has been scanned ~15 times
> > > HighMem: each page has been scanned ~18 times

Can you verify if there are GFP_DMA allocations happening on this box?

A search for GFP_DMA in the current v2.6 tree shows mostly older network
drivers, older sound drivers, and SCSI (which needs to cope with older
ISA cards).

Ah, now I remember an interesting fact. balance_pgdat(), while
doing reclaim zone iteration in the DMA->highmem direction, uses
SWAP_CLUSTER_MAX (32) as the number of pages to reclaim for every zone,
independently of zone size.

Thats clearly unfair when you think that the ratio between normal/dma
zone sizes is much higher than the ratio of normal/dma _freed pages_
(needs to be confirmed with real numbers). Only the amount of pages to
scan (sc->nr_to_scan) is relative to zone size.

While playing with ARC (http://www.linux-mm.org/AdvancedPageReplacement)
ideas earlier this year I noticed that the reclaim work on the DMA
zone was excessive (which, I thought at the time, was entirely due to
modified shrink_zone logic). The fair approach would be to have the
number of pages to reclaim also relative to zone size.

sc->nr_to_reclaim = (zone->present_pages * sc->swap_cluster_max) /
total_memory;

Which worked very well under said situation, having the reclaim work
back to apparent fairness. Maybe you can try something similar with a
stock kernel?

Full patch can be found at
http://marc.theaimsgroup.com/?l=linux-mm&m=112387857203221&w=2

However I did not notice the issue with vanilla at the time, a quick
look at its numbers seemed to exhibit fair behaviour (your numbers
disagree though).

The benchmark I was using: dbench. Not very much of a meaningful test.

Ah, I had split up the "pgfree" counter exported in /proc/vmstat to be
per-zone so I could see the differences (you could enhance your tscanned
patch to count for freed pages).

Another useful number in this game which is not available at the moment
AFAIK is relative pressure (number of allocations divided by zone size).

> > > So if your user space page happens to be allocated from the DMA zone,
> > > it looks like it is more probable that it will be paged out sooner
> > > than if it was allocated from another zone. And this is on a half year
> > > old P4 system.
> >
> > Well the higher relative pressure on a specific zone is a fact you have
> > to live with.
>
> Yes, and even if the DMA zone was removed we still would have the same
> issue with highmem vs lowmem.

Yep.

> > Even with a global LRU you're going to suffer from the same issue once
> > you've got different relative pressure on different zones.
>
> Yep, the per-node LRU will not even out the pressure. But my main
> concern is rather the side effect of the pressure difference than the
> pressure difference itself.
>
> The side effect is that the "wrong" pages may be paged out in a
> per-zone LRU compared to a per-node LRU. This may or may not be a big
> deal for performance.
>
> > Thats the reason for the mechanisms which attempt to avoid allocating
> > from the lower precious zones (lowmem_reserve and the allocation
> > fallback logic).
>
> Exactly. But will this logic always work well? With some memory
> configurations the normal zone might be smaller than the DMA zone. And
> the same applies for highmem vs normal zone.

It should - but I'm not sure really.

Andrea, Andrew, Nick et all have been doing most of the tuning in 2.6.

> I'm not sure, but doesn't the size of the zones somehow relate to the
> memory pressure?

Yep. Ratio between number of allocations to a given zone versus zone size
can be though of as "relative pressure".

Smaller the zone, higher the pressure.

> > > > > There are probably not that many drivers using the DMA zone on a
> > > > > modern PC, so instead of bringing performance penalty on the entire
> > > > > system I think it would be nicer to punish the evil hardware instead.
> > > >
> > > > Agreed - the 16MB DMA zone is silly. Would love to see it go away...
> > >
> > > But is the DMA zone itself evil, or just that we have one LRU per zone...?
> >
> > I agree that per-zone LRU complicates global page aging (you simply don't have
> > global aging).
> >
> > But how to deal with restricted allocation requirements otherwise?
> > Scanning several GB's worth of pages looking for pages in a specific
> > small range can't be very promising.
>
> I'm not sure exactly how much of the buddy allocator design that
> currently is used by the kernel, but I suspect that 99.9% of all
> allocations are 0-order. So it probably makes sense to optimize for
> such a case.
>
> Maybe it is possible to scrap the zones and instead use:
>
> 0-order pages without restrictions (common case):
> Free pages in the node are chained together and either kept on one
> list (64 bit system or 32 bit system without highmem) or on two lists;
> one for lowmem and one for highmem. Maybe per cpu lists should be used
> on top of this too.
>
> Other pages (>0-order, special requirements):
> Each node has a bitmap where pages belonging to the node are
> represented by one bit each. Each bit is used to determine if the
> per-page status. A value of 0 means that the page is used/reserved,
> and a 1 means that the page is either free or allocated somehow but it
> is possible migrate or page out the data.
>
> So a page marked as 1 may be on the 0-order list, in use on some LRU,
> or maybe even migratable SLAB.
>
> The functions in linux/bitmap.h or asm/bitops.h are then used to scan
> through the bitmap to find contiguous pages within a certain range of
> pages. This allows us to fulfill all sorts of funky requirements such
> as alignment or "within N address bits".
>
> The allocator should of course prefer free pages over "used but
> migratable", but if no free pages exist to fulfill the requirement,
> page migration is used to empty the contiguous range.
>
> The drawback of the idea above is of course the overhead (both memory
> and cpu) introduced by the bitmap. But the allocator above may be more
> successful for N-order allocations than the buddy allocator since the
> pages doesn't have to be aligned. The allocator will probably be even
> more successful if page migration is used too.
>
> And then you have a per-node LRU on top of the above. =)

Yep, sounds feasible.

An interesting test on x86 would be to have all ZONE_NORMAL pages in
ZONE_DMA (which is what arches with no accessability limitation do).
That way we could see the impact of managing the 16MB ZONE_DMA.

I like the idea of penalizing the 16MB limited users in favour of
increasing global aging efficiency as you suggest.

After all such hardware will become more ancient and rare as time moves.

2005-10-27 20:44:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Marcelo Tosatti <[email protected]> wrote:
>
> The fair approach would be to have the
> number of pages to reclaim also relative to zone size.
>
> sc->nr_to_reclaim = (zone->present_pages * sc->swap_cluster_max) /
> total_memory;

You can try it, but that shouldn't matter. SWAP_CLUSTER_MAX is just a
batching factor used to reduce CPU consumption. If you make it twice as
bug, we run DMA-zone reclaim half as often - it should balance out.

2005-10-28 02:42:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Hi Andrew!

On Thu, Oct 27, 2005 at 01:43:47PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> > The fair approach would be to have the
> > number of pages to reclaim also relative to zone size.
> >
> > sc->nr_to_reclaim = (zone->present_pages * sc->swap_cluster_max) /
> > total_memory;
>
> You can try it, but that shouldn't matter. SWAP_CLUSTER_MAX is just a
> batching factor used to reduce CPU consumption. If you make it twice as
> bug, we run DMA-zone reclaim half as often - it should balance out.

But you're not taking the relationship between DMA and NORMAL zone
into account?

I suppose that a side effect of such change is that more allocations
will become serviced from the NORMAL/HIGHMEM zones ("more intensively
reclaimed") while less allocations will become serviced by the DMA zone
(whose scan/reclaim progress should now be _much_ lighter than that of
the NORMAL zone). ie DMA zone will be much less often "available" for
GFP_HIGHMEM/GFP_KERNEL allocations, which are the vast majority.

Might be talking BS though.

What else could explain this numbers from Magnus, taking into account
that a large number of pages in the DMA zone are used for kernel text,
etc. These unbalancing seems to be potentially suboptimal (and result
in unpredictable behaviour depending from which zone pages becomes
allocated from):

"$ cat /proc/zoneinfo | grep present
present 4096
present 225280
present 30342

$ cat /proc/zoneinfo | grep tscanned
tscanned 151352
tscanned 3480599
tscanned 541466

"tscanned" counts how many pages that has been scanned in each zone
since power on. Executive summary assuming that only LRU pages exist
in the zone:

DMA: each page has been scanned ~37 times
Normal: each page has been scanned ~15 times
HighMem: each page has been scanned ~18 times"

I feel that I'm reaching the point where things should be confirmed
instead of guessed (on my part!).

2005-10-28 03:08:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] Swap migration V3: Overview

Marcelo Tosatti <[email protected]> wrote:
>
> Hi Andrew!
>
> On Thu, Oct 27, 2005 at 01:43:47PM -0700, Andrew Morton wrote:
> > Marcelo Tosatti <[email protected]> wrote:
> > >
> > > The fair approach would be to have the
> > > number of pages to reclaim also relative to zone size.
> > >
> > > sc->nr_to_reclaim = (zone->present_pages * sc->swap_cluster_max) /
> > > total_memory;
> >
> > You can try it, but that shouldn't matter. SWAP_CLUSTER_MAX is just a
> > batching factor used to reduce CPU consumption. If you make it twice as
> > bug, we run DMA-zone reclaim half as often - it should balance out.
>
> But you're not taking the relationship between DMA and NORMAL zone
> into account?

We need to be careful to differentiate between page allocation and page
reclaim. In some ways they're coupled, but the VM does attempt to make one
independent from the other..

> I suppose that a side effect of such change is that more allocations
> will become serviced from the NORMAL/HIGHMEM zones ("more intensively
> reclaimed") while less allocations will become serviced by the DMA zone
> (whose scan/reclaim progress should now be _much_ lighter than that of
> the NORMAL zone). ie DMA zone will be much less often "available" for
> GFP_HIGHMEM/GFP_KERNEL allocations, which are the vast majority.

The use of SWAP_CLUSTER_MAX in the reclaim code shouldn't affect the
inter-zone balancing over in the allocation code. Much.

> Might be talking BS though.
>
> What else could explain this numbers from Magnus, taking into account
> that a large number of pages in the DMA zone are used for kernel text,
> etc. These unbalancing seems to be potentially suboptimal (and result
> in unpredictable behaviour depending from which zone pages becomes
> allocated from):
>
> "$ cat /proc/zoneinfo | grep present
> present 4096
> present 225280
> present 30342
>
> $ cat /proc/zoneinfo | grep tscanned
> tscanned 151352
> tscanned 3480599
> tscanned 541466
>
> "tscanned" counts how many pages that has been scanned in each zone
> since power on. Executive summary assuming that only LRU pages exist
> in the zone:
>
> DMA: each page has been scanned ~37 times
> Normal: each page has been scanned ~15 times
> HighMem: each page has been scanned ~18 times"

Yes, I've noticed that.

> I feel that I'm reaching the point where things should be confirmed
> instead of guessed (on my part!).

Need to check the numbers, but I expect you'll find that ZONE_DMA is
basically never used for either __GFP_HIGHMEM or GFP_KERNEL allocations,
due to the watermark thingies.

So it's basically just sitting there, being used by GFP_DMA allocations.
And IIRC there _are_ a batch of GFP_DMA allocations early in boot for
block-related stuff(?). It's all hazy ;)

But that would mean that most of the ZONE_DMA pages are used for
unreclaimable purposes, and only a small proportion of them are on the LRU.
That might cause the arithmetic to perform more scanning down there.