2005-02-12 03:29:20

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Overview
--------

The purpose of this set of patches is to introduce (one part of) the
necessary kernel infrastructure to support "manual page migration".
That phrase is intended to describe a facility whereby some user program
(most likely a batch scheduler) is given the responsibility of managing
where jobs run on a large NUMA system. If it turns out that a job needs
to be run on a different set of nodes from where it is running now,
then that user program would invoke this facility to move the job to
the new set of nodes.

We use the word "manual" here to indicate that the facility is invoked
in a way that the kernel is told where to move things; we distinguish
this approach from "automatic page migration" facilities which have been
proposed in the past. To us, "automatic page migration" implies using
hardware counters to determine where pages should reside and having the
O/S automatically move misplaced pages. The utility of such facilities,
for example, on IRIX has, been mixed, and we are not currently proposing
such a facility for Linux.

The normal sequence of events would be as follows: A job is running
on, say nodes 5-8, and a higher priority job arrives and the only place
it can be run, for whatever reason, is nodes 5-8. Then the scheduler
would suspend the processes of the existing job (by, for example sending
them a SIGSTOP) and start the new job on those nodes. At some point in
the future, other nodes become available for use, and at this point the
batch scheduler would invoke the manual page migration facility to move
the processes of the suspended job from nodes 5-8 to the new set of nodes.

Note that not all of the pages of all of the processes will need to (or
should) be moved. For example, pages of shared libraries are likely to be
shared by many processes in the system; these pages should not be moved
merely because a few processes using these libraries have been migrated.
For the moment, we are going to defer the problem of determining which
pages should be moved; a solution to this problem will be the subject
of a subsequent patch set.

So, for now let us assume that we have determined that a particular
set of pages associated with a particular process need to be moved.
The kernel interface that we are proposing is the following:

page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

Here va_start and va_end are assumed to be mapped by the same vma; these
addresses are effectively ways that we can specify all (or part of)
an address space map as given in /proc/pid/maps. count is the number
of entries in the old_nodes and new_nodes arrays. The effect of this
system call is to cause all pages in the page range specified that are
found to be resident on old_nodes[i] to be moved to new_nodes[i].

In addition to its use by batch schedulers, we also envision that
this facility could be used by a program to re-arrange the allocation
of its own pages on various nodes of the NUMA system, most likely
to optimize performance of the application during different phases
of its computation.

Implementation Details
----------------------

This patch depends on the implementation of page migration from the
Memory Hotplug Patch (see http://sr71.net/patches; this patch set is
maintained by Dave Hansen of IBM and many other contributors). Recently,
I worked with Dave to rearrange the sequence of the hotplug patches so
that the page migration patch could be applied by itself and then the
rest of the Memory Hotplug patches could be applied on top of that patch.
(In the following and in the descriptions of the other patches, we will
refer to the page migration patch and to the Memory Hotplug patch itself
-- by this we mean the patches available as, for example:

patch-2.6.11-rc2-mm3-mhp1-pm.gz

and the rest of the hotplug patches available in

broken-out-2.6.11-rc2-mm2-mhp1.tar.gz

The latter actually includes the page migration patch, but we will use
the term Memory Hotplug patch to mean the patchset that starts with
patch "A1.1-refactor-setup_memory-i386.patch" in the series file for the
broken-out patches. The page-migration patch consists of the patches
before that, these patches have names ithat start with "AA-".)

Given this powerful underlying framework, the implementation of manual page
migration is relatively straightforward. There are 7 patches supplied
here, the first 5 of those are cleanup patches of various sorts for the
page migration patch.

Patches 6 and 7 of the series implement the system call described
above.

Limitations of the Current Patch
--------------------------------

This is, after all, an RFC and the current patch is only prototype code.
It is being sent to the list in its current form to get some early
comments back and to allow for careful validation of the approach
that has been selected, before so much code has been written that the
project has solidified and become difficult to be changed. I welcome the
opportunity for others to examine this patch and provide suggestions,
point out possible improvements, help me to eliminate bugs, or to make
suggestions about improved coding style or algorithms. I will, however,
be away from the office for the next week, so will likely not be able
to respond until the week of Feb 21st.

There are several things that this patch does not do, however, and
we hope to resolve some of these issues in subsequent versions of the
patchset:

(1) There is no security or authentication checking. Any process
can migrate any pages of any other process. This needs to
be addressed.

(2) We have not figured out yet what to do about the interaction
between page migration and Andi Kleen's memory policy infrastructure.
Presumably the memory policy data structures will have to be
updated either as part of the system call above or through
a new (or existing) system call.

(3) As previously mentioned, we have omitted a glaring detail --
how to determine what pages to migrate. I have an algorithm
and code to solve this problem, but it is still a little
buggy and I wanted to get the ball rolling with what already
existed and seems to work reasonably well.

(4) It is likely that we will add a new operation to the vm_ops
structure -- the "page_migration" routine. The reason for
this is to provide a way for each type of memory object to provide
a way that it's pages can be migrated. We have not included
code for this in the current patch.

(5) There are still some small bugs relating to what happens to
non-present pages. These issues should not hinder evaluation
or discussion of the overall approach, however.

Finally, it is my goal to include the migration cache patch in
the final version of this code, however, at the moment there are
some issues with this patch that are still being worked out, so
it has not been included in this version of the patch.

So, with all of the disclaimers and other details out of the
way, we should go on, in subsequent notes, to discuss each of the
7 patches. Remember that only the last 2 are really significant;
the others are mostly cleanup of warnings and the like.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------


2005-02-12 03:26:06

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 3/7] mm: manual page migration -- cleanup 3

Fix a trivial error in include/linux/mmigrate.h

Signed-off-by: Ray Bryant <[email protected]>

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===================================================================
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h 2005-02-11 10:08:10.000000000 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h 2005-02-11 11:22:34.000000000 -0800
@@ -1,5 +1,5 @@
-#ifndef _LINUX_MEMHOTPLUG_H
-#define _LINUX_MEMHOTPLUG_H
+#ifndef _LINUX_MMIGRATE_H
+#define _LINUX_MMIGRATE_H

#include <linux/config.h>
#include <linux/mm.h>
@@ -36,4 +36,4 @@ extern void arch_migrate_page(struct pag
static inline void arch_migrate_page(struct page *page, struct page *newpage) {}
#endif

-#endif /* _LINUX_MEMHOTPLUG_H */
+#endif /* _LINUX_MMIGRATE_H */

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 03:26:16

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 2/7] mm: manual page migration -- cleanup 2

This patch removes some remaining Memory HOTPLUG specific code
from the page migration patch. I have sent Dave Hansen the -R
version of this patch so that this code can be added back
later at the start of the Memory HOTPLUG patches themselves.

In particular, this patch removes some #ifdef CONFIG_MEMORY_HOTPLUG
code from the page migration patch.

Signed-off-by: Ray Bryant <[email protected]>

Index: linux-2.6.11-rc2-mm2/mm/vmalloc.c
===================================================================
--- linux-2.6.11-rc2-mm2.orig/mm/vmalloc.c 2005-02-11 10:08:10.000000000 -0800
+++ linux-2.6.11-rc2-mm2/mm/vmalloc.c 2005-02-11 10:35:47.000000000 -0800
@@ -523,16 +523,7 @@ EXPORT_SYMBOL(__vmalloc);
*/
void *vmalloc(unsigned long size)
{
-#ifdef CONFIG_MEMORY_HOTPLUG
- /*
- * XXXX: This is temprary code, which should be replaced with proper one
- * after the scheme to specify hot removable region has defined.
- * 25/Sep/2004 -- taka
- */
- return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
-#else
return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
-#endif
}

EXPORT_SYMBOL(vmalloc);
Index: linux-2.6.11-rc2-mm2/mm/shmem.c
===================================================================
--- linux-2.6.11-rc2-mm2.orig/mm/shmem.c 2005-02-11 10:08:10.000000000 -0800
+++ linux-2.6.11-rc2-mm2/mm/shmem.c 2005-02-11 10:35:47.000000000 -0800
@@ -93,16 +93,7 @@ static inline struct page *shmem_dir_all
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
*/
-#ifdef CONFIG_MEMORY_HOTPLUG
- /*
- * XXXX: This is temprary code, which should be replaced with proper one
- * after the scheme to specify hot removable region has defined.
- * 25/Sep/2004 -- taka
- */
- return alloc_pages(gfp_mask & ~__GFP_HIGHMEM, PAGE_CACHE_SHIFT-PAGE_SHIFT);
-#else
return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
-#endif
}

static inline void shmem_dir_free(struct page *page)

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 03:29:14

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 1/7] mm: manual page migration -- cleanup 1

This patch removes some remaining Memory HOTPLUG specific code
from the page migration patch. I have sent Dave Hansen the -R
version of this patch so that this code can be added back
later at the start of the Memory HOTPLUG patches themselves.

In particular, this patchremoves VM_IMMOVABLE and MAP_IMMOVABLE.

Signed-off-by: Ray Bryant <[email protected]>

Index: linux-2.6.10-mm1-page-migration/kernel/fork.c
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/kernel/fork.c 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/kernel/fork.c 2005-01-10 09:14:03.000000000 -0800
@@ -211,7 +211,7 @@ static inline int dup_mmap(struct mm_str
if (IS_ERR(pol))
goto fail_nomem_policy;
vma_set_policy(tmp, pol);
- tmp->vm_flags &= ~(VM_LOCKED|VM_IMMOVABLE);
+ tmp->vm_flags &= ~(VM_LOCKED);
tmp->vm_mm = mm;
tmp->vm_next = NULL;
anon_vma_link(tmp);
Index: linux-2.6.10-mm1-page-migration/include/linux/mm.h
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/include/linux/mm.h 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/include/linux/mm.h 2005-01-10 09:14:04.000000000 -0800
@@ -164,7 +164,6 @@ extern unsigned int kobjsize(const void
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
-#define VM_IMMOVABLE 0x01000000 /* Don't place in hot removable area */

#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6.10-mm1-page-migration/include/linux/mman.h
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/include/linux/mman.h 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/include/linux/mman.h 2005-01-10 10:05:54.000000000 -0800
@@ -61,8 +61,7 @@ calc_vm_flag_bits(unsigned long flags)
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
_calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
- _calc_vm_trans(flags, MAP_IMMOVABLE, VM_IMMOVABLE );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
}

#endif /* _LINUX_MMAN_H */
Index: linux-2.6.10-mm1-page-migration/arch/i386/kernel/sys_i386.c
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/arch/i386/kernel/sys_i386.c 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/arch/i386/kernel/sys_i386.c 2005-01-10 09:14:04.000000000 -0800
@@ -70,7 +70,7 @@ asmlinkage long sys_mmap2(unsigned long
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
{
- return do_mmap2(addr, len, prot, flags & ~MAP_IMMOVABLE, fd, pgoff);
+ return do_mmap2(addr, len, prot, flags, fd, pgoff);
}

/*
@@ -101,7 +101,7 @@ asmlinkage int old_mmap(struct mmap_arg_
if (a.offset & ~PAGE_MASK)
goto out;

- err = do_mmap2(a.addr, a.len, a.prot, a.flags & ~MAP_IMMOVABLE,
+ err = do_mmap2(a.addr, a.len, a.prot, a.flags,
a.fd, a.offset >> PAGE_SHIFT);
out:
return err;
Index: linux-2.6.10-mm1-page-migration/include/asm-ppc64/mman.h
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/include/asm-ppc64/mman.h 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-ppc64/mman.h 2005-01-10 09:14:04.000000000 -0800
@@ -38,7 +38,6 @@

#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
-#define MAP_IMMOVABLE 0x20000

#define MADV_NORMAL 0x0 /* default page-in behavior */
#define MADV_RANDOM 0x1 /* page-in minimum required */
Index: linux-2.6.10-mm1-page-migration/include/asm-i386/mman.h
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/include/asm-i386/mman.h 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-i386/mman.h 2005-01-10 09:14:04.000000000 -0800
@@ -22,7 +22,6 @@
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
-#define MAP_IMMOVABLE 0x20000

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
Index: linux-2.6.10-mm1-page-migration/fs/aio.c
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/fs/aio.c 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/fs/aio.c 2005-01-10 09:14:04.000000000 -0800
@@ -134,7 +134,7 @@ static int aio_setup_ring(struct kioctx
down_write(&ctx->mm->mmap_sem);
info->mmap_base = do_mmap(NULL, 0, info->mmap_size,
PROT_READ|PROT_WRITE,
- MAP_ANON|MAP_PRIVATE|MAP_IMMOVABLE,
+ MAP_ANON|MAP_PRIVATE,
0);
if (IS_ERR((void *)info->mmap_base)) {
up_write(&ctx->mm->mmap_sem);
Index: linux-2.6.10-mm1-page-migration/include/asm-ia64/mman.h
===================================================================
--- linux-2.6.10-mm1-page-migration.orig/include/asm-ia64/mman.h 2005-01-10 08:46:51.000000000 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-ia64/mman.h 2005-01-10 09:14:04.000000000 -0800
@@ -30,7 +30,6 @@
#define MAP_NORESERVE 0x04000 /* don't check for reservations */
#define MAP_POPULATE 0x08000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
-#define MAP_IMMOVABLE 0x20000

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 03:29:20

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 5/7] mm: manual page migration -- cleanup 5

Fix up a switch statement so gcc doesn't complain about it.

Signed-off-by: Ray Bryant <[email protected]>

Index: linux/mm/mmigrate.c
===================================================================
--- linux.orig/mm/mmigrate.c 2005-01-30 11:13:58.000000000 -0800
+++ linux/mm/mmigrate.c 2005-01-30 11:19:33.000000000 -0800
@@ -319,17 +319,17 @@ generic_migrate_page(struct page *page,
/* Wait for all operations against the page to finish. */
ret = migrate_fn(page, newpage, &vlist);
switch (ret) {
- default:
- /* The page is busy. Try it later. */
- goto out_busy;
case -ENOENT:
/* The file the page belongs to has been truncated. */
page_cache_get(page);
page_cache_release(newpage);
newpage->mapping = NULL;
- /* fall thru */
+ break;
case 0:
- /* fall thru */
+ break;
+ default:
+ /* The page is busy. Try it later. */
+ goto out_busy;
}

arch_migrate_page(page, newpage);

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 03:34:57

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 4/7] mm: manual page migration -- cleanup 4

Add some extern declarations to include/linux/mmigrate.h to
eliminate some "implicitly" declared warnings.

Signed-off-by:Ray Bryant <[email protected]>

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===================================================================
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h 2005-02-11 11:23:46.000000000 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h 2005-02-11 11:50:27.000000000 -0800
@@ -17,6 +17,9 @@ extern int page_migratable(struct page *
struct list_head *);
extern struct page * migrate_onepage(struct page *, int nodeid);
extern int try_to_migrate_pages(struct list_head *);
+extern int migration_duplicate(swp_entry_t);
+extern struct page * lookup_migration_cache(int);
+extern int migration_remove_reference(struct page *, int);

#else
static inline int generic_migrate_page(struct page *page, struct page *newpage,

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 03:35:07

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 6/7] mm: manual page migration -- add node_map arg to try_to_migrate_pages()

To migrate pages from one node to another, we need to tell
try_to_migrate_pages() which nodes we want to migrate off
of and where to migrate the pages found on each such node.

We do this by adding the node_map array argument to
try_to_migrate_pages(); node_map[N] gives the target
node to migrate pages to from node N.

This patch depends on a previous patch I submiteed that
adds a node argument to migrate_onepage(); this patch
is currently part of the Memory HOTPLUG page migration
patch.

node_migrate_onepage() is introduced to handle the case
where node_map is NULL (i. e. caller doesn't care where
we migrate the page, just migrate it out of here) or
the system is not a NUMA system.

Signed-off-by:Ray Bryant <[email protected]>

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===================================================================
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h 2005-02-11 11:50:27.000000000 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h 2005-02-11 11:52:50.000000000 -0800
@@ -16,11 +16,29 @@ extern int migrate_page_buffer(struct pa
extern int page_migratable(struct page *, struct page *, int,
struct list_head *);
extern struct page * migrate_onepage(struct page *, int nodeid);
-extern int try_to_migrate_pages(struct list_head *);
+extern int try_to_migrate_pages(struct list_head *, short *);
extern int migration_duplicate(swp_entry_t);
extern struct page * lookup_migration_cache(int);
extern int migration_remove_reference(struct page *, int);

+extern int try_to_migrate_pages(struct list_head *, short *node_map);
+
+#ifdef CONFIG_NUMA
+static inline struct page *node_migrate_onepage(struct page *page, short *node_map)
+{
+ if (node_map)
+ return migrate_onepage(page, node_map[page_to_nid(page)]);
+ else
+ return migrate_onepage(page, MIGRATE_NODE_ANY);
+
+}
+#else
+static inline struct page *node_migrate_onepage(struct page *page, short *node_map)
+{
+ return migrate_onepage(page, MIGRATE_NODE_ANY);
+}
+#endif
+
#else
static inline int generic_migrate_page(struct page *page, struct page *newpage,
int (*fn)(struct page *, struct page *))
Index: linux-2.6.11-rc2-mm2/mm/mmigrate.c
===================================================================
--- linux-2.6.11-rc2-mm2.orig/mm/mmigrate.c 2005-02-11 11:50:40.000000000 -0800
+++ linux-2.6.11-rc2-mm2/mm/mmigrate.c 2005-02-11 11:51:04.000000000 -0800
@@ -502,9 +502,11 @@ out_unlock:
/*
* This is the main entry point to migrate pages in a specific region.
* If a page is inactive, the page may be just released instead of
- * migration.
+ * migration. node_map is supplied in those cases (on NUMA systems)
+ * where the caller wishes to specify to which nodes the pages are
+ * migrated. If node_map is null, the target node is MIGRATE_NODE_ANY.
*/
-int try_to_migrate_pages(struct list_head *page_list)
+int try_to_migrate_pages(struct list_head *page_list, short *node_map)
{
struct page *page, *page2, *newpage;
LIST_HEAD(pass1_list);
@@ -542,7 +544,7 @@ int try_to_migrate_pages(struct list_hea
list_for_each_entry_safe(page, page2, &pass1_list, lru) {
list_del(&page->lru);
if (PageLocked(page) || PageWriteback(page) ||
- IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+ IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
if (page_count(page) == 1) {
/* the page is already unused */
putback_page_to_lru(page_zone(page), page);
@@ -560,7 +562,7 @@ int try_to_migrate_pages(struct list_hea
*/
list_for_each_entry_safe(page, page2, &pass2_list, lru) {
list_del(&page->lru);
- if (IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+ if (IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
if (page_count(page) == 1) {
/* the page is already unused */
putback_page_to_lru(page_zone(page), page);

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 03:35:08

by Ray Bryant

[permalink] [raw]
Subject: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

This patch introduces the sys_page_migrate() system call:

sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

Its intent is to cause the pages in the range given that are found on
old_nodes[i] to be moved to new_nodes[i]. Count is the the number of
entries in these two arrays of short.

Restrictions and limitations of this version:

(1) va_start and va_end must be mapped by the same vma. (The user
can read /proc/pid/maps to find out the appropriate vma ranges.)
This could easily be generalized, but has not been done for the
moment.

(2) There is no capability or authority checking being done here.
Any process can migrate any other process. This will be fixed
in a future version, once we agree on what the authority model
should be.

(3) Eventually, we plan on adding a page_migrate entry to the
vm_operations_struct. The problem is, in general, that only
the object itself knows how to migrate its pages. For the
moment, we are only handling the case of anonymous private
and memory mapped files, which handles practially all known
cases, but there are som other cases that are peculiar to
SN2 hardware that are not handled by the present code (e. g.
fetch & op storage). So for now, it is sufficient for us
to test vma->vm_ops pointer; if this is null we are in the
anonymoust private case, elsewise we are in the mapped file
case. The mapped file case handles mapped files, shared
anonymouse storage, and shared segments.


Signed-off-by:Ray Bryant <[email protected]>

Index: linux-2.6.11-rc2-mm2/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.11-rc2-mm2.orig/arch/ia64/kernel/entry.S 2005-02-11 08:18:58.000000000 -0800
+++ linux-2.6.11-rc2-mm2/arch/ia64/kernel/entry.S 2005-02-11 16:07:27.000000000 -0800
@@ -1581,6 +1581,6 @@ sys_call_table:
data8 sys_ni_syscall
data8 sys_ni_syscall
data8 sys_ni_syscall
- data8 sys_ni_syscall
+ data8 sys_page_migrate // 1279

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.11-rc2-mm2/mm/mmigrate.c
===================================================================
--- linux-2.6.11-rc2-mm2.orig/mm/mmigrate.c 2005-02-11 16:07:27.000000000 -0800
+++ linux-2.6.11-rc2-mm2/mm/mmigrate.c 2005-02-11 16:10:13.000000000 -0800
@@ -588,6 +588,228 @@ int try_to_migrate_pages(struct list_hea
return nr_busy;
}

+static int
+migrate_vma_common(struct list_head *page_list, short *node_map, int count)
+{
+ int pass=0, remains, migrated;
+ struct page *page;
+
+ while(pass<10) {
+
+ remains = try_to_migrate_pages(page_list, node_map);
+
+ if (remains < 0)
+ return remains;
+
+ migrated = 0;
+ if (!list_empty(page_list))
+ list_for_each_entry(page, page_list, lru)
+ migrated++;
+ else {
+ migrated = count;
+ break;
+ }
+
+ pass++;
+
+ migrated = count - migrated;
+
+ /* wait a bit and try again */
+ msleep(10);
+
+ }
+ return migrated;
+}
+
+static int
+migrate_mapped_file_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, size_t va_start,
+ size_t va_end, short *node_map)
+{
+ struct page *page;
+ struct zone *zone;
+ struct address_space *as;
+ int count = 0, nid, ret;
+ LIST_HEAD(page_list);
+ long idx, start_idx, end_idx;
+
+ va_start = va_start & PAGE_MASK;
+ va_end = va_end & PAGE_MASK;
+ start_idx = (va_start - vma->vm_start) >> PAGE_SHIFT;
+ end_idx = (va_end - vma->vm_start) >> PAGE_SHIFT;
+
+ if (!vma->vm_file || !vma->vm_file->f_mapping)
+ BUG();
+
+ as = vma->vm_file->f_mapping;
+
+ for (idx = start_idx; idx <= end_idx; idx++) {
+ page = find_get_page(as, idx);
+ if (page) {
+ page_cache_release(page);
+
+ if (!page_mapcount(page) && !page->mapping)
+ BUG();
+
+ nid = page_to_nid(page);
+ if (node_map[nid] > 0) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (PageLRU(page) &&
+ __steal_page_from_lru(zone, page)) {
+ count++;
+ list_add(&page->lru, &page_list);
+ } else
+ BUG();
+ spin_unlock_irq(&zone->lru_lock);
+ }
+ }
+ }
+
+ ret = migrate_vma_common(&page_list, node_map, count);
+
+ return ret;
+
+}
+
+static int
+migrate_anon_private_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, size_t va_start,
+ size_t va_end, short *node_map)
+{
+ struct page *page;
+ struct zone *zone;
+ unsigned long vaddr;
+ int count = 0, nid, ret;
+ LIST_HEAD(page_list);
+
+ va_start = va_start & PAGE_MASK;
+ va_end = va_end & PAGE_MASK;
+
+ for (vaddr=va_start; vaddr<=va_end; vaddr += PAGE_SIZE) {
+ spin_lock(&mm->page_table_lock);
+ page = follow_page(mm, vaddr, 0);
+ spin_unlock(&mm->page_table_lock);
+ /*
+ * follow_page has been observed to return pages with zero
+ * mapcount and NULL mapping. Skip those pages as well
+ */
+ if (page && page_mapcount(page) && page->mapping) {
+ nid = page_to_nid(page);
+ if (node_map[nid] > 0) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (PageLRU(page) &&
+ __steal_page_from_lru(zone, page)) {
+ count++;
+ list_add(&page->lru, &page_list);
+ } else
+ BUG();
+ spin_unlock_irq(&zone->lru_lock);
+ }
+ }
+ }
+
+ ret = migrate_vma_common(&page_list, node_map, count);
+
+ return ret;
+}
+
+void lru_add_drain_per_cpu(void *info) {
+ lru_add_drain();
+}
+
+asmlinkage long
+sys_page_migrate(const pid_t pid, size_t va_start, size_t va_end,
+ const int count, caddr_t old_nodes, caddr_t new_nodes)
+{
+ int i, ret = 0;
+ short *tmp_old_nodes;
+ short *tmp_new_nodes;
+ short *node_map;
+ struct task_struct *task;
+ struct mm_struct *mm = 0;
+ size_t size = count*sizeof(short);
+ struct vm_area_struct *vma, *vma2;
+
+
+ tmp_old_nodes = (short *) kmalloc(size, GFP_KERNEL);
+ tmp_new_nodes = (short *) kmalloc(size, GFP_KERNEL);
+ node_map = (short *) kmalloc(MAX_NUMNODES*sizeof(short), GFP_KERNEL);
+
+ if (!tmp_old_nodes || !tmp_new_nodes || !node_map) {
+ ret = -ENOMEM;
+ goto out_nodec;
+ }
+
+ if (copy_from_user(tmp_old_nodes, old_nodes, size) ||
+ copy_from_user(tmp_new_nodes, new_nodes, size)) {
+ ret = -EFAULT;
+ goto out_nodec;
+ }
+
+ read_lock(&tasklist_lock);
+ task = find_task_by_pid(pid);
+ if (task) {
+ task_lock(task);
+ mm = task->mm;
+ if (mm)
+ atomic_inc(&mm->mm_users);
+ task_unlock(task);
+ } else {
+ ret = -ESRCH;
+ goto out_nodec;
+ }
+ read_unlock(&tasklist_lock);
+ if (!mm) {
+ ret = -EINVAL;
+ goto out_nodec;
+ }
+
+ /*
+ * for now, we require both the start and end addresses to
+ * be mapped by the same vma.
+ */
+ vma = find_vma(mm, va_start);
+ vma2 = find_vma(mm, va_end);
+ if (!vma || !vma2 || (vma != vma2)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* set up the node_map array */
+ for(i=0; i<MAX_NUMNODES; i++)
+ node_map[i] = -1;
+ for(i=0; i<count; i++)
+ node_map[tmp_old_nodes[i]] = tmp_new_nodes[i];
+
+ /* prepare for lru list manipulation */
+ smp_call_function(&lru_add_drain_per_cpu, NULL, 0, 1);
+ lru_add_drain();
+
+ /* actually do the migration */
+ if (vma->vm_ops)
+ ret = migrate_mapped_file_vma(task, mm, vma, va_start, va_end,
+ node_map);
+ else
+ ret = migrate_anon_private_vma(task, mm, vma, va_start, va_end,
+ node_map);
+
+out:
+ atomic_dec(&mm->mm_users);
+
+out_nodec:
+ if (tmp_old_nodes)
+ kfree(tmp_old_nodes);
+ if (tmp_new_nodes)
+ kfree(tmp_new_nodes);
+ if (node_map)
+ kfree(node_map);
+
+ return ret;
+
+}
+
EXPORT_SYMBOL(generic_migrate_page);
EXPORT_SYMBOL(migrate_page_common);
EXPORT_SYMBOL(migrate_page_buffer);

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-12 08:08:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Minor comments ... nothing profound.

Ray wrote:
> once we agree on what the authority model should be.

Are the usual kill-like permissions sufficient?
You can migrate the pages of a process if you can kill it.

===

In the following routine, tighten up some vertical spacing,
add { ... } , ...

The migrated and count manipulations are confusing my
feeble brain. Is this thing supposed to return 0 if all
count pages are migrated? Sure seems that it does, as it
returns 'migrated', which is 'count - migrated', but that
migrated is really count, so it returns 'count - count',
which is zero. Huh ... The phrase 'return migrated' would
make me think it returned some count of how many were
migrated on success, not zero.

The variable name 'remains' is rather elaborate for what
looks like a trivial return case. But perhaps it actually
provides a better clue to the return value, which apparently
is the number of pages _not_ migrated successfully.

Think carefully about what each variable represents, and
then use each variable consistently.

And try to avoid the embedded 'return remains'. A function
header comment, saying what this routine does and returns might
be helpful.

=========================================================================
static int
migrate_vma_common(struct list_head *page_list, short *node_map, int count)
{
int pass, remains, migrated;
struct page *page;

for (pass = 0; pass < 10; msleep(10), pass++) {
remains = try_to_migrate_pages(page_list, node_map);
if (remains < 0)
return remains;

migrated = 0;
if (!list_empty(page_list)) {
list_for_each_entry(page, page_list, lru)
migrated++;
} else {
migrated = count;
break;
}
migrated = count - migrated;
}
return migrated;
}
=========================================================================

Better init tmp_new_nodes, node_map to 0, or if tmp_old_news fails to
allocate, you might try freeing bogus values for the other two in
sys_page_migrate():

===============================================================
+ short *tmp_old_nodes;
+ short *tmp_new_nodes;
+ short *node_map;
+ ...
+
+
+ tmp_old_nodes = (short *) kmalloc(size, GFP_KERNEL);
+ tmp_new_nodes = (short *) kmalloc(size, GFP_KERNEL);
+ node_map = (short *) kmalloc(MAX_NUMNODES*sizeof(short), GFP_KERNEL);
================================================================

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-12 11:17:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Ray Bryant <[email protected]> writes:
> set of pages associated with a particular process need to be moved.
> The kernel interface that we are proposing is the following:
>
> page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

[Only commenting on the interface, haven't read your patches at all]

This is basically mbind() with MPOL_F_STRICT, except that it has a pid
argument. I assume that's for the benefit of your batch scheduler.

But it's not clear to me how and why the batch scheduler should know about
virtual addresses of different processes anyways. Walking
/proc/pid/maps? That's all inherently racy when the process is doing
mmap in parallel. The only way I can think of to do this would be to
check for changes in maps after a full move and loop, but then you risk
livelock.

And you cannot also just specify va_start=0, va_end=~0UL because that
would make the node arrays grow infinitely.

Also is there a good use case why the batch scheduler should only
move individual areas in a process around, not the full process?

I think the only sane way for an external process to move another
around is to do it for the whole process. For that you wouldn't need
most of the arguments, but just a simple move_process_vm call,
or perhaps just a file in /proc where the new node can be written to.

There may be an argument to do this for individual
tmpfs/hugetlbfs/sysv shm segments too, but mbind() already supports
that (just map them from a different process and change the policy there)

For process use you could just do it in mbind() or perhaps
part of the process policy (move page around when touched by process).

-Andi

2005-02-12 12:34:40

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Fri, 2005-02-11 at 19:26 -0800, Ray Bryant wrote:
> This patch introduces the sys_page_migrate() system call:
>
> sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

are you really sure you want to expose nodes to userspace via an ABI
this solid and never changing? To me that feels somewhat like too much
of an internal thing to expose that will mean that those internals are
now set in stone due to the interface...


2005-02-12 14:49:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Sat, Feb 12, 2005 at 07:34:32AM -0500, Arjan van de Ven wrote:
> On Fri, 2005-02-11 at 19:26 -0800, Ray Bryant wrote:
> > This patch introduces the sys_page_migrate() system call:
> >
> > sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
> are you really sure you want to expose nodes to userspace via an ABI
> this solid and never changing? To me that feels somewhat like too much
> of an internal thing to expose that will mean that those internals are
> now set in stone due to the interface...

They're already exposed through mbind/set_mempolicy/get_mempolicy and sysfs
of course.

-Andi

2005-02-12 19:50:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Sat, Feb 12, 2005 at 12:17:25PM +0100, Andi Kleen wrote:
> Ray Bryant <[email protected]> writes:
> > set of pages associated with a particular process need to be moved.
> > The kernel interface that we are proposing is the following:
> >
> > page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
> [Only commenting on the interface, haven't read your patches at all]
>
> This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> argument. I assume that's for the benefit of your batch scheduler.

As far as I understand mbind() is used to set policies to given memory
regions, not move memory regions?

> But it's not clear to me how and why the batch scheduler should know about
> virtual addresses of different processes anyways. Walking
> /proc/pid/maps? That's all inherently racy when the process is doing
> mmap in parallel. The only way I can think of to do this would be to
> check for changes in maps after a full move and loop, but then you risk
> livelock.

True.

There is no problem, however, if all threads beloging to the process are stopped,
as Ray mentions.

So, there wont be memory mapping changes happening at the same time.

Note that the memory migration code which sys_page_migrate() uses moves
running processes to other memory zones, handling truncate, etc.

> And you cannot also just specify va_start=0, va_end=~0UL because that
> would make the node arrays grow infinitely.
>
> Also is there a good use case why the batch scheduler should only
> move individual areas in a process around, not the full process?

Quoting him:

"In addition to its use by batch schedulers, we also envision that
this facility could be used by a program to re-arrange the allocation
of its own pages on various nodes of the NUMA system, most likely
to optimize performance of the application during different phases
of its computation."

Seems doable.

Are there any good xamples of optimizations that could be made by
moving pages around except for NUMA?

Does IRIX has anything similar?

> I think the only sane way for an external process to move another
> around is to do it for the whole process. For that you wouldn't need
> most of the arguments, but just a simple move_process_vm call,
> or perhaps just a file in /proc where the new node can be written to.

It seems interesting for a process to move its own vma for optimizations
reasons?

> There may be an argument to do this for individual
> tmpfs/hugetlbfs/sysv shm segments too, but mbind() already supports
> that (just map them from a different process and change the policy there)
>
> For process use you could just do it in mbind() or perhaps
> part of the process policy (move page around when touched by process).

Hum, how is that supposed to work ? You want to modify the pagefault handler?

2005-02-12 20:14:41

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Sat, Feb 12, 2005 at 01:54:26PM -0200, Marcelo Tosatti wrote:
> On Sat, Feb 12, 2005 at 12:17:25PM +0100, Andi Kleen wrote:
> > Ray Bryant <[email protected]> writes:
> > > set of pages associated with a particular process need to be moved.
> > > The kernel interface that we are proposing is the following:
> > >
> > > page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
> >
> > [Only commenting on the interface, haven't read your patches at all]
> >
> > This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> > argument. I assume that's for the benefit of your batch scheduler.
>
> As far as I understand mbind() is used to set policies to given memory
> regions, not move memory regions?
>
> > But it's not clear to me how and why the batch scheduler should know about
> > virtual addresses of different processes anyways. Walking
> > /proc/pid/maps? That's all inherently racy when the process is doing
> > mmap in parallel. The only way I can think of to do this would be to
> > check for changes in maps after a full move and loop, but then you risk
> > livelock.
>
> True.
>
> There is no problem, however, if all threads beloging to the process are stopped,
> as Ray mentions.
>
> So, there wont be memory mapping changes happening at the same time.
>
> Note that the memory migration code which sys_page_migrate() uses moves
> running processes to other memory zones, handling truncate, etc.
>
> > And you cannot also just specify va_start=0, va_end=~0UL because that
> > would make the node arrays grow infinitely.
> >
> > Also is there a good use case why the batch scheduler should only
> > move individual areas in a process around, not the full process?
>
> Quoting him:
>
> "In addition to its use by batch schedulers, we also envision that
> this facility could be used by a program to re-arrange the allocation
> of its own pages on various nodes of the NUMA system, most likely
> to optimize performance of the application during different phases
> of its computation."
>
> Seems doable.
>
> Are there any good xamples of optimizations that could be made by
> moving pages around except for NUMA?

If you have virtually indexed caches moving pages around can optimize cache behaviour
if program access pattern is well known? That is not a thing common thing
to do - and is architecture dependant anyway.

2005-02-12 20:52:25

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Andi wrote:
> They're already exposed through mbind/set_mempolicy/get_mempolicy and sysfs
> of course.

And soon I hope through cpusets ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-12 21:04:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Fri, 2005-02-11 at 19:26 -0800, Ray Bryant wrote:
> This patch introduces the sys_page_migrate() system call:
>
> sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
> Its intent is to cause the pages in the range given that are found on
> old_nodes[i] to be moved to new_nodes[i]. Count is the the number of
> entries in these two arrays of short.

Might it be useful to use nodemasks instead of those arrays? That's
already the interface that the mbind() interfaces use, and it probably
pays to be consistent with all of the numa syscalls.

There also probably needs to be a bit more coordination between the
other NUMA API and this one. I noticed that, for now, the migration
loop only makes a limited number of passes. It appears that either you
don't require that, once the syscall returns, that *all* pages have been
migrated (there could have been allocations done behind the loop) or you
have some way of keeping the process from doing any more allocations.

There might also be some use to making sure that the NUMA binding API
and the migration code agree what is in the affected VMA. Otherwise,
there might be some interesting situations where kswapd is swapping
pages out behind a migration call, and the NUMA API is refilling those
pages with ones that the migration call doesn't agree with.

That's one reason I was looking at the loop to make sure it's only one
pass. I think doing passes until all pages are migrated gives you a
livelock, so the limited number obviously makes sense.

Will you need other APIs to tell how successful the migration request
was? Simply returning how many pages were migrated back from the
syscall doesn't really tell you anything concrete because there could be
kswapd activity or other migration calls that could be messing up the
work from the previous call. Are all of these VMAs meant to be
mlock()ed?

-- Dave

2005-02-12 21:29:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Sat, Feb 12, 2005 at 01:54:26PM -0200, Marcelo Tosatti wrote:
> On Sat, Feb 12, 2005 at 12:17:25PM +0100, Andi Kleen wrote:
> > Ray Bryant <[email protected]> writes:
> > > set of pages associated with a particular process need to be moved.
> > > The kernel interface that we are proposing is the following:
> > >
> > > page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
> >
> > [Only commenting on the interface, haven't read your patches at all]
> >
> > This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> > argument. I assume that's for the benefit of your batch scheduler.
>
> As far as I understand mbind() is used to set policies to given memory
> regions, not move memory regions?

There is a MPOL_F_STRICT flag. Currently it fails when the memory
is not on the right node(s) and the flag is set, but it could as well move.

In fact Steve Longerbeam already did a patch to move in this case,
but it hasn't been merged yet for some reasons.


> > mmap in parallel. The only way I can think of to do this would be to
> > check for changes in maps after a full move and loop, but then you risk
> > livelock.
>
> True.
>
> There is no problem, however, if all threads beloging to the process are stopped,
> as Ray mentions.
>
> So, there wont be memory mapping changes happening at the same time.

Ok. But it's still quite ugly to read /proc/*/maps for this.

>
> > And you cannot also just specify va_start=0, va_end=~0UL because that
> > would make the node arrays grow infinitely.
> >
> > Also is there a good use case why the batch scheduler should only
> > move individual areas in a process around, not the full process?
>
> Quoting him:
>
> "In addition to its use by batch schedulers, we also envision that
> this facility could be used by a program to re-arrange the allocation
> of its own pages on various nodes of the NUMA system, most likely
> to optimize performance of the application during different phases
> of its computation."
>
> Seems doable.

That is what mbind() already supports, just someone needs to hook up
the page moving code with MPOL_F_STRICT.

> Are there any good xamples of optimizations that could be made by
> moving pages around except for NUMA?

It's all fundamentally a NUMA thing.

There was some talk to define fake nodes as fall back pools
to get low latency multimedia allocation, with that it may be useful
too at some point.

-Andi

2005-02-12 21:44:48

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Dave wrote:
> Might it be useful to use nodemasks instead of those arrays?

I don't think he can. A nodemask represents an unorderd set of nodes.
He needs (or wants) to pass a <nid, nid> map, mapping the node that each
page might be on, to the node to which it should migrate. A bitmask
doesn't contain enough information to specify that.

Perhaps instead he could pass two node arguments, old and new, with the
migration routines understanding that they were to migrate only pages
found on the old node, to the new node, ignoring other pages.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-14 15:34:11

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Sat, Feb 12, 2005 at 01:04:22PM -0800, Dave Hansen wrote:
> On Fri, 2005-02-11 at 19:26 -0800, Ray Bryant wrote:
> > This patch introduces the sys_page_migrate() system call:
> >
> > sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
> >
> > Its intent is to cause the pages in the range given that are found on
> > old_nodes[i] to be moved to new_nodes[i]. Count is the the number of
> > entries in these two arrays of short.
>
> Might it be useful to use nodemasks instead of those arrays? That's
> already the interface that the mbind() interfaces use, and it probably
> pays to be consistent with all of the numa syscalls.

The node mask is a list of allowed. This is intended to be as near
to a one-to-one migration path as possible.

> There also probably needs to be a bit more coordination between the
> other NUMA API and this one. I noticed that, for now, the migration
> loop only makes a limited number of passes. It appears that either you
> don't require that, once the syscall returns, that *all* pages have been
> migrated (there could have been allocations done behind the loop) or you
> have some way of keeping the process from doing any more allocations.

It is intended that the process would be stopped during the migration
to simplify considerations such as overlapping destination node lists.

Robin

2005-02-14 15:34:02

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Sat, Feb 12, 2005 at 12:17:25PM +0100, Andi Kleen wrote:
> Ray Bryant <[email protected]> writes:
> > set of pages associated with a particular process need to be moved.
> > The kernel interface that we are proposing is the following:
> >
> > page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
> [Only commenting on the interface, haven't read your patches at all]
>
> This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> argument. I assume that's for the benefit of your batch scheduler.
>
> But it's not clear to me how and why the batch scheduler should know about
> virtual addresses of different processes anyways. Walking
> /proc/pid/maps? That's all inherently racy when the process is doing
> mmap in parallel. The only way I can think of to do this would be to
> check for changes in maps after a full move and loop, but then you risk
> livelock.

For our use, the batch scheduler will give an intermediary program a
list of processes and a series of from-to node pairs. That process would
then ensure all the processes are stopped, scan their VMAs to determine
what regions are mapped by more than one process, which are mapped
by additional processes not in the job, and make this system call for
each of the unique ranges in the job to migrate their pages from one
node to the next. I believe Ray is working on a library and a standalone
program to do this from a command line.

>
> And you cannot also just specify va_start=0, va_end=~0UL because that
> would make the node arrays grow infinitely.

Across the job, you could be moving some memory regions multiple times.

>
> Also is there a good use case why the batch scheduler should only
> move individual areas in a process around, not the full process?

Overlapping regions.

>
> I think the only sane way for an external process to move another
> around is to do it for the whole process. For that you wouldn't need
> most of the arguments, but just a simple move_process_vm call,
> or perhaps just a file in /proc where the new node can be written to.

But when you take into consideration multiple processes in a job that
all started from one set of mappings and has since tweaked their
mappings to suite their particular needs, there doesn't appear to
be any way to do it without some form of leg work as described
above.

>
> There may be an argument to do this for individual
> tmpfs/hugetlbfs/sysv shm segments too, but mbind() already supports
> that (just map them from a different process and change the policy there)
>
> For process use you could just do it in mbind() or perhaps
> part of the process policy (move page around when touched by process).

This functionality will be used by the batch scheduler to not only
move the processes memory to a different set of nodes, but also reduce
the memory usage on the old set of nodes. For that reason, you can not
rely on touch as the process that _NEEDS_ the memory moved does not
have control of program flow to ensure that after the SIGCONT is sent
the process will touch all of its address space.

Thanks,
Robin Holt

2005-02-14 16:39:23

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Sat, Feb 12, 2005 at 10:29:14PM +0100, Andi Kleen wrote:
> On Sat, Feb 12, 2005 at 01:54:26PM -0200, Marcelo Tosatti wrote:
> > On Sat, Feb 12, 2005 at 12:17:25PM +0100, Andi Kleen wrote:
> > > Ray Bryant <[email protected]> writes:
> > > > set of pages associated with a particular process need to be moved.
> > > > The kernel interface that we are proposing is the following:
> > > >
> > > > page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
> > >
> > > [Only commenting on the interface, haven't read your patches at all]
> > >
> > > This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> > > argument. I assume that's for the benefit of your batch scheduler.
> >
> > As far as I understand mbind() is used to set policies to given memory
> > regions, not move memory regions?
>
> There is a MPOL_F_STRICT flag. Currently it fails when the memory
> is not on the right node(s) and the flag is set, but it could as well move.
>
> In fact Steve Longerbeam already did a patch to move in this case,
> but it hasn't been merged yet for some reasons.
>
>
> > > mmap in parallel. The only way I can think of to do this would be to
> > > check for changes in maps after a full move and loop, but then you risk
> > > livelock.
> >
> > True.
> >
> > There is no problem, however, if all threads beloging to the process are stopped,
> > as Ray mentions.
> >
> > So, there wont be memory mapping changes happening at the same time.
>
> Ok. But it's still quite ugly to read /proc/*/maps for this.
>
> >
> > > And you cannot also just specify va_start=0, va_end=~0UL because that
> > > would make the node arrays grow infinitely.
> > >
> > > Also is there a good use case why the batch scheduler should only
> > > move individual areas in a process around, not the full process?
> >
> > Quoting him:
> >
> > "In addition to its use by batch schedulers, we also envision that
> > this facility could be used by a program to re-arrange the allocation
> > of its own pages on various nodes of the NUMA system, most likely
> > to optimize performance of the application during different phases
> > of its computation."
> >
> > Seems doable.
>
> That is what mbind() already supports, just someone needs to hook up
> the page moving code with MPOL_F_STRICT.

But how do you use mbind() to change the memory placement for an anonymous
private mapping used by a vendor provided executable with mbind()?

Robin

2005-02-14 18:51:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Mon, 2005-02-14 at 07:52 -0600, Robin Holt wrote:
> The node mask is a list of allowed. This is intended to be as near
> to a one-to-one migration path as possible.

If that's the case, it would make the kernel internals a bit simpler to
only take a "from" and "to" node, instead of those maps. You'll end up
making multiple syscalls, but that shouldn't be a problem.

> > There also probably needs to be a bit more coordination between the
> > other NUMA API and this one. I noticed that, for now, the migration
> > loop only makes a limited number of passes. It appears that either you
> > don't require that, once the syscall returns, that *all* pages have been
> > migrated (there could have been allocations done behind the loop) or you
> > have some way of keeping the process from doing any more allocations.
>
> It is intended that the process would be stopped during the migration
> to simplify considerations such as overlapping destination node lists.

Requiring that the process is stopped will somewhat limit the use of
this API outside of the HPC space where so much control can be had over
the processes. I have the feeling that very few other kinds of
applications will be willing to be stopped for the time that it takes
for a set of migrations to occur. But, if stopping the process is going
to be a requirement, having more syscalls that take less time each
should be desirable.

-- Dave

2005-02-14 19:15:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

> But how do you use mbind() to change the memory placement for an anonymous
> private mapping used by a vendor provided executable with mbind()?

For that you use set_mempolicy.

-Andi

2005-02-14 19:19:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

> For our use, the batch scheduler will give an intermediary program a
> list of processes and a series of from-to node pairs. That process would
> then ensure all the processes are stopped, scan their VMAs to determine
> what regions are mapped by more than one process, which are mapped
> by additional processes not in the job, and make this system call for
> each of the unique ranges in the job to migrate their pages from one
> node to the next. I believe Ray is working on a library and a standalone
> program to do this from a command line.

Sounds quite ugly.

Do you have evidence that this is a common use case? (jobs having stuff
mapped from programs not in the job). If not I think it's better
to go with a simple interface, not one that is unusable without
a complex user space library.

If you mean glibc etc. only then the best solution for that would be probably
to use the (currently unmerged) arbitary file mempolicy code for this and set
a suitable attribute that prevents moving.

-Andi

2005-02-14 22:02:26

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Mon, Feb 14, 2005 at 10:50:42AM -0800, Dave Hansen wrote:
> On Mon, 2005-02-14 at 07:52 -0600, Robin Holt wrote:
> > The node mask is a list of allowed. This is intended to be as near
> > to a one-to-one migration path as possible.
>
> If that's the case, it would make the kernel internals a bit simpler to
> only take a "from" and "to" node, instead of those maps. You'll end up
> making multiple syscalls, but that shouldn't be a problem.

Then how do you handle overlapping nodes. If I am doing a 5->4, 4->3,
3->2, 2->1 shift in the memory placement and had only a from and to node,
I would end up calling multiple times. This would end up in memory shifting
from 5->4 on the first, 4->3 on the second, ... with the end result of
all memory shifting to a single node.

With the array-of-node maps, you make a single pass across the address
space. This results in a clean mapping without the userspace needing to
know which nodes the pages are on.

On a seperate topic, I would guess the syscall time is trivial compared
to the time to walk the page tables.

Thanks,
Robin

2005-02-14 22:23:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Mon, 2005-02-14 at 16:01 -0600, Robin Holt wrote:
> On Mon, Feb 14, 2005 at 10:50:42AM -0800, Dave Hansen wrote:
> > On Mon, 2005-02-14 at 07:52 -0600, Robin Holt wrote:
> > > The node mask is a list of allowed. This is intended to be as near
> > > to a one-to-one migration path as possible.
> >
> > If that's the case, it would make the kernel internals a bit simpler to
> > only take a "from" and "to" node, instead of those maps. You'll end up
> > making multiple syscalls, but that shouldn't be a problem.
>
> Then how do you handle overlapping nodes. If I am doing a 5->4, 4->3,
> 3->2, 2->1 shift in the memory placement and had only a from and to node,
> I would end up calling multiple times. This would end up in memory shifting
> from 5->4 on the first, 4->3 on the second, ... with the end result of
> all memory shifting to a single node.

Can you give an example of when you'd actually want to do this?

> On a seperate topic, I would guess the syscall time is trivial compared
> to the time to walk the page tables.

I'd certainly agree.

-- Dave

2005-02-14 23:53:39

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Andi Kleen wrote:
>>But how do you use mbind() to change the memory placement for an anonymous
>>private mapping used by a vendor provided executable with mbind()?
>
>
> For that you use set_mempolicy.
>
> -Andi
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

Andi,

If all processes are guarenteed to use the NUMA api for memory placement,
then AFAIK one could, in principle, imbed the migration of pages into
the NUMA api as you propose. The problem is that AFAIK most programs
that we run are not using the NUMA api. Instead, they are using first-touch
with the knowledge that such pages will be allocated on the node where they
are first referenced.

Since we have to build a migration facility that will migrate jobs that
use both the NUMA API and the first-touch approach, it seems to me the
only plausible soluion is to move the pages via a migration facility
and then if there are NUMA API control structures found associated with
the moved pages to update them to represent the new reality. Whether
this happens as an automatic side effect of the migration call or it
happens by a issuing a new set_mempolicy() is not clear to me. I would
prefer to just issue a new set_mempolicy(), but somehow the migration
code will have to figure out where this call needs to be executed (i. e.
which pages have an associated NUMA policy). [Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 00:30:47

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Andi Kleen wrote:
> Ray Bryant <[email protected]> writes:
>
>>set of pages associated with a particular process need to be moved.
>>The kernel interface that we are proposing is the following:
>>
>>page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
>
> [Only commenting on the interface, haven't read your patches at all]
>
> This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> argument. I assume that's for the benefit of your batch scheduler.
>
> But it's not clear to me how and why the batch scheduler should know about
> virtual addresses of different processes anyways. Walking
> /proc/pid/maps? That's all inherently racy when the process is doing
> mmap in parallel. The only way I can think of to do this would be to
> check for changes in maps after a full move and loop, but then you risk
> livelock.
>
> And you cannot also just specify va_start=0, va_end=~0UL because that
> would make the node arrays grow infinitely.
>
> Also is there a good use case why the batch scheduler should only
> move individual areas in a process around, not the full process?
>

The batch scheduler interface will be to move entire jobs (groups of
processes) around from one set of nodes to another. But that interface
doesn't work at the kernel level. The problem is that one just can't
ask the kernel to move the entire address space of a process for a number
of reasons:

(1) You really don't want to migrate the code pages of shared libraries
that are mapped into the process address space. This causes a
useless shuffling of pages which really doesn't help system
performance. On the other hand, if a shared library is some
private thing that is only used by the processes being migrated,
then you should move that.

(2) You really only want to migrate pages once. If a file is mapped
into several of the pid's that are being migrated, then you want
to figure this out and issue one call to have it moved wrt one of
the pid's.
(The page migration code from the memory hotplug patch will handle
updating the pte's of the other processs (thank goodness for
rmap...))

(3) In the case where a particular file is mapped into different
processes at different file offsets (and we are migrating both
of the processes), one has to examine the file offsets to figure
out if the mappings overlap or not. If they overlap, then you've
got to issue two calls, each of which describes a non-overlapping
region; both calls taken together would cover the entire range
of pages mapped to the file. Similarly if the ranges do not
overlap.

Figuring all of this out seems to me to be way too complicated to
want to stick into the kernel. Hence we proposed the kernel interface
as discussed in the overview note. This interface would be used by
a user space library, whose batch scheduler interface would look
something like this:

migrate_processes(pid_count, pid_list, node_count, old_nodes, new_nodes);

which is what you are asking for, I think. The library's job
(in addition to suspending all of the processes in the list for
the duration of the migration operation, plus do some other things
that are specific to sn2 hardware) would be to examine the
/proc/pid/maps entries for each pid that we are
migrating, and figure out from that what portions of which pid's
address spaces need to migrated so that we satisfy the constraints
given above. I admit that this may be viewed as ugly, but I really
can't figure out a better solution than this without shuffling a
ton of ugly code into the kernel.

One issue that hasn't been addressed is the following: given a
particular entry in /proc/pid/maps, how does one figure out whether
that entry is mapped into some other process in the system, one
that is not in the set of processes to be migrated? One could
scan ALL of the /proc/pid/maps entries, I suppose, but that is
pretty expensive task on a 512 processor NUMA box. The approach
I would like to follow would be to add a reference count to
/proc/pid/maps. The reference could would tell how many VMAs
point at this particular /proc/pid/map entry. Using this, if
all of the processes in the set to be migrated account for all
of the references, then this map entry represents an address
range that should be migrated. If there are other references
then you shouldn't migrate the address range.

Note also that the data so reported represents a performance
optimization, not a correctness issue. If some of the /proc/pid/map
info changes after we have read it and made our decision as
to what address ranges in which PIDs to migrate, the result
may be suboptimal performance. But in most cases that we have
been able to think of where this could happen, it is not that
big of a deal. (The typical example is library private.so
is used by an instance of batch job J1. We decide to migrate
J1. We look at the /proc/pid/maps info and find out that
only processes in J1 references private.so. So we decide to migrate
private.so. After we read the /proc/pid/maps info, job J2
starts up and it also uses private.so. Well, in this case
there is no good solution anyway, because private.so will
either be on J1's set of nodes or J2's, but not both. So
if we migrate private.so, it will slow down J1 for a bit
while are migrating it, but in the end, it won't matter.)

> I think the only sane way for an external process to move another
> around is to do it for the whole process. For that you wouldn't need
> most of the arguments, but just a simple move_process_vm call,
> or perhaps just a file in /proc where the new node can be written to.
>

Not so for the reasons given above. You simply cannot move an
entire address space without deciding what to do with all of the
shared stuff. And the shared stuff comprises most of the entrues
in /proc/pid/maps for most procesess.

> There may be an argument to do this for individual
> tmpfs/hugetlbfs/sysv shm segments too, but mbind() already supports
> that (just map them from a different process and change the policy there)
>

Changing the policy will handle the placement of new pages. Unless
you scan the tmpfs/hugetlbfs/sysv shm segments and look for misplaced
pages, and then migrate them, you won't have moved pages off of the
old nodes, and won't have freed up the storage that this whole
migration thing is really about. We could certainly fix up the
mbind() code to do this, ala what Steve Longerbeam has done, but
that code needs to interface to the page migration code in that case
as well.

If we did this, we still have to have the page migration system call
to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
pages were placed by first touch and for which there used to not be
a memory policy. As discussed in a previous note, we are not in a
position to require that each memory object in the system has an
associated memory policy. AFAIK, very few of our programs are using
the NUMA API to do placement. Instead, I think that most programs
do memory placement by first touch, during initialization. This is,
in part, because most of our codes originate on non-NUMA systems,
and we've typically done very just what is necessary to make them
NUMA aware. For this reason, I don't think an approach of embedding
the migration facility into the NUMA API is going to work.

> For process use you could just do it in mbind() or perhaps
> part of the process policy (move page around when touched by process).
>
> -Andi
>


--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 01:19:33

by Steve Longerbeam

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Andi Kleen wrote:

>>For our use, the batch scheduler will give an intermediary program a
>>list of processes and a series of from-to node pairs. That process would
>>then ensure all the processes are stopped, scan their VMAs to determine
>>what regions are mapped by more than one process, which are mapped
>>by additional processes not in the job, and make this system call for
>>each of the unique ranges in the job to migrate their pages from one
>>node to the next. I believe Ray is working on a library and a standalone
>>program to do this from a command line.
>>
>>
>
>Sounds quite ugly.
>
>Do you have evidence that this is a common use case? (jobs having stuff
>mapped from programs not in the job). If not I think it's better
>to go with a simple interface, not one that is unusable without
>a complex user space library.
>
>If you mean glibc etc. only then the best solution for that would be probably
>to use the (currently unmerged) arbitary file mempolicy code for this and set
> a suitable attribute that prevents moving.
>
>

Hi Andi, Ray, et.al.,

Just want to let you know that I'm still planning to push
my patches to NUMA mempolicy for filemap support and
page migration. I've been swamped with another task at work,
but later this week I will post the latest patches for review.
I haven't been following Ray's manual page migration thread
but will get up-to-speed also, and see how it impacts my patchset
to mempolicy.

Steve

2005-02-15 03:17:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Ray wrote:
> [Thus the disclaimer in
> the overview note that we have figured all the interaction with
> memory policy stuff yet.]

Does the same disclaimer apply to cpusets?

Unless it causes some undo pain, I would think that page migration
should _not_ violate a tasks cpuset. I guess this means that a typical
batch manager would move a task to its new cpuset on the new nodes, or
move the cpuset containing some tasks to their new nodes, before asking
the page migrator to drag along the currently allocated pages from the
old location.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 09:17:10

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Paul Jackson wrote:
> Ray wrote:
>
>>[Thus the disclaimer in
>>the overview note that we have figured all the interaction with
>>memory policy stuff yet.]
>
>
> Does the same disclaimer apply to cpusets?
>
> Unless it causes some undo pain, I would think that page migration
> should _not_ violate a tasks cpuset. I guess this means that a typical
> batch manager would move a task to its new cpuset on the new nodes, or
> move the cpuset containing some tasks to their new nodes, before asking
> the page migrator to drag along the currently allocated pages from the
> old location.
>
No, I think we understand the interaction between manual page migration
and cpusets. We've tried to keep the discussion here disjoint from cpusets
for tactical reasons -- we didn't want to tie acceptance of the manual
page migration code to acceptance of cpusets.

The exact ordering of when a task is moved to a new cpuset and when the
migration occurs doesn't matter, AFAIK, if we accept the notion that
a migrated task is in suspended state until after everything associated
with it (including the new cpuset definition) is done.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 10:51:52

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Mon, Feb 14, 2005 at 02:22:54PM -0800, Dave Hansen wrote:
> On Mon, 2005-02-14 at 16:01 -0600, Robin Holt wrote:
> > On Mon, Feb 14, 2005 at 10:50:42AM -0800, Dave Hansen wrote:
> > > On Mon, 2005-02-14 at 07:52 -0600, Robin Holt wrote:
> > > > The node mask is a list of allowed. This is intended to be as near
> > > > to a one-to-one migration path as possible.
> > >
> > > If that's the case, it would make the kernel internals a bit simpler to
> > > only take a "from" and "to" node, instead of those maps. You'll end up
> > > making multiple syscalls, but that shouldn't be a problem.
> >
> > Then how do you handle overlapping nodes. If I am doing a 5->4, 4->3,
> > 3->2, 2->1 shift in the memory placement and had only a from and to node,
> > I would end up calling multiple times. This would end up in memory shifting
> > from 5->4 on the first, 4->3 on the second, ... with the end result of
> > all memory shifting to a single node.
>
> Can you give an example of when you'd actually want to do this?

Assume it is moving from a 4,5,6,7,8,9 to 2,3,4,5,6,7 because it
wants to move jobs from nodes 8 and 9 which are topologically closer
to 10-15 and the job that was running there did not care about node
distances as much but nodes 2 and 3 were busy when the job was starting.
Batch schedulers will use machine in very interesting ways that you
would never have imagined. Give it the freedom to move a job around,
any you will get some really interesting new behavior

Given that the first user of this may place in onto a 256 node system,
the chances that they use the same node in the source and destination node
array are very good. If I focus on the word "actually" from above,I
can not give you a precise example of when this was asked for by a
user because this is in the early design phase as opposed to the late
troubleshooting phase. Given the size of the machine we are dealing
with, it is certainly plausible that they will, at some time, ask to
migrate from and to an overlapping set of nodes. I see this as even more
likely given that the decision will be made by their batch scheduler.
This example may be a bit simplistic, but there are certainly many times
where a batch scheduler decides that because of topology, it wants to
move stuff around some.

What is the fundamental opposition to an array from from-to node mappings?
They are not that difficult to follow. They make the expensive traversal
of ptes the single pass operation. The time to scan the list of from nodes
to locate the node this page belongs to is relatively quick when compared
to the time to scan ptes and will result in probably no cache trashing
like the long traversal of all ptes in the system required for multiple
system calls. I can not see the node array as anything but the right way
when compared to multiple system calls. What am I missing?


Thanks,
Robin

2005-02-15 11:05:32

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
> which is what you are asking for, I think. The library's job
> (in addition to suspending all of the processes in the list for
> the duration of the migration operation, plus do some other things
> that are specific to sn2 hardware) would be to examine the

You probably want the batch scheduler to do the suspend/resume as it
may be parking part of the job on nodes that have memory but running
processes of a different job while moving a job out of the way for a
big-mem app that wants to run on one of this jobs nodes.

> do memory placement by first touch, during initialization. This is,
> in part, because most of our codes originate on non-NUMA systems,
> and we've typically done very just what is necessary to make them

Software Vendors tend to be very reluctant to do things for a single
architecture unless there are clear wins.

Thanks,
Robin

2005-02-15 11:53:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

> (1) You really don't want to migrate the code pages of shared libraries
> that are mapped into the process address space. This causes a
> useless shuffling of pages which really doesn't help system
> performance. On the other hand, if a shared library is some
> private thing that is only used by the processes being migrated,
> then you should move that.

I think the better solution for this would be to finally integrate Steve L.'s
file attribute code (and find some solution to make it persistent,
e.g. using xattrs with a new inode flag) and then "lock" the shared
libraries to their policy using a new attribute flag.

>
> (2) You really only want to migrate pages once. If a file is mapped
> into several of the pid's that are being migrated, then you want
> to figure this out and issue one call to have it moved wrt one of
> the pid's.
> (The page migration code from the memory hotplug patch will handle
> updating the pte's of the other processs (thank goodness for
> rmap...))

I don't get this. Surely the migration code will check if a page
is already in the target node, and when that is the case do nothing.

How could this "double migration" happen?

>
> (3) In the case where a particular file is mapped into different
> processes at different file offsets (and we are migrating both
> of the processes), one has to examine the file offsets to figure
> out if the mappings overlap or not. If they overlap, then you've
> got to issue two calls, each of which describes a non-overlapping
> region; both calls taken together would cover the entire range
> of pages mapped to the file. Similarly if the ranges do not
> overlap.

That sounds like a quite obscure corner case which I'm not sure
is worth all the complexity.

-Andi

2005-02-15 12:16:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

[Sorry, didn't answer to everything in your mail the first time.
See previous mail for beginning]

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
> migrating, and figure out from that what portions of which pid's
> address spaces need to migrated so that we satisfy the constraints
> given above. I admit that this may be viewed as ugly, but I really
> can't figure out a better solution than this without shuffling a
> ton of ugly code into the kernel.

I like the concept of marking stuff that shouldn't be migrated
externally (using NUMA policy) better.

>
> One issue that hasn't been addressed is the following: given a
> particular entry in /proc/pid/maps, how does one figure out whether
> that entry is mapped into some other process in the system, one
> that is not in the set of processes to be migrated? One could

[...]

Marking things externally would take care of that.

> If we did this, we still have to have the page migration system call
> to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
> pages were placed by first touch and for which there used to not be
> a memory policy. As discussed in a previous note, we are not in a

You can handle those with mbind(..., MPOL_F_STRICT);
(once it is hooked up to page migration)

Just mmap the tmpfs/shm/hugetlb file in an external program and apply
the policy. That is what numactl supports today too for shm
files like this.

It should work later.


-Andi

2005-02-15 12:21:18

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

On Tue, Feb 15, 2005 at 12:53:03PM +0100, Andi Kleen wrote:
> > (2) You really only want to migrate pages once. If a file is mapped
> > into several of the pid's that are being migrated, then you want
> > to figure this out and issue one call to have it moved wrt one of
> > the pid's.
> > (The page migration code from the memory hotplug patch will handle
> > updating the pte's of the other processs (thank goodness for
> > rmap...))
>
> I don't get this. Surely the migration code will check if a page
> is already in the target node, and when that is the case do nothing.
>
> How could this "double migration" happen?

A node is not always equal distant to a cpu. We need to keep node-to-cpu
distant relatively constant between the original and final placement.
There may be a time where you are moving stuff from node 8 to node 4
and stuff from node 12 to node 8. If you scan the vmas for both the
processes in the wrong order you will migrate memory from node 12 to 8
for the second process and then from node 8 to node 4 for the second.

> > (3) In the case where a particular file is mapped into different
> > processes at different file offsets (and we are migrating both
> > of the processes), one has to examine the file offsets to figure
> > out if the mappings overlap or not. If they overlap, then you've
> > got to issue two calls, each of which describes a non-overlapping
> > region; both calls taken together would cover the entire range
> > of pages mapped to the file. Similarly if the ranges do not
> > overlap.
>
> That sounds like a quite obscure corner case which I'm not sure
> is worth all the complexity.

So obscure that nearly every example batch job we looked at had exactly
this circumstance. Turns out that quite a few batch jobs we looked at
have a parent that maps their working set initially. After the workers
are forked, they map some part of the same data file to different parts
of their own address space. They also commonly map over the top of the
large file mapping that was originally done leaving us with a jumble of
address space. This really showed the need for a user-space application
to figure the problem out and allow the flexibility to come up with more
advanced migration algorithms.

Thanks,
Robin

2005-02-15 15:12:12

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Robin wrote:
> for the second process and then from node 8 to node 4 for the second.

"for the second ... for the second"

I couldn't make sense of this statement. Should one of those
seconds be a first; what word(s) are elided after the second
"second"?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 15:19:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Would it work to have the migration system call take exactly two node
numbers, the old and the new? Have it migrate all pages in the address
space specified that are on the old node to the new node. Leave any
other pages alone. For one thing, this avoids passing a long list of
nodes, for an N-way to N-way migration. And for another thing, it seems
to solve some of the double migration and such issues.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 15:23:32

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Ray wrote:
> The exact ordering of when a task is moved to a new cpuset and when the
> migration occurs doesn't matter, AFAIK, if we accept the notion that
> a migrated task is in suspended state until after everything associated
> with it (including the new cpuset definition) is done.

The existance of _some_ sequence of system calls such that user space
could, if it so chose, do the 'right' thing does not exonerate the
kernel from enforcing its rules, on each call.

The kernel certainly does not have a crystal ball that lets it say "ok -
let this violation of my rules pass - I know that the caller will
straighten things out before it lets anything ontoward occur (before
it removes the suspension, in this case.)

In other words, more directly, the kernel must return from each system
call with everything in order, all its rules enforced.

I still think that migration should honor cpusets, unless you can show
me a good reason why that's too cumbersome. At least a migration patch
for *-mm should honor cpusets. When the migration patch goes into
Linus's main tree, then it should honor cpusets there too, if cpusets
are already there. Or if migration goes into Linus's tree before
cpusets, the onus would be on cpusets to add the changes to the
migration code honoring cpusets, when and if cpusets followed along.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 15:41:14

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Robin wrote:
> Given that the first user of this may place in onto a 256 node system,
> the chances that they use the same node in the source and destination node
> array are very good.

Am I parsing this sentence correctly when I read it as stating that we
need to handle the case where the source and destination node sets
overlap (have non-empty intersection)?

> I can not see the node array as anything but the right way
> when compared to multiple system calls.

Variable length arrays across the system call boundary are a pain in the
butt. Especially ones that add what are essentially "new types", in this
case, an array of MAX_NUMNODES node numbers. Odds are well over 50% that
there will be a bug in this area, in our lifetime.

And simplicity is measured more, in my mind, by whether each specific
system call does the essential minimum of work, with clear pre and post
conditions, than by whether the caller is able to make the fewest number
of such calls. Such reduction to the smallest irreducible atoms of work
both ensures that the kernel is best able to maintain order, and that it
can be used in the most flexible, unforseeable patterns possible,
without further kernel changes.

Such a node array call may well make good sense as a library API.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 15:43:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Robin wrote:
> Requiring that the process is stopped will somewhat limit the use of
> this API outside of the HPC space where so much control can be had over
> the processes.

Good point. Hopefully we can find a way to design this system
call so that it does not require suspension. Some uses of it
may well choose to suspend, but that's a user space choice.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 15:49:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Robin wrote:
> Then how do you handle overlapping nodes. If I am doing a 5->4, 4->3,
> 3->2, 2->1 shift ...

Then do the shifts in the other order, first 2->1, then 3->2, ...

So now you ask, what if you are doing a rotation? Use a temporary
node: 2->tmp, 3->2, ..., N->(N-1), tmp->N.

So now you ask, what if you are doing a rotation involving _all_
nodes, and have nothing you can use as a temporary node?

Argh I say ... would anyone really do that? Or perhaps it makes
sense to have the system call take a virtual address range (and
hence a pid). In which case, you can do one page at a time, if
need be, and get any foolhardy migration possible.

Or perhaps some integration with Andi's mbind/mempolicy make sense.
I'm not quite following Andi's comments on this, so I can't say
one way or the other if this is good.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 16:23:14

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Tue, Feb 15, 2005 at 07:49:06AM -0800, Paul Jackson wrote:
> Robin wrote:
> > Then how do you handle overlapping nodes. If I am doing a 5->4, 4->3,
> > 3->2, 2->1 shift ...
>
> Then do the shifts in the other order, first 2->1, then 3->2, ...
>
> So now you ask, what if you are doing a rotation? Use a temporary
> node: 2->tmp, 3->2, ..., N->(N-1), tmp->N.

Consider the case where you are moving 248GB of data off of that node
onto a temporary. You have just made that a double copy to save the
difficulty of passing in an array. That seems like it is insane!

>
> So now you ask, what if you are doing a rotation involving _all_
> nodes, and have nothing you can use as a temporary node?

Not necessarily all nodes for the rotation, but if you have no free nodes
in the system aside from the ones you are working with. That will be the
typical case. The batch scheduler will have control of all the nodes
except the nodes that are dedicated to I/O. These will also likely
have less memory on them. The batch scheduler may have any number
of jobs running in small cpusets. At the time of the migration, the
system may only have the nodes from the old and new jobs to work with.
They you are stuck with a need for the arrays.

>
> Argh I say ... would anyone really do that? Or perhaps it makes
> sense to have the system call take a virtual address range (and
> hence a pid). In which case, you can do one page at a time, if
> need be, and get any foolhardy migration possible.
>
> Or perhaps some integration with Andi's mbind/mempolicy make sense.
> I'm not quite following Andi's comments on this, so I can't say
> one way or the other if this is good.

I think this is more closely related to cpusets, but that was not in when
Ray started working on his stuff. The mem policy stuff does not handle
the immediate need to migrate (at least not that I see) and it does not
preserve node locality for already touched pages. Assume we have a job
which has 16 processes which are doing work on 16 blocks of memory.
The code is designed to first touch the pages it will work with on
startup, redezvous with the other processes, and then start working.
During its run, it needs access to its block 97% of the time and needs
to read from the other blocks 3% of the time.

With a mem policy, after the "migration" it is a race to see who touches
the page first as for which node the memory is migrated to. We need to
have a way to migrate the memory which preserves the placement information
the process has already given us.

Thanks,
Robin

2005-02-15 16:36:28

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Robin wrote:
> That seems like it is insane!

Thank-you, thank-you. <blush>

What about the suggestion I had that you sort of skipped over, which
amounted to changing the system call from a node array to just one
node:

sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

to:

sys_page_migrate(pid, va_start, va_end, old_node, new_node);

Doesn't that let you do all you need to? Is it insane too?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 17:46:49

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Robin Holt wrote:
> On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
>
>>which is what you are asking for, I think. The library's job
>>(in addition to suspending all of the processes in the list for
>>the duration of the migration operation, plus do some other things
>>that are specific to sn2 hardware) would be to examine the
>
>
> You probably want the batch scheduler to do the suspend/resume as it
> may be parking part of the job on nodes that have memory but running
> processes of a different job while moving a job out of the way for a
> big-mem app that wants to run on one of this jobs nodes.
>

That works as well, and if we keep the majority of the work on
deciding who to migrate where and what to do when in a user space
library rather than in the kernel, then we have a lot more flexibility
in, for example who suspends/resumes the jobs to be migrated.

>
>>do memory placement by first touch, during initialization. This is,
>>in part, because most of our codes originate on non-NUMA systems,
>>and we've typically done very just what is necessary to make them
>
>
> Software Vendors tend to be very reluctant to do things for a single
> architecture unless there are clear wins.
>
> Thanks,
> Robin
>


--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 18:18:31

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

Andi Kleen wrote:
>>(1) You really don't want to migrate the code pages of shared libraries
>> that are mapped into the process address space. This causes a
>> useless shuffling of pages which really doesn't help system
>> performance. On the other hand, if a shared library is some
>> private thing that is only used by the processes being migrated,
>> then you should move that.
>
>
> I think the better solution for this would be to finally integrate Steve L.'s
> file attribute code (and find some solution to make it persistent,
> e.g. using xattrs with a new inode flag) and then "lock" the shared
> libraries to their policy using a new attribute flag.
>

I really don't see how that is relevant to the current discussion, which,
as AFAIK, is that the kernel interface should be "migrate an entire process"
versus what I have proposed. What we are trying to avoid here for shared
libraries is two things: (1) don't migrate them needlessly, and (2) don't
even make the migration request if we know that the pages shouldn't be
migrated.

Using Steve Longerbeam's approach avoids (1). But you will still scan the
pte's of the proceeses to be migrated (if you go with a "migrate the
entire process" approach) and try to migrate them, only to find out that
they are pinned in place. How is that a good thing?

A much simpler way to do this would be to add a list of libraries that
you don't want to be migrated to the migration library that I have
proposed to be the interface between the batch scheduler and the kernel.
Then when the library scans the /proc/pid/maps stuff, it can exlcude
those libraries from migration. Furthermore, no migration requests will
even be initiated for those parts of the address space.

Of course, this means maintaining a library list in the migration
library. We may eventually decide to do that. For now, we're following
up on the reference count approach I outlined before.

>
>>(2) You really only want to migrate pages once. If a file is mapped
>> into several of the pid's that are being migrated, then you want
>> to figure this out and issue one call to have it moved wrt one of
>> the pid's.
>> (The page migration code from the memory hotplug patch will handle
>> updating the pte's of the other processs (thank goodness for
>> rmap...))
>
>
> I don't get this. Surely the migration code will check if a page
> is already in the target node, and when that is the case do nothing.
>
> How could this "double migration" happen?

Not so much a double migration, but a double request for migration.
(This is not a correctness, but a performance issue, once again.)
Consider the case of a 300 GB file mapped into 256 pid's. One doesn't
want each pid to try to migrate the file pages. Granted, all after the
first one will find the data already migrated, but if you issue a
migration request for each address space, the others won't know that
the page has been migrated until they have found the page and looked
up its current node. That means doing a find_get_page() for each page
in the mapped file in all 256 address spaces, and 255 of those address
spaces will find the page has already been migrated. How is that
useful? I'd much rather migrate it once from the perspective of
a single address space, and then skip the scanning for pages to
migrate in all of the other address spaces.

>
>
>>(3) In the case where a particular file is mapped into different
>> processes at different file offsets (and we are migrating both
>> of the processes), one has to examine the file offsets to figure
>> out if the mappings overlap or not. If they overlap, then you've
>> got to issue two calls, each of which describes a non-overlapping
>> region; both calls taken together would cover the entire range
>> of pages mapped to the file. Similarly if the ranges do not
>> overlap.
>
>
> That sounds like a quite obscure corner case which I'm not sure
> is worth all the complexity.
>
> -Andi
>
>

So what is your solution when this happens? Make the job non-migratable?
Yes, it may be an obscure case in your view but we've got to handle all of
those cases to make a robust facility that can be used in a production
environment.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 18:26:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

> I really don't see how that is relevant to the current discussion, which,
> as AFAIK, is that the kernel interface should be "migrate an entire process"
> versus what I have proposed. What we are trying to avoid here for shared
> libraries is two things: (1) don't migrate them needlessly, and (2) don't
> even make the migration request if we know that the pages shouldn't be
> migrated.
>
> Using Steve Longerbeam's approach avoids (1). But you will still scan the
> pte's of the proceeses to be migrated (if you go with a "migrate the
> entire process" approach) and try to migrate them, only to find out that
> they are pinned in place. How is that a good thing?

You don't scan any PTEs, just the mempolicy tree. That is extremly
cheap.

> >> (The page migration code from the memory hotplug patch will handle
> >> updating the pte's of the other processs (thank goodness for
> >> rmap...))
> >
> >
> >I don't get this. Surely the migration code will check if a page
> >is already in the target node, and when that is the case do nothing.
> >
> >How could this "double migration" happen?
>
> Not so much a double migration, but a double request for migration.
> (This is not a correctness, but a performance issue, once again.)
> Consider the case of a 300 GB file mapped into 256 pid's. One doesn't
> want each pid to try to migrate the file pages. Granted, all after the

Again file policy nicely takes care of this.

> first one will find the data already migrated, but if you issue a
> migration request for each address space, the others won't know that
> the page has been migrated until they have found the page and looked
> up its current node. That means doing a find_get_page() for each page
> in the mapped file in all 256 address spaces, and 255 of those address

You just look at the mempolicy extent tree linked from the
address space.

> >
> >>(3) In the case where a particular file is mapped into different
> >> processes at different file offsets (and we are migrating both
> >> of the processes), one has to examine the file offsets to figure
> >> out if the mappings overlap or not. If they overlap, then you've
> >> got to issue two calls, each of which describes a non-overlapping
> >> region; both calls taken together would cover the entire range
> >> of pages mapped to the file. Similarly if the ranges do not
> >> overlap.
> >
> >
> >That sounds like a quite obscure corner case which I'm not sure
> >is worth all the complexity.
> >
> >-Andi
> >
> >
>
> So what is your solution when this happens? Make the job non-migratable?
> Yes, it may be an obscure case in your view but we've got to handle all of
> those cases to make a robust facility that can be used in a production
> environment.

With per file policies you really don't care if there are overlaps or
not. You then care about offsets inside the object, not addresses
in some process virtual memory image. You just set the policy to migrate or
not migrate to the file and set a "lock bit" (that would
need to be added) and then no one else will touch the poliocy

-Andi

2005-02-15 18:39:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Tue, 2005-02-15 at 04:50 -0600, Robin Holt wrote:
> What is the fundamental opposition to an array from from-to node mappings?
> They are not that difficult to follow. They make the expensive traversal
> of ptes the single pass operation. The time to scan the list of from nodes
> to locate the node this page belongs to is relatively quick when compared
> to the time to scan ptes and will result in probably no cache trashing
> like the long traversal of all ptes in the system required for multiple
> system calls. I can not see the node array as anything but the right way
> when compared to multiple system calls. What am I missing?

I don't really have any fundamental opposition. I'm just trying to make
sure that there's not a simpler (better) way of doing it. You've
obviously thought about it a lot more than I have, and I'm trying to
understand your process.

As far as the execution speed with a simpler system call. Yes, it will
likely be slower. However, I'm not sure that the increase in scan time
is all that significant compared to the migration code (it's pretty
slow).

-- Dave

2005-02-15 18:40:22

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
> [Sorry, didn't answer to everything in your mail the first time.
> See previous mail for beginning]
>
> On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
>
>>migrating, and figure out from that what portions of which pid's
>>address spaces need to migrated so that we satisfy the constraints
>>given above. I admit that this may be viewed as ugly, but I really
>>can't figure out a better solution than this without shuffling a
>>ton of ugly code into the kernel.
>
>
> I like the concept of marking stuff that shouldn't be migrated
> externally (using NUMA policy) better.
>

I really don't have an objection to that for the case of the shared
libraries in, for example, /lib and /usr/lib. I just worry about making
sure that all of the libraries have so been marked. I can do this
in a much simpler way by just adding a list of "do not migrate stuff"
to the migration library rather than requiring Steve Longerbeam's
API.

>
>>One issue that hasn't been addressed is the following: given a
>>particular entry in /proc/pid/maps, how does one figure out whether
>>that entry is mapped into some other process in the system, one
>>that is not in the set of processes to be migrated? One could
>
>
> [...]
>
> Marking things externally would take care of that.
>

So the default would be that if the file is not mapped as "not-migratable",
then the file would be migratable, is that the idea?

>
>>If we did this, we still have to have the page migration system call
>>to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
>>pages were placed by first touch and for which there used to not be
>>a memory policy. As discussed in a previous note, we are not in a
>
>
> You can handle those with mbind(..., MPOL_F_STRICT);
> (once it is hooked up to page migration)

Making memory migration a subset of page migration is not a general
solution. It only works for programs that are using memory policy
to control placement. As I've tried to point out multiple times
before, most programs that I am aware of use placement based on
first-touch. When we migrate such programs, we have to respect
the placement decisions that the program has implicitly made in
this way.

Requiring memory migration to be a subset of the NUMA API is a
non-starter for this reason. We have to follow the approach
of doing the correct migration, followed by fixing up the NUMA
policy to match the new reality. (Perhaps we can do this as
part of memory migration.)

Until ALL programs use the NUMA mempolicy for placement
decisions, we cannot support page migration under the NUMA
API.

I don't understand why this is not clear to you. Are you
assuming that you can manufacture a NUMA API for the new
location of the job that correctly represents the placement
information and toplogy of the job on the old set of nodes?

>
> Just mmap the tmpfs/shm/hugetlb file in an external program and apply
> the policy. That is what numactl supports today too for shm
> files like this.
>
> It should work later.
>

Wait. As near as I can tell you

>
> -Andi
>


--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 18:54:52

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Dave Hansen wrote:
> On Tue, 2005-02-15 at 04:50 -0600, Robin Holt wrote:
>
>>What is the fundamental opposition to an array from from-to node mappings?
>>They are not that difficult to follow. They make the expensive traversal
>>of ptes the single pass operation. The time to scan the list of from nodes
>>to locate the node this page belongs to is relatively quick when compared
>>to the time to scan ptes and will result in probably no cache trashing
>>like the long traversal of all ptes in the system required for multiple
>>system calls. I can not see the node array as anything but the right way
>>when compared to multiple system calls. What am I missing?
>
>
> I don't really have any fundamental opposition. I'm just trying to make
> sure that there's not a simpler (better) way of doing it. You've
> obviously thought about it a lot more than I have, and I'm trying to
> understand your process.
>
> As far as the execution speed with a simpler system call. Yes, it will
> likely be slower. However, I'm not sure that the increase in scan time
> is all that significant compared to the migration code (it's pretty
> slow).
>
> -- Dave
>
>
I'm worried about doing all of those find_get_page() things over and over
when the mapped file we are migrating is large. I suppose one can argue
that that is never going to be the case (e. g. no one in their right mind
would migrate a job with a 300 GB mapped file). So we are back to the
overlapping set of nodes issue. Let me look into this some more.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-15 19:00:14

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote:
> What about the suggestion I had that you sort of skipped over, which
> amounted to changing the system call from a node array to just one
> node:
>
> sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
> to:
>
> sys_page_migrate(pid, va_start, va_end, old_node, new_node);
>
> Doesn't that let you do all you need to? Is it insane too?

Migration could be done in most cases and would only fall apart when
there are overlapping node lists and no nodes available as temp space
and we are not moving large chunks of data.

What is the fundamental concern with passing in an array of integers?
That seems like a fairly easy to verify item with very little chance
of breaking. I don't feel the concern that others seem to.

I do see the benefit to those arrays as being a single pass through the
page tables, the ability to migrate without using a temporary node, and
reducing the number of times data is copied when there are overlapping
nodes. To me, those seem to be very compelling reasons when compared
to the potential for a possible problem with an array of integers.

Thanks,
Robin

2005-02-15 20:55:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

In the interest of the size of everyone's inboxes, I mentioned to Ray
that we might move this discussion to a smaller forum while we resolve
some of the outstanding issues. Ray's going to post a followup to to
linux-mm, and trim the cc list down. So, if you're still interested,
keep your eyes on linux-mm and we'll continue there.

-- Dave

2005-02-15 21:48:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

> Making memory migration a subset of page migration is not a general
> solution. It only works for programs that are using memory policy
> to control placement. As I've tried to point out multiple times
> before, most programs that I am aware of use placement based on
> first-touch. When we migrate such programs, we have to respect
> the placement decisions that the program has implicitly made in
> this way.

Sorry, but the only real difference between your API and mbind is that
yours has a pid argument.

I think we are talking by each other, here's a more structured
overview of my thinking on the issue.

Main cases:

- Program is NUMA API aware. Fine. It takes care of its own.
- Program is not aware, but is started with a process policy from
numactl/cpusets/batch scheduler. Already covered too in NUMA API.
- Program is not aware and hasn't been started with a policy
(or has and you change your mind) but you want to change it later.
That's the new case we discuss here.

Now how to change policy of objects in an already running process.

First there are already some special cases already handled or
with existing patches:
- tmpfs/hugetlbfs/sysv shm: numactl can handle this by just mapping
the object into a different process and changing the policy there.
- shared libraries and mmaped files in general: this is a generialization of
the previous point. SteveL's patch is the beginning of handling this, although
it needs some more work (xattrs) to make the policy persistent over
memory pressure.

Only case not covered left is anonymous memory.

You said it would need user space control, but the main reason for
wanting that seems to be to handle the non anonymous cases which
are already covered above.

My thinking is the simplest way to handle that is to have a call that just o
migrates everything. The main reasons for that is that I don't think external
processes should mess with virtual addresses of another process.
It just feels unclean and has many drawbacks (parsing /proc/*/maps
needs complicated user code, racy, locking difficult).

In kernel space handling full VMs is much easier and safer due to better
locking facilities.

In user space only the process itself really can handle its own virtual
addresses well, and if it does that it can use NUMA API directly anyways.

You argued that it may be costly to walk everything, but I don't see this
as a big problem - first walking mempolicies is not very costly and then
fork() and exit() do exactly this already.

The main missing piece for this would be a way to make policies for
files persistent. One way would be to use xattrs like selinux, but
that may be costly (not sure we want to read xattrs all the time
when reading a file).

A hackish way to do this that already
works would be to do a mlock on one page of the file to keep
the inode pinned. E.g. the batch manager could do this. That's
not very clean, but would probably work.

-Andi

2005-02-15 22:13:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Dr Peter Chubb writes:
> Can page migration be done lazily, instead of all at once?

That might be a useful option. Not my area to comment on.

We would also require, at least as an option, to be able to force the
migration on demand. Some of our big honkin iron parallel jobs run with
a high degree of parallelism, and nearly saturate each node being used.
For jobs like that, it can be better to get everything in place, before
resuming execution.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 22:56:52

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

On Wed, Feb 16, 2005 at 08:58:19AM +1100, Peter Chubb wrote:
> >>>>> "Robin" == Robin Holt <[email protected]> writes:
>
> Robin> On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote:
> >> What about the suggestion I had that you sort of skipped over,
> >> which amounted to changing the system call from a node array to
> >> just one node:
> >>
> >> sys_page_migrate(pid, va_start, va_end, count, old_nodes,
> >> new_nodes);
> >>
> >> to:
> >>
> >> sys_page_migrate(pid, va_start, va_end, old_node, new_node);
> >>
> >> Doesn't that let you do all you need to? Is it insane too?
>
> Robin> Migration could be done in most cases and would only fall apart
> Robin> when there are overlapping node lists and no nodes available as
> Robin> temp space and we are not moving large chunks of data.
>
> A possibly stupid suggestion:
>
> Can page migration be done lazily, instead of all at once? Move the
> process, mark its pages as candidates for migration, and when
> the page faults, decide whether to copy across or not...
>
> That way you only copy the pages the process is using, and only copy
> each page once. It makes copy for replication easier in some future
> incarnation, too, because the same basic infrastructure can be used.

I would agree that lazy might be possible, but then we need to keep track
of the desired destination and can not rely upon first touch as that
will likely result in scrambling the memory of the application.

I have been very lax in describing how a typical MPI application works.
This method has been in place for years and is commonly accepted practice.

In the MPI model, a set of large mappings are done by the first process.
It then forks x number of worker threads which touch their chunk of
memory and rendezvous with the other workers. Once all workers have
redezvoused, they are allowed to start their processing. A typical
worker thread will reference their memory set 85-97% of the time and
reference other memory sets in a read-only fashion the other part
of the time.

It is important to performance that the worker threads memory remains
as close to its cpu as possible. Any time the memory is on a different
node, the performance of that thread degrades (memory is further away)
and performance of the other thread is hindered (its memory controller
is more busy) and the read portions of the neighbor threads to both
of the afor mentioned worker threads is hindered as there is more
NUMA activity. Given all that, there is a common concept in MPI called
a barrier where when worker threads complete a work set, they awaken
threads waiting at the barrier associated with the work set. As a
result of this wait, by slowing down a single thread you can have a
cascade effect which slows down the entire application significantly
as barriers are missed.

Because of all this discussion, memory placement needs be thought of
as relative to the worker threads and maintained relatively consistent
before and after the migration.

Another issue with making it a lazy migrate is the real impetus for
this is to free up memory on a node so a job can be stopped on one
node, migrated to a different node and thereby free up the original
node for a second job which would not fit with the original job
taking up a section of the machine which would cause the other
job to perform too poorly.

Sorry for the long rambling explanation. I guess I will try to
break this into smaller chunks on the upcoming discussion on the
linux-mm list.

Thanks,
Robin

2005-02-15 23:09:23

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

Good explanation, Robin. Thanks.

See y'all on linux-mm.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-15 22:43:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Thanks Andi for your effort to present your case more completely.
I agree that there is some 'talking by each other' going on.

Dave Hansen has publically (and Ray privately) sought to
move this discussion to linux-mm (or more specifically,
off lkml for now).

Any chance, Andi, that you could repost this, in response
to Ray's restarting this thread on linux-mm, once he gets
around to that?

I will reserve my response until I see if that works out.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-16 04:11:54

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
>>Making memory migration a subset of page migration is not a general
>>solution. It only works for programs that are using memory policy
>>to control placement. As I've tried to point out multiple times
>>before, most programs that I am aware of use placement based on
>>first-touch. When we migrate such programs, we have to respect
>>the placement decisions that the program has implicitly made in
>>this way.
>
>
> Sorry, but the only real difference between your API and mbind is that
> yours has a pid argument.
>

That may be true, but the internals of the implementations have got
to be pretty different as near as I can tell. So just beause the
API's are nearly the same doesn't imply that the internals are at
all the same. And I'm convinced that using node masks is an
insufficiently general approach to specifying page migration.
But let's save that discussion for a later note, ok?

> I think we are talking by each other, here's a more structured
> overview of my thinking on the issue.
>

I'm sure that is what is going on and we face little other choice
than keep our good humor about this and keep trying until we see
our way clear to a common understanding. :-)

> Main cases:
>
> - Program is NUMA API aware. Fine. It takes care of its own.

Yes, we could migrate this program using a migration facility
embedded in the NUMA API.

> - Program is not aware, but is started with a process policy from
> numactl/cpusets/batch scheduler. Already covered too in NUMA API.

Hmmm.... What about the case where no NUMA API is used and cpusets
are used as containers, and page placement is done by first touch.
Then there no NUMA API whatsoever. I think this is the category
where most of the programs in a large Altix system would fall.
(See more on this below....)

> - Program is not aware and hasn't been started with a policy
> (or has and you change your mind) but you want to change it later.

I'm having a little trouble parsing the "it" in that sentence.
Does that sentence mean "you want to change the NUMA API later"?
What if there never is a NUMA API structure associated with
the program other than the default (local) policy?

The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.
Eventually more programs will use the NUMA API, but I don't think
they do at the present time.

Let me expand on that a bit. What most programs do on Altix is
to do first-touch to get data allocated locally. That is, lets
say you have a big array that your parallel computation is going to
work on. The programmer would sit down and say, I want processor 1
to work on this part of the array, processor 2 on that part, etc.
Then the programmer writes code that causes each processor to touch
the portions of the data array that should be allocated locally on
that processor. Bingo, storage is now allocated the way the user
wants it, and no NUMA API call was ever issued.

Yes, it is clumsy, but that is because these programs were written
before your NUMA API came into being. Now we simply can't go back
to these people (many of them ISV's) and say "Please rewrite your
code to use the NUMA API." So we are left with a pile of legacy
programs that we have to be able to migrate that don't have any
NUMA api data structures associated with them. What are we
supposed to do in this case?

We can't necessarily construct a NUMA API that will cause storage
to be allocated as the programmer intended, because we can't fathom
what the programmer was trying to accomplish based on the state
of the program when we go to migrate it. So how would we use
a migration facility embedded into the NUMA API to migrate this
program and maintain its old topology?

That's the fundamental question here. Can you address that
question specifically for me, please?

> That's the new case we discuss here.
>
> Now how to change policy of objects in an already running process.
>

If the running process has a non-trivial mempolicy defined for
all of its address space, then I think I understand this. This
is not where our disconnect lies. The disconnect is in the above, I
think.

> First there are already some special cases already handled or
> with existing patches:
> - tmpfs/hugetlbfs/sysv shm: numactl can handle this by just mapping
> the object into a different process and changing the policy there.
> - shared libraries and mmaped files in general: this is a generialization of
> the previous point. SteveL's patch is the beginning of handling this, although
> it needs some more work (xattrs) to make the policy persistent over
> memory pressure.
>
> Only case not covered left is anonymous memory.
>
> You said it would need user space control, but the main reason for
> wanting that seems to be to handle the non anonymous cases which
> are already covered above.

Yes, so long as the rest of the cases were handled in user space, then
the anonymous memory case has to be handled there as well.

>
> My thinking is the simplest way to handle that is to have a call that just o
> migrates everything. The main reasons for that is that I don't think external
> processes should mess with virtual addresses of another process.
> It just feels unclean and has many drawbacks (parsing /proc/*/maps
> needs complicated user code, racy, locking difficult).
>

Yes, but remember, we are coming from an assumption that migrated processes
are suspended. This may be myopic, but we CAN make this work with the
constraints we have in place. Now if you are arguing for a more general
migration facility that doesn't require the processes to be blocked, well
then I agree with you. The /proc/*/maps approach doesn't work.

So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed. Then, we get into the kernel and starting
scanning the pte's and the page cache for anonymous memory and mapped files,
respectively. For each VMA we have to make a migrate/don't migrate decision.
We also have to accept that the set of originating and destination nodes
have to be distinct. Otherwise, there is no good way to tell whether or not
a particular page has been migrated. So we have to make that restriction.

Without xattrs, how do we make the migrate/non-migrate decision? Where
do we put the data? Well, we can have some file in the file system that
has file names in it and read that file into the kernel and convert each
file to a device and inode pair. We can then match that against each of
the VMAs and choose not to migrate any VMA that maps a file on the list.
For each anonymous VMA we just migrate the pages.

Sounds like it is doable, but I have this requirement from my IRIX
buddies that I support overlapping sets of nodes in the two and from
node sets. I guess we have to go back and examine that in more detail.

> In kernel space handling full VMs is much easier and safer due to better
> locking facilities.
>
> In user space only the process itself really can handle its own virtual
> addresses well, and if it does that it can use NUMA API directly anyways.
>
> You argued that it may be costly to walk everything, but I don't see this
> as a big problem - first walking mempolicies is not very costly and then
> fork() and exit() do exactly this already.

I'm willing to accept that walking the page table (via follow_page()) or
the file (via find_get_page()) is not that expensive, at least until it
is shown otherwise. We do tent to have big address spaces and lots of
processors associated with them, but I'm willing to accept that we won't
migrate a huge process around very often. (Or at least not often enough
for it to be interesting.) However, if this does turn out to be a performance
problem for us, we will have to come back and re-examine this stuff.

>
> The main missing piece for this would be a way to make policies for
> files persistent. One way would be to use xattrs like selinux, but
> that may be costly (not sure we want to read xattrs all the time
> when reading a file).
>

I'm not sure I want to tie implementation of the page migration
API to getting xattrs into all of the file systems in Linux
(although I suppose we could live with it if we got them into XFS).
Is this really the way go to here? This seems like this would
decrease the likelyhood of getting the page migration code
accepted by a significant amount. It introduces a new set of
people (the file system maintainers) whom I have to convince to
make changes. I just don't see that as being a fruitful exercise.

Instead I would propose a magic file to be read at boot time as discussed
above -- that file would contain the names of all files not to be
migrated. The kicker comes here in that what do we do if that set
needs to be changed during the course of a single boot? (i. e. somone
adds a new shared library, for example. I suppose we could have a
sysctl() that would cause that file to be re-read. This would be
a short term solution until xattrs are accepted and/or until Steve
Longerbeam's patch is accepted. Would that be an acceptable short
term kludge?

> A hackish way to do this that already
> works would be to do a mlock on one page of the file to keep
> the inode pinned. E.g. the batch manager could do this. That's
> not very clean, but would probably work.
>
> -Andi
>


--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-17 23:55:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

[Sorry for the late answer.]

On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
> >
> >
> >Sorry, but the only real difference between your API and mbind is that
> >yours has a pid argument.
> >
>
> That may be true, but the internals of the implementations have got
> to be pretty different as near as I can tell. So just beause the

Not necessarily. E.g. Steve's file attribute patch actually
implemented very simple page migration into NUMA API
because he needed it to solve some problems with allocation.
It was even exposed as a new mbind() flag.

> >Main cases:
> >
> >- Program is NUMA API aware. Fine. It takes care of its own.
>
> Yes, we could migrate this program using a migration facility
> embedded in the NUMA API.
>
> >- Program is not aware, but is started with a process policy from
> >numactl/cpusets/batch scheduler. Already covered too in NUMA API.
>
> Hmmm.... What about the case where no NUMA API is used and cpusets

First the NUMA API internally doesn't care that much about this
case. It just considers no policy as "DEFAULT" policy which
just happens to be what you call first-touch.

But there is no fundamental reason you can't change the policy
of an existing program externally. It is already implemented for some
kinds of named objects (shmfs etc.), but it can be extended to
more.

> >- Program is not aware and hasn't been started with a policy
> >(or has and you change your mind) but you want to change it later.

> I'm having a little trouble parsing the "it" in that sentence.
> Does that sentence mean "you want to change the NUMA API later"?

The policy. In this case policy means including the page placement
(this would be MPOL_F_STRICT)

> What if there never is a NUMA API structure associated with
> the program other than the default (local) policy?

If you have some generic facility to change policy externally
it doesn't matter if there was policy before or not.

> The fundamental disconnect here is that I think that very few
> programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel)
Internally it's all the same.

Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.

I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
NUMA API currently doesn't offer a way to do that,
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.
>
> Let me expand on that a bit. What most programs do on Altix is
> to do first-touch to get data allocated locally. That is, lets
> say you have a big array that your parallel computation is going to
> work on. The programmer would sit down and say, I want processor 1
> to work on this part of the array, processor 2 on that part, etc.
> Then the programmer writes code that causes each processor to touch
> the portions of the data array that should be allocated locally on
> that processor. Bingo, storage is now allocated the way the user
> wants it, and no NUMA API call was ever issued.

Sure, but NUMA API goes to great pains to handle such programs.
>
> Yes, it is clumsy, but that is because these programs were written
> before your NUMA API came into being. Now we simply can't go back
> to these people (many of them ISV's) and say "Please rewrite your
> code to use the NUMA API." So we are left with a pile of legacy
> programs that we have to be able to migrate that don't have any
> NUMA api data structures associated with them. What are we
> supposed to do in this case?


>
> We can't necessarily construct a NUMA API that will cause storage
> to be allocated as the programmer intended, because we can't fathom
> what the programmer was trying to accomplish based on the state
> of the program when we go to migrate it. So how would we use
> a migration facility embedded into the NUMA API to migrate this
> program and maintain its old topology?

numactl went to great pains to handle such programs. Take
a look at all the command line options ;-)

If the program is using shm and you applied the patch
to do page migration in mbind() you could handle it right now:

- map the shm segment into the management process.
- change policy with mbind(), triggering page migration
- set back default policy.

For other objects (files etc.) there are patches in the pipeline.

The only hole that's still there is anonymous memory, but I think
we can fill that much simpler than what you're proposing, with
a "migrate whole process except when policy says otherwise" call.



> >That's the new case we discuss here.
> >
> >Now how to change policy of objects in an already running process.
> >
>
> If the running process has a non-trivial mempolicy defined for
> all of its address space, then I think I understand this. This
> is not where our disconnect lies. The disconnect is in the above, I
> think.

No, I was discussing even uncooperative processes. See below.

>
> >First there are already some special cases already handled or
> >with existing patches:
> >- tmpfs/hugetlbfs/sysv shm: numactl can handle this by just mapping
> >the object into a different process and changing the policy there.

numactl is an external program

I designed it originally mostly to handle databases, although many HPC
people I talked to also be happy with it (it may need more tweaks
to handle this better of course, but I hope it won't end up
as complicated as your Irix command for this ;-)

> >- shared libraries and mmaped files in general: this is a generialization
> >of
> >the previous point. SteveL's patch is the beginning of handling this,
> >although
> >it needs some more work (xattrs) to make the policy persistent over
> >memory pressure.

Again uncooperative, just set by the administrator system wide.

> >Only case not covered left is anonymous memory.
> >
> >You said it would need user space control, but the main reason for
> >wanting that seems to be to handle the non anonymous cases which
> >are already covered above.
>
> Yes, so long as the rest of the cases were handled in user space, then
> the anonymous memory case has to be handled there as well.

Handling in user space would mean (at least in my worldview...) setting
NUMA policy for them at some point there. The kernel would then provide
facilities to remember that policy and use it when allocating
anything (or even migrating pages if the policy was set late)

> >
> >My thinking is the simplest way to handle that is to have a call that just
> >o
> >migrates everything. The main reasons for that is that I don't think
> >external
> >processes should mess with virtual addresses of another process.
> >It just feels unclean and has many drawbacks (parsing /proc/*/maps
> >needs complicated user code, racy, locking difficult).
> >
>
> Yes, but remember, we are coming from an assumption that migrated processes
> are suspended. This may be myopic, but we CAN make this work with the
> constraints we have in place. Now if you are arguing for a more general
> migration facility that doesn't require the processes to be blocked, well
> then I agree with you. The /proc/*/maps approach doesn't work.

It's not just the races, it seems unclean to do a complicated user
space library for something that the kernel can do better because
it has direct access to the data structures.

>
> So lets go with the idea of dropping the va_start and va_end arguments from
> the system call I proposed. Then, we get into the kernel and starting

That would make the node array infinite, won't it? What happens when
you want to migrate a 1TB process? @) I think you have to replace
that one with a single target node argument too.

> scanning the pte's and the page cache for anonymous memory and mapped files,
> respectively. For each VMA we have to make a migrate/don't migrate
> decision.
> We also have to accept that the set of originating and destination nodes
> have to be distinct. Otherwise, there is no good way to tell whether or not
> a particular page has been migrated. So we have to make that restriction.

Hmm. That looks a bit unreliable. Unless you use BIND policy it's
always possible you get memory on wrong nodes on memory pressure.

I wouldn't like it when the page migration facility doesn't "fix"
those stray allocations later.

Basically it would be random if a page is migrated then or not
on a busy system.

You could walk the page tables in user space use get_mempolicy
(I put a hack in there to look up the node of a page - was
intended for regression testing of libnuma), but it's probably
not worth it. I would just let the kernel migrate everything.

You and Robin mentioned some problems with "double migration"
with that, but it's still not completely clear to me what
problem you're solving here. Perhaps that needs to be reexamined.

>
> Without xattrs, how do we make the migrate/non-migrate decision? Where
> do we put the data? Well, we can have some file in the file system that

One way would be the hack I proposed: mlock one page of the file
in a daemon, then set the policy. That keeps the inode pinned and the
address space with the policy tree. Not very nice, but would
work right now.

But I think you underestimate xattrs. The infrastructure
is really already quite widespread. They are quite widely used these
days, most linux FS support it in some form (ext2/3, reiserfs, JFS, XFS, ...)

One hole is NFS right now, afaik it only supports ACLs, but no
genericized xattrs, but the selinux guys are pushing it so
I expect this will be eventually solved too (selinux needs xattrs
for advanced security)


> has file names in it and read that file into the kernel and convert each
> file to a device and inode pair. We can then match that against each of
> the VMAs and choose not to migrate any VMA that maps a file on the list.

That's quite ugly.

> Sounds like it is doable, but I have this requirement from my IRIX
> buddies that I support overlapping sets of nodes in the two and from
> node sets. I guess we have to go back and examine that in more detail.

That and the double migration (it's still not completely clear to me
what exactly you're trying to solve here)

> I'm willing to accept that walking the page table (via follow_page()) or
> the file (via find_get_page()) is not that expensive, at least until it
> is shown otherwise. We do tent to have big address spaces and lots of
> processors associated with them, but I'm willing to accept that we won't
> migrate a huge process around very often. (Or at least not often enough
> for it to be interesting.) However, if this does turn out to be a
> performance
> problem for us, we will have to come back and re-examine this stuff.

Perhaps you can re-examine fork() and exit() and exec() too while
you're at that. They do exactly the same ;-)

[I actually have some work in the pipeline to make it faster.]

>
> >
> >The main missing piece for this would be a way to make policies for
> >files persistent. One way would be to use xattrs like selinux, but
> >that may be costly (not sure we want to read xattrs all the time
> >when reading a file).
> >
>
> I'm not sure I want to tie implementation of the page migration
> API to getting xattrs into all of the file systems in Linux
> (although I suppose we could live with it if we got them into XFS).

XFS has had xattrs from day one on Irix (I found them in
the earliest design documents on the XFS website). In fact the generic
xattr code in Linux came from SGI as part of the XFS contribution.

Near all other important 2.6 file systems have them too now


> Instead I would propose a magic file to be read at boot time as discussed
> above -- that file would contain the names of all files not to be
> migrated. The kicker comes here in that what do we do if that set

Sorry, but I see no way to get such a hack merged.

-Andi

2005-02-18 08:45:32

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
> [Sorry for the late answer.]
>

No problem, remember, I'm supposed to be on vacation, anyway. :-)

Let's start off with at least one thing we can agree on. If xattrs
are already part of XFS, then it seems reasonable to use an extended
attribute to mark certain files as non-migratable. (Some further
thought is going to be required here -- r/o sections of a
shared library need not be migrated, but r/w sections containing
program or thread private data would need to be migrated. So
the extended attribute may be a little more complicated than
just "don't migrate".)

The fact that NFS doesn't support this means that we will have to
have some other way to handle files from NFS though. It is possible
we can live with the notion that files mapped in from NFS are always
migratable. (I'll need to look into that some more).

> On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
>
>>>
>>>Sorry, but the only real difference between your API and mbind is that
>>>yours has a pid argument.
>>>

OK, so I've "lost the thread" a little bit here. Specifically what
would you propose the API for page migration be? As I read through your note,
I see a couple of different possibilities being considered:

(1) Map each object to be migrated into a management process,
update the object's memory policy to match the new node locations
and then unmap the object. Use the MPOL_F_STRICT argument to mbind() and
the result is that migration happens as part of the call.

(2) Something along the lines of:

page_migrate(pid, old_node, new_node);

or perhaps

page_migrate(pid, old_node_mask, new_node_mask);

or

(3) mbind() with a pid argument?

I'm sorry to be so confused, but could you briefly describe what
your proposed API would be (or choose from the above list if I
have guessed correctly?) :-)


>
>>The fundamental disconnect here is that I think that very few
>>programs use the NUMA API, and you think that most programs do.
>
>
> All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel)
> Internally it's all the same.

Well, yes, I guess to be more precise I should have said that
very few programs use any NUMA policy other than the DEFAULT
policy. And that they instead make page placement decisions implicitly
using first touch.

>
> Hmm, I see perhaps my distinction of these cases with programs
> already using NUMA API and not using it was not very useful
> and lead you to a tangent. Perhaps we can just drop it.
>
> I think one problem that you have that you essentially
> want to keep DEFAULT policy, but change the nodes.

Yes, that is correct. This has been exactly my point from the
beginning.

We have programs that use the DEFAULT policy and do placement
by first touch, and we want to migrate those programs without
requiring them to create a non-DEFAULT policy of some kind.

> NUMA API currently doesn't offer a way to do that,
> not even with Steve's patch that does simple page migration.
> You only get a migration when you set a BIND or PREFERED
> policy, and then it would stay. But I guess you could
> force that and then set back DEFAULT. It's a big ugly,
> but not too bad.
>

Very ugly, I think. Particularly if you have to do a lot of
vma splitting to get the correct node placement. (Worst case
is a VMA with nodes interleaved by first touch across a set of
nodes in a way that doesn't match the INTERLEAVE mempolicy.
Then you would have to create a separate VMA for each page
and use the BIND policy. Then after migration you would
have to go through and set the policy back to DEFAULT,
resulting in a lot of vma merges.)

>
>
> Sure, but NUMA API goes to great pains to handle such programs.
>

Yes, it does. But, how do we handle legacy NUMA codes that people
use today on our Linux 2.4.21 based Altix kernels? Such programs
don't have access to the NUMA API, so they aren't using it. They
work fine on 2.6 with the DEFAULT memory policy. It seems unreasonable
to go back and require these programs to use "numactl" to solve a problem that
they are already solving without it. And it certainly seems difficult
to require them to use numactl to enable migration of those programs.

(I'm sorry to keep harping on this but I think this is the
heart of the issue we are discussing. Are you of the opinion that
we sould require every program that runs on ALTIX under Linux 2.6 to use numactl?)
>
>>So lets go with the idea of dropping the va_start and va_end arguments from
>>the system call I proposed. Then, we get into the kernel and starting
>
>
> That would make the node array infinite, won't it? What happens when
> you want to migrate a 1TB process? @) I think you have to replace
> that one with a single target node argument too.
>

I'm sorry, I don't follow that at all. The node array has nothing to do with
the size of the address range to be migrated. It is not the case that the
ith entry in the node array says what to do with the ith page. Instead the
old and new node arrays defining a mapping of pages: for pages found on
old_node[i], move them to new_node[i]. The count field is the size of those
arrays, not the size of the region being migrated.


> You and Robin mentioned some problems with "double migration"
> with that, but it's still not completely clear to me what
> problem you're solving here. Perhaps that needs to be reexamined.
>

I think the issue here is basically a scalability issue. The problem
we have with most of the proposals that suggest passing in just a single
target and destination node, e. g.:

sys_page_migrate(pid, old_node, new_node);

is that this is basically an O(N**2) operation if N is the number of
processes in the job. How do we get that? Well, we have to make the
above system call M times for each process, where M is the number of
nodes being migrated. In our environment, we typically pin each process
to a different processor. So we have to make M*N system calls. But
we have two processors per node, so really that is N**2/2 system calls,
hence it is O(N**2).

(To simplify the dicussion, I am making an implicit assumption here
that the majority of the memory involved here is shared among all
N processes.)

Now once a particular shared object has been migrated, then the migration
code won't do any additional work, but we will still scan over the page
table entries once per each system call, so that is the component that
is done O(N**2) times. It is likely a much smaller component of the
system call than the migration code, which is O(p) in the number of
pages p, but that N**2 factor is a little scary when you have N=128
and the number of pages large.

So what we mean by the phrase "double migration" is the process
of scanning the page tables a second time to try to migrate an
object that a previous call has already moved the underlying pages
for.

Compare this to the system call:

sys_page_migrate(pid, count, old_node_list, new_node_list);

We are then down to O(N) system calls and O(N) page table scans.

But we can do better than that. If we have a system call
of the form

sys_page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);

and the majority of the memory is shared, then we only need to make
one system call and one page table scan. (We just "migrate" the
shared object once.) So the time to do the page table scans disappears
from interest and the only thing that takes the time is the migration code
itself. (Remember, I making a big simplifying assumption here that most of
the memory is shared among all of the processes of the job, so there is
basically just one object to migrate, so just one system call. The obvious
generalization to the real case is left as an exercise for the
reader. :-) )

Passing in an old_node/new_node mask solves this part of the
problem as well, but we have some concerns with that approach as well,
but that is a detail for a later email exchange.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-18 13:02:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

[Enjoy your vacation]

On Fri, Feb 18, 2005 at 02:38:42AM -0600, Ray Bryant wrote:
>
> Let's start off with at least one thing we can agree on. If xattrs
> are already part of XFS, then it seems reasonable to use an extended
> attribute to mark certain files as non-migratable. (Some further
> thought is going to be required here -- r/o sections of a
> shared library need not be migrated, but r/w sections containing
> program or thread private data would need to be migrated. So
> the extended attribute may be a little more complicated than
> just "don't migrate".)

I assume they would allow marking arbitary segments with specific
policies, so it should be possible.

An alternative way to handle shared libraries BTW would be to add the ELF
headers Steve did in his patch. And then handle them in user space
in ld.so and let it apply the necessary policy.

This won't work for non ELF files though.


>
> The fact that NFS doesn't support this means that we will have to
> have some other way to handle files from NFS though. It is possible
> we can live with the notion that files mapped in from NFS are always
> migratable. (I'll need to look into that some more).

I don't know details, but I would assume selinux (and other "advanced security"
people who generally need more security information per file) have plans in
this area too.

> >
> >>>
> >>>Sorry, but the only real difference between your API and mbind is that
> >>>yours has a pid argument.
> >>>
>
> OK, so I've "lost the thread" a little bit here. Specifically what
> would you propose the API for page migration be? As I read through your
> note,
> I see a couple of different possibilities being considered:
>
> (1) Map each object to be migrated into a management process,
> update the object's memory policy to match the new node locations
> and then unmap the object. Use the MPOL_F_STRICT argument to mbind()
> and
> the result is that migration happens as part of the call.
>
> (2) Something along the lines of:
>
> page_migrate(pid, old_node, new_node);
>
> or perhaps
>
> page_migrate(pid, old_node_mask, new_node_mask);

+ node mask length.

I don't like old_node* very much because it's imho unreliable
(because you can usually never fully know on which nodes the old
process was and there can be good reasons to just migrate everything)

I assume the second way would be more flexible, although I found
having node masks for this has the problem that you tend to allocate
most memory on the lowest numbered node because it's not easy to
round-robin over all set nodes (that's an issue in PREFERED policy
in NUMA API currently). So maybe the simple new_node argument
is preferable.

page_migrate(pid, new_node)

(or putting it into a writable /proc file if you prefer that)

>
> or
>
> (3) mbind() with a pid argument?

That would bring it to 7 arguments, really too much for a system
call (and a function in general). Also it would mean needing
to know about other process private addresses again.

Maybe set_mempolicy, but a new call is probably better.

> >NUMA API currently doesn't offer a way to do that,
> >not even with Steve's patch that does simple page migration.
> >You only get a migration when you set a BIND or PREFERED
> >policy, and then it would stay. But I guess you could
> >force that and then set back DEFAULT. It's a big ugly,
> >but not too bad.
> >
>
> Very ugly, I think. Particularly if you have to do a lot of

Well, I guess it could be made a new flag that says to
not change the future policy.

> vma splitting to get the correct node placement. (Worst case
> is a VMA with nodes interleaved by first touch across a set of
> nodes in a way that doesn't match the INTERLEAVE mempolicy.
> Then you would have to create a separate VMA for each page
> and use the BIND policy. Then after migration you would
> have to go through and set the policy back to DEFAULT,
> resulting in a lot of vma merges.)

Umm - I hope you don't want to do such tricks from external
processes. If a program does it by itself it can just use interleave
policy.

But I think I now understand why you want this complicated
user space control. You want to preserve relative ordering
on a set of nodes, right?

e.g. job runs threads on nodes 0,1,2,3 and you want it to move
to nodes 4,5,6,7 with all memory staying staying in the same
distance from the new CPUs as it were from the old CPUs, right?

It explains why you want old_node, you would do
(assuming node mask arguments)

page_migrate(pid, 0, 4)
page_migrate(pid, 1, 5)
...
page_migrate(pid, 3, 7)

keeping the memory in the same relative order. Problem is what happens
when some memory is in some other node due to memory pressure fallbacks.
Your scheme would not migrate this memory at all. While you may
get away with this in your application I think it would make
page migration much less useful in the general case than it could
be. e.g. for a single threaded process it is very useful to just
force all its pages that have been allocated on multiple nodes
to a specific node. I would like to have this option at least,
but with old node it would be rather inefficient. Ok, I guess you could
add a wildcard value for it; I guess that would work.

Problem is still that you would need to iterate through all nodes for your
migration scenario (or how would you find out where the job allocated
its old pages?), which is not very nice.

Perhaps node masks would be better and teaching the kernel to handle
relative distances inside the masks transparently while migrating?
Not sure how complicated this would be to implement though.

Supporting interleaving on the new nodes may be also useful, that would
need a policy argument at least too and masks.

> (I'm sorry to keep harping on this but I think this is the
> heart of the issue we are discussing. Are you of the opinion that
> we sould require every program that runs on ALTIX under Linux 2.6 to use
> numactl?)

Programs usually don't use numactl; administrators do.

If your job runs under the control of a NUMA aware job manager then I don't
see why it couldn't use numactl or do the necessary kernel syscalls directly
on its own.

I don't think every programs needs to be linked with libnuma for
NUMA policy, although I think that there should be limits on
how much functionality you try to offer in the external tools.
At some point internal support will need to be needed.

> I'm sorry, I don't follow that at all. The node array has nothing to do
> with
> the size of the address range to be migrated. It is not the case that the
> ith entry in the node array says what to do with the ith page. Instead the
> old and new node arrays defining a mapping of pages: for pages found on
> old_node[i], move them to new_node[i]. The count field is the size of those
> arrays, not the size of the region being migrated.

Yes, was a misunderstanding from my side.

> Compare this to the system call:
>
> sys_page_migrate(pid, count, old_node_list, new_node_list);
>
> We are then down to O(N) system calls and O(N) page table scans.

Ok. I can see the point of that now.

The main problem i see is how to handle "unknown" nodes. But perhaps
if a wild card node was defined for this purpose (-1?) it may do.

> But we can do better than that. If we have a system call
> of the form
>
> sys_page_migrate(pid, va_start, va_end, count, old_node_list,
> new_node_list);
>
> and the majority of the memory is shared, then we only need to make
> one system call and one page table scan. (We just "migrate" the
> shared object once.) So the time to do the page table scans disappears

I don't like this because it makes it much more complicated
to use for user space. And you can set separate policies for
shared objects anyways.

-Andi

2005-02-18 16:20:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi - what does this line mean:

+ node mask length.

I guess its the names of the parameters in a proposed
migration system call. Length of what, mask of what,
what's the node mean, huh?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-18 16:21:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi wrote:
> I don't like old_node* very much because it's imho unreliable
> (because you can usually never fully know on which nodes the old
> process was and there can be good reasons to just migrate everything)

That's one way that the arrays of old and new nodes pays off.
You can list any old node that might have a page, and state
which new node that page should go to.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-18 16:24:28

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi wrote:
> e.g. job runs threads on nodes 0,1,2,3 and you want it to move
> to nodes 4,5,6,7 with all memory staying staying in the same
> distance from the new CPUs as it were from the old CPUs, right?
>
> It explains why you want old_node, you would do
> (assuming node mask arguments)

Yup - my immediately preceeding post repeated this - sorry.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-18 16:25:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi wrote:
> Problem is what happens
> when some memory is in some other node due to memory pressure fallbacks.
> Your scheme would not migrate this memory at all.

The arrays of old and new nodes handle this fine.
Include that 'other node' in the array of old nodes,
and the corresponding new node, where those pages
should migrate, in the array of new nodes.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-18 17:09:52

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Here's an interface proposal that may be a middle ground and
should satisfy both small and large system requirements:

The system call interface would be:

page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);

(e. g. same as before, but please keep reading....):

The following restrictions of my original proposal would be
dropped:

(1) va_start and va_end can span multiple vma's. To migrate
all pages in a process, va_start can be 0UL and va_end
would be MAX_INT L. (Equivalently, we could use va_start
and length, in pages....) We would expect the normal usage
of this call on small systems to be va_start=0, va_end=MAX_INT.
va_start and va_end would be required to be page aligned.

(2) There is no requirement that the pid be suspended before
the system call is issued. Further requirements below
are proposed to handle the allocation of new pages while
the migrate system call is in progress.

(3) Mempolicy data structures will be updated to reflect the
new node locations before any pages are migrated. That
way, if the process allocates new pages before the migration
process is completed, they will be allocated on the new
nodes.

(An alternative: we could require the user to update
the NUMA API data structures to reflect the new reality
before the page_migrate() call is issued. This is consistent
with item (4). If the user doesn't do this, then
there is no guarentee that the page migration call will
actually be able to migrate all pages.)

If any memory policy is DEFAULT, then the pid will need to
be migrated to a cpu associated with one of the new_node_list
nodes before the page_migrate() call. This is so new
allocations will happen in the new_node_list and the
migration call won't miss those pages. The system call
will work correctly without this, it just can't guarentee
that it will migrate all pages from the old_nodes.

(4) If cpusets are in use, the new_node_list must represent
valid nodes to allocate pages from for the cpuset that
pid is currently a member of. This implies that the
pid is moved from its old cpuset to a new cpuset before
the page_migrate() call is issued. Any nodes not part
of the new cpu set will cause the system call to return
with -EINVAL.

(5) If, during the migration process, a page is to be moved to
node N, but the alloc_pages_node() call for node N fails, then the
page will fall over to allocation on the "nearest" node
in the new_node_list; if this node is full then fall over
to the next nearest node, etc. If none of the nodes has
space, then the migration system call will fail. (Hmmm...
would we unmigrate the pages that had been migrated
this far?? sounds messy.... also, not sure what one
would do about error reporting here so that the caller
could take some corrective action.)

(6) The system call is reserved to root or a pid with
capability CAP_PAGE_MIGRATE.

(7) Mapped files with the extended attribute MIGRATE
set to NONE are not migrated by the system call.
Mapped files with the extended attribute MIGRATE
set to LIB will be handled as follows: r/o
mappings will not be migrated. r/w mappings will
be migrated. If no MIGRATE extended attribute is available,
then the assumtion is that the MIGRATE extended
attribute is not set. (Files mapped from NFS
would always be regarded as migrateable until
NFS gets extended attributes.)

Note that nothing here requires parsing of /proc/pid/maps,
etc. However, very large systems may use the system call
in special ways, e. g:

(1) They may decide to suspend processes before migration.
(2) They may decide to optimize the migration process by
trying to migrate large shared objects only "once",
in the sense that only one scan of a large shared
object will be done.

Issues of complexity related to the above are reserved for
those systems who choose to use the system call in this way.

Please note, however that this is a performance optimization
that some systems MAY decide to do. There is NO REQUIREMENT
that any user follow these steps from a correctness point of
view, the page_migrate() system call will still do the correct
thing.

Now, I know that is complicated and lot of verbage. But this
would satisfy our requirements and I think it would satisfy
the concern that the page_migration() call was built just to
satisfy SGI requirements.

Comments, flames, suggestions, etc, as usual are all welcome.
--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-18 17:12:02

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:

> You and Robin mentioned some problems with "double migration"
> with that, but it's still not completely clear to me what
> problem you're solving here. Perhaps that needs to be reexamined.
>
>
There is one other case where Robin and I have talked about double
migration. That is the case where the set of old nodes and new
nodes overlap. If one is not careful, and the system call interface
is assumed to be something like:

page_migrate(pid, old_node, new_node);

then if one is not careful (and depending on what the complete list
of old_nodes and new_nodes are), then if one does something like:

page_migrate(pid, 1, 2);
page_migrate(pid, 2, 3);

then you can end up actually moving pages from node 1 to node 2,
only to move them again from node 2 to node 3. This is another
form of double migration that we have worried about avoiding.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-18 17:14:19

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi, et al:

I see that several messages have been sent in the interim.
I apologize for being "out of sync", but today is my last
day to go skiing and it is gorgeous outside. I'll try
to catch up and digest everthing later.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-19 01:04:41

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
> [Enjoy your vacation]
>

[I am thanks -- or I was -- I go home tomorrow]

> I assume they would allow marking arbitary segments with specific
> policies, so it should be possible.
>
> An alternative way to handle shared libraries BTW would be to add the ELF
> headers Steve did in his patch. And then handle them in user space
> in ld.so and let it apply the necessary policy.
>
> This won't work for non ELF files though.
>

Would I then have to sign-off from the ld.so maintainer to get that patch
in? :-(

This sounds more general than the xattr attribute thing I was thinking
of (i. e. marking a file non-migratable or library....)

Well, we can work the exact details of this part later.


>>(2) Something along the lines of:
>>
>> page_migrate(pid, old_node, new_node);
>>
>> or perhaps
>>
>> page_migrate(pid, old_node_mask, new_node_mask);
>
>
> + node mask length.
>
> I don't like old_node* very much because it's imho unreliable
> (because you can usually never fully know on which nodes the old
> process was and there can be good reasons to just migrate everything)
>

In our case, it turns out we do because the job is running inside of
a cpuset. So it can't allocate memory outside of that cpuset. In
more general scenarios, you are right, you don't know. But this
can be handled with a MIGRATE_NODE_ANY (more below).

> I assume the second way would be more flexible, although I found
> having node masks for this has the problem that you tend to allocate
> most memory on the lowest numbered node because it's not easy to
> round-robin over all set nodes (that's an issue in PREFERED policy
> in NUMA API currently). So maybe the simple new_node argument
> is preferable.
>
> page_migrate(pid, new_node)
>
> (or putting it into a writable /proc file if you prefer that)
>
>
>>or
>>
>>(3) mbind() with a pid argument?
>
>
> That would bring it to 7 arguments, really too much for a system
> call (and a function in general). Also it would mean needing
> to know about other process private addresses again.
>
> Maybe set_mempolicy, but a new call is probably better.

OK, lets assume we have a new call of some kind then.
>
>
> But I think I now understand why you want this complicated
> user space control. You want to preserve relative ordering
> on a set of nodes, right?
>
> e.g. job runs threads on nodes 0,1,2,3 and you want it to move
> to nodes 4,5,6,7 with all memory staying staying in the same
> distance from the new CPUs as it were from the old CPUs, right?

Yes, thats it: we want the relative distances between the pages
on the new set of nodes to match the distances on the old set of
nodes as much as is possible, or we at least want a sufficiently
powerful system call to let us do this if the correct set of new
nodes is available. This is to have the application have the same
level of performance before and after the migration call.

In actuality, what we intend to do is to use this API to migrate
jobs from one cpuset to another; we will require the administrator
to set up the cpusets so they are topologically equivalent for cpusets
of the same size. If the don't do that, then performance can
change when a job is migrated.
>
> It explains why you want old_node, you would do
> (assuming node mask arguments)
>
> page_migrate(pid, 0, 4)
> page_migrate(pid, 1, 5)
> ...
> page_migrate(pid, 3, 7)
>
> keeping the memory in the same relative order. Problem is what happens
> when some memory is in some other node due to memory pressure fallbacks.
> Your scheme would not migrate this memory at all. While you may
> get away with this in your application I think it would make
> page migration much less useful in the general case than it could
> be. e.g. for a single threaded process it is very useful to just
> force all its pages that have been allocated on multiple nodes
> to a specific node. I would like to have this option at least,
> but with old node it would be rather inefficient. Ok, I guess you could
> add a wildcard value for it; I guess that would work.
>

The patch that I sent out already defines MIGRATE_NODE_ANY to request
any other available node; this is needed for those cases where memory
hotplug just wants to move the page off of >>this<< node. I don't
see why we we couldn't allow this as a value for old node, and it
would mean "migrate all pages". (i. e. MIGRATE_NODE_ANY matches
pages on all nodes.)

> Problem is still that you would need to iterate through all nodes for your
> migration scenario (or how would you find out where the job allocated
> its old pages?), which is not very nice.

Agreed. Which is why we really prefer an old_node_list, new_node_list,
then we iterate acrcoss pages and make the indicated decision for each
page.

>
> Perhaps node masks would be better and teaching the kernel to handle
> relative distances inside the masks transparently while migrating?
> Not sure how complicated this would be to implement though.
>
> Supporting interleaving on the new nodes may be also useful, that would
> need a policy argument at least too and masks.
>

The worry I have about using node masks is that it is not as general as
old_node,new_node mappings (or preferably, the original proposal I made
of old_node_list, new_node_list). One can't differentiate between the
N! different mappings that a pair of nodemasks (with N bits set in each
mask) represents. So one would have to choose one such map (the canonical
one) where the Ith bit of the first mask maps to the Ith bit of the second.

I believe this limits the kind of cpusets one can define. In particular,
it means that for any cpuset, if you sort the nodes in ordinal order,
corresponding entries of that sorted order have the same topological
relationship to one other in each cpuset. Now think of the comminications
interconnect as being a tree. One could construct cpuset A by taking
nodes from the left hand side of the tree, and cpuset B by taking
symmetrically chosen nodes from the right hand side of the tree. It
is clear that the two cpusets are toplogically equivalent.

If nodes are numbered right to left, then the loweset numbered node in
cpuset A doesn't correspond to the lowest numbered node in cpuset B,
it corresponds to the highest numbered node. So we can't represent
the correct mapping of nodes between cpuset A and cpuset B using
a node mask, we have to use an explicit 1-1 mapping of some kind.
>
>>(I'm sorry to keep harping on this but I think this is the
>>heart of the issue we are discussing. Are you of the opinion that
>>we sould require every program that runs on ALTIX under Linux 2.6 to use
>>numactl?)
>
>
> Programs usually don't use numactl; administrators do.
>
> If your job runs under the control of a NUMA aware job manager then I don't
> see why it couldn't use numactl or do the necessary kernel syscalls directly
> on its own.
>

If a job manager is going to use numactl to control placement, then
the job manager has to understand which parts of the address space
the program wants allocated on which node. The latter depends, in
general on the input data the program reads. So I don't see a good
way for this to happen. Instead, today, it happens by first touch.

Note, in particular, that the mappings that sophistocated HPC
programs use are much more complicated than "Interleave everything".
They typically will be something like put this part of the address
space on that node, this part over there, this part over there,
interleave this part, etc, etc. And all of those decisions can
be different every time the input data is changed, since that input
data can control the size of the matrix being anlayzed, etc.

Additionally, note that the kind of NUMA aware job manager we use on
our Altix systems is based on cpusets. Jobs typically just request
the number of processors and a cpuset is created as a container for
the job.

Finally, note that we can't force our ISV's to add new NUMA API
calls in order to migrate from our Linux 2.4.21 based kernel to
our Linux 2.6 kernel. We are more or less at their mercy. And
since what they have now works well under 2.4.21, and it works
under 2.6 if we use the DEFAULT mempolicy, I just can't see how
we are going to win that argument, particularly since we are
very close to shipping our first 2.6 based systems.

So we have to figure out a way to migrate NUMA-aware programs
that are using the DEFAULT mempolicy and using first-touch
for memory placement, and we have to figure out how to migrate
them so that application peformance before and after the
migration is equivalent.


> I don't think every programs needs to be linked with libnuma for
> NUMA policy, although I think that there should be limits on
> how much functionality you try to offer in the external tools.
> At some point internal support will need to be needed.
>
>
>
>
>>Compare this to the system call:
>>
>>sys_page_migrate(pid, count, old_node_list, new_node_list);
>>
>>We are then down to O(N) system calls and O(N) page table scans.
>
>
> Ok. I can see the point of that now.
>
> The main problem i see is how to handle "unknown" nodes. But perhaps
> if a wild card node was defined for this purpose (-1?) it may do.
>

Right, MIGRATE_NODE_ANY in the new node list means, allocate on
any node. We can define MIGRATE_NODE_ANY in the old node list to
mean, "take all pages". In this case, there can only only be one
entry in the old and new node lists, so you could gather all pages
for a PID and move them to a single new node.
>
>>But we can do better than that. If we have a system call
>>of the form
>>
>>sys_page_migrate(pid, va_start, va_end, count, old_node_list,
>>new_node_list);
>>
>>and the majority of the memory is shared, then we only need to make
>>one system call and one page table scan. (We just "migrate" the
>>shared object once.) So the time to do the page table scans disappears
>
>
> I don't like this because it makes it much more complicated
> to use for user space. And you can set separate policies for
> shared objects anyways.

Yes, but only programs that care have to use the va_start and
va_end. Programs who want to move everything can specify
0 and MAX_INT there and they are done.

Indeed we can only expose what we want most users to see in
glibc and leave the underlying system call in its full form
for only those systems that need it.

>
> -Andi

But we are least at the level of agreeing that the new system
call looks something like the following:

migrate_pages(pid, count, old_list, new_list);

right?

That's progress. :-)


--
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-20 21:49:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

> >Perhaps node masks would be better and teaching the kernel to handle
> >relative distances inside the masks transparently while migrating?
> >Not sure how complicated this would be to implement though.
> >
> >Supporting interleaving on the new nodes may be also useful, that would
> >need a policy argument at least too and masks.
> >
>
> The worry I have about using node masks is that it is not as general as
> old_node,new_node mappings (or preferably, the original proposal I made
> of old_node_list, new_node_list). One can't differentiate between the

I agree that the node arrays are better for this case.

> >>and the majority of the memory is shared, then we only need to make
> >>one system call and one page table scan. (We just "migrate" the
> >>shared object once.) So the time to do the page table scans disappears
> >
> >
> >I don't like this because it makes it much more complicated
> >to use for user space. And you can set separate policies for
> >shared objects anyways.
>
> Yes, but only programs that care have to use the va_start and
> va_end. Programs who want to move everything can specify
> 0 and MAX_INT there and they are done.

I still think it's fundamentally unclean and racy. External processes
shouldn't mess with virtual addresses of other processes.

> >-Andi
>
> But we are least at the level of agreeing that the new system
> call looks something like the following:
>
> migrate_pages(pid, count, old_list, new_list);
>
> right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.

-Andi

2005-02-20 22:31:15

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi wrote:
> I still think it's fundamentally unclean and racy. External processes
> shouldn't mess with virtual addresses of other processes.

It's not really messing with (changing) the virtual addresses of
another process. It's messing with the physical placement. It's
using the virtual addresses to help choose which pages to move.

Do you have any better way to suggest, Andi, for a batch manager to
relocate a job? The typical scenario, as Ray explained it to me, is
thus. A lower priority job, after running a while, is displaced by a
higher priority job that needs a large number of nodes. Later on enough
nodes to run the lower priority job become available elsewhere. The
lower priority job can either continue to wait for its original nodes to
come free (after the high priority job finishes) or it can be relocated
to the nodes available now.

How would you recommend that the batch manager move that job to the
nodes that can run it? The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

(I'm starting to have visions of vma's having externally visible id's,
in a per-task namespace.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-20 22:35:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

> Do you have any better way to suggest, Andi, for a batch manager to
> relocate a job? The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind)

- Then execute migrate_pages() for the anonymous pages with a suitable
old node -> new node mapping.

> How would you recommend that the batch manager move that job to the
> nodes that can run it? The layout of allocated memory pages and tasks
> for that job must be preserved in order to keep the same performance.
> The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it)

-Andi

2005-02-21 01:52:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

> - Give the shared libraries and any other files a suitable policy
> (by mapping them and applying mbind)

Ah - I think you've said this before, and I'm being a bit retarded.

You're saying that one could horse around with the physical placement of
existing files mapped into another tasks space by mapping them into ones
own space and using mbind, (once mbind is hooked up to page migration,
to quote one of your earlier posts ;). Ok.

How well does this work with a mapped file if the pages of that file
have been placed (allocated on nodes) using some intricate first-touch
pattern that is only encoded in some inscrutable initialization code of
the application, and that is heavily fragmented, with few contiguous
pages on the same node?

It seems to me that you can't migrate such regions efficiently using the
above explicit mbind'ing -- it could require a vma per page in the
limit. And you can't migrate such regions using a migrate_pages() for
all anonymous pages in a tasks space, because these aren't anon pages.

Do you have in mind being able to tag such mapped files with an
attribute that causes their pages to be migrated along with the
anon pages on the migrate_pages() call? That might work ...


> > How would you recommend that the batch manager move that job to the
> > nodes that can run it? ...
>
> You have to walk to full node mapping for each array, but
> even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-21 04:15:50

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
>>
>>But we are least at the level of agreeing that the new system
>>call looks something like the following:
>>
>>migrate_pages(pid, count, old_list, new_list);
>>
>>right?
>
>
> For the external case probably yes. For internal (process does this
> on its own address space) it should be hooked into mbind() too.
>
> -Andi
>
That makes sense. I will agree to make that part work, too. as part
of this. We will probably do the external case first, because we have
need for that.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-21 07:26:46

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
>>Do you have any better way to suggest, Andi, for a batch manager to
>>relocate a job? The typical scenario, as Ray explained it to me, is
>
>
> - Give the shared libraries and any other files a suitable policy
> (by mapping them and applying mbind)
>
> - Then execute migrate_pages() for the anonymous pages with a suitable
> old node -> new node mapping.
>
>
>>How would you recommend that the batch manager move that job to the
>>nodes that can run it? The layout of allocated memory pages and tasks
>>for that job must be preserved in order to keep the same performance.
>>The migration method needs to scale to hundreds, or more, of nodes.
>
>
> You have to walk to full node mapping for each array, but
> even with hundreds of nodes that should not be that costly
> (in the worst case you could create a small hash table for it
> in the kernel, but I'm not sure it's worth it)
>
> -Andi
> -

I'm going to assume that there have been some "crossed emails" here.
I don't think that this is the interface that you and I have been
converging on. As I understood it, we were converging on the following:

(1) extended attributes will be used to mark files as non-migratable
(2) the page_migrate() system call will be defined as:

page_migrate(pid, count, old_nodes, new_nodes);

and it will migrate all pages that are either anonymous or part
of mapped files that are not marked non-migratable.
(3) The mbind() system call with MPOL_MF_STRICT will be hooked up
to the migration code so that it actually causes a migration.
Processes can use this interface to migrate a portion of their own
address space containing a mapped file.

This is different than your reply above, which seems to imply that:

(A) Step 1 is to migrate mapped files using mbind(). I don't understand
how to do this in general, because:
(a) I don't know how to make a non-racy list of the mapped files to
migrate without assuming that the process to be migrated is stopped
and (b) If the mapped file is associated with the DEFAULT memory policy,
and page placement was done by first touch, then it is not clear
how to use mbind() to cause the pages to be migrated, and still
end up with the identical topological placement of pages after
the migration.
(B) Step 2 is to use page_migrate() to migrate just the anonymous pages.
I don't like the restriction of this to just anonymous pages.

Fundamentally, I don't see why (A) is much different from allowing one
process to manipulate the physical storage for another process. It's
just stated in terms of mmap'd objects instead of pid's. So I don't
see why that is fundamentally different from a page_migration() call
with va_start and va_end arguments.

So I'm going to assume that the agreement was really (1)-(3) above.

The only problem I see with that is the following: Suppose that a user
wants to migrate a portion of their own address space that is composed
of (at last partly) anonymous pages or pages mapped to a file associated
with the DEFAULT memory policy, and we want the pages to be toplogically
allocated the same way after the migration as they were before the
migration?

The only way I know how to do the latter is with a system call of the form:

page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

where the permission model is that a pid can migrate any process that it
can send a signal to. So a root pid can migrate any process, and a user
pid can migrate pages of any pid started by the user.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-21 07:35:18

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Paul Jackson wrote:
>>
>>You have to walk to full node mapping for each array, but
>>even with hundreds of nodes that should not be that costly
>
>
> I presume if you knew that the job only had pages on certain nodes,
> perhaps due to aggressive use of cpusets, that you would only have to
> walk those nodes, right?
>
I don't think Andi was proposing you have to search all of the pages
on a node. I think that the idea was that the (count, old_nodes, new_nodes)
parameters would have to be converted to a full node_map such as is done
in the patch (let's call it "sample code") that I sent out with the
overview that started this whole discussion. node_map[] is MAX_NUMNODES
in length, and node_map[i] gives the node where pages on node i should be
migrated to, or is -1 if we are not migrating pages on this node.

Since we have extended the interface to support -1 as a possible value for
the old_nodes array [and it matches any old node], then in that case we
would make node_map[i]=new_node for all values of i.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-21 08:38:30

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

All,

Just an update on the idea of migrating a process without suspending
it.

The hard part of the problem here is to make sure that the page_migrate()
system call sees all of the pages to migrate. If the process that is
being migrated can still allocate pages, then the page_migrate() call
may miss some of the pages.

One way to solve this problem is to force the process to start allocating
pages on the new nodes before calling page_migrate(). There are a couple
of subcases:

(1) For memory mapped files with a non-DEFAULT associated memory policy,
one can use mbind() to fixup the memory policy. (This assumes the
Steve Longerbeam patches are applied, as I understand things).

(2) For anonymous pages and memory mapped files with DEFAULT policy,
the allocation depends on which node the process is running. So
after doing the above, you need to migrate the task to a cpu
associated with one of the nodes.

The problem with (1) is that it is racy, there is no guarenteed way to get the
list of mapped files for the process while it is still running. A process
can do it for itself, so one way to do this would be to write the set of
new nodes to a /proc/pid file, then send the process a SIG_MIGRATE
signal. Ugly.... (For multithreaded programs, all of the threads have
to be signalled to keep them from mmap()ing new files during the migration.)

(1) could be handled as part of the page_migrate() system call --
make one pass through the address space searching for mempolicy()
data structures, and updating them as necessary. Then make a second
pass through and do the migrations. Any new allocations will then
be done under the new mempolicy, so they won't be missed. But this
still gets us into trouble if the old and new node lists are not
disjoint.

This doesn't handle anonymous memory or mapped files associated with
the DEFAULT policy. A way around that would be to add a target cpu_id
to the page_migrate() system call. Then before doing the first pass
described above, one would do the equivalenet of set_sched_affinity()
for the target pid, moving it to the indicated cpu. Once it is known
the pid has moved (how to do that?), we now know anonymous memory and
DEFAULT mempolicy mapped files will be allocated on the nodes associated
with the new cpu. Then we can proceed as discussed in the last paragraph.
Also ugly, due to the extra parameter.

Alternatively, we can just require, for correct execution, the invoking
code to do the set_sched_affinity() first, in those cases where
migrating a running task is important.

Anyway, how important is this, really for acceptance of a page_migrate()
system call in the community? (that is, how important is it to be
able to migrate a process without suspending it?)
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-21 09:58:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

On Mon, Feb 21, 2005 at 01:29:41AM -0600, Ray Bryant wrote:
> This is different than your reply above, which seems to imply that:
>
> (A) Step 1 is to migrate mapped files using mbind(). I don't understand
> how to do this in general, because:
> (a) I don't know how to make a non-racy list of the mapped files to
> migrate without assuming that the process to be migrated is
> stopped

That was just a stop gap way to do the migration before you have
xattrs for shared libraries. If you have them it's not needed.

> and (b) If the mapped file is associated with the DEFAULT memory policy,
> and page placement was done by first touch, then it is not clear
> how to use mbind() to cause the pages to be migrated, and still
> end up with the identical topological placement of pages after
> the migration.

It can be done, but it's ugly. But it really was only intended for
the shared libraries.

> (B) Step 2 is to use page_migrate() to migrate just the anonymous pages.
> I don't like the restriction of this to just anonymous pages.

That would be only in this scenario; I agree it doesn't make sense
to add it as a general restriction to the syscall.

>
> Fundamentally, I don't see why (A) is much different from allowing one
> process to manipulate the physical storage for another process. It's
> just stated in terms of mmap'd objects instead of pid's. So I don't
> see why that is fundamentally different from a page_migration() call
> with va_start and va_end arguments.

An mmaped object exists on its own. It's access is fully reference counted etc.

> The only problem I see with that is the following: Suppose that a user
> wants to migrate a portion of their own address space that is composed
> of (at last partly) anonymous pages or pages mapped to a file associated
> with the DEFAULT memory policy, and we want the pages to be toplogically
> allocated the same way after the migration as they were before the
> migration?

It doesn't seem very realistic to me. When a user wants to change
its own address room then they can use mbind() from the beginning
and they should know how their memory layout is.

-Andi

2005-02-21 12:03:05

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II


Ray wrote:
> As I understood it, we were converging on the following:
> (1) ...
> (2) ...
> (3) ...
> This is different than your reply above, which seems to imply that:
> (A) ...
> (B) ...

Andi reacted to various details of (A) and (B).

Any chance, Andi, of you directly stating whether you concur
with Ray that you two are converging on (1), (2) and (3)?

I'm afraid my mind reading skills aren't that good.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-21 12:10:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

On Mon, Feb 21, 2005 at 02:42:16AM -0600, Ray Bryant wrote:
> All,
>
> Just an update on the idea of migrating a process without suspending
> it.
>
> The hard part of the problem here is to make sure that the page_migrate()
> system call sees all of the pages to migrate. If the process that is
> being migrated can still allocate pages, then the page_migrate() call
> may miss some of the pages.

I would do an easy 95% solution:

When process has default process policy set temporarily a prefered policy
with the new node

[this won't work with multiple nodes though, so you have to decide on one
or stop the process if that is unacceptable]

>
> One way to solve this problem is to force the process to start allocating
> pages on the new nodes before calling page_migrate(). There are a couple
> of subcases:
>
> (1) For memory mapped files with a non-DEFAULT associated memory policy,
> one can use mbind() to fixup the memory policy. (This assumes the
> Steve Longerbeam patches are applied, as I understand things).

I would just ignore them. If user space wants it can handle it,
but it's probably not worth it.

> (1) could be handled as part of the page_migrate() system call --
> make one pass through the address space searching for mempolicy()
> data structures, and updating them as necessary. Then make a second
> pass through and do the migrations. Any new allocations will then
> be done under the new mempolicy, so they won't be missed. But this
> still gets us into trouble if the old and new node lists are not
> disjoint.

I wouldn't bother fixing up VMA policies.

> This doesn't handle anonymous memory or mapped files associated with
> the DEFAULT policy. A way around that would be to add a target cpu_id

[...]

I would set temporarily a prefered policy as mentioned above.

That only handles a single node, but you solution is not better.

-Andi

2005-02-21 17:09:43

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:


>
> I wouldn't bother fixing up VMA policies.
>
>

How would these policies get changed so that they represent the
reality of the new node location(s) then? Doesn't this have to
happen as part of migrate_pages()?


--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-22 06:39:11

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi,

I went back and did some digging on one the issues that has dropped
off the list here: the case where the set of old nodes and new
nodes overlap in some way. No one could provide me with a specific
example, but the thread was that "This did happen in certain scenarios".

Part of these scenarios involved situations where a particular job
had to have access to a certain node, because that certain node was
attached to a graphics device, for example. Here is one such
scenario:

Let's suppose that nodes 0-1 of a 64 CPU system have graphics
pipes. To keep it simple, we will assume that there are 2 cpus
per node like an Altox. Let's suppose that jobs arrive as follows:

(1) 32 processor, non-graphics job arrives and gets assigned
cpus 96-127 (nodes 48-63)

(2) A second 32 processor, non-graphics job arrives and is
assigned cpus 64-95 (nodes 32-47)

(3) A 64 processor non-graphics job arrives and gets assigned
cpus 0-63.

(bear with me, please)....

(4) The job on nodes 64-95 terminates. A new 28 processor
job arrives and is assigned cpus 68-95.

(5) A 4 cpu graphics job comes in and we want to assign it to
cpus 0-3 (nodes 0-1) and it has a very high priority, so
we want to migrate the 64 CPU job. The only place left
to migrate it is from cpus 0-63 to cpus 4-67.

(Note that we can't just migrate nodes 0-1 to nodes 32-33, because
for all we know, the program depends on the fact that nodes 0-1
are physically close to [have low latency access to] nodes 2-3.
So moving 0-1 to 32-33 would be a non-topological preserving
migration.)

Now if we are using a system call of the form

migrate_pages(pid, count, old_node_list, new_node_list);

then we really can't have old_node_list and new_node_list overlap,
unless this is the only process that we are migrating or there is
no shared memory among the pid's. (Neither is very likely for
our workload mix. :-) ).

The reason that this doesn't work is the following: It works
fine for the first pid. The shared segment gets moved to the
new_node_list. But when we call migrate_pages() for the 2nd
pid, we will remigrate the pages that ended up on the nodes
that are in the intersection of the sets of members of the
two lists. (The scanning code has no way to recognize that
the pages have been migrated. It finds pages that are on one
of the old nodes, and migrates them again.) This gets repeated
for each subsequent call. Not pretty. What happens in this
particular case if you do the trivial thing and try:

old_nodes=0 1 2 ... 31
new_nodes=2 3 4 ... 33

Then after 16 process have been migrated, all of the shared memory
pages of the job are on nodes 32 and 33. (I've assume the shared
memory is shared among all of the processes of the job.)

Now you COULD do multiple migrations to make this work.
In this case, you could do 16 migrations:

step old_nodes new_nodes
1 30 31 32 33
2 28 29 30 31
3 26 27 28 29
...
16 0 1 2 3

During each step, you would have to call migrate_pages() 64 times,
since there are 64 processes involved. (You can't migrate
any more nodes in each step without creating a situation where
pages will be physically migrated twice.) Once again, we are
starting to veer close to O(N**2) behavior here, and we want
to stay away from that.

OK, so what is the alternative? Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form

migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);

with old_node_list = 0 1 2 ... 31
new_node_list = 2 3 4 ... 33

for one of the pid's in the job.

(This is particularly important if the shared region is large.)

Next we could go and move the non-shared memory in each process
using similar calls, repeated one or more times in each process.

Yes, this is ugly, and yes this requires us to parse /proc/pid/maps.
Life is like that sometimes.

Now, I admit that this example is somewhat contrived, and it shows
worst case behavior. But this is not an implausible scenario. And
it shows the difficulties of trying to use a system call of the
form:

migrate_pages(pid, count, old_node_list, new_node_list)

in those cases where the old_node_list and the new_node_list are not
disjoint. Furthermore, it shows how we could end up in a situation
where the old_node_list and the new_node_lists overlap.

Jack Steiner pointed out this kind of example to me, and this kind
of example did arise in IRIX, so we believe that it will arise on
Altix and we don't know of a good way around these problems other
than the system call form that includes the va_start and va_end.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-22 06:40:21

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi,

Oops. It's late. The pargraph below in my previous note confused
cpus and nodes. It should have read as follows:

Let's suppose that nodes 0-1 of a 64 node [was: CPU] system have graphics
pipes. To keep it simple, we will assume that there are 2 cpus
per node like an Altix [128 CPUS in this system]. Let's suppose that jobs
arrive as follows:
. . .

Sorry about that.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-22 18:01:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

> OK, so what is the alternative? Well, if we had a va_start and
> va_end (or a va_start and length) we could move the shared object
> once using a call of the form
>
> migrate_pages(pid, va_start, va_end, count, old_node_list,
> new_node_list);
>
> with old_node_list = 0 1 2 ... 31
> new_node_list = 2 3 4 ... 33
>
> for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]

But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that
is mapped into a management process.

How about you add the va_start, va_end but only accept them
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...)

BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.

-Andi

2005-02-22 18:04:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

On Mon, Feb 21, 2005 at 11:12:14AM -0600, Ray Bryant wrote:
> Andi Kleen wrote:
>
>
> >
> >I wouldn't bother fixing up VMA policies.
> >
> >
>
> How would these policies get changed so that they represent the
> reality of the new node location(s) then? Doesn't this have to
> happen as part of migrate_pages()?

You might want to change it, but it's a pure policy issue. And
such kind of policy should be in user space. However I can see
it being ugly to grab the list of policies from user space
(it would need a /proc file).

Perhaps you're right and it's better to do in the kernel.
It just won't be very pretty code to convert all the masks.

-Andi

2005-02-22 18:40:39

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:

>
> How about you add the va_start, va_end but only accept them
> when pid is 0 (= current process). Otherwise enforce with EINVAL
> that they are both 0. This way you could map the
> shared object into the batch manager, migrate it there, then
> mark it somehow to not be migrated further, and then
> migrate the anonymous pages using migrate_pages(pid, ...)
>

We'd have to use up a struct page flag (PG_MIGRATED?) to mark
the page as migrated to keep the call to migrate_pages() for
the anonymous pages from migrating the pages again. Then we'd
have to have some way to clear PG_MIGRATED once all of the
migrate_pages() calls are complete (we can't have the anonymous
page migrate_pages() calls clear the flags, since the second
such call would find the flag clear and remigrate the pages
in the overlapping nodes case.)

How about ignoring the va_start and va_end values unless
either:

pid == current->pid
or current->euid == 0 /* we're root */

I like the first check a bit better than checking for 0. Are
there other system calls that follow that convention (e. g.
pid = 0 implies current?)

The second check lets a sufficiently responsible task manipulate
other tasks. This task can choose to have the target tasks
suspended before it starts fussing with them.

> BTW it might be better to make va_end a size, just to be more
> symmetric with mlock,madvise,mmap et.al.
>

Yes,.that's been pointed out to me before. Let's make it so.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-22 18:50:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

On Tue, Feb 22, 2005 at 12:45:21PM -0600, Ray Bryant wrote:
> Andi Kleen wrote:
>
> >
> >How about you add the va_start, va_end but only accept them
> >when pid is 0 (= current process). Otherwise enforce with EINVAL
> >that they are both 0. This way you could map the
> >shared object into the batch manager, migrate it there, then
> >mark it somehow to not be migrated further, and then
> >migrate the anonymous pages using migrate_pages(pid, ...)
> >
>
> We'd have to use up a struct page flag (PG_MIGRATED?) to mark
> the page as migrated to keep the call to migrate_pages() for
> the anonymous pages from migrating the pages again. Then we'd

I was more thinking of a new mempolicy or a flag for one.
Flag would be probably better. No need to keep state in struct page.

> How about ignoring the va_start and va_end values unless
> either:
>
> pid == current->pid
> or current->euid == 0 /* we're root */
>
> I like the first check a bit better than checking for 0. Are
> there other system calls that follow that convention (e. g.
> pid = 0 implies current?)
>
> The second check lets a sufficiently responsible task manipulate
> other tasks. This task can choose to have the target tasks
> suspended before it starts fussing with them.

I don't like that. The idea behind this restriction is to simplify
things by making sure only processes change their own VM. Letting
root overwrite this doesn't make much sense.

-Andi

2005-02-22 22:01:44

by Ray Bryant

[permalink] [raw]
Subject: Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

Andi Kleen wrote:
>>OK, so what is the alternative? Well, if we had a va_start and
>>va_end (or a va_start and length) we could move the shared object
>>once using a call of the form
>>
>> migrate_pages(pid, va_start, va_end, count, old_node_list,
>> new_node_list);
>>
>>with old_node_list = 0 1 2 ... 31
>> new_node_list = 2 3 4 ... 33
>>
>>for one of the pid's in the job.
>
>
> I still don't like it. It would be bad to make migrate_pages another
> ptrace() [and ptrace at least really enforces a stopped process]
>
> But I can see your point that migration DEFAULT pages with first touch
> aware applications pretty much needs the old_node, new_node lists.
> I just don't think an external process should mess with other processes
> VA. But I can see that it makes sense to do this on SHM that
> is mapped into a management process.
>
> How about you add the va_start, va_end but only accept them
> when pid is 0 (= current process). Otherwise enforce with EINVAL
> that they are both 0. This way you could map the
> shared object into the batch manager, migrate it there, then
> mark it somehow to not be migrated further, and then
> migrate the anonymous pages using migrate_pages(pid, ...)
>

There can be mapped files that can't be mapped into the migration task.
.
Here's an example (courtesy of Jack Steiner);

sprintf(fname, "/tmp/tmp.%d", getpid());
unlink(fname);
fd = open(fname, O_CREAT|O_RDWR);
p = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
unlink(fname);
/* "p" remains valid until unmapped */

The file /tmp/tmp.pid is both mapped and deleted. It can't be opened
by another process to mmap() it, so it can't be mapped into the
migration task AFAIK how to do things. The file does show up in
/proc/pid/maps as shown below (pardon the line splitting):

2000000000270000-2000000000278000 rw-p 00200000 08:13 75498728 \
/lib/tls/libc.so.6.1
2000000000278000-2000000000284000 rw-p 2000000000278000 00:00 0
2000000000300000-2000000000c8c000 rw-s 00000000 08:13 100885287 \
/tmp/tmp.18259 (deleted)
4000000000000000-4000000000008000 r-xp 00000000 00:2a 14688706 \
/home/tulip14/steiner/apps/bigmem/big

Jack says:

"This is a fairly common way to work with scratch map'ed files. Sites that
have very large disk farms but limited swap space frequently do this (or at
least they use to...)"

So while I tend to agree with your concern about manipulating
one process's address space from another, I honestly think we
are stuck, and I don't see a good way around this.

> BTW it might be better to make va_end a size, just to be more
> symmetric with mlock,madvise,mmap et.al.
>

Yes, I agree. Let's make that so.

> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------