Migrate pages from a ccNUMA node to another.
============================================
Version 0.1, 25th of March 2004
By Zoltan Menyhart, Bull S.A. <[email protected]>
The usual GPL applies.
What is it all about ?
----------------------
The old golden days of the Symmetrical Multi-Processor systems are over.
Gone forever.
We are left with (cache coherent) Non Uniform Memory Architectures.
I can see the future.
I can see systems with hundreds, thousands of processors, with less and less
uniform memory architectures.
The "closeness" of a processor to its working set of memory will have the most
important effect on the performance.
You can make use of the forthcoming NUMA APIs to set up your NUMA environment:
to bind processes to (groups of ) processors, to define the memory placement
policy, etc.
Yes, the initial placement is very much important. It affects tremendously the
performance you obtain.
Yet, what if
- the application changes its behavior over time ?
(which processor uses which part of the memory)
- you have not got the source of the application ?
- you cannot add the NUMA services to it ?
- you are not authorized to touch it ? (e.g. it is a reference benchmark)
Page migration tries to help you out in these situations.
What can this service do ?
--------------------------
- Migrate pages identified by their physical addresses to another NUMA node
- Migrate pages of a virtual user address range to another NUMA node
How can it be used ?
--------------------
1. Hardware assisted migration
..............................
As you can guess, it is very much platform dependent.
I can only give you an example. Any advice on how to define a platform
independent interface will be appreciated.
We've got an Intel IA64 based machine for development / testing.
It consists of 4 "Tiger boxes" connected together by a pair of Scalability Port
Switches. A "Tiger box" is built around a Scalable Node Controller (SNC), and
includes 4 Itanium-2 processors and some Gbytes of memory.
The NUMA factor is 1 : 2.25.
The SNC contains 2048 counters which allow us to count how many times these 2048
zones of memory are touched from each node in a given observation period.
An "artificial intelligence" can make predictions from these usage statistics
and decide what pages are to be migrated and where.
(Unfortunately, the SNCs are buggy - even the version C.1 is - we've got to use
a couple of work-arounds, much of the work has to be done in software.
This wastes about 10 seconds of CPU time while executing a benchmark of
2 minutes. I hope, one day...)
2. Application driven migration
...............................
An application can exploit the forthcoming NUMA APIs to specify its initial
memory placement policy.
Yet what if the application wants to change its behavior ?
Allocating room on the destination node, copying the data by the application
itself, and finally freeing the original room of the data is not very efficient.
An application can ask the migration service to move a range of its virtual
address space to the destination node.
Example:
A process of an application prepares a huge amount of data and hands it over to
Its fellow processes (which happen to be bound to another NUMA node) for their
(almost) exclusive usage.
Migrating a page costs 128 remote accesses (assuming a page size of 16 Kbytes
and a bus transaction size of 128 bytes) + some administration.
Assuming the consumers of the data will frequently touch the page (cache misses)
a considerable number of times, say more that 1000 times, then the migration
becomes largely profitable.
3. NUMA aware scheduler
.......................
A NUMA aware scheduler tries to keep processes on their "home" node where they
have allocated (most of) their memory. What if the processors in this node are
overloaded while several processors in the other nodes are largely in idle ?
Should the scheduler select some other processors in the other nodes to execute
these processes, at the expense of considerable number of extra node
transactions ?
Or should the scheduler leave the processors in the other nodes doing nothing ?
Or should it move some processes with their memory working set to another node ?
Let's leave this dilemma for the NUMA aware scheduler for the moment.
Once the scheduler has made up its mind, the migration service can move the
working set of memory of the selected processes to their new "home" node.
User mode interface
-------------------
This prototype of the page migration service is implemented as a system call,
the different forms of which are wrapped by use of some small,
static, inline functions.
NAME
migrate_ph_pages - migrate pages to another NUMA node
migrate_virt_addr_range - migrate virtual address range to another node
SYNOPSIS
#include <sys/types.h>
#include "page_migrate.h"
int migrate_ph_pages(
const phaddr_t * const table,
const size_t length,
const int node,
struct _un_success_count_ * const p,
const pid_t pid);
int migrate_virt_addr_range(
const caddr_t address,
const size_t length,
const int node,
struct _un_success_count_ * const p,
const pid_t pid);
DESCRIPTION
The "migrate_ph_pages()" system call is used to migrate pages - their
physical addresses of "phaddr_t" type are given in "table" - to "node".
"length" indicates the number of the physical addresses in "table" and
should not be greater than "PAGE_SIZE / sizeof(phaddr_t)".
Only the pages belonging to the process indicated by "pid" and its
child processes cloned via "clone2(CLONEVM)" are treated, the other
processes' pages are silently ignored.
The "migrate_virt_addr_range()" system call is used to migrate pages of
a virtual address range of "length" starting at "address" to "node".
The virtual address range belongs to the process indicated by "pid" and
to its cloned children. If "pid" is zero then the current
process's virtual address range is moved.
Some statistics are returned via "p":
struct _un_success_count_ {
unsigned int successful; // Pages successfully migrated
unsigned int failed; // Minor failures
};
RETURN VALUE
"migrate_ph_pages()" and "migrate_virt_addr_range()" return 0 on
success, or -1 if a major error occurred (in which case, "errno" is set
appropriately). Minor errors are silently ignored (migration continues
with the rest of the pages).
ERRORS
ENODEV: illegal destination node
ESRCH: no process of "pid" can be found
EPERM: no permission
EINVAL: invalid system call parameters
EFAULT: illegal virtual user address
ENOMEM: cannot allocate memory
RESTRICTIONS
We can migrate a page if it belongs to a single "mm_struct" / PGD,
i.e. it is private to a process or shared with its child processes
cloned via "clone2(CLONEVM)".
Notes:
- A "major error" prevents us from carrying on the migration, but it is not a
real error for the "victim" application that can continue (it is guaranteed
not to be broken). The pages already migrated are left in their new node.
- Migrating a page shared among other than child processes cloned via
"clone2(CLONEVM)" would require locking all the page owners' PGDs.
I've got serious concerns about locking more than one PGDs:
+ It is not foreseen in the design of the virtual memory management.
+ Obviously, the PGDs have to be "trylock()"-ed in order to avoid dead locks.
However, "trylock()"-ing lots of PGDs, possibly thousands of them, would
lead to starvation problems. A performance enhancement tool consuming so
much in the event of not concluding...
Some figures
------------
One of our customers has an OpenMP benchmark which was used to measure the
machine described above. It uses 1 Gbytes of memory and runs on 16 processors,
on 4 NUMA nodes.
If the benchmark is adapted to our NUMA architecture, then it takes 86 seconds
to complete.
As results are not accepted if obtained by modifying the benchmark in any
way, the best we can do is to use a random or round robin memory allocation
policy. We end up with a locality rate of 25 % and the benchmark executes in 121
seconds.
If we had a zero-overhead migration tool, then - I estimate - it would complete
In 92 seconds (the benchmark starts in a "pessimized" environment, and it takes
time for the locality ramp up from 25 % to almost 100 %).
Actually it takes 2 to 3 seconds to move 750 Mbytes of memory (on a heavily
loaded machine), reading out the counters of the SNCs and making some quick
decisions take 1 to 2 seconds, and we lose about 10 seconds due to the buggy
SNCs. We end up with 106 seconds.
Some if's
---------
- if the benchmark used more memory, then it would be more expensive to migrate
all of it's pages
- if the benchmark ran for longer without changing its memory usage
pattern, then it could spend a greater percentage of its lifetime in a well
localized environment
- if you had a NUMA factor higher than ours, then obviously, you would gain
more in performance by use of the migration service
- if we used Madison processors with 6 Mbytes of cache (twice as much we have
right now), then the NUMA factor would be masked more efficiently
- if the clock frequency of the processors increases, then you run out of cached
data more quickly and the NUMA factor becomes a higher performance cut factor
TODOs
------
As I have not got access to machines other than IA64 based ones, any help to
test page migration on other architectures will be appreciated.
I include some demo programs:
.............................
test/ph.c: migrates some of its pages by use of their physical addresses
test/v.c: migrates a part of its virtual address range
test/vmig.c: migrates a part of the virtual address range of "test/victim.c"
test/migstat.c: displays some internal counters if the kernel has been compiled
with "_NEED_STATISTICS_" defined
I'll send the patch in the next letter.
Should the list refuse the patch due to its length, please pick it up at our
anonymous FTP server: ftp://visibull.frec.bull.fr/pub/linux/migration
The patch is against:
patch-2.6.4.-bk4
kdb-v4.3-2.6.3-common-b0
kdb-v4.3-2.6.3-ia64-1
Your remarks will be appreciated.
Zoltan Menyhart
On Fri, Mar 26, 2004 at 10:02:00AM +0100, Zoltan Menyhart wrote:
>
> Migrate pages from a ccNUMA node to another.
> ============================================
>
> 1. Hardware assisted migration
> ..............................
>
We have found that "automatic" migration ends to result in the
system deciding to move the wrong pieces around. Since applications
can be so varied, I would recommend we let the application decide
when it thinks it is beneficial to move a memory range to a nearby
node.
>
> 2. Application driven migration
> ...............................
>
> An application can exploit the forthcoming NUMA APIs to specify its initial
> memory placement policy.
> Yet what if the application wants to change its behavior ?
The placement policy doesn't really fit the bill entirely. We are
currently tracking a problem with repeatability of a benchmark. We
found that the newer libc we are using used to result in a newly
forked process touching a page before the parent did and therefore
the page, which had been marked COW, would, on the old libc end up
on the childs node for the child and parents node for the parent.
After the update, both pages ended up on the parents.
If you syscall would simply do the copy to the destination node
for COW pages, this would have worked terrifically in both cases.
>
> 3. NUMA aware scheduler
> .......................
>
Back to my earlier comment about magic. This is a second tier of
magic. Here we are talking about infering a reason to migrate based
on memory access patterns, but what if that migration results in
some other process being hurt more than this one is helped.
Honestly, we have beaten on the scheduler quite a bit and the "allocate
memory close to my node" has helped considerably.
One thing that would probably help considerably, in addition to the
syscall you seem to be proposing, would be an addition to the
task_struct. The new field would specify which node to attempt
allocations on. Before doing a fork, the parent would do a
syscall to set this field to the node the child will target. It
would then call fork. The PGDs et al and associated memory, including
the task struct and pages would end up being allocated based upon
that numa node's allocation preference.
What do you think of combining these two items into a single syscall?
>
> User mode interface
> -------------------
>
> This prototype of the page migration service is implemented as a system call,
> the different forms of which are wrapped by use of some small,
> static, inline functions.
>
> NAME
> migrate_ph_pages - migrate pages to another NUMA node
At first, I thought "Wow, this could result in some nice admin tools."
The more I scratch my head on this, the less useful I see it, but
would not argue against it.
> migrate_virt_addr_range - migrate virtual address range to another node
This one sounds good.
Thanks,
Robin Holt
On Fri, 26 Mar 2004 04:39:59 -0600
Robin Holt <[email protected]> wrote:
> One thing that would probably help considerably, in addition to the
> syscall you seem to be proposing, would be an addition to the
> task_struct. The new field would specify which node to attempt
> allocations on. Before doing a fork, the parent would do a
> syscall to set this field to the node the child will target. It
> would then call fork. The PGDs et al and associated memory, including
> the task struct and pages would end up being allocated based upon
> that numa node's allocation preference.
You just described the process policy of NUMA API.
-Andi
Robin Holt wrote:
>
> We have found that "automatic" migration ends to result in the
> system deciding to move the wrong pieces around. Since applications
> can be so varied, I would recommend we let the application decide
> when it thinks it is beneficial to move a memory range to a nearby
> node.
I am not saying it is for every application
(see the paragraph of the "if's").
There are a couple of applications which run for long time, with
relatively stable memory working sets. And I can help them.
You launch your application with and without, and you use if you
gain enough.
> The placement policy doesn't really fit the bill entirely. We are
> currently tracking a problem with repeatability of a benchmark. We
> found that the newer libc we are using used to result in a newly
> forked process touching a page before the parent did and therefore
> the page, which had been marked COW, would, on the old libc end up
> on the childs node for the child and parents node for the parent.
> After the update, both pages ended up on the parents.
I haven't modified anything in the existing page fault handler.
Nor I've changed the placement policy.
You need to specify explicitly where the pages go for my proposed
syscall.
> If you syscall would simply do the copy to the destination node
> for COW pages, this would have worked terrifically in both cases.
The COW pages are referenced by more than one PGDs (by that of the
parent and its children). As I state in RESTRICTIONS, I skip these
pages.
I think this issue with the COW pages is a fork() - exec()
placement problem, i do not address it with my stuff.
> >
> > 3. NUMA aware scheduler
> > .......................
> >
>
> Back to my earlier comment about magic. This is a second tier of
> magic. Here we are talking about infering a reason to migrate based
> on memory access patterns, but what if that migration results in
> some other process being hurt more than this one is helped.
>
> Honestly, we have beaten on the scheduler quite a bit and the "allocate
> memory close to my node" has helped considerably.
>
> One thing that would probably help considerably, in addition to the
> syscall you seem to be proposing, would be an addition to the
> task_struct. The new field would specify which node to attempt
> allocations on. Before doing a fork, the parent would do a
> syscall to set this field to the node the child will target. It
> would then call fork. The PGDs et al and associated memory, including
> the task struct and pages would end up being allocated based upon
> that numa node's allocation preference.
>
> What do you think of combining these two items into a single syscall?
I can agree with Robin Holt, it's NUMA API issue.
I just give a tool, if someone somehow knows that this piece of memory
would be better on another node, I can do it.
> > NAME
> > migrate_ph_pages - migrate pages to another NUMA node
>
> At first, I thought "Wow, this could result in some nice admin tools."
> The more I scratch my head on this, the less useful I see it, but
> would not argue against it.
We are working on the prototype of a device driver to read out the
"hot page" counters on n-th Scalable Node Controller
(say: "/dev/snc/n/hotpage").
An "artificial intelligence" can guess what to move and calls this service.
BTW Has someone a machine with a chip set other than i82870 ?
Thanks,
Zoltan Menyhart
Szia Zoltan,
I like the aproach very much and was hoping that someone will bring
on-demand page migration to Linux.
> - Migrate pages identified by their physical addresses to another NUMA node
You want this only for your "AI" keeping track of the hw counters in
the chipset? I hope you can teach it to keep track of the bandwidth of
all processes on the machine, otherwise it might disturb the processes
more than it helps them... and waste the machine's bandwidth with
migrating pages.
> - Migrate pages of a virtual user address range to another NUMA node
This is good. I'm thinking about the rss/node patches, they would tell
you when you should think about migrating something for a process. My
current usage model would be simpler: for a given mm migrate all pages
currently on node A to node B. But the flexibility of your API will
certainly not remain unused.
...
> BTW Has someone a machine with a chip set other than i82870 ?
??? As far as I know SGI, HP, NEC and IBM have all their own NUMA
chipsets for IA64. Was this the question? Are you looking for hardware
counters?
Regards,
Erich
Erich Focht wrote:
>
> Szia Zoltan,
>
> > - Migrate pages identified by their physical addresses to another NUMA node
> You want this only for your "AI" keeping track of the hw counters in
> the chipset? I hope you can teach it to keep track of the bandwidth of
> all processes on the machine, otherwise it might disturb the processes
> more than it helps them... and waste the machine's bandwidth with
> migrating pages.
Szervusz Erich,
I put AI between quotation marks because it is not a real intelligence.
It should not waste much time on being intelligent :-)
At least we have not managed to find out a general purpose solution that
could cope with no matter what application.
We try to tune parameters, like sampling period / time, how many pages
are checked, when we consider a page as "hot", how many % or # cycles
distant vs. local justifies the migration...
We try to set up a "profile" for an application.
The launcher of the application takes into account the profile and
the AI evaluates the HW counters according this profile info.
I think most of the huge applications of our clients will run several
times with different data but with the data of similar behavior.
The clients will have the ability to tune their application profiles.
> > - Migrate pages of a virtual user address range to another NUMA node
> This is good. I'm thinking about the rss/node patches, they would tell
> you when you should think about migrating something for a process. My
> current usage model would be simpler: for a given mm migrate all pages
> currently on node A to node B. But the flexibility of your API will
> certainly not remain unused.
You should migrate if it worth the cost of the migration to do
(private malloc()-ed data, stack,...).
Do you mean to guess in the kernel what pages to move ?
I need "real information" :-)) to decide, this is either I ask the HW or
I wait the application to tell me its requirement.
> > BTW Has someone a machine with a chip set other than i82870 ?
> ??? As far as I know SGI, HP, NEC and IBM have all their own NUMA
> chipsets for IA64. Was this the question? Are you looking for hardware
> counters?
I'd like to know - and someone to try :-)) - how much it costs on another
chip set to measure the page usage statistics, say for 1 Gbytes of memory...
Thanks,
Zolt?n
Dave,
Thank you for your remarks.
Sure, I'll do my best to comply with the coding style of the community.
Dave Hansen wrote:
> Before anything else, please take a long look at
> Documentation/CodingStyle. Pay particular attention to the column
> width, indenting, and function sections.
Shell I really stick to the width of 80 characters ?
I have got 89, yet several other files already are much wider.
By the indenting issues, do you mean (not) breaking lines like these below ?
if ((pte_val(*pte) & _PFN_MASK) !=
(*p & _PFN_MASK))
(I do have TABs = 8 spaces.)
Cutting the functions into smaller pieces - I'll try to do my best.
> One of the best things about your code is that it uses a lot of
> architecture-independent functions and data structures. The page table
> walks in your patch, for instance, would work on any Linux
> architecture. However, all of this code is in the ia64 arc. Why? Will
> other NUMA architectures not need this page migration functionality?
I made an effort to write it as architecture independent as possible
(I guess only the SAVE_ITC - STORE_DELAY macro pair is to be re-written.)
As I have not even seen any other NUMA machine - I would not post a code that
no one ever tried on the other architectures.
> Also, although it's great
> while you're developing a patch, it's best to try and refrain from
> documenting things in comments things that are already a non-tricky part
> of the way that things already work:
> //
> // "pte_page->mapping" points at the victim process'es "mm_struct"
> //
> These comments really just take up space and reduce readability.
This one was for myself :-)
It is difficult to find the golden mean. E.g. if I have not inserted
a comment here:
mm = current->mm;
//
// Actually, there is no need to grab "mm" because it is ours, wont go
// away in the mean time. As we do not want to ask questions when
// releasing it...
// It is safe just to increment the counter: it is ours.
//
atomic_inc(&mm->mm_users);
I think people could have asked why on the Earth I incremented "current->mm->mm_users".
Or another example:
//
// "get_task_mm()" includes "task_lock()" that "nests both inside and outside of
// read_lock(&tasklist_lock)" - as a note in "sched.h" states.
//
As there is no "locking guide", at least I make a note why I think I'm safe.
(I think the main goals are efficiency and quality: how much effort it takes to
add a new functionality, to correct a bug or simply to understand the code;
how much chance we've got not to crash the system.
In the old golden ages, there were some few gurus who knew the code by hart.
It was efficient for them not to waste time on "useless stuff". Now as the
Linux community becomes larger and larger, the meaning what the efficiency is
has changed. I think the efficiency should mean how easily people can understand,
add, change, correct things. How can I be efficient if need to resolve puzzles ?
No problem with solving puzzles, it's an intellectual challenge. But how can I know
if some "stuff" is there intentionally or by chance ? Or I just misunderstood the
puzzle ? The quality requirement is against solving puzzles.
Most of my comments are "synchronization points" for those who want to understand
my code and "statements" for those who want to double check if I've missed a point.
Another OS that "must not be named" includes thousands of ASSERT()-s and assert()-s
(see also the JFS of Linux) and the lower case assert()-s are kept even for
non debug mode. Should not include Linux lost of similar "statements" ?)
> I find the comments inside of function and macro calls a bit hard to
> read:
> STORE_DELAY(/* in */ unlock_time, /* out */ new_page_unlock);
I think there is a conceptual problem here, e.g.:
a = 1; b = 2;
Foo(a); Foo_2(&b);
c = a + b; // Is still a == 1 and b == 2 ?
I just read through the code: o.k. we call a "functionality" Foo(a) - assuming
it is a speaking name - I can see more or less what it does; but is "a" sill equal
to 1 at the 3rd line ?
Don't know. If Foo() is a function, then "a" has not been changed, if Foo() is a
macro - who knows ? As far as "b" and Foo_2() are concerned, "&" is
a more than warning that "b" gets modified.
I have to warn the reader "by hand" that an argument gets modified.
It is more "space efficient" as I wrote than writing as:
/* Warning: "new_page_unlock" is going to be changed */
STORE_DELAY(unlock_time, new_page_unlock);
> I think every VM hacker has a couple of these functions stashed around
> in various patches for debugging, but they're not really something that
> belongs in the kernel. In general, you should try to remove debugging
> code before posting a patch.
If you want to print out a VMA, where is the debug service ?
I agree this is not best place in my code. As the print out routines should
be kept coherent with the definitions of the respective structures, they
should be some static functions in the .h files.
I'll remove them but we still need some official stuff.
I stopped my source navigator after it has dug up the 1000th "...DEBUG..." :-)
> + case _SIZEOF_STATISTICS_:
> + rc = *(long long *) &_statistics_sizes;
> + break;
>
> I'm sure the statistics are very important, but they're a bit
> intrusive. Can you separate that code out into a file by itself? Are
> they even something that a user would want when they're running
> normally, or is it another debugging feature?
As our "AI" is very much experimental, I really do not know how much
this information is useful. Neither the HW assisted nor the application
driven version knows how many pages are movable, how many of them cannot
be handled at all by my migration mechanism (e.g. you have a SysV SHM).
Other error information or knowing how much it costs can be a feed back
for the "AI". Probably it is useful wen we tune the profile for an
application. We need more experience.
> +migrate_virt_addr_range(
> ...
> + u.ll = syscall(__NR_page_migrate, _VA_RANGE_MIGRATE_,
> + address, length, node, pid);
> ...
> +}
> Making syscalls from inside of the kernel is strongly discouraged. I'm
> not sure what you're trying to do there. You might want to look at some
> existing code like sys_mmap() vs do_mmap().
The wrapper functions like this are after an "#if !defined(__KERNEL__)"
As "types.h" also contains "#ifdef __KERNEL__"...
> +#define __VA(pgdi, pmdi, ptei) (((pgdi) >> (PAGE_SHIFT - 6)) << 61 | \
> + ((pgdi) & ((PTRS_PER_PGD >> 3) - 1)) << PGDIR_SHIFT | \
> + (pmdi) << PMD_SHIFT | (ptei) << PAGE_SHIFT)
>
> There are magic numbers galore in this macro. Would this work?
Well, I just copied what the offitial "pgd_index()" macro and its fellows do:
pgd_index (unsigned long address)
{
unsigned long region = address >> 61;
unsigned long l1index = (address >> PGDIR_SHIFT) & ((PTRS_PER_PGD >> 3) - 1);
return (region << (PAGE_SHIFT - 6)) | l1index;
}
Me too, I'd like to see more symbolic constants...
But I don't want to be more catholic than the Pope :-)
> #define __VA(pgdi, pmdi, ptei) ((pgdi)*PGDIR_SIZE + \
> (pmdi)*PMD_SIZE + \
> (ptei)*PAGE_SIZE)
> If ia64 doesn't have the _SIZE macros, you can just copy them from
> include/asm-i386/pgtable*.h
Unfortunately, it is not so simple like that.
We've got 5 user regions out of 8 regions of size of 0x2000000000000
(don't know how is this called). Only the first 16 Tbytes of each region
are mapped bye the PGD-PMD-PTE structure. Then we've got a large hole
that cannot be taken into account by your macro. I.e. we count as
0x000fffffffffffff, 0x2000000000000, ... 0x200fffffffffffff, 0x4000000000000
Apart from using some nice macros with speaking names instead of "<< 61",
I cannot see any simpler way for the conversion (as far as IA64 is concerned).
> -- 2.6.4.ref/mm/rmap.c Tue Mar 16 10:18:17 2004
> +++ 2.6.4.mig4/mm/rmap.c Thu Mar 25 09:00:13 2004
> ...
> -struct pte_chain {
> - unsigned long next_and_idx;
> - pte_addr_t ptes[NRPTE];
> -} ____cacheline_aligned;-- 2.6.4.ref/include/asm-ia64/rmap-locking.h
>
> Exposing the VM internals like that probably isn't going to be
> acceptable. Why was this necessary?
Previously, I was more ambitious, and wanted to handle the not direct
mapped pages, too. I'll put back where it was.
For the rest: it's YES.
> There don't appear to be any security checks in your syscall. Should
> all users be allowed to migrate memory around at will from any pid?
Well, who makes migrate someone else, cannot read / write the victim's data,
nor can break it. For more security, I can add a something like whoever can
kill an application, s/he can migrate it, too.
> By not modifying a single line in the existing VM path, your patch
> simply duplicates functionality from that existing code, which I'm not
> sure is any better.
> I think there's a lot of commonality with what the swap code, NUMA page
> migration, and memory removal have to do. However, none of them share
> any code today. I think all of the implementations could benefit from
> making them a bit more generic.
I can agree on not doubling code. I can give a try to the code written by
Hirokazu Takahashi and see if it is sufficiently efficient for me.
Thanks,
Zolt?n
Dave Hansen wrote:
> > > >Notes: "pte" can be NULL if I do not know it apriori
> > > > I cannot release "mm->page_table_lock" otherwise I have to re-scan the "mm->pgd".
> > >
> > > Re-schan plicy would be much better since migrating pages is heavy work.
> > > I don't think that holding mm->page_table_lock for long time would be
> > > good idea.
> >
> > Re-scanning is "cache killer", at least on IA64 with huge user memory size.
> > I have more than 512 Mbytes user memory and its PTEs do not fit into the L2 cache.
> >
> > In my current design, I have the outer loops: PGD, PMD and PTE walking; and once
> > I find a valid PTE, I check it against the list of max. 2048 physical addresses as
> > the inner loop.
> > I reversed them: walking through the list of max. 2048 physical addresses as outer
> > loop and the PGD - PMD - PTE scans as inner loops resulted in 4 to 5 times slower
> > migration.
>
> Could you explain where you're getting these "magic numbers?" I don't
> quite understand the significance of 2048 physical addresses or 512 MB
> of memory.
I use an IA64 machine with page size of 16 Kbytes.
I allocate a single page (for simplicity) when I copy in from user space
the list of the addresses (2048 of them fit in a page).
One page of PTEs maps 32 Mbytes.
16 pages (which map 512 Mbytes) eat up my L2 cache of 256 Kbytes
(and the other data, the stack, the code, etc...).
Sure I have an L3 cache of at least 3 Mbytes (new CPUs have bigger L3 caches)
but it is much slower than the L2.
> Zoltan, it appears that we have a bit of an inherent conflict with how
> much CPU each of you is expecting to use in the removal and migration
> cases. You're coming from a HPC environment where each CPU cycle is
> valuable, while the people trying to remove memory are probably going to
> be taking CPUs offline soon anyway, and care a bit less about how
> efficient they're being with CPU and cache resources.
Yes I see.
> Could you be a bit more explicit about how expensive (cpu-wise) these
> migrate operations can be?
My machine consists of 4 "Tiger boxes" connected together by a pair of Scalability
Port Switches. A "Tiger box" is built around a Scalable Node Controller (SNC), and
includes 4 Itanium-2 processors and some Gbytes of memory.
The CPU clock runs at 1.3 GHz. Local memory access costs 200 nanosec.
The NUMA factor is 1 : 2.25.
An OpenMP benchmark uses 1 Gbytes of memory and runs on 16 processors,
on 4 NUMA nodes.
If the benchmark is adapted to our NUMA architecture (local allocations by hand),
then it takes 86 seconds to complete.
As results are not accepted if obtained by modifying the benchmark in any
way, the best we can do is to use a random or round robin memory allocation
policy. We end up with a locality rate of 25 % and the benchmark executes in 121
seconds.
If we had a zero-overhead migration tool, then - I estimate - it would complete
In 92 seconds (the benchmark starts in a "pessimized" environment, and it takes
time for the locality ramp up from 25 % to almost 100 %).
Actually it takes 2 to 3 seconds to move 750 Mbytes of memory (on a heavily
loaded machine), reading out the counters of the SNCs and making some quick
decisions take 1 to 2 seconds, and we lose about 10 seconds due to the buggy
SNCs. We end up with 106 seconds.
(I keep holding the PGD lock, no re-scan while moving the max. 2048 pages.)
Inverting the inner - extern loops, i.e. for each page, I scan the PGD-PMD-PTE,
and the PGD lock is released between the scans, I have 10 to 12 seconds
spent on the migration (not counting the 1 to 2 seconds of the "AI" + ~10 seconds
due to the buggy SNCs.)
Thanks,
Zolt?n