LinuxLists.cc - Migrate pages from a ccNUMA node to another

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Have you considered any common ground your patch might share with the
people doing memory hotplug?

http://people.valinux.co.jp/~iwamoto/mh.html

They have a similar problem to your migration that occurs when a user
wants to remove a whole or partial NUMA node.
[email protected]

Is your code something that you'd like to see go into the mainline 2.6
or 2.7 kernel?

Also, please don't spam-encode your address when sending to the list.
It just makes it harder for people to send feedback.

-- Dave

2004-03-30 08:28:49

by IWAMOTO Toshihiro

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hi Zoltan,

At Fri, 26 Mar 2004 09:20:46 -0800,
Dave Hansen wrote:
>
> Have you considered any common ground your patch might share with the
> people doing memory hotplug?
>
> http://people.valinux.co.jp/~iwamoto/mh.html
>
> They have a similar problem to your migration that occurs when a user
> wants to remove a whole or partial NUMA node.
> [email protected]

Processes must be migrated to other nodes when a node is being
removed. Conversely, processes may be migrated from other nodes when
a node is added. I'm not familiar with NUMA things, and I think our
team doesn't have a particular solution. If you have some idea,
that's great.

BTW, it seems page migration can use my remap_onepage function. Our
code can move most kinds of pages including hugetlbfs pages and page
caches.

--
IWAMOTO Toshihiro

2004-03-30 09:04:37

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hello,

> > Have you considered any common ground your patch might share with the
> > people doing memory hotplug?
> >
> > http://people.valinux.co.jp/~iwamoto/mh.html
> >
> > They have a similar problem to your migration that occurs when a user
> > wants to remove a whole or partial NUMA node.
> > [email protected]
>
> Processes must be migrated to other nodes when a node is being
> removed. Conversely, processes may be migrated from other nodes when
> a node is added. I'm not familiar with NUMA things, and I think our
> team doesn't have a particular solution. If you have some idea,
> that's great.
>
> BTW, it seems page migration can use my remap_onepage function. Our
> code can move most kinds of pages including hugetlbfs pages and page
> caches.

I believe his patch will interest you since most of the code is
independent of cpu architecture and it also covers mmaped files,
shmem, ramdisk, mlocked pages and so on.

We will post new version of the memory hotplug patches in a week.

Thank you,
Hirokazu Takahashi.

2004-03-30 11:20:11

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hirokazu Takahashi wrote:
>
> Hello,
>
> > > Have you considered any common ground your patch might share with the
> > > people doing memory hotplug?
> > >
> > > http://people.valinux.co.jp/~iwamoto/mh.html
> > >
> > > They have a similar problem to your migration that occurs when a user
> > > wants to remove a whole or partial NUMA node.
> > > [email protected]
> >
> > Processes must be migrated to other nodes when a node is being
> > removed. Conversely, processes may be migrated from other nodes when
> > a node is added. I'm not familiar with NUMA things, and I think our
> > team doesn't have a particular solution. If you have some idea,
> > that's great.
> >
> > BTW, it seems page migration can use my remap_onepage function. Our
> > code can move most kinds of pages including hugetlbfs pages and page
> > caches.
>
> I believe his patch will interest you since most of the code is
> independent of cpu architecture and it also covers mmaped files,
> shmem, ramdisk, mlocked pages and so on.
>
> We will post new version of the memory hotplug patches in a week.
>
> Thank you,
> Hirokazu Takahashi.

I am afraid the "remap_onepage()" function + the modifications necessary
at some other places are too much for me :-)

You do a couple of retries, waits. I cannot afford spending so much as
overhead due to some performance optimization.

I can understand that if you want to remove a node / memory module, then you
have to succeed by all means, you have to handle all kinds of pages,
the performance is not at a premium.

Regards,

Zolt?n Menyh?rt

2004-03-30 11:38:09

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Dave Hansen wrote:
>
> Have you considered any common ground your patch might share with the
> people doing memory hotplug?
>
> http://people.valinux.co.jp/~iwamoto/mh.html
>
> They have a similar problem to your migration that occurs when a user
> wants to remove a whole or partial NUMA node.
> [email protected]

Comparing my stuff to their work, I just do some small performance enhancements:

- I do not modify a single line on the existing VM paths - if my stuff has no
improvement for you, then yo will not be obliged to pay any overhead
- I do not insist on :-)) ... that would block the execution of the application
while the resources are not available
- I handle only the simplest case: private anonymous pages (...a singe PTE...)

- IWAMOTO Toshihiro provides a complete "fool proof" solution with obligation to
cussed in the migration

> Is your code something that you'd like to see go into the mainline 2.6
> or 2.7 kernel?

Since someone is asking...

Thanks,

Zolt?n

2004-03-30 12:07:15

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hello,

Zolt?n Menyh?rt wrote:

> > > > Have you considered any common ground your patch might share with the
> > > > people doing memory hotplug?
> > > >
> > > > http://people.valinux.co.jp/~iwamoto/mh.html
> > > >
> > > > They have a similar problem to your migration that occurs when a user
> > > > wants to remove a whole or partial NUMA node.
> > > > [email protected]
> > >
> > > Processes must be migrated to other nodes when a node is being
> > > removed. Conversely, processes may be migrated from other nodes when
> > > a node is added. I'm not familiar with NUMA things, and I think our
> > > team doesn't have a particular solution. If you have some idea,
> > > that's great.
> > >
> > > BTW, it seems page migration can use my remap_onepage function. Our
> > > code can move most kinds of pages including hugetlbfs pages and page
> > > caches.
> >
> > I believe his patch will interest you since most of the code is
> > independent of cpu architecture and it also covers mmaped files,
> > shmem, ramdisk, mlocked pages and so on.
> >
> > We will post new version of the memory hotplug patches in a week.
> >
> > Thank you,
> > Hirokazu Takahashi.
>
> I am afraid the "remap_onepage()" function + the modifications necessary
> at some other places are too much for me :-)
>
> You do a couple of retries, waits. I cannot afford spending so much as
> overhead due to some performance optimization.

I understand what you want to do. Page migration is meaningless if the
cost of it is high.

> I can understand that if you want to remove a node / memory module, then you
> have to succeed by all means, you have to handle all kinds of pages,
> the performance is not at a premium.
>
> Regards,
>
> Zolt?n Menyh?rt

It's not hard to add "no-retry-mode" to "remap_onepage()" function
if you want. It may skip to migrate some pages if they are accessed
heavily. In paticular if you only want to care about anonymous pages,
they will be handled very well.

Thank you,
Hirokazu Takahashi.

2004-03-30 14:31:27

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hirokazu Takahashi wrote:

[...]
>
> It's not hard to add "no-retry-mode" to "remap_onepage()" function
> if you want. It may skip to migrate some pages if they are accessed
> heavily. In paticular if you only want to care about anonymous pages,
> they will be handled very well.

Well, why not to give it a try ?
Yet your code is not really easy to read. :-)
I do not dare to adapt it on my own, I am afraid of breaking something.
Could you please provide me a modified version of your "remap_onepage()" ?
Can we move to 2.6.4 ?

In addition to "no-retry-mode", I need to specify where the new page
should be allocated from.

Here is my interface I need to implement with "remap_onepage()":

/*
* Common part of checking & migrating the pages one by one.
*
* Arguments: src_node: Source NUMA node
* old_p: -> old page structure
* node: Destination NUMA node
* mm: -> victim "mm_struct"
* pte: -> PTE of the page to be moved
*
* Returns: 1: Migration O. K.
* 0: Minor error, no actual migration has been done
* -Exxx: Catastrophic error
*
* Notes: - "mm->page_table_lock" and "mm->mmap_sem" have to be held.
* - The old page is "get_page()"-ed on entry to make sure it does not go
* away in the mean time - on return it gets "put_page()"-ed.
*/
int
common_check_migrate_1_page(const int src_node, struct page * const old_p,
const int node, struct mm_struct * const mm, pte_t * const pte)

Notes: "pte" can be NULL if I do not know it apriori
I cannot release "mm->page_table_lock" otherwise I have to re-scan the "mm->pgd".

Thanks,

Zolt?n Menyh?rt

2004-03-30 15:20:30

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

On Tue, 2004-03-30 at 03:39, Zoltan Menyhart wrote:
> Dave Hansen wrote:
> >
> > Have you considered any common ground your patch might share with the
> > people doing memory hotplug?
>
> Comparing my stuff to their work, I just do some small performance enhancements:
>
> - I do not modify a single line on the existing VM paths - if my stuff has no
> improvement for you, then yo will not be obliged to pay any overhead
...
> - I handle only the simplest case: private anonymous pages (...a singe PTE...)

By not modifying a single line in the existing VM path, your patch
simply duplicates functionality from that existing code, which I'm not
sure is any better.

I think there's a lot of commonality with what the swap code, NUMA page
migration, and memory removal have to do. However, none of them share
any code today. I think all of the implementations could benefit from
making them a bit more generic.

-- Dave

2004-03-30 15:59:38

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

On Tue, 2004-03-30 at 03:39, Zoltan Menyhart wrote:
> Dave Hansen wrote:
> > Is your code something that you'd like to see go into the mainline 2.6
> > or 2.7 kernel?
>
> Since someone is asking...

Before anything else, please take a long look at
Documentation/CodingStyle. Pay particular attention to the column
width, indenting, and function sections.

One of the best things about your code is that it uses a lot of
architecture-independent functions and data structures. The page table
walks in your patch, for instance, would work on any Linux
architecture. However, all of this code is in the ia64 arc. Why? Will
other NUMA architectures not need this page migration functionality?

It's great that you are commenting so many things, but normal Linux
style is to use C-style comments, not C++. Also, although it's great
while you're developing a patch, it's best to try and refrain from
documenting things in comments things that are already a non-tricky part
of the way that things already work:
//
// "pte_page->mapping" points at the victim process'es "mm_struct"
//
These comments really just take up space and reduce readability.

I find the comments inside of function and macro calls a bit hard to
read:
STORE_DELAY(/* in */ unlock_time, /* out */ new_page_unlock);

void
dump_mm(const struct mm_struct * const mm)
{
...

void
dump_vma(const struct vm_area_struct * const vma)
{

I think every VM hacker has a couple of these functions stashed around
in various patches for debugging, but they're not really something that
belongs in the kernel. In general, you should try to remove debugging
code before posting a patch.

+ case _SIZEOF_STATISTICS_:
+ rc = *(long long *) &_statistics_sizes;
+ break;

I'm sure the statistics are very important, but they're a bit
intrusive. Can you separate that code out into a file by itself? Are
they even something that a user would want when they're running
normally, or is it another debugging feature?

+#if defined(CONFIG_NUMA)
+ data8 sys_page_migrate // 1276: Migrate pages
to another NUMA node
+#else
data8 sys_ni_syscall
+#endif

See cond_syscall. Basically you declare a weak symbol and override it
later if necessary.

+obj-$(CONFIG_NUMA) += numa.o migrate.o

Can you separate this out under its own config option?

+asmlinkage long long
+sys_page_migrate(const int cmd, const caddr_t address, const size_t.
...
+ switch (cmd){
...
+ case _PHADDR_BATCH_MIGRATE_:
...
+ case _VA_RANGE_MIGRATE_:
...
+ case _STATISTICS_:
...
+ case _GIMME_AN_ADDRESS_:

This smells strongly of an ioctl. If there really are 2 distinct kinds
of memory removal operations, then go ahead and make 2 different
syscalls. As for the _STATISTICS_ and _GIMME_AN_ADDRESS_, they really
shouldn't be there at all. They're just abusing the syscall.

+migrate_virt_addr_range(
...
+ u.ll = syscall(__NR_page_migrate, _VA_RANGE_MIGRATE_,
+ address, length, node, pid);
...
+}

Making syscalls from inside of the kernel is strongly discouraged. I'm
not sure what you're trying to do there. You might want to look at some
existing code like sys_mmap() vs do_mmap().

+#define __VA(pgdi, pmdi, ptei) (((pgdi) >> (PAGE_SHIFT - 6)) << 61 | \
+ ((pgdi) & ((PTRS_PER_PGD >> 3) - 1)) << PGDIR_SHIFT | \
+ (pmdi) << PMD_SHIFT | (ptei) << PAGE_SHIFT)

There are magic numbers galore in this macro. Would this work?

#define __VA(pgdi, pmdi, ptei) ((pgdi)*PGDIR_SIZE + \
(pmdi)*PMD_SIZE + \
(ptei)*PAGE_SIZE)
If ia64 doesn't have the _SIZE macros, you can just copy them from
include/asm-i386/pgtable*.h

-- 2.6.4.ref/mm/rmap.c Tue Mar 16 10:18:17 2004
+++ 2.6.4.mig4/mm/rmap.c Thu Mar 25 09:00:13 2004
...
-struct pte_chain {
- unsigned long next_and_idx;
- pte_addr_t ptes[NRPTE];
-} ____cacheline_aligned;-- 2.6.4.ref/include/asm-ia64/rmap-locking.h

Exposing the VM internals like that probably isn't going to be
acceptable. Why was this necessary?

--- 2.6.4.ref/test/vmig.c Thu Jan 1 01:00:00 1970
+++ 2.6.4.mig4/test/vmig.c Thu Mar 25 09:02:00 2004

If you need userspace code to demonstrate how to use your patch, it's
probably best to post it separately instead of including it in the
patch. Someone might mistake it for kernel code.

I'm sure I missed some things, but I it's hard to look at the patch in
depth functionally before it is cleaned up a bit.

I look forward to seeing an updated version.

-- Dave

2004-03-30 16:37:55

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

On Tue, 2004-03-30 at 07:58, Dave Hansen wrote:
> I'm sure I missed some things, but I it's hard to look at the patch in
> depth functionally before it is cleaned up a bit.

One thing I forgot...

There don't appear to be any security checks in your syscall. Should
all users be allowed to migrate memory around at will from any pid?

-- Dave

2004-04-03 02:58:04

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hello,

>> It's not hard to add "no-retry-mode" to "remap_onepage()" function
>> if you want. It may skip to migrate some pages if they are accessed
>> heavily. In paticular if you only want to care about anonymous pages,
>> they will be handled very well.
>
>Well, why not to give it a try ?
>Yet your code is not really easy to read. :-)
>I do not dare to adapt it on my own, I am afraid of breaking something.
>Could you please provide me a modified version of your "remap_onepage()" ?
>Can we move to 2.6.4 ?

Iwamot and I are working on this. We'll post it soon.

>In addition to "no-retry-mode", I need to specify where the new page
>should be allocated from.
>
>Here is my interface I need to implement with "remap_onepage()":

I guess aruguments src_node, mm and pte would be redundant since
they can be looked up from old_p with the reverse mapping scheme.

>/*
> * Common part of checking & migrating the pages one by one.
> *
> * Arguments: src_node: Source NUMA node
> * old_p: -> old page structure
> * node: Destination NUMA node
> * mm: -> victim "mm_struct"
> * pte: -> PTE of the page to be moved
> *
> * Returns: 1: Migration O. K.
> * 0: Minor error, no actual migration has been done
> * -Exxx: Catastrophic error
> *
> * Notes: - "mm->page_table_lock" and "mm->mmap_sem" have to be held.
> * - The old page is "get_page()"-ed on entry to make sure it does not go
> * away in the mean time - on return it gets "put_page()"-ed.
> */
>int
>common_check_migrate_1_page(const int src_node, struct page * const old_p,
> const int node, struct mm_struct * const mm, pte_t * const pte)
>
>Notes: "pte" can be NULL if I do not know it apriori
> I cannot release "mm->page_table_lock" otherwise I have to re-scan the "mm->pgd".

Re-schan plicy would be much better since migrating pages is heavy work.
I don't think that holding mm->page_table_lock for long time would be
good idea.

How do you think about following algorism:
1. get mm->page_table_lock
2. chose some pages.
3. release mm->page_table_lock
4. call remap_onepage() against each page.
5. goto step1 if there remain pages to be migrated.

2004-04-05 15:07:56

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

Hirokazu Takahashi wrote:

> I guess aruguments src_node, mm and pte would be redundant since
> they can be looked up from old_p with the reverse mapping scheme.

In my version 0.2, I can do with only the following arguments:
* node: Destination NUMA node
* mm: -> victim "mm_struct"
* pte: -> PTE of the page to be moved
(If I have "mm" at hand, why not to use it ? Why not to avoid fetching the r-map
page struct ?)

> >Notes: "pte" can be NULL if I do not know it apriori
> > I cannot release "mm->page_table_lock" otherwise I have to re-scan the "mm->pgd".
>
> Re-schan plicy would be much better since migrating pages is heavy work.
> I don't think that holding mm->page_table_lock for long time would be
> good idea.

Re-scanning is "cache killer", at least on IA64 with huge user memory size.
I have more than 512 Mbytes user memory and its PTEs do not fit into the L2 cache.

In my current design, I have the outer loops: PGD, PMD and PTE walking; and once
I find a valid PTE, I check it against the list of max. 2048 physical addresses as
the inner loop.
I reversed them: walking through the list of max. 2048 physical addresses as outer
loop and the PGD - PMD - PTE scans as inner loops resulted in 4 to 5 times slower
migration.

> How do you think about following algorism:
> 1. get mm->page_table_lock
> 2. chose some pages.
> 3. release mm->page_table_lock
> 4. call remap_onepage() against each page.
> 5. goto step1 if there remain pages to be migrated.

I want to move the most frequently used pages - at least with the HW assisted
hot page detection.
I take "mm->page_table_lock", I nuke the PTE. We've got a good chance that the CPU
using the page observes a page fault almost immediately. It enters the page fault
handler and gets blocked by "mm->page_table_lock". If I released the lock, the CPU
could continue and realize that there is nothing to do, the page fault has already
been repaired. In the mean time, it is me who wait for "mm->page_table_lock".
At worst this scenario happens 2048 times.
If I keep the lock, the victim CPU enters only once the page fault handler.

I think what we should do is to "pull in" pages in to a node rather than than
"pushing them out" for two reasons:
- the recipient CPU executes the migration instead of busy waiting for the lock
- there is chance that the recipient CPU will find the migrated data useful
in its cache

Regards,

Zolt?n Menyh?rt

2004-04-05 15:41:32

[permalink] [raw]

Subject: Re: Migrate pages from a ccNUMA node to another - patch

On Mon, 2004-04-05 at 08:07, Zoltan Menyhart wrote:
> Hirokazu Takahashi wrote:
>
> > I guess aruguments src_node, mm and pte would be redundant since
> > they can be looked up from old_p with the reverse mapping scheme.
>
> In my version 0.2, I can do with only the following arguments:
> * node: Destination NUMA node
> * mm: -> victim "mm_struct"
> * pte: -> PTE of the page to be moved
> (If I have "mm" at hand, why not to use it ? Why not to avoid fetching the r-map
> page struct ?)

That's a good point. There is at least some cost (at least 1 lock)
associated with walking the rmap chains. If it can be avoided, it might
as well be.

But, if someone needs the "no walk" interface, just wrap the function:

foo(page)
{
rmap_results = get_rmap_stuff(page);
__foo(page, rmap_results);
}

__foo(page, rmap_results)
{
...
}

> > >Notes: "pte" can be NULL if I do not know it apriori
> > > I cannot release "mm->page_table_lock" otherwise I have to re-scan the "mm->pgd".
> >
> > Re-schan plicy would be much better since migrating pages is heavy work.
> > I don't think that holding mm->page_table_lock for long time would be
> > good idea.
>
> Re-scanning is "cache killer", at least on IA64 with huge user memory size.
> I have more than 512 Mbytes user memory and its PTEs do not fit into the L2 cache.
>
> In my current design, I have the outer loops: PGD, PMD and PTE walking; and once
> I find a valid PTE, I check it against the list of max. 2048 physical addresses as
> the inner loop.
> I reversed them: walking through the list of max. 2048 physical addresses as outer
> loop and the PGD - PMD - PTE scans as inner loops resulted in 4 to 5 times slower
> migration.

Could you explain where you're getting these "magic numbers?" I don't
quite understand the significance of 2048 physical addresses or 512 MB
of memory.

Zoltan, it appears that we have a bit of an inherent conflict with how
much CPU each of you is expecting to use in the removal and migration
cases. You're coming from a HPC environment where each CPU cycle is
valuable, while the people trying to remove memory are probably going to
be taking CPUs offline soon anyway, and care a bit less about how
efficient they're being with CPU and cache resources.

Could you be a bit more explicit about how expensive (cpu-wise) these
migrate operations can be?

-- Dave

2004-04-08 13:32:43