2005-01-05 00:45:30

by Ray Bryant

[permalink] [raw]
Subject: page migration patchset

Andrew,

Dave Hansen and I have reordered the memory hotplug patchset so that the page
migration patches occur first. This allows us to create a standalone page
migration patchset (on top of which the rest of the memory hotplug patches
apply). A snapshot of these patches is available at:

http://sr71.net/patches/2.6.10/2.6.10-mm1-mhp-test7/

A number of us are interested in using the page migration patchset by itself:

(1) Myself, for a manual page migration project I am working on. (This
is for migrating jobs from one set of nodes to another under batch
scheduler control).
(2) Marcello, for his memory defragmentation work.
(3) Of course, the memory hotplug project itself.

(there are probably other "users" that I have not enumerated here).

Unfortunately, none of these "users" of the page migration patchset are ready
to be merged into -mm yet.

The question at the moment is, "Would you be interesting in merging the
page migration patchset now, or should we wait until one or more of (1) to
(3) above is also ready for merging?"

(Historically, lkml has waited for a user of new functionality before merging
that functionality, so I expect that to be your answer; in that case, please
consider this note to be an preliminary notice that we will be submitting
such patches for merging in the next month or so. :-) )
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------


2005-01-05 02:07:13

by Andi Kleen

[permalink] [raw]
Subject: Re: page migration patchset

Ray Bryant <[email protected]> writes:

> http://sr71.net/patches/2.6.10/2.6.10-mm1-mhp-test7/
>
> A number of us are interested in using the page migration patchset by itself:
>
> (1) Myself, for a manual page migration project I am working on. (This
> is for migrating jobs from one set of nodes to another under batch
> scheduler control).
> (2) Marcello, for his memory defragmentation work.
> (3) Of course, the memory hotplug project itself.
>
> (there are probably other "users" that I have not enumerated here).

Could you coordinate that with Steve Longerbeam (cc'ed) ?

He has a NUMA API extension ready to be merged into -mm* that also
does kind of page migration when changing the policies of files.

-Andi

2005-01-05 03:18:50

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Andi Kleen wrote:
> Ray Bryant <[email protected]> writes:
>
>
>>http://sr71.net/patches/2.6.10/2.6.10-mm1-mhp-test7/
>>
>>A number of us are interested in using the page migration patchset by itself:
>>
>>(1) Myself, for a manual page migration project I am working on. (This
>> is for migrating jobs from one set of nodes to another under batch
>> scheduler control).
>>(2) Marcello, for his memory defragmentation work.
>>(3) Of course, the memory hotplug project itself.
>>
>>(there are probably other "users" that I have not enumerated here).
>
>
> Could you coordinate that with Steve Longerbeam (cc'ed) ?
>
> He has a NUMA API extension ready to be merged into -mm* that also
> does kind of page migration when changing the policies of files.
>
> -Andi
>
>
Yes, Steve's patch tries to move page cache pages that are found to be
allocated in the "wrong" place. (See remove_invalid_filemap_page() in his
patch of 11/02/2004 on lkml). But if the page is found to be busy, the code
gives up, as near as I can tell.

If the page migration patch were merged, Steve could call
migrate_onepage(page,node) to move the page to the correct node. even if it
is busy [hopefully his code can "wait" at that point, I haven't looked into it
further to see if that is the case.]

[This is really the page migration patch plus a small patch of
mine that addss the node argument to migrate_onepage(), and that I hope will
get merged into the page migration patch shortly]

Other than that, I don't see a big intersection between the two patches.
Steve, do you see anything else where we need to coordinate?

On the other hand, there is some work to be done wrt memory policies
and page migration. For the project I am working on, we need to be able
to move all of the pages used by a process on one set of nodes to another
set of nodes. At some point during this process we will need to update
the memory policy for that process. For Steve's patch, we will
similarly need to update the policy associated with files associated with
the process, I would think, elsewise new pages will get allocated on the
old set of nodes, which is something we don't want. Sounds like some
new interfaces will have to be developed here. Does that make sense
to you, Andi and Steve?

My personal preference would be to keep as much of this as possible
under user space control; that is, rather than having a big autonomous
system call that migrates pages and then updates policy information,
I'd prefer to split the work into several smaller system calls that
are issued by a user space program responsible for coordinating the
process migration as a series of steps, e. g.:

(1) suspend the process via SIGSTOP
(2) update the mempolicy information
(3) migrate the process's pages
(4) migrate the process to the new cpu via set_schedaffinity()
(5) resume the process via SIGCONT

that way the user program actually implements the process memory
migration fuctionality rather than having it all done by the kernel.
Thid also lets the user (or sys admin) modify or add new steps to
the page migration to satisfy local requirements without having to
modify the kernel.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-05 18:44:34

by Steve Longerbeam

[permalink] [raw]
Subject: Re: page migration patchset

Hi everyone,

Ray Bryant wrote:

> Andi Kleen wrote:
>
>> Ray Bryant <[email protected]> writes:
>>
>>
>>> http://sr71.net/patches/2.6.10/2.6.10-mm1-mhp-test7/
>>>
>>> A number of us are interested in using the page migration patchset
>>> by itself:
>>>
>>> (1) Myself, for a manual page migration project I am working on.
>>> (This
>>> is for migrating jobs from one set of nodes to another under batch
>>> scheduler control).
>>> (2) Marcello, for his memory defragmentation work.
>>> (3) Of course, the memory hotplug project itself.
>>>
>>> (there are probably other "users" that I have not enumerated here).
>>
>>
>>
>> Could you coordinate that with Steve Longerbeam (cc'ed) ?
>> He has a NUMA API extension ready to be merged into -mm* that also
>> does kind of page migration when changing the policies of files.
>>
>> -Andi
>>
>>
> Yes, Steve's patch tries to move page cache pages that are found to be
> allocated in the "wrong" place. (See remove_invalid_filemap_page() in
> his
> patch of 11/02/2004 on lkml). But if the page is found to be busy,
> the code
> gives up, as near as I can tell.


correct, my patch is using invalidate_mapping_pages(), which doesn't wait
for a locked pagecache page.

>
> If the page migration patch were merged, Steve could call
> migrate_onepage(page,node) to move the page to the correct node. even
> if it
> is busy [hopefully his code can "wait" at that point, I haven't looked
> into it further to see if that is the case.]


sounds good to me. And it can wait, since remove_invalid_filemap_page()
is called
at syscall time, so the syscall will just block.

>
> [This is really the page migration patch plus a small patch of
> mine that addss the node argument to migrate_onepage(), and that I
> hope will
> get merged into the page migration patch shortly]
>
> Other than that, I don't see a big intersection between the two patches.
> Steve, do you see anything else where we need to coordinate?


well, I need to study the page migration patch more (this is the
first time I've heard of it). But it sounds as if my patch and the
page migration patch are complementary.

>
> On the other hand, there is some work to be done wrt memory policies
> and page migration. For the project I am working on, we need to be able
> to move all of the pages used by a process on one set of nodes to another
> set of nodes. At some point during this process we will need to update
> the memory policy for that process. For Steve's patch, we will
> similarly need to update the policy associated with files associated with
> the process, I would think, elsewise new pages will get allocated on the
> old set of nodes, which is something we don't want. Sounds like some
> new interfaces will have to be developed here. Does that make sense
> to you, Andi and Steve?


yes.

>
> My personal preference would be to keep as much of this as possible
> under user space control; that is, rather than having a big autonomous
> system call that migrates pages and then updates policy information,
> I'd prefer to split the work into several smaller system calls that
> are issued by a user space program responsible for coordinating the
> process migration as a series of steps, e. g.:
>
> (1) suspend the process via SIGSTOP
> (2) update the mempolicy information
> (3) migrate the process's pages
> (4) migrate the process to the new cpu via set_schedaffinity()
> (5) resume the process via SIGCONT
>

steps 2 and 3 can be accomplished by a call to mbind() and
specifying MPOL_MF_MOVE. And since mbind() takes an
address range, you could probably migrate pages and change
the policies for all of the process' mappings in a single mbind()
call.

Note that Andrew had to drop my patch from 2.6.10, because
the 4-level page tables feature was re-implemented using a
different interface, which broke my patch. So Andrew asked me
to re-do the patch for inclusion in 2.6.11. That gives us ~2 months
to work on integrating the page migration and NUMA mempolicy
filemap patches.

Ray, btw it is beneficial that I can work with you on this, because I have
no access to true NUMA machines. My testing of the filemap mempolicy
patch has only been on a UP discontiguous memory system. I assume
you've got access to Altix machines at SGI to do testing and benchmarking
of my filemap patch and your migration patches.

Steve

2005-01-05 19:23:39

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Hi Steve!

Steve Longerbeam wrote:

>
> well, I need to study the page migration patch more (this is the
> first time I've heard of it). But it sounds as if my patch and the
> page migration patch are complementary.
>

Did you get the url from the last email?

http://sr71.net/patches/2.6.10/2.6.10-mm1-mhp-test7/page_migration/

>>
>> On the other hand, there is some work to be done wrt memory policies
>> and page migration. For the project I am working on, we need to be able
>> to move all of the pages used by a process on one set of nodes to another
>> set of nodes. At some point during this process we will need to update
>> the memory policy for that process. For Steve's patch, we will
>> similarly need to update the policy associated with files associated with
>> the process, I would think, elsewise new pages will get allocated on the
>> old set of nodes, which is something we don't want. Sounds like some
>> new interfaces will have to be developed here. Does that make sense
>> to you, Andi and Steve?
>
> yes.
>
>>
>> My personal preference would be to keep as much of this as possible
>> under user space control; that is, rather than having a big autonomous
>> system call that migrates pages and then updates policy information,
>> I'd prefer to split the work into several smaller system calls that
>> are issued by a user space program responsible for coordinating the
>> process migration as a series of steps, e. g.:
>>
>> (1) suspend the process via SIGSTOP
>> (2) update the mempolicy information
>> (3) migrate the process's pages
>> (4) migrate the process to the new cpu via set_schedaffinity()
>> (5) resume the process via SIGCONT
>>
>
> steps 2 and 3 can be accomplished by a call to mbind() and
> specifying MPOL_MF_MOVE. And since mbind() takes an
> address range, you could probably migrate pages and change
> the policies for all of the process' mappings in a single mbind()
> call.
>

Interesting, I hadn't tbought of that. I'll look at that.

> Note that Andrew had to drop my patch from 2.6.10, because
> the 4-level page tables feature was re-implemented using a
> different interface, which broke my patch. So Andrew asked me
> to re-do the patch for inclusion in 2.6.11. That gives us ~2 months
> to work on integrating the page migration and NUMA mempolicy
> filemap patches.

Sounds like a plan.

>
> Ray, btw it is beneficial that I can work with you on this, because I have
> no access to true NUMA machines. My testing of the filemap mempolicy
> patch has only been on a UP discontiguous memory system. I assume
> you've got access to Altix machines at SGI to do testing and benchmarking
> of my filemap patch and your migration patches.
>
> Steve
>
>

Oh yeah, I have access to a >>few<< Altix systems. :-)

I'd be happy to test your patches on Altix. I have another project sitting
on the back burner to get page cache allocated (by default) in round-robin
memory for Altix; I need to see how to integrate this with your work (which
is how this was all left a few months back when I got pulled off to work on
the latest release for Altix.) So that is another area for collaboration.

Is the latest version of your patch the one from lkml dated 11/02/2004?


--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-05 20:56:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: page migration patchset

Hi Steve,

On Wed, 5 Jan 2005, Steve Longerbeam wrote:
>
> Note that Andrew had to drop my patch from 2.6.10, because
> the 4-level page tables feature was re-implemented using a
> different interface, which broke my patch. So Andrew asked me
> to re-do the patch for inclusion in 2.6.11. That gives us ~2 months
> to work on integrating the page migration and NUMA mempolicy
> filemap patches.

Something I found odd about your patch was that you had filemap.c
doing NUMA policy one way (via mapping->policy), but left shmem.c
doing NUMA policy another way (via info->policy). I was preparing
a patch against 2.6.10-rc3-mm1 to clean that up, but got diverted.

Or was I missing a significant distinction?

I seem also to have concluded that destroy_inode ought always to
do the mpol_free_shared_policy itself rather than leaving it to the
filesystem's ->destroy_inode; but offhand can't remember my reasoning
(just to match alloc_inode doing its _init, or a more vital reason?).
Does that make sense to you?

Below is the patch I was working on then (like you I don't have NUMA,
so it was only build tested): would you like to factor it into yours,
or would you prefer me to come along and add it to -mm after yours
has gone in?

I did think your patch would be better split into two (if not more):
the (straightforward) implementation of mapping->policy, and then
the (more complex) page migration business on top of that.

There is more cleanup I'd like to do (or even better, let someone
else do!) in that area: not originating in your patch, but I loathe
the way the vma interface demands construction of a temporary struct
vm_area_struct (pvma) on the stack to get things done - to me that
just indicates the interface is wrong. The user interface must of
course stay, but should be better handled internally. Separate job.

(And I still don't know what should be done about NUMA policy versus
swap: it has not been anyone's priority, but swapin_readahead's NUMA
belief that swap is laid out linearly following vmas is quite wrong.
Should page migration be used instead? Should swap be divided into
per-node extents? Does swap readahead really serve a useful purpose,
or could we just delete that code? Should NUMA policy on a file be
determining NUMA policy on private swap copies of that file? Feel
free to ignore these questions, they're really not on your track;
but I can't glance at that code without wondering, and someone
reading this mail might have better ideas.)

Hugh

--- 2.6.10-rc3-mm1/fs/inode.c 2004-12-14 11:15:38.000000000 +0000
+++ linux/fs/inode.c 2004-12-14 12:02:01.751655096 +0000
@@ -178,12 +178,11 @@ void destroy_inode(struct inode *inode)
if (inode_has_buffers(inode))
BUG();
security_inode_free(inode);
+ mpol_free_shared_policy(&inode->i_mapping->policy);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
- else {
- mpol_free_shared_policy(&inode->i_mapping->policy);
+ else
kmem_cache_free(inode_cachep, (inode));
- }
}
EXPORT_SYMBOL(destroy_inode);

--- 2.6.10-rc3-mm1/include/linux/mempolicy.h 2004-12-14 11:15:39.000000000 +0000
+++ linux/include/linux/mempolicy.h 2004-12-14 12:02:01.784650080 +0000
@@ -156,6 +156,9 @@ struct page *alloc_page_shared_policy(un
extern void numa_default_policy(void);
extern void numa_policy_init(void);

+int generic_file_set_policy(struct vm_area_struct *, struct mempolicy *);
+struct mempolicy *generic_file_get_policy(struct vm_area_struct *, unsigned long);
+
#else

struct mempolicy {};
--- 2.6.10-rc3-mm1/include/linux/mm.h 2004-12-14 11:15:39.000000000 +0000
+++ linux/include/linux/mm.h 2004-12-14 12:02:01.834642480 +0000
@@ -555,15 +555,10 @@ extern void show_free_areas(void);
#ifdef CONFIG_SHMEM
struct page *shmem_nopage(struct vm_area_struct *vma,
unsigned long address, int *type);
-int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
-struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
- unsigned long addr);
int shmem_lock(struct file *file, int lock, struct user_struct *user);
#else
#define shmem_nopage filemap_nopage
#define shmem_lock(a, b, c) ({0;}) /* always in memory, no need to lock */
-#define shmem_set_policy(a, b) (0)
-#define shmem_get_policy(a, b) (NULL)
#endif
struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags);

--- 2.6.10-rc3-mm1/include/linux/shmem_fs.h 2004-10-18 22:56:50.000000000 +0100
+++ linux/include/linux/shmem_fs.h 2004-12-14 12:02:01.835642328 +0000
@@ -14,7 +14,6 @@ struct shmem_inode_info {
unsigned long alloced; /* data pages alloced to file */
unsigned long swapped; /* subtotal assigned to swap */
unsigned long next_index; /* highest alloced index + 1 */
- struct shared_policy policy; /* NUMA memory alloc policy */
struct page *i_indirect; /* top indirect blocks page */
swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* first blocks */
struct list_head swaplist; /* chain of maybes on swap */
--- 2.6.10-rc3-mm1/ipc/shm.c 2004-12-14 11:15:39.000000000 +0000
+++ linux/ipc/shm.c 2004-12-14 12:02:01.885634728 +0000
@@ -168,8 +168,8 @@ static struct vm_operations_struct shm_v
.close = shm_close, /* callback for when the vm-area is released */
.nopage = shmem_nopage,
#ifdef CONFIG_NUMA
- .set_policy = shmem_set_policy,
- .get_policy = shmem_get_policy,
+ .set_policy = generic_file_set_policy,
+ .get_policy = generic_file_get_policy,
#endif
};

--- 2.6.10-rc3-mm1/mm/shmem.c 2004-12-14 11:15:40.000000000 +0000
+++ linux/mm/shmem.c 2004-12-14 12:02:01.930627888 +0000
@@ -879,10 +879,10 @@ static struct page *shmem_swapin_async(s
return page;
}

-struct page *shmem_swapin(struct shmem_inode_info *info, swp_entry_t entry,
- unsigned long idx)
+struct page *shmem_swapin(struct address_space *mapping,
+ swp_entry_t entry, unsigned long idx)
{
- struct shared_policy *p = &info->policy;
+ struct shared_policy *p = &mapping->policy;
int i, num;
struct page *page;
unsigned long offset;
@@ -898,27 +898,13 @@ struct page *shmem_swapin(struct shmem_i
lru_add_drain(); /* Push any new pages onto the LRU now */
return shmem_swapin_async(p, entry, idx);
}
-
-static struct page *
-shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info,
- unsigned long idx)
-{
- return alloc_page_shared_policy(gfp, &info->policy, idx);
-}
#else
-static inline struct page *
-shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
+static inline struct page *shmem_swapin(struct address_space *mapping,
+ swp_entry_t entry, unsigned long idx)
{
swapin_readahead(entry, 0, NULL);
return read_swap_cache_async(entry, NULL, 0);
}
-
-static inline struct page *
-shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
- unsigned long idx)
-{
- return alloc_page(gfp);
-}
#endif

/*
@@ -980,7 +966,7 @@ repeat:
inc_page_state(pgmajfault);
*type = VM_FAULT_MAJOR;
}
- swappage = shmem_swapin(info, swap, idx);
+ swappage = shmem_swapin(mapping, swap, idx);
if (!swappage) {
spin_lock(&info->lock);
entry = shmem_swp_alloc(info, idx, sgp);
@@ -1092,9 +1078,7 @@ repeat:

if (!filepage) {
spin_unlock(&info->lock);
- filepage = shmem_alloc_page(mapping_gfp_mask(mapping),
- info,
- idx);
+ filepage = page_cache_alloc(mapping, idx);
if (!filepage) {
shmem_unacct_blocks(info->flags, 1);
shmem_free_blocks(inode, 1);
@@ -1206,24 +1190,6 @@ static int shmem_populate(struct vm_area
return 0;
}

-#ifdef CONFIG_NUMA
-int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
-{
- struct inode *i = vma->vm_file->f_dentry->d_inode;
- return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
-}
-
-struct mempolicy *
-shmem_get_policy(struct vm_area_struct *vma, unsigned long addr)
-{
- struct inode *i = vma->vm_file->f_dentry->d_inode;
- unsigned long idx;
-
- idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
-}
-#endif
-
int shmem_lock(struct file *file, int lock, struct user_struct *user)
{
struct inode *inode = file->f_dentry->d_inode;
@@ -1293,7 +1259,6 @@ shmem_get_inode(struct super_block *sb,
case S_IFREG:
inode->i_op = &shmem_inode_operations;
inode->i_fop = &shmem_file_operations;
- mpol_shared_policy_init(&info->policy);
break;
case S_IFDIR:
inode->i_nlink++;
@@ -1303,11 +1268,6 @@ shmem_get_inode(struct super_block *sb,
inode->i_fop = &simple_dir_operations;
break;
case S_IFLNK:
- /*
- * Must not load anything in the rbtree,
- * mpol_free_shared_policy will not be called.
- */
- mpol_shared_policy_init(&info->policy);
break;
}
} else if (sbinfo) {
@@ -2026,10 +1986,6 @@ static struct inode *shmem_alloc_inode(s

static void shmem_destroy_inode(struct inode *inode)
{
- if ((inode->i_mode & S_IFMT) == S_IFREG) {
- /* only struct inode is valid if it's an inline symlink */
- mpol_free_shared_policy(&SHMEM_I(inode)->policy);
- }
kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
}

@@ -2135,8 +2091,8 @@ static struct vm_operations_struct shmem
.nopage = shmem_nopage,
.populate = shmem_populate,
#ifdef CONFIG_NUMA
- .set_policy = shmem_set_policy,
- .get_policy = shmem_get_policy,
+ .set_policy = generic_file_set_policy,
+ .get_policy = generic_file_get_policy,
#endif
};


2005-01-05 23:03:16

by Steve Longerbeam

[permalink] [raw]
Subject: Re: page migration patchset

Source: MontaVista Software, Inc., Steve Longerbeam <[email protected]>
Type: Enhancement
Disposition: merge to kernel.org
Acked-by: Andi Kleen <[email protected]>
Description:
Patches NUMA mempolicy to allow policies for file mappings. Also adds
a new mbind() flag that attempts to move existing anonymous and
filemap pages that do not satisfy a mapping's policy (MPOL_MF_MOVE).

diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/fs/cachefs/block.c linux-2.6.10-rc1-mm5/fs/cachefs/block.c
--- linux-2.6.10-rc1-mm5.orig/fs/cachefs/block.c 2004-11-12 10:23:13.000000000 -0800
+++ linux-2.6.10-rc1-mm5/fs/cachefs/block.c 2004-11-15 10:59:33.116735936 -0800
@@ -374,7 +374,7 @@
mapping = super->imisc->i_mapping;

ret = -ENOMEM;
- newpage = page_cache_alloc_cold(mapping);
+ newpage = page_cache_alloc_cold(mapping, block->bix);
if (!newpage)
goto error;

diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/fs/inode.c linux-2.6.10-rc1-mm5/fs/inode.c
--- linux-2.6.10-rc1-mm5.orig/fs/inode.c 2004-11-12 10:23:12.000000000 -0800
+++ linux-2.6.10-rc1-mm5/fs/inode.c 2004-11-12 10:25:43.000000000 -0800
@@ -152,6 +152,7 @@
mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+ mpol_shared_policy_init(&mapping->policy);

/*
* If the block_device provides a backing_dev_info for client
@@ -179,8 +180,10 @@
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
- else
+ else {
+ mpol_free_shared_policy(&inode->i_mapping->policy);
kmem_cache_free(inode_cachep, (inode));
+ }
}
EXPORT_SYMBOL(destroy_inode);

diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/include/linux/fs.h linux-2.6.10-rc1-mm5/include/linux/fs.h
--- linux-2.6.10-rc1-mm5.orig/include/linux/fs.h 2004-11-12 10:24:05.000000000 -0800
+++ linux-2.6.10-rc1-mm5/include/linux/fs.h 2004-11-12 10:25:43.000000000 -0800
@@ -18,6 +18,7 @@
#include <linux/cache.h>
#include <linux/prio_tree.h>
#include <linux/kobject.h>
+#include <linux/mempolicy.h>
#include <asm/atomic.h>

struct iovec;
@@ -352,6 +353,7 @@
struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+ struct shared_policy policy; /* page alloc policy */
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/include/linux/mempolicy.h linux-2.6.10-rc1-mm5/include/linux/mempolicy.h
--- linux-2.6.10-rc1-mm5.orig/include/linux/mempolicy.h 2004-11-12 10:24:05.000000000 -0800
+++ linux-2.6.10-rc1-mm5/include/linux/mempolicy.h 2004-11-12 10:25:43.000000000 -0800
@@ -22,6 +22,8 @@

/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
+#define MPOL_MF_MOVE (1<<1) /* Attempt to move pages in mapping that do
+ not satisfy policy */

#ifdef __KERNEL__

@@ -149,7 +151,8 @@
void mpol_free_shared_policy(struct shared_policy *p);
struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
unsigned long idx);
-
+struct page *alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+ unsigned long idx);
extern void numa_default_policy(void);
extern void numa_policy_init(void);

@@ -215,6 +218,13 @@
#define vma_policy(vma) NULL
#define vma_set_policy(vma, pol) do {} while(0)

+static inline struct page *
+alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+ unsigned long idx)
+{
+ return alloc_pages(gfp, 0);
+}
+
static inline void numa_policy_init(void)
{
}
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/include/linux/page-flags.h linux-2.6.10-rc1-mm5/include/linux/page-flags.h
--- linux-2.6.10-rc1-mm5.orig/include/linux/page-flags.h 2004-11-12 10:23:55.000000000 -0800
+++ linux-2.6.10-rc1-mm5/include/linux/page-flags.h 2004-11-12 10:25:43.000000000 -0800
@@ -75,6 +75,8 @@
#define PG_swapcache 16 /* Swap page: swp_entry_t in private */
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
+#define PG_sharedpolicy 19 /* Page was allocated for a file
+ mapping using a shared_policy */


/*
@@ -293,6 +295,10 @@
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)

+#define PageSharedPolicy(page) test_bit(PG_sharedpolicy, &(page)->flags)
+#define SetPageSharedPolicy(page) set_bit(PG_sharedpolicy, &(page)->flags)
+#define ClearPageSharedPolicy(page) clear_bit(PG_sharedpolicy, &(page)->flags)
+
#ifdef CONFIG_SWAP
#define PageSwapCache(page) test_bit(PG_swapcache, &(page)->flags)
#define SetPageSwapCache(page) set_bit(PG_swapcache, &(page)->flags)
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/include/linux/pagemap.h linux-2.6.10-rc1-mm5/include/linux/pagemap.h
--- linux-2.6.10-rc1-mm5.orig/include/linux/pagemap.h 2004-11-12 10:23:56.000000000 -0800
+++ linux-2.6.10-rc1-mm5/include/linux/pagemap.h 2004-11-12 10:25:43.000000000 -0800
@@ -50,14 +50,24 @@
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);

-static inline struct page *page_cache_alloc(struct address_space *x)
+
+static inline struct page *__page_cache_alloc(struct address_space *x,
+ unsigned long idx,
+ unsigned int gfp_mask)
+{
+ return alloc_page_shared_policy(gfp_mask, &x->policy, idx);
+}
+
+static inline struct page *page_cache_alloc(struct address_space *x,
+ unsigned long idx)
{
- return alloc_pages(mapping_gfp_mask(x), 0);
+ return __page_cache_alloc(x, idx, mapping_gfp_mask(x));
}

-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+ unsigned long idx)
{
- return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+ return __page_cache_alloc(x, idx, mapping_gfp_mask(x)|__GFP_COLD);
}

typedef int filler_t(void *, struct page *);
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/mm/filemap.c linux-2.6.10-rc1-mm5/mm/filemap.c
--- linux-2.6.10-rc1-mm5.orig/mm/filemap.c 2004-11-12 10:25:07.000000000 -0800
+++ linux-2.6.10-rc1-mm5/mm/filemap.c 2004-11-12 10:25:43.000000000 -0800
@@ -586,7 +586,8 @@
page = find_lock_page(mapping, index);
if (!page) {
if (!cached_page) {
- cached_page = alloc_page(gfp_mask);
+ cached_page = __page_cache_alloc(mapping, index,
+ gfp_mask);
if (!cached_page)
return NULL;
}
@@ -679,7 +680,7 @@
return NULL;
}
gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
- page = alloc_pages(gfp_mask, 0);
+ page = __page_cache_alloc(mapping, index, gfp_mask);
if (page && add_to_page_cache_lru(page, mapping, index, gfp_mask)) {
page_cache_release(page);
page = NULL;
@@ -866,7 +867,7 @@
* page..
*/
if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
+ cached_page = page_cache_alloc_cold(mapping, index);
if (!cached_page) {
desc->error = -ENOMEM;
goto out;
@@ -1129,7 +1130,7 @@
struct page *page;
int error;

- page = page_cache_alloc_cold(mapping);
+ page = page_cache_alloc_cold(mapping, offset);
if (!page)
return -ENOMEM;

@@ -1519,9 +1520,35 @@
return page->mapping->a_ops->page_mkwrite(page);
}

+
+#ifdef CONFIG_NUMA
+int generic_file_set_policy(struct vm_area_struct *vma,
+ struct mempolicy *new)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ return mpol_set_shared_policy(&mapping->policy, vma, new);
+}
+
+struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ unsigned long idx;
+
+ idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ return mpol_shared_policy_lookup(&mapping->policy, idx);
+}
+#endif
+
+
struct vm_operations_struct generic_file_vm_ops = {
.nopage = filemap_nopage,
.populate = filemap_populate,
+#ifdef CONFIG_NUMA
+ .set_policy = generic_file_set_policy,
+ .get_policy = generic_file_get_policy,
+#endif
};

struct vm_operations_struct generic_file_vm_mkwr_ops = {
@@ -1580,7 +1607,7 @@
page = find_get_page(mapping, index);
if (!page) {
if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
+ cached_page = page_cache_alloc_cold(mapping, index);
if (!cached_page)
return ERR_PTR(-ENOMEM);
}
@@ -1662,7 +1689,7 @@
page = find_lock_page(mapping, index);
if (!page) {
if (!*cached_page) {
- *cached_page = page_cache_alloc(mapping);
+ *cached_page = page_cache_alloc(mapping, index);
if (!*cached_page)
return NULL;
}
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/mm/mempolicy.c linux-2.6.10-rc1-mm5/mm/mempolicy.c
--- linux-2.6.10-rc1-mm5.orig/mm/mempolicy.c 2004-11-12 10:25:07.000000000 -0800
+++ linux-2.6.10-rc1-mm5/mm/mempolicy.c 2004-11-16 10:14:06.135753597 -0800
@@ -2,6 +2,7 @@
* Simple NUMA memory policy for the Linux kernel.
*
* Copyright 2003,2004 Andi Kleen, SuSE Labs.
+ * Copyright 2004 Steve Longerbeam, MontaVista Software.
* Subject to the GNU Public License, version 2.
*
* NUMA policy allows the user to give hints in which node(s) memory should
@@ -47,15 +48,28 @@
*/

/* Notebook:
- fix mmap readahead to honour policy and enable policy for any page cache
- object
- statistics for bigpages
- global policy for page cache? currently it uses process policy. Requires
- first item above.
+ Page cache pages can now be policied, by adding a shared_policy tree to
+ inodes (actually located in address_space). One entry in the tree for
+ each mapped region of a file. Generic files now have set_policy and
+ get_policy methods in generic_file_vm_ops [stevel].
+
+ Added a page-move feature, whereby existing pte-mapped or filemap
+ pagecache pages that are/can be mapped to the given virtual memory
+ region, that do not satisfy the NUMA policy, are moved to a new
+ page that satisfies the policy. Enabled by the new mbind flag
+ MPOL_MF_MOVE [stevel].
+
+ statistics for bigpages.
+
+ global policy for page cache? currently it uses per-file policies in
+ address_space (see first item above).
+
handle mremap for shared memory (currently ignored for the policy)
grows down?
+
make bind policy root only? It can trigger oom much faster and the
kernel is not always grateful with that.
+
could replace all the switch()es with a mempolicy_ops structure.
*/

@@ -66,6 +80,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/nodemask.h>
#include <linux/cpuset.h>
#include <linux/gfp.h>
@@ -76,6 +91,9 @@
#include <linux/init.h>
#include <linux/compat.h>
#include <linux/mempolicy.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/pgalloc.h>
#include <asm/uaccess.h>

static kmem_cache_t *policy_cache;
@@ -236,33 +254,225 @@
return policy;
}

-/* Ensure all existing pages follow the policy. */
+
+/* Return effective policy for a VMA */
+static struct mempolicy *
+get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = current->mempolicy;
+
+ if (vma) {
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ pol = vma->vm_ops->get_policy(vma, addr);
+ else if (vma->vm_policy &&
+ vma->vm_policy->policy != MPOL_DEFAULT)
+ pol = vma->vm_policy;
+ }
+ if (!pol)
+ pol = &default_policy;
+ return pol;
+}
+
+
+/* Find secondary valid nodes for an allocation */
+static int __mpol_node_valid(int nid, struct mempolicy *pol)
+{
+ switch (pol->policy) {
+ case MPOL_PREFERRED:
+ case MPOL_DEFAULT:
+ case MPOL_INTERLEAVE:
+ return 1;
+ case MPOL_BIND: {
+ struct zone **z;
+ for (z = pol->v.zonelist->zones; *z; z++)
+ if ((*z)->zone_pgdat->node_id == nid)
+ return 1;
+ return 0;
+ }
+ default:
+ BUG();
+ return 0;
+ }
+}
+
+int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr)
+{
+ return __mpol_node_valid(nid, get_vma_policy(vma, addr));
+}
+
+/*
+ * The given page doesn't match a file mapped VMA's policy. If the
+ * page is unused, remove it from the page cache, so that a new page
+ * can be later reallocated to the cache using the correct policy.
+ * Returns 0 if the page was removed from the cache, < 0 if failed.
+ *
+ * We use invalidate_mapping_pages(), which doesn't try very hard.
+ * It won't remove pages which are locked (won't wait for a lock),
+ * dirty, under writeback, or mapped by pte's. All the latter are
+ * valid checks for us, but we might be able to improve our success
+ * by waiting for a lock.
+ */
+static int
+remove_invalid_filemap_page(struct page * page,
+ struct vm_area_struct *vma,
+ pgoff_t pgoff)
+{
+ /*
+ * the page in the cache is not in any of the nodes this
+ * VMA's policy wants it to be in. Can we remove it?
+ */
+ if (!PageSharedPolicy(page) &&
+ invalidate_mapping_pages(vma->vm_file->f_mapping,
+ pgoff, pgoff) > 0) {
+ PDprintk("removed cache page in node %ld, "
+ "pgoff=%lu, for %s\n",
+ page_to_nid(page), pgoff,
+ vma->vm_file->f_dentry->d_name.name);
+ return 0;
+ }
+
+ /*
+ * the page is being used by other pagetable mappings,
+ * or is currently locked, dirty, or under writeback.
+ */
+ PDprintk("could not remove cache page in node %ld, "
+ "pgoff=%lu, for %s\n",
+ page_to_nid(page), pgoff,
+ vma->vm_file->f_dentry->d_name.name);
+ return -EIO;
+}
+
+/*
+ * The given page doesn't match a VMA's policy. Allocate a new
+ * page using the policy, copy contents from old to new, free
+ * the old page, map in the new page. This looks a lot like a COW.
+ */
+static int
+move_invalid_page(struct page * page, struct mempolicy *pol,
+ struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page * new_page;
+ struct vm_area_struct pvma;
+ pte_t *page_table;
+ pte_t entry;
+
+ PDprintk("moving anon page in node %ld, address=%08lx\n",
+ page_to_nid(page), addr);
+
+ if (!PageReserved(page))
+ page_cache_get(page);
+ spin_unlock(&mm->page_table_lock);
+ if (unlikely(anon_vma_prepare(vma)))
+ goto err_no_mem;
+
+ /* Create a pseudo vma that just contains the policy */
+ memset(&pvma, 0, sizeof(struct vm_area_struct));
+ pvma.vm_end = PAGE_SIZE;
+ pvma.vm_pgoff = vma->vm_pgoff;
+ pvma.vm_policy = pol;
+ new_page = alloc_page_vma(GFP_HIGHUSER, &pvma, addr);
+ if (!new_page)
+ goto err_no_mem;
+
+ copy_user_highpage(new_page, page, addr);
+
+ spin_lock(&mm->page_table_lock);
+ page_table = pte_offset_map(pmd, addr);
+ if (!PageReserved(page))
+ page_remove_rmap(page);
+
+ flush_cache_page(vma, addr);
+ entry = pte_mkdirty(mk_pte(new_page, vma->vm_page_prot));
+ if (likely(vma->vm_flags & VM_WRITE))
+ entry = pte_mkwrite(entry);
+ ptep_establish(vma, addr, page_table, entry);
+ update_mmu_cache(vma, addr, entry);
+ lru_cache_add_active(new_page);
+ page_add_anon_rmap(new_page, vma, addr);
+
+ pte_unmap(page_table);
+ page_cache_release(page); /* release our ref on the old page */
+ page_cache_release(page); /* release our pte ref on the old page */
+ return 0;
+
+ err_no_mem:
+ spin_lock(&mm->page_table_lock);
+ return -ENOMEM;
+}
+
+/* Ensure all existing pages in a VMA follow the policy. */
static int
-verify_pages(struct mm_struct *mm,
- unsigned long addr, unsigned long end, unsigned long *nodes)
+move_verify_pages(struct vm_area_struct *vma, struct mempolicy *pol,
+ unsigned long flags)
{
- while (addr < end) {
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr;
+ unsigned long start = vma->vm_start;
+ unsigned long end = vma->vm_end;
+
+ if (!(flags & (MPOL_MF_MOVE | MPOL_MF_STRICT)))
+ return 0;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE) {
struct page *p;
pte_t *pte;
pmd_t *pmd;
pgd_t *pgd;
pml4_t *pml4;
+ int err;
+
+ /*
+ * first, if this is a file mapping and we are moving pages,
+ * check for invalid page cache pages, and if they are unused,
+ * remove.
+ */
+ if (vma->vm_ops && vma->vm_ops->nopage) {
+ struct address_space *mapping =
+ vma->vm_file->f_mapping;
+ unsigned long pgoff =
+ ((addr - vma->vm_start) >> PAGE_CACHE_SHIFT) +
+ vma->vm_pgoff;
+
+ p = find_get_page(mapping, pgoff);
+ if (p) {
+ err = 0;
+ if (!__mpol_node_valid(page_to_nid(p), pol)) {
+ if (!(flags & MPOL_MF_MOVE))
+ err = -EIO;
+ else
+ err = remove_invalid_filemap_page(
+ p,vma,pgoff);
+ }
+ page_cache_release(p); /* find_get_page */
+ if (err && (flags & MPOL_MF_STRICT))
+ return err;
+ }
+ }
+
+ /*
+ * Now let's see if there is a pte-mapped page that doesn't
+ * satisfy the policy. Because of the above, we can be sure
+ * from here that, if there is a VMA page that's pte-mapped
+ * and it belongs to the page cache, it either satisfies the
+ * policy, or we don't mind if it doesn't (MF_STRICT not set).
+ */
+ spin_lock(&mm->page_table_lock);
pml4 = pml4_offset(mm, addr);
if (pml4_none(*pml4)) {
- unsigned long next = (addr + PML4_SIZE) & PML4_MASK;
- if (next > addr)
- break;
- addr = next;
+ spin_unlock(&mm->page_table_lock);
continue;
}
pgd = pml4_pgd_offset(pml4, addr);
+
if (pgd_none(*pgd)) {
- addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
+ spin_unlock(&mm->page_table_lock);
continue;
}
pmd = pmd_offset(pgd, addr);
if (pmd_none(*pmd)) {
- addr = (addr + PMD_SIZE) & PMD_MASK;
+ spin_unlock(&mm->page_table_lock);
continue;
}
p = NULL;
@@ -271,19 +481,29 @@
p = pte_page(*pte);
pte_unmap(pte);
if (p) {
- unsigned nid = page_to_nid(p);
- if (!test_bit(nid, nodes))
- return -EIO;
+ err = 0;
+ if (!__mpol_node_valid(page_to_nid(p), pol)) {
+ if (!(flags & MPOL_MF_MOVE))
+ err = -EIO;
+ else
+ err = move_invalid_page(p, pol, vma,
+ addr, pmd);
+ }
+ if (err && (flags & MPOL_MF_STRICT)) {
+ spin_unlock(&mm->page_table_lock);
+ return err;
+ }
}
- addr += PAGE_SIZE;
+ spin_unlock(&mm->page_table_lock);
}
+
return 0;
}

/* Step 1: check the range */
static struct vm_area_struct *
check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
- unsigned long *nodes, unsigned long flags)
+ struct mempolicy *policy, unsigned long flags)
{
int err;
struct vm_area_struct *first, *vma, *prev;
@@ -297,9 +517,8 @@
return ERR_PTR(-EFAULT);
if (prev && prev->vm_end < vma->vm_start)
return ERR_PTR(-EFAULT);
- if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
- err = verify_pages(vma->vm_mm,
- vma->vm_start, vma->vm_end, nodes);
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_STRICT)) {
+ err = move_verify_pages(vma, policy, flags);
if (err) {
first = ERR_PTR(err);
break;
@@ -366,12 +585,13 @@
DECLARE_BITMAP(nodes, MAX_NUMNODES);
int err;

- if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
+ if ((flags & ~(unsigned long)(MPOL_MF_STRICT | MPOL_MF_MOVE)) ||
+ mode > MPOL_MAX)
return -EINVAL;
if (start & ~PAGE_MASK)
return -EINVAL;
if (mode == MPOL_DEFAULT)
- flags &= ~MPOL_MF_STRICT;
+ flags &= ~(MPOL_MF_STRICT | MPOL_MF_MOVE);
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
end = start + len;
if (end < start)
@@ -391,7 +611,7 @@
mode,nodes[0]);

down_write(&mm->mmap_sem);
- vma = check_range(mm, start, end, nodes, flags);
+ vma = check_range(mm, start, end, new, flags);
err = PTR_ERR(vma);
if (!IS_ERR(vma))
err = mbind_range(vma, start, end, new);
@@ -624,24 +844,6 @@

#endif

-/* Return effective policy for a VMA */
-static struct mempolicy *
-get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
-{
- struct mempolicy *pol = current->mempolicy;
-
- if (vma) {
- if (vma->vm_ops && vma->vm_ops->get_policy)
- pol = vma->vm_ops->get_policy(vma, addr);
- else if (vma->vm_policy &&
- vma->vm_policy->policy != MPOL_DEFAULT)
- pol = vma->vm_policy;
- }
- if (!pol)
- pol = &default_policy;
- return pol;
-}
-
/* Return a zonelist representing a mempolicy */
static struct zonelist *zonelist_policy(unsigned gfp, struct mempolicy *policy)
{
@@ -882,28 +1084,6 @@
return 0;
}

-/* Find secondary valid nodes for an allocation */
-int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr)
-{
- struct mempolicy *pol = get_vma_policy(vma, addr);
-
- switch (pol->policy) {
- case MPOL_PREFERRED:
- case MPOL_DEFAULT:
- case MPOL_INTERLEAVE:
- return 1;
- case MPOL_BIND: {
- struct zone **z;
- for (z = pol->v.zonelist->zones; *z; z++)
- if ((*z)->zone_pgdat->node_id == nid)
- return 1;
- return 0;
- }
- default:
- BUG();
- return 0;
- }
-}

/*
* Shared memory backing store policy support.
@@ -1023,10 +1203,14 @@
/* Take care of old policies in the same range. */
while (n && n->start < end) {
struct rb_node *next = rb_next(&n->nd);
- if (n->start >= start) {
- if (n->end <= end)
+ if (n->start == start && n->end == end &&
+ mpol_equal(n->policy, new->policy)) {
+ /* the same shared policy already exists, just exit */
+ goto out;
+ } else if (n->start >= start) {
+ if (n->end <= end) {
sp_delete(sp, n);
- else
+ } else
n->start = end;
} else {
/* Old policy spanning whole new range. */
@@ -1052,6 +1236,7 @@
}
if (new)
sp_insert(sp, new);
+ out:
spin_unlock(&sp->lock);
if (new2) {
mpol_free(new2->policy);
@@ -1103,6 +1288,37 @@
spin_unlock(&p->lock);
}

+struct page *
+alloc_page_shared_policy(unsigned gfp, struct shared_policy *sp,
+ unsigned long idx)
+{
+ struct page *page;
+ struct mempolicy * shared_pol = NULL;
+
+ if (sp->root.rb_node) {
+ struct vm_area_struct pvma;
+ /* Create a pseudo vma that just contains the policy */
+ memset(&pvma, 0, sizeof(struct vm_area_struct));
+ pvma.vm_end = PAGE_SIZE;
+ pvma.vm_pgoff = idx;
+ shared_pol = mpol_shared_policy_lookup(sp, idx);
+ pvma.vm_policy = shared_pol;
+ page = alloc_page_vma(gfp, &pvma, 0);
+ mpol_free(pvma.vm_policy);
+ } else {
+ page = alloc_pages(gfp, 0);
+ }
+
+ if (page) {
+ if (shared_pol)
+ SetPageSharedPolicy(page);
+ else
+ ClearPageSharedPolicy(page);
+ }
+
+ return page;
+}
+
/* assumes fs == KERNEL_DS */
void __init numa_policy_init(void)
{
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/mm/readahead.c linux-2.6.10-rc1-mm5/mm/readahead.c
--- linux-2.6.10-rc1-mm5.orig/mm/readahead.c 2004-11-12 10:25:08.000000000 -0800
+++ linux-2.6.10-rc1-mm5/mm/readahead.c 2004-11-12 10:30:08.000000000 -0800
@@ -246,7 +246,7 @@
continue;

read_unlock_irq(&mapping->tree_lock);
- page = page_cache_alloc_cold(mapping);
+ page = page_cache_alloc_cold(mapping, page_offset);
read_lock_irq(&mapping->tree_lock);
if (!page)
break;
diff -Nuar -X /home/stevel/dontdiff linux-2.6.10-rc1-mm5.orig/mm/shmem.c linux-2.6.10-rc1-mm5/mm/shmem.c
--- linux-2.6.10-rc1-mm5.orig/mm/shmem.c 2004-11-12 10:25:07.000000000 -0800
+++ linux-2.6.10-rc1-mm5/mm/shmem.c 2004-11-12 10:25:43.000000000 -0800
@@ -903,16 +903,7 @@
shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info,
unsigned long idx)
{
- struct vm_area_struct pvma;
- struct page *page;
-
- memset(&pvma, 0, sizeof(struct vm_area_struct));
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
- pvma.vm_pgoff = idx;
- pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp, &pvma, 0);
- mpol_free(pvma.vm_policy);
- return page;
+ return alloc_page_shared_policy(gfp, &info->policy, idx);
}
#else
static inline struct page *


Attachments:
mempol-2.6.10-rc1-mm5.filemap-policy.patch (23.41 kB)

2005-01-05 23:16:41

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Steve Longerbeam wrote:

>
> you mean like a global mempolicy for the page cache? This shouldn't
> be difficult to integrate with my patch, ie. when allocating a page
> for the cache, first check if the mapping object has a policy (my patch),
> if not, then check if there is a global pagecache policy (your patch).
>

Yes, I think thats exactly what I am thinking of.

I'll take a look at your patch and see what develops. :-)
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-06 14:43:13

by Andi Kleen

[permalink] [raw]
Subject: Re: page migration patchset

On Wed, Jan 05, 2005 at 03:56:29PM -0800, Steve Longerbeam wrote:
> Hugetlbfs is also defining its own shared policy RB tree in its
> inode info struct, but it doesn't seem to be used, just initialized
> and freed at alloc/destroy inode time. Does anyone know why that
> is there? A place-holder for future hugetlbfs mempolicy support?
> If so, it can be removed and use the generic_file policies instead.

You need lazy hugetlbfs to use it (= allocate at page fault time,
not mmap time). Otherwise the policy can never be applied. I implemented
my own version of lazy allocation for SLES9, but when I wanted to
merge it into mainline some other people told they had a much better
singing&dancing lazy hugetlb patch. So I waited for them, but they
never went forward with their stuff and their code seems to be dead
now. So this is still a dangling end :/

If nothing happens soon regarding the "other" hugetlb code I will
forward port my SLES9 code. It already has NUMA policy support.

For now you can remove the hugetlb policy code from mainline if you
want, it would be easy to readd it when lazy hugetlbfs is merged.

>
> >(And I still don't know what should be done about NUMA policy versus
> >swap: it has not been anyone's priority, but swapin_readahead's NUMA
> >belief that swap is laid out linearly following vmas is quite wrong.
> >Should page migration be used instead? Should swap be divided into
> >per-node extents? Does swap readahead really serve a useful purpose,
> >or could we just delete that code? Should NUMA policy on a file be
> >determining NUMA policy on private swap copies of that file?

It's on my TODO list, but I haven't had time to work on it. But Steve's
simple minded page migration is probably the right way to fix it anyways,
so once that it is in it just needs some extension.

Basically you would delete the code and then later migrate the pages.
Not very nice, but I didn't come up with a better design so far.

-Andi

2005-01-06 15:59:13

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Andi Kleen wrote:

>
> You need lazy hugetlbfs to use it (= allocate at page fault time,
> not mmap time). Otherwise the policy can never be applied. I implemented
> my own version of lazy allocation for SLES9, but when I wanted to
> merge it into mainline some other people told they had a much better
> singing&dancing lazy hugetlb patch. So I waited for them, but they
> never went forward with their stuff and their code seems to be dead
> now. So this is still a dangling end :/
>
> If nothing happens soon regarding the "other" hugetlb code I will
> forward port my SLES9 code. It already has NUMA policy support.

Andi,

I too have been frustrated by this process. I think Christoph Lameter
at SGI is looking at forward porting the "old" lazy hugetlbpage allocation
code. Of course, the proof is in the "doing" of this and I am not sure
what other priorities he has at the moment.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-06 18:03:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: page migration patchset

On Thu, 6 Jan 2005, Ray Bryant wrote:

> > If nothing happens soon regarding the "other" hugetlb code I will
> > forward port my SLES9 code. It already has NUMA policy support.
> I too have been frustrated by this process. I think Christoph Lameter
> at SGI is looking at forward porting the "old" lazy hugetlbpage allocation
> code. Of course, the proof is in the "doing" of this and I am not sure
> what other priorities he has at the moment.

Sorry I did not have time to continue the huge stuff in face of other
things that came up in the fall. I ported the stuff to 2.6.10 yesterday
but it still needs some rework.

Could you sent me the most up to date version of the SLES9 stuff including
any unintegrated changes? I can work though this next week I believe and
post a new huge pages patch.

2005-01-06 19:38:48

by Andi Kleen

[permalink] [raw]
Subject: Re: page migration patchset

> Sorry I did not have time to continue the huge stuff in face of other
> things that came up in the fall. I ported the stuff to 2.6.10 yesterday
> but it still needs some rework.
>
> Could you sent me the most up to date version of the SLES9 stuff including
> any unintegrated changes? I can work though this next week I believe and
> post a new huge pages patch.

It's a lot of split out patches. I put a tarball of all
of them at ftp.suse.com:/pub/people/ak/huge/huge.tar.gz
They probably depend on other patches in SLES9. The patches
are not very cleanly split, since we just fixed problems
one by one without refactoring later.

Also there is at least one mysterious bug in it that probably
would need to be fixed first before merging :/ I hope this
gets resolved soon however. Other than issue that they're well
tested and work fine.

-Andi

2005-01-06 22:33:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page migration patchset

On Thu, Jan 06, 2005 at 03:43:07PM +0100, Andi Kleen wrote:
> If nothing happens soon regarding the "other" hugetlb code I will
> forward port my SLES9 code. It already has NUMA policy support.
> For now you can remove the hugetlb policy code from mainline if you

This is not specifically directed at Andi...

I am rather unhappy with various activities ``surrounding'' hugetlb,
as I've received less than zero assistance in bughunting or fielding
problem reports or actual end-user requests. Instead, there are feature
patches, which of course ``compete with'' what I've already written
myself (completely crushed by private permavetoes before they were ever
posted, of course) and screw me out of credit for having done anything
while I slave away to fix bugs.

There is a relatively consistent pattern of my being steamrolled over
I'm rather sick of. Of course, this all exploits the fact that there
can can be no like response, as the entire force of the attack is
derived from upstream backing, which once obtained by one party, is
forever unavailable to others.

There are several problems occurring beyond what appears to be a very
strong social bias against my own work.

The first is strong architectural bias and weak or absent architectural
code sweeps on the parts of contributors. This has caused
nonfunctionality and bugs apparent upon inspection in the ``less
favored'' architectures.

The second is zero bugfixing or cross-architecture testing activity
apart from my own. This and the first in conjunction cast serious
doubt upon and increase the need for heavy critical examination of
so-called ``consolidation'' patches.

The third is inattention to backward compatibility. The operational
characteristics of hugetlb, however odd they may be, are effectively
set in stone by the requirement for backward compatibility. The mmap()
vs. truncate() behavioral changes were the first of the deviations from
this, and were rather ugly ones that put hugetlb at great variance with
all normal filesystems in its lack of support for expanding truncate
and bizarre expansion behavior during mmap(), which were neither
forward nor backward compatible. Changes in COW behavior directly
threaten to create a similar backward compatibility nightmare, where
zero consideration of such has yet been given.

The fourth is the inattention to outstanding issues in need of repair.
For instance, hugetlb's locking, inherited from the system call code,
desperately needs to be normalized, and no individual attention has
been given to this by those with purportedly vested interests in
hugetlb, though apparently numerous locking rearrangements have
appeared while inappropriately mixed with other changes.

The fifth is that many of the patches I've been sent are apparently
predicated on the assumption that the authors are exempt from
compliance with Documentation/CodingStyle.

Sixth is that patch presentations have overall been poor enough to
consider them well below general kernel standards. This includes both
poor changelogging and lacking separation of distinct behavioral
changes into distinct patches.

These six issues act in concert to severely aggravate preexisting
chaos with no effort whatsoever expended on the parts of contributors
to mitigate or correct it.

Obviously, I have no recourse, otherwise there would be no credible
threat of this kind of end-run tactic succeeding, and I've apparently
already been circumvented by pushing the things to distros anyway. So
I can do no more than kindly ask you to address issues 1-6 in your
patch presentations.

Not that I expect anyone to listen. No one ever has before. In fact,
given the precedents, it's more likely for this to provoke verbal and
several other kinds of retaliation than any kind of cooperation or
ostensibly useful effect. The only rational motive for this post is to
leave some kind of public record that I've been screwed over, unlike
the various other instances where I silently ``took it''. In all other
respects I will be heavily penalized for it.


-- wli

2005-01-06 23:14:03

by Andrew Morton

[permalink] [raw]
Subject: Re: page migration patchset

William Lee Irwin III <[email protected]> wrote:
>
> There is a relatively consistent pattern of my being steamrolled over
> I'm rather sick of.

That's news to me.

I do recall some months ago that there were a whole bunch of patches doing
a whole bunch of stuff and I was concerned that there was an absence of a
central coordinating role. But then everything went quiet.

If you have time/inclination to marshall the hugetlb efforts then for
heavens sake, send in a MAINTAINERS record and let's roll the sleeves up.

> and I've apparently
> already been circumvented by pushing the things to distros anyway.

aaargh. A pox upon the people who did that. Well if they find their
upstream later breaks things then tough luck.

Look. All interested parties should subscribe to linux-mm
([email protected]). Let's get all the patches on the table there and
work through them, asap. We know how do this.

2005-01-06 23:19:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page migration patchset

William Lee Irwin III <[email protected]> wrote:
>> There is a relatively consistent pattern of my being steamrolled over
>> I'm rather sick of.

On Thu, Jan 06, 2005 at 03:08:42PM -0800, Andrew Morton wrote:
> That's news to me.
> I do recall some months ago that there were a whole bunch of patches doing
> a whole bunch of stuff and I was concerned that there was an absence of a
> central coordinating role. But then everything went quiet.
> If you have time/inclination to marshall the hugetlb efforts then for
> heavens sake, send in a MAINTAINERS record and let's roll the sleeves up.


I'm being at least sometimes deferred to for hugetlb maintenance.
I also originally wrote the fs methods, and generally get stuck
working on it on a regular basis. So here is a MAINTAINERS entry
reflecting that.


Index: mm2-2.6.10/MAINTAINERS
===================================================================
--- mm2-2.6.10.orig/MAINTAINERS 2005-01-06 09:42:03.000000000 -0800
+++ mm2-2.6.10/MAINTAINERS 2005-01-06 15:10:53.586581112 -0800
@@ -979,6 +979,11 @@
M: [email protected]
S: Maintained

+HUGETLB FILESYSTEM
+P: William Irwin
+M: [email protected]
+S: Maintained
+
I2C AND SENSORS DRIVERS
P: Greg Kroah-Hartman
M: [email protected]

2005-01-06 23:24:52

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

William Lee Irwin III wrote:
> On Thu, Jan 06, 2005 at 03:43:07PM +0100, Andi Kleen wrote:
>
>>If nothing happens soon regarding the "other" hugetlb code I will
>>forward port my SLES9 code. It already has NUMA policy support.
>>For now you can remove the hugetlb policy code from mainline if you
>
>
> This is not specifically directed at Andi...
>

Who is it directed at then?

<snip>

> Obviously, I have no recourse, otherwise there would be no credible
> threat of this kind of end-run tactic succeeding, and I've apparently
> already been circumvented by pushing the things to distros anyway. So
> I can do no more than kindly ask you to address issues 1-6 in your
> patch presentations.
>

And who does "you" refer to here?

I'd point out that one of the reasons we have Christoph Lameter working
on this is that he is better at working with cross architecture type
stuff than I am, since I have neither the skills nor interest to do
such things (I'm much too focused on Altix specific problems).

So, I guess the question is, do you, wli, have allocate hugetlbpage on
fault code available somewhere that we, SGI, have somehow stepped on,
ignored, or not properly given credit for? SGI has a strong requirement
to eliminate the current "allocate hugetlb pages at mmap() time",
single-threaded allocation. (We have sold machines where it would
take thousands of seconds to complete that operation as it is
currently coded in the mainline.)

We need the allocate on fault hugetlbpage code. We worked quite hard
to get that code to behave the same way wrt out of memory failures as the
existing code. To say that we didn't worry about backwards
compatibility there (at least in that regard) is simply absurd.

But I care not where this code comes from. If it works, meets our
scaling requirements, and can get accepted into the mainline, then
I am all for it. And I will happily give credit where credit is
due.

However, at the pesent time it appears that if we want this code in the
mainline, we will have to bring it up to level and push it upstream,
and that is what Christoph is working on.

When that happens, the code is subject to review and we look forward
to working with you to resolve your concerns (1)-(6) wrt to those
patches.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-06 23:40:53

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page migration patchset

On Thu, Jan 06, 2005 at 03:43:07PM +0100, Andi Kleen wrote:
>>> If nothing happens soon regarding the "other" hugetlb code I will
>>> forward port my SLES9 code. It already has NUMA policy support.
>>> For now you can remove the hugetlb policy code from mainline if you

William Lee Irwin III wrote:
>> This is not specifically directed at Andi...

On Thu, Jan 06, 2005 at 05:21:22PM -0600, Ray Bryant wrote:
> Who is it directed at then?
> And who does "you" refer to here?

The set of people who have contributed hugetlb patches, plus akpm.


William Lee Irwin III wrote:
>> Obviously, I have no recourse, otherwise there would be no credible
>> threat of this kind of end-run tactic succeeding, and I've apparently
>> already been circumvented by pushing the things to distros anyway. So
>> I can do no more than kindly ask you to address issues 1-6 in your
>> patch presentations.

On Thu, Jan 06, 2005 at 05:21:22PM -0600, Ray Bryant wrote:
> I'd point out that one of the reasons we have Christoph Lameter working
> on this is that he is better at working with cross architecture type
> stuff than I am, since I have neither the skills nor interest to do
> such things (I'm much too focused on Altix specific problems).

Not all points apply to all patches, of course. I've seen the
separation of functions problems more in Christoph's patches than
portability issues, not to say they are nonexistant. The most prominent
case of a portability issue was in Kenneth Chen's patches.


On Thu, Jan 06, 2005 at 05:21:22PM -0600, Ray Bryant wrote:
> So, I guess the question is, do you, wli, have allocate hugetlbpage on
> fault code available somewhere that we, SGI, have somehow stepped on,
> ignored, or not properly given credit for? SGI has a strong requirement
> to eliminate the current "allocate hugetlb pages at mmap() time",
> single-threaded allocation. (We have sold machines where it would
> take thousands of seconds to complete that operation as it is
> currently coded in the mainline.)

That bit was largely directed at akpm, regarding an off-list discussion.


On Thu, Jan 06, 2005 at 05:21:22PM -0600, Ray Bryant wrote:
> We need the allocate on fault hugetlbpage code. We worked quite hard
> to get that code to behave the same way wrt out of memory failures as the
> existing code. To say that we didn't worry about backwards
> compatibility there (at least in that regard) is simply absurd.
> But I care not where this code comes from. If it works, meets our
> scaling requirements, and can get accepted into the mainline, then
> I am all for it. And I will happily give credit where credit is
> due.

The backward compatibility concerns apply to a past patch, the
expand-on-mmap() code, and the COW patches. Zero-fill-on-demand patches
haven't posed particular problems there.


On Thu, Jan 06, 2005 at 05:21:22PM -0600, Ray Bryant wrote:
> However, at the pesent time it appears that if we want this code in the
> mainline, we will have to bring it up to level and push it upstream,
> and that is what Christoph is working on.
> When that happens, the code is subject to review and we look forward
> to working with you to resolve your concerns (1)-(6) wrt to those
> patches.

My hurt feelings about my own code won't have an impact on this.

What has been holding up the show has been a serious bug there has not
been a resolution on yet. I'm probably going to give up on that bug
being fixed before other things are done and just do the heavy review
and testing, particularly as I'm in a unique position to have access to
all of the arches supporting hugetlb.


-- wli

2005-01-06 23:54:27

by Anton Blanchard

[permalink] [raw]
Subject: Re: page migration patchset


> The second is zero bugfixing or cross-architecture testing activity
> apart from my own. This and the first in conjunction cast serious
> doubt upon and increase the need for heavy critical examination of
> so-called ``consolidation'' patches.

OK lets get moving on the bug fixing. I know of one outstanding hugetlb
bug which is the one you have been working on.

Can we have a complete bug report on it so the rest of us can try to assist?

Anton

2005-01-06 23:45:33

by Steve Longerbeam

[permalink] [raw]
Subject: Re: page migration patchset

Andi Kleen wrote:

>On Wed, Jan 05, 2005 at 03:56:29PM -0800, Steve Longerbeam wrote:
>
>
>>Hugetlbfs is also defining its own shared policy RB tree in its
>>inode info struct, but it doesn't seem to be used, just initialized
>>and freed at alloc/destroy inode time. Does anyone know why that
>>is there? A place-holder for future hugetlbfs mempolicy support?
>>If so, it can be removed and use the generic_file policies instead.
>>
>>
>
>You need lazy hugetlbfs to use it (= allocate at page fault time,
>not mmap time). Otherwise the policy can never be applied. I implemented
>my own version of lazy allocation for SLES9, but when I wanted to
>merge it into mainline some other people told they had a much better
>singing&dancing lazy hugetlb patch. So I waited for them, but they
>never went forward with their stuff and their code seems to be dead
>now. So this is still a dangling end :/
>
>If nothing happens soon regarding the "other" hugetlb code I will
>forward port my SLES9 code. It already has NUMA policy support.
>
>For now you can remove the hugetlb policy code from mainline if you
>want, it would be easy to readd it when lazy hugetlbfs is merged.
>
>

if you don't mind I'd like to. Sounds as if lazy hugetlbfs would be able to
make use of the generic file mapping->policy instead of a hugetlb-specific
policy anyway. Same goes for shmem.

Steve


2005-01-07 00:10:03

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page migration patchset

At some point in the past, I wrote:
>> The second is zero bugfixing or cross-architecture testing activity
>> apart from my own. This and the first in conjunction cast serious
>> doubt upon and increase the need for heavy critical examination of
>> so-called ``consolidation'' patches.

On Fri, Jan 07, 2005 at 10:53:00AM +1100, Anton Blanchard wrote:
> OK lets get moving on the bug fixing. I know of one outstanding hugetlb
> bug which is the one you have been working on.
> Can we have a complete bug report on it so the rest of us can try to assist?

The one-sentence summary is that a triplefault causing machine reset
occurs while hugetlb is used during a long-running regression test for
the Oracle database on both EM64T and x86-64. Thus far attempts to
produce isolated testcases have not been successful. The test involves
duplicating a database across two database instances.

My current work on this consists largely of attempting to get access to
debugging equipment and/or simulators to carry out post-mortem analysis.
I've recently been informed that some of this will be provided to me.


-- wli

2005-01-07 00:38:39

by Andi Kleen

[permalink] [raw]
Subject: Re: page migration patchset

> Can we have a complete bug report on it so the rest of us can try to assist?

The crash bug seems to be x86-64 specific.

-Andi

2005-01-07 00:14:07

by William Lee Irwin III

[permalink] [raw]
Subject: Re: page migration patchset

Andi Kleen wrote:
>> You need lazy hugetlbfs to use it (= allocate at page fault time,
>> not mmap time). Otherwise the policy can never be applied. I implemented
>> my own version of lazy allocation for SLES9, but when I wanted to
>> merge it into mainline some other people told they had a much better
>> singing&dancing lazy hugetlb patch. So I waited for them, but they
>> never went forward with their stuff and their code seems to be dead
>> now. So this is still a dangling end :/
>> If nothing happens soon regarding the "other" hugetlb code I will
>> forward port my SLES9 code. It already has NUMA policy support.
>> For now you can remove the hugetlb policy code from mainline if you
>> want, it would be easy to readd it when lazy hugetlbfs is merged.

On Thu, Jan 06, 2005 at 03:43:39PM -0800, Steve Longerbeam wrote:
> if you don't mind I'd like to. Sounds as if lazy hugetlbfs would be
> able to make use of the generic file mapping->policy instead of a
> hugetlb-specific policy anyway. Same goes for shmem.

If Andi's comments refer to my work, it already got permavetoed.

Anyway, using the vma's is a minor change. Please include this as a
patch separate from other changes (fault handling, consolidations, etc.)


-- wli

2005-01-11 15:36:39

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Andi and Steve,

Steve Longerbeam wrote:
<snip>

>>
>> My personal preference would be to keep as much of this as possible
>> under user space control; that is, rather than having a big autonomous
>> system call that migrates pages and then updates policy information,
>> I'd prefer to split the work into several smaller system calls that
>> are issued by a user space program responsible for coordinating the
>> process migration as a series of steps, e. g.:
>>
>> (1) suspend the process via SIGSTOP
>> (2) update the mempolicy information
>> (3) migrate the process's pages
>> (4) migrate the process to the new cpu via set_schedaffinity()
>> (5) resume the process via SIGCONT
>>
>
> steps 2 and 3 can be accomplished by a call to mbind() and
> specifying MPOL_MF_MOVE. And since mbind() takes an
> address range, you could probably migrate pages and change
> the policies for all of the process' mappings in a single mbind()
> call.

OK, I just got around to looking into this suggestion. Unfortunately,
it doesn't look as if this will do what I want. I need to be able to
conserve the topology of the application when it is migrated (required
to give the application the same performance in its new location that
it got in its old location). So, I need to be able to say "take the
pages on this node and move them to that node". The sys_mbind() call
doesn't have the necessry arguments to do this. I'm thinking of
something like:

migrate_process_pages(pid, numnodes, oldnodelist, newnodelist);

This would scan the address space of process pid, and each page that
is found on oldnodelist[i] would be moved to node newnodelist[i].

Pages that are found to be swapped out would be handled as follows:
Add the original node id to either the swap pte or the swp_entry_t.
Swap in will be modified to allocate the page on the same node it
came from. Then, as part of migrate_process_pages, all that would
be done for swapped out pages would be to change the "original node"
field to point at the new node.

However, I could probably do both steps (2) and (3) as part of the
migrate_process_pages() call.

Does this all seem reasonable?

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-11 19:03:13

by Steve Longerbeam

[permalink] [raw]
Subject: Re: page migration patchset

Ray Bryant wrote:

> Andi and Steve,
>
> Steve Longerbeam wrote:
> <snip>
>
>>>
>>> My personal preference would be to keep as much of this as possible
>>> under user space control; that is, rather than having a big autonomous
>>> system call that migrates pages and then updates policy information,
>>> I'd prefer to split the work into several smaller system calls that
>>> are issued by a user space program responsible for coordinating the
>>> process migration as a series of steps, e. g.:
>>>
>>> (1) suspend the process via SIGSTOP
>>> (2) update the mempolicy information
>>> (3) migrate the process's pages
>>> (4) migrate the process to the new cpu via set_schedaffinity()
>>> (5) resume the process via SIGCONT
>>>
>>
>> steps 2 and 3 can be accomplished by a call to mbind() and
>> specifying MPOL_MF_MOVE. And since mbind() takes an
>> address range, you could probably migrate pages and change
>> the policies for all of the process' mappings in a single mbind()
>> call.
>
>
> OK, I just got around to looking into this suggestion. Unfortunately,
> it doesn't look as if this will do what I want. I need to be able to
> conserve the topology of the application when it is migrated (required
> to give the application the same performance in its new location that
> it got in its old location).


I see what you mean, unless the requested address range exactly
fits within an existing vma, existing vma's will get split up.

> So, I need to be able to say "take the
> pages on this node and move them to that node". The sys_mbind() call
> doesn't have the necessry arguments to do this. I'm thinking of
> something like:
>
> migrate_process_pages(pid, numnodes, oldnodelist, newnodelist);
>
> This would scan the address space of process pid, and each page that
> is found on oldnodelist[i] would be moved to node newnodelist[i].


right, that's something I'd be interested in as well. In fact, an address
range is not ideal for me either - what I really need is an API that
allows me to specify a single existing vma (or all the process'
regions in your case) that is to have its policy changed and resident
pages migrated, without changing the topology (eg. split vma's).

>
> Pages that are found to be swapped out would be handled as follows:
> Add the original node id to either the swap pte or the swp_entry_t.
> Swap in will be modified to allocate the page on the same node it
> came from. Then, as part of migrate_process_pages, all that would
> be done for swapped out pages would be to change the "original node"
> field to point at the new node.


isn't this already taken care of? read_swap_cache_async() is given
a vma, and passes it to alloc_page_vma(). So if you have earlier
changed the policy for that vma, the new policy will be used
when allocating the page during the swap in.

Steve

2005-01-11 19:29:30

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Steve Longerbeam wrote:


>
> isn't this already taken care of? read_swap_cache_async() is given
> a vma, and passes it to alloc_page_vma(). So if you have earlier
> changed the policy for that vma, the new policy will be used
> when allocating the page during the swap in.
>
> Steve
>

What if the policy associated with a vma is the default policy?
Then the page will be swapped in on the node that took the page
fault -- this is >>probably<< correct in most cases, but if a
page is accessed from several nodes, and predominately accessed
from a particular node, it can end up moving due to being swapped
out, and that is probably not what the application intended.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-11 20:59:48

by Steve Longerbeam

[permalink] [raw]
Subject: Re: page migration patchset

Ray Bryant wrote:

> Steve Longerbeam wrote:
>
>
>>
>> isn't this already taken care of? read_swap_cache_async() is given
>> a vma, and passes it to alloc_page_vma(). So if you have earlier
>> changed the policy for that vma, the new policy will be used
>> when allocating the page during the swap in.
>>
>> Steve
>>
>
> What if the policy associated with a vma is the default policy?


then read_swap_cache_async() would probably allocate pages for
the swap readin from the wrong nodes, but then migrate_process_pages
would move those to the correct nodes later. But if migrate_process_pages
is called *before* swap readin, the policies will be changed and
read_swap_cache_async() would allocate from the correct nodes.

Maybe I'm missing something, but let me rephrase my argument.
If read_swap_cache_async() is called *before* the vma policies are
changed, they will most likely be allocated from the wrong nodes but
will then be migrated to the correct nodes during the
policy-change-and-page-migrate syscall, and if the swap readin happens
*after* the syscall, the page allocations will use the new policies.

Steve



2005-01-12 12:36:36

by Robin Holt

[permalink] [raw]
Subject: Re: page migration patchset

On Tue, Jan 11, 2005 at 09:38:02AM -0600, Ray Bryant wrote:
> Pages that are found to be swapped out would be handled as follows:
> Add the original node id to either the swap pte or the swp_entry_t.
> Swap in will be modified to allocate the page on the same node it
> came from. Then, as part of migrate_process_pages, all that would
> be done for swapped out pages would be to change the "original node"
> field to point at the new node.
>
> However, I could probably do both steps (2) and (3) as part of the
> migrate_process_pages() call.

I don't think we need to worry about the swap case. Let's keep the
changes small and build when we see problems. The normal swap
out/in mechanism should handle nearly all the page migration issues
you are concerned with.

Just my 2 cents,
Robin

2005-01-12 18:13:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: page migration patchset

On Wed, 12 Jan 2005, Robin Holt wrote:
> On Tue, Jan 11, 2005 at 09:38:02AM -0600, Ray Bryant wrote:
> > Pages that are found to be swapped out would be handled as follows:
> > Add the original node id to either the swap pte or the swp_entry_t.
> > Swap in will be modified to allocate the page on the same node it
> > came from. Then, as part of migrate_process_pages, all that would
> > be done for swapped out pages would be to change the "original node"
> > field to point at the new node.
> >
> > However, I could probably do both steps (2) and (3) as part of the
> > migrate_process_pages() call.
>
> I don't think we need to worry about the swap case. Let's keep the
> changes small and build when we see problems. The normal swap
> out/in mechanism should handle nearly all the page migration issues
> you are concerned with.

I don't think so: swapin_readahead hasn't a clue what nodes to allocate
from, swap just isn't arranged in the predictable way that the NUMA code
there currently pretends (which Andi acknowledges).

Ray's suggestion above makes sense to me, though there may be other ways.

The simplest solution, which most appeals to me, is to delete
swapin_readahead altogether - it's based on the principle "well,
if I'm going to read something from the disk, I might as well read
adjacent pages in one go, there's a ghost of a chance that some of
the others might be useful soon too, and if we're lucky, pushing
other pages out of cache to make way for these might pay off".

Which probably is a win in some workloads, but I wonder how often.
Though doing the hard work of endless research to establish the
truth doesn't appeal to me at all!

Hugh

2005-01-12 18:43:55

by Ray Bryant

[permalink] [raw]
Subject: Re: page migration patchset

Hugh Dickins wrote:
> On Wed, 12 Jan 2005, Robin Holt wrote:
>
>>On Tue, Jan 11, 2005 at 09:38:02AM -0600, Ray Bryant wrote:
>>
>>>Pages that are found to be swapped out would be handled as follows:
>>>Add the original node id to either the swap pte or the swp_entry_t.
>>>Swap in will be modified to allocate the page on the same node it
>>>came from. Then, as part of migrate_process_pages, all that would
>>>be done for swapped out pages would be to change the "original node"
>>>field to point at the new node.
>>>
>>>However, I could probably do both steps (2) and (3) as part of the
>>>migrate_process_pages() call.
>>
>>I don't think we need to worry about the swap case. Let's keep the
>>changes small and build when we see problems. The normal swap
>>out/in mechanism should handle nearly all the page migration issues
>>you are concerned with.

At the moment, this discussion is moot (for my application at least).
For our workloads, we almost never swap, we are going to ignore migrating
swapped out pages until such a time that we see a performance need for same.

If that point ever comes, we will have to solve this problem then.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-12 19:00:10

by Andrew Morton

[permalink] [raw]
Subject: Re: page migration patchset

Hugh Dickins <[email protected]> wrote:
>
> The simplest solution, which most appeals to me, is to delete
> swapin_readahead altogether - it's based on the principle "well,
> if I'm going to read something from the disk, I might as well read
> adjacent pages in one go, there's a ghost of a chance that some of
> the others might be useful soon too, and if we're lucky, pushing
> other pages out of cache to make way for these might pay off".
>
> Which probably is a win in some workloads, but I wonder how often.

Our current way of allocating swap can cause us to end up with little
correlation between adjacent pages on-disk. But this can be improved. THe
old swapspace-layout-improvements patch was designed to fix that up, but
needs more testing and tuning.

It clusters pages on-disk via their virtual address.

diff -puN mm/vmscan.c~swapspace-layout-improvements mm/vmscan.c
--- 25/mm/vmscan.c~swapspace-layout-improvements 2004-06-03 21:32:51.087602712 -0700
+++ 25-akpm/mm/vmscan.c 2004-06-03 21:32:51.102600432 -0700
@@ -381,8 +381,11 @@ static int shrink_list(struct list_head
* XXX: implement swap clustering ?
*/
if (PageAnon(page) && !PageSwapCache(page)) {
+ void *cookie = page->mapping;
+ pgoff_t index = page->index;
+
page_map_unlock(page);
- if (!add_to_swap(page))
+ if (!add_to_swap(page, cookie, index))
goto activate_locked;
page_map_lock(page);
}
diff -puN mm/swap_state.c~swapspace-layout-improvements mm/swap_state.c
--- 25/mm/swap_state.c~swapspace-layout-improvements 2004-06-03 21:32:51.089602408 -0700
+++ 25-akpm/mm/swap_state.c 2004-06-03 21:32:51.103600280 -0700
@@ -137,8 +137,12 @@ void __delete_from_swap_cache(struct pag
*
* Allocate swap space for the page and add the page to the
* swap cache. Caller needs to hold the page lock.
+ *
+ * We attempt to lay pages out on swap to that virtually-contiguous pages are
+ * contiguous on-disk. To do this we utilise page->index (offset into vma) and
+ * page->mapping (the anon_vma's address).
*/
-int add_to_swap(struct page * page)
+int add_to_swap(struct page *page, void *cookie, pgoff_t index)
{
swp_entry_t entry;
int pf_flags;
@@ -148,7 +152,7 @@ int add_to_swap(struct page * page)
BUG();

for (;;) {
- entry = get_swap_page();
+ entry = get_swap_page(cookie, index);
if (!entry.val)
return 0;

diff -puN include/linux/swap.h~swapspace-layout-improvements include/linux/swap.h
--- 25/include/linux/swap.h~swapspace-layout-improvements 2004-06-03 21:32:51.090602256 -0700
+++ 25-akpm/include/linux/swap.h 2004-06-03 21:32:51.104600128 -0700
@@ -193,7 +193,7 @@ extern int rw_swap_page_sync(int, swp_en
extern struct address_space swapper_space;
#define total_swapcache_pages swapper_space.nrpages
extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *);
+extern int add_to_swap(struct page *page, void *cookie, pgoff_t index);
extern void __delete_from_swap_cache(struct page *);
extern void delete_from_swap_cache(struct page *);
extern int move_to_swap_cache(struct page *, swp_entry_t);
@@ -210,7 +210,7 @@ extern int total_swap_pages;
extern unsigned int nr_swapfiles;
extern struct swap_info_struct swap_info[];
extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t get_swap_page(void *cookie, pgoff_t index);
extern int swap_duplicate(swp_entry_t);
extern int valid_swaphandles(swp_entry_t, unsigned long *);
extern void swap_free(swp_entry_t);
@@ -259,7 +259,7 @@ static inline int remove_exclusive_swap_
return 0;
}

-static inline swp_entry_t get_swap_page(void)
+static inline swp_entry_t get_swap_page(void *cookie, pgoff_t index)
{
swp_entry_t entry;
entry.val = 0;
diff -puN mm/shmem.c~swapspace-layout-improvements mm/shmem.c
--- 25/mm/shmem.c~swapspace-layout-improvements 2004-06-03 21:32:51.092601952 -0700
+++ 25-akpm/mm/shmem.c 2004-06-03 21:32:51.108599520 -0700
@@ -744,7 +744,7 @@ static int shmem_writepage(struct page *
struct shmem_inode_info *info;
swp_entry_t *entry, swap;
struct address_space *mapping;
- unsigned long index;
+ pgoff_t index;
struct inode *inode;

BUG_ON(!PageLocked(page));
@@ -756,7 +756,7 @@ static int shmem_writepage(struct page *
info = SHMEM_I(inode);
if (info->flags & VM_LOCKED)
goto redirty;
- swap = get_swap_page();
+ swap = get_swap_page(mapping, index);
if (!swap.val)
goto redirty;

diff -puN mm/swapfile.c~swapspace-layout-improvements mm/swapfile.c
--- 25/mm/swapfile.c~swapspace-layout-improvements 2004-06-03 21:32:51.094601648 -0700
+++ 25-akpm/mm/swapfile.c 2004-06-03 23:40:44.396082512 -0700
@@ -25,6 +25,7 @@
#include <linux/rmap.h>
#include <linux/security.h>
#include <linux/backing-dev.h>
+#include <linux/hash.h>

#include <asm/pgtable.h>
#include <asm/tlbflush.h>
@@ -83,71 +84,51 @@ void swap_unplug_io_fn(struct backing_de
up_read(&swap_unplug_sem);
}

-static inline int scan_swap_map(struct swap_info_struct *si)
-{
- unsigned long offset;
- /*
- * We try to cluster swap pages by allocating them
- * sequentially in swap. Once we've allocated
- * SWAPFILE_CLUSTER pages this way, however, we resort to
- * first-free allocation, starting a new cluster. This
- * prevents us from scattering swap pages all over the entire
- * swap partition, so that we reduce overall disk seek times
- * between swap pages. -- sct */
- if (si->cluster_nr) {
- while (si->cluster_next <= si->highest_bit) {
- offset = si->cluster_next++;
- if (si->swap_map[offset])
- continue;
- si->cluster_nr--;
- goto got_page;
- }
- }
- si->cluster_nr = SWAPFILE_CLUSTER;
+int akpm;

- /* try to find an empty (even not aligned) cluster. */
- offset = si->lowest_bit;
- check_next_cluster:
- if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
- {
- int nr;
- for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
- if (si->swap_map[nr])
- {
- offset = nr+1;
- goto check_next_cluster;
- }
- /* We found a completly empty cluster, so start
- * using it.
- */
- goto got_page;
- }
- /* No luck, so now go finegrined as usual. -Andrea */
- for (offset = si->lowest_bit; offset <= si->highest_bit ; offset++) {
- if (si->swap_map[offset])
+/*
+ * We divide the swapdev into 1024 kilobyte chunks. We use the cookie and the
+ * upper bits of the index to select a chunk and the rest of the index as the
+ * offset into the selected chunk.
+ */
+#define CHUNK_SHIFT (20 - PAGE_SHIFT)
+#define CHUNK_MASK (-1UL << CHUNK_SHIFT)
+
+static int
+scan_swap_map(struct swap_info_struct *si, void *cookie, pgoff_t index)
+{
+ unsigned long chunk;
+ unsigned long nchunks;
+ unsigned long block;
+ unsigned long scan;
+
+ nchunks = si->max >> CHUNK_SHIFT;
+ chunk = 0;
+ if (nchunks)
+ chunk = hash_long((unsigned long)cookie + (index & CHUNK_MASK),
+ BITS_PER_LONG) % nchunks;
+
+ block = (chunk << CHUNK_SHIFT) + (index & ~CHUNK_MASK);
+
+ for (scan = 0; scan < si->max; scan++, block++) {
+ if (block == si->max)
+ block = 0;
+ if (block == 0)
continue;
- si->lowest_bit = offset+1;
- got_page:
- if (offset == si->lowest_bit)
- si->lowest_bit++;
- if (offset == si->highest_bit)
- si->highest_bit--;
- if (si->lowest_bit > si->highest_bit) {
- si->lowest_bit = si->max;
- si->highest_bit = 0;
- }
- si->swap_map[offset] = 1;
- si->inuse_pages++;
+ if (si->swap_map[block])
+ continue;
+ si->swap_map[block] = 1;
nr_swap_pages--;
- si->cluster_next = offset+1;
- return offset;
+ if (akpm)
+ printk("cookie:%p, index:%lu, chunk:%lu nchunks:%lu "
+ "block:%lu\n",
+ cookie, index, chunk, nchunks, block);
+ return block;
}
- si->lowest_bit = si->max;
- si->highest_bit = 0;
return 0;
}

-swp_entry_t get_swap_page(void)
+swp_entry_t get_swap_page(void *cookie, pgoff_t index)
{
struct swap_info_struct * p;
unsigned long offset;
@@ -166,7 +147,7 @@ swp_entry_t get_swap_page(void)
p = &swap_info[type];
if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) {
swap_device_lock(p);
- offset = scan_swap_map(p);
+ offset = scan_swap_map(p, cookie, index);
swap_device_unlock(p);
if (offset) {
entry = swp_entry(type,offset);
diff -puN kernel/power/swsusp.c~swapspace-layout-improvements kernel/power/swsusp.c
--- 25/kernel/power/swsusp.c~swapspace-layout-improvements 2004-06-03 21:32:51.096601344 -0700
+++ 25-akpm/kernel/power/swsusp.c 2004-06-03 21:32:51.112598912 -0700
@@ -317,7 +317,7 @@ static int write_suspend_image(void)
for (i=0; i<nr_copy_pages; i++) {
if (!(i%100))
printk( "." );
- if (!(entry = get_swap_page()).val)
+ if (!(entry = get_swap_page(NULL, i)).val)
panic("\nNot enough swapspace when writing data" );

if (swapfile_used[swp_type(entry)] != SWAPFILE_SUSPEND)
@@ -334,7 +334,7 @@ static int write_suspend_image(void)
cur = (union diskpage *)((char *) pagedir_nosave)+i;
BUG_ON ((char *) cur != (((char *) pagedir_nosave) + i*PAGE_SIZE));
printk( "." );
- if (!(entry = get_swap_page()).val) {
+ if (!(entry = get_swap_page(NULL, i)).val) {
printk(KERN_CRIT "Not enough swapspace when writing pgdir\n" );
panic("Don't know how to recover");
free_page((unsigned long) buffer);
@@ -356,7 +356,7 @@ static int write_suspend_image(void)
BUG_ON (sizeof(struct suspend_header) > PAGE_SIZE-sizeof(swp_entry_t));
BUG_ON (sizeof(union diskpage) != PAGE_SIZE);
BUG_ON (sizeof(struct link) != PAGE_SIZE);
- if (!(entry = get_swap_page()).val)
+ if (!(entry = get_swap_page(NULL, i)).val)
panic( "\nNot enough swapspace when writing header" );
if (swapfile_used[swp_type(entry)] != SWAPFILE_SUSPEND)
panic("\nNot enough swapspace for header on suspend device" );
diff -puN kernel/power/pmdisk.c~swapspace-layout-improvements kernel/power/pmdisk.c
--- 25/kernel/power/pmdisk.c~swapspace-layout-improvements 2004-06-03 21:32:51.098601040 -0700
+++ 25-akpm/kernel/power/pmdisk.c 2004-06-03 21:32:51.113598760 -0700
@@ -206,7 +206,7 @@ static int write_swap_page(unsigned long
swp_entry_t entry;
int error = 0;

- entry = get_swap_page();
+ entry = get_swap_page(NULL, addr >> PAGE_SHIFT);
if (swp_offset(entry) &&
swapfile_used[swp_type(entry)] == SWAPFILE_SUSPEND) {
error = rw_swap_page_sync(WRITE, entry,
_

2005-01-14 13:55:46

by Tim Schmielau

[permalink] [raw]
Subject: swapspace layout improvements advocacy

On Wed, 12 Jan 2005, Andrew Morton wrote:

> Our current way of allocating swap can cause us to end up with little
> correlation between adjacent pages on-disk. But this can be improved. THe
> old swapspace-layout-improvements patch was designed to fix that up, but
> needs more testing and tuning.
>
> It clusters pages on-disk via their virtual address.

2.6 seems in due need of such a patch.

I recently found out that 2.6 kernels degrade horribly when going into
swap. On my dual PIII-850 with as little as 256 mb ram, I can easily
demonstrate that by opening about 40-50 instances of konquerer with large
tables, many images and such things. When the machine is into 80-120 mb of
the 256 mb swap partition, it becomes almost unusable. Even the desktop
background picture needs ~20sec to update, not to talk about any windows'
contents. And you can literally hear the reason for it: the harddisk is
seeking like crazy.

I've applied Ingo Molnars swapspace-layout-improvements-2.6.9-rc1-bk12-A1
port of the patch to a 2.6.11-rc1 kernel, and it handles the same workload
much smoother. It's slow, but you can work with it.

I just wonder why noone else complained yet. Are systems with tight memory
constraints so uncommon these days?

Tim

2005-01-14 18:16:28

by Andrew Morton

[permalink] [raw]
Subject: Re: swapspace layout improvements advocacy

Tim Schmielau <[email protected]> wrote:
>
> I recently found out that 2.6 kernels degrade horribly when going into
> swap. On my dual PIII-850 with as little as 256 mb ram, I can easily
> demonstrate that by opening about 40-50 instances of konquerer with large
> tables, many images and such things. When the machine is into 80-120 mb of
> the 256 mb swap partition, it becomes almost unusable. Even the desktop
> background picture needs ~20sec to update, not to talk about any windows'
> contents. And you can literally hear the reason for it: the harddisk is
> seeking like crazy.
>
> I've applied Ingo Molnars swapspace-layout-improvements-2.6.9-rc1-bk12-A1
> port of the patch to a 2.6.11-rc1 kernel, and it handles the same workload
> much smoother. It's slow, but you can work with it.

Well I'm surprised. I ran a couple of silly tests and wasn't able to
demonstrate any benefit. But I didn't persist at all due to general inbox
overload :(

> I just wonder why noone else complained yet.

They're all too polite?

> Are systems with tight memory constraints so uncommon these days?

Relatively, but I think we do have some fairly technical people on this
list who push their systems that hard, which is appreciated. I'll add the
patch to the -mm lineup for a while..

2005-01-14 22:53:27

by Barry K. Nathan

[permalink] [raw]
Subject: Re: swapspace layout improvements advocacy

On Fri, Jan 14, 2005 at 02:55:27PM +0100, Tim Schmielau wrote:
> 2.6 seems in due need of such a patch.
>
> I recently found out that 2.6 kernels degrade horribly when going into
> swap. On my dual PIII-850 with as little as 256 mb ram, I can easily
[snip]

I haven't tried the patch in question (unless it's in any Fedora
kernels), but I've noticed that the single biggest step to improve
swapping performance in 2.6 is to use the CFQ scheduler, not the AS
scheduler. (That's also why Red Hat/Fedora kernels use CFQ as the
default scheduler.)

-Barry K. Nathan <[email protected]>

2005-01-15 01:40:38

by Alan

[permalink] [raw]
Subject: Re: swapspace layout improvements advocacy

On Gwe, 2005-01-14 at 22:52, Barry K. Nathan wrote:
> I haven't tried the patch in question (unless it's in any Fedora
> kernels), but I've noticed that the single biggest step to improve
> swapping performance in 2.6 is to use the CFQ scheduler, not the AS
> scheduler. (That's also why Red Hat/Fedora kernels use CFQ as the
> default scheduler.)

Definitely the case. There is a lot wrong with our swap logic looked at
from the point of view of modern IDE disks at least beyond that though.

We'd be much better IMHO to have log structured swap so we can swap out
fast (which is normally time critical to get crap on disk and us back
running). The log tidier would then merge groups of pages either by va
or by read reference if they paged in and back out again.

There's probably an MSC or maybe a PHD lurking for someone in this sort
of area 8)

2005-01-15 02:27:10

by Tim Schmielau

[permalink] [raw]
Subject: Re: swapspace layout improvements advocacy

On Fri, 14 Jan 2005, Barry K. Nathan wrote:

> I haven't tried the patch in question (unless it's in any Fedora
> kernels), but I've noticed that the single biggest step to improve
> swapping performance in 2.6 is to use the CFQ scheduler, not the AS
> scheduler. (That's also why Red Hat/Fedora kernels use CFQ as the
> default scheduler.)

Yes, I also use CFQ. Didn't dare to try with the anticipatory scheduler.

Since Andrew was so sceptical, I tested a bit more, and it's not as easily
reproducible as I wrote. There don't seem to be any problems on a freshly
booted system. It only happens after using the system for some time. But I
think that's only natural as the virtal->physical mapping needs to be
disturbed before seeing any problems.

Tim

2005-01-15 08:55:18

by Pasi Savolainen

[permalink] [raw]
Subject: Re: swapspace layout improvements advocacy

* Barry K. Nathan <[email protected]>:
> On Fri, Jan 14, 2005 at 02:55:27PM +0100, Tim Schmielau wrote:
>> 2.6 seems in due need of such a patch.
>>
>> I recently found out that 2.6 kernels degrade horribly when going into
>> swap. On my dual PIII-850 with as little as 256 mb ram, I can easily
> [snip]
>
> I haven't tried the patch in question (unless it's in any Fedora
> kernels), but I've noticed that the single biggest step to improve
> swapping performance in 2.6 is to use the CFQ scheduler, not the AS
> scheduler. (That's also why Red Hat/Fedora kernels use CFQ as the
> default scheduler.)

I've not tried the patch yet, but with 1G mem / 1G swap when I finally
hit the swap and quit the program that uses it (straw, gimp..), machine
will stop responding for 10-15sec.
I have 'elevator=cfq' in boot command and this is SMP. Last seen
yesterday when shutting down 2.6.10-rc2-mm4 to try out latest -mm.
I did vmstat runs a while ago, but didn't see anything really out of
line, maybe it didn't get data either (or I don't know what is normal).


--
Psi -- <http://www.iki.fi/pasi.savolainen>