When tmpfs has the memory policy interleaved it always starts allocating at each file at node 0.
When there are many small files the lower nodes fill up disproportionately.
My proposed solution is to start a file at a randomly chosen node.
Cc: Christoph Lameter <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: [email protected]
Signed-off-by: Nathan T Zimmer <[email protected]>
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 79ab255..38eda26 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -17,6 +17,7 @@ struct shmem_inode_info {
char *symlink; /* unswappable short symlink */
};
struct shared_policy policy; /* NUMA memory alloc policy */
+ int node_offset; /* bias for interleaved nodes */
struct list_head swaplist; /* chain of maybes on swap */
struct list_head xattr_list; /* list of shmem_xattr */
struct inode vfs_inode;
diff --git a/mm/shmem.c b/mm/shmem.c
index f99ff3e..58ef512 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -819,7 +819,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
/* Create a pseudo vma that just contains the policy */
pvma.vm_start = 0;
- pvma.vm_pgoff = index;
+ pvma.vm_pgoff = index + info->node_offset;
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
@@ -1153,6 +1153,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
inode->i_fop = &shmem_file_operations;
mpol_shared_policy_init(&info->policy,
shmem_get_sbmpol(sbinfo));
+ info->node_offset = node_random(&node_online_map);
break;
case S_IFDIR:
inc_nlink(inode);
On 05/23/2012 09:28 AM, Nathan Zimmer wrote:
>
> When tmpfs has the memory policy interleaved it always starts allocating at each file at node 0.
> When there are many small files the lower nodes fill up disproportionately.
> My proposed solution is to start a file at a randomly chosen node.
>
> Cc: Christoph Lameter<[email protected]>
> Cc: Nick Piggin<[email protected]>
> Cc: Hugh Dickins<[email protected]>
> Cc: Lee Schermerhorn<[email protected]>
> Cc: [email protected]
> Signed-off-by: Nathan T Zimmer<[email protected]>
Acked-by: Rik van Riel <[email protected]>
On Wed, 23 May 2012 13:28:21 +0000
Nathan Zimmer <[email protected]> wrote:
>
> When tmpfs has the memory policy interleaved it always starts allocating at each file at node 0.
> When there are many small files the lower nodes fill up disproportionately.
> My proposed solution is to start a file at a randomly chosen node.
>
> ...
>
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -17,6 +17,7 @@ struct shmem_inode_info {
> char *symlink; /* unswappable short symlink */
> };
> struct shared_policy policy; /* NUMA memory alloc policy */
> + int node_offset; /* bias for interleaved nodes */
> struct list_head swaplist; /* chain of maybes on swap */
> struct list_head xattr_list; /* list of shmem_xattr */
> struct inode vfs_inode;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f99ff3e..58ef512 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -819,7 +819,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>
> /* Create a pseudo vma that just contains the policy */
> pvma.vm_start = 0;
> - pvma.vm_pgoff = index;
> + pvma.vm_pgoff = index + info->node_offset;
> pvma.vm_ops = NULL;
> pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
>
> @@ -1153,6 +1153,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> inode->i_fop = &shmem_file_operations;
> mpol_shared_policy_init(&info->policy,
> shmem_get_sbmpol(sbinfo));
> + info->node_offset = node_random(&node_online_map);
> break;
> case S_IFDIR:
> inc_nlink(inode);
The patch seems a bit arbitrary and hacky. It would have helped if you
had fully described how it works, and why this implementation was
chosen.
- Why alter (actually, lie about!) the offset-into-file? Could we
have similarly perturbed the address arg to alloc_page_vma() to do
the spreading?
- The patch is dependent upon MPOL_INTERLEAVE being in effect, isn't
it? How do we guarantee that it is in force here?
- We look up the policy via mpol_shared_policy_lookup() using the
unperturbed index. Why? Should we be using index+info->node_offset
there?
On Wed, May 23, 2012 at 03:20:11PM -0700, Andrew Morton wrote:
> On Wed, 23 May 2012 13:28:21 +0000
> Nathan Zimmer <[email protected]> wrote:
>
> >
> > When tmpfs has the memory policy interleaved it always starts allocating at each file at node 0.
> > When there are many small files the lower nodes fill up disproportionately.
> > My proposed solution is to start a file at a randomly chosen node.
> >
> > ...
> >
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -17,6 +17,7 @@ struct shmem_inode_info {
> > char *symlink; /* unswappable short symlink */
> > };
> > struct shared_policy policy; /* NUMA memory alloc policy */
> > + int node_offset; /* bias for interleaved nodes */
> > struct list_head swaplist; /* chain of maybes on swap */
> > struct list_head xattr_list; /* list of shmem_xattr */
> > struct inode vfs_inode;
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index f99ff3e..58ef512 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -819,7 +819,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
> >
> > /* Create a pseudo vma that just contains the policy */
> > pvma.vm_start = 0;
> > - pvma.vm_pgoff = index;
> > + pvma.vm_pgoff = index + info->node_offset;
> > pvma.vm_ops = NULL;
> > pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
> >
> > @@ -1153,6 +1153,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> > inode->i_fop = &shmem_file_operations;
> > mpol_shared_policy_init(&info->policy,
> > shmem_get_sbmpol(sbinfo));
> > + info->node_offset = node_random(&node_online_map);
> > break;
> > case S_IFDIR:
> > inc_nlink(inode);
>
> The patch seems a bit arbitrary and hacky. It would have helped if you
> had fully described how it works, and why this implementation was
> chosen.
>
The patch attempt to spread out the node usage by starting files at nodes other
then 0. node_offset is set to a random node when the inode is allocated.
> - Why alter (actually, lie about!) the offset-into-file? Could we
> have similarly perturbed the address arg to alloc_page_vma() to do
> the spreading?
>
Using the address arg would be better. It also makes clear that we should
still be using the index for looking up the memory policy.
> - The patch is dependent upon MPOL_INTERLEAVE being in effect, isn't
> it? How do we guarantee that it is in force here?
>
The node_offset is only used when MPOL_INTERLEAVE is in effect. However
node_offset is set unconditionally. It would be quite easy to only generate
the offset when the policy is set to interleave.
> - We look up the policy via mpol_shared_policy_lookup() using the
> unperturbed index. Why? Should we be using index+info->node_offset
> there?
>
This concern should be obviated using the address arg instead of 'altering' the
vm_pgoff.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>