When tmpfs has the memory policy interleaved it always starts allocating at each
file at node 0. When there are many small files the lower nodes fill up
disproportionately.
This patch attempts to spread out node usage by starting files at nodes other
then 0. I disturbed the addr parameter since alloc_pages_vma will only use it
when the policy is MPOL_INTERLEAVE. Random was picked over using another
variable which would require some sort of contention management.
Cc: Christoph Lameter <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: [email protected]
Signed-off-by: Nathan T Zimmer <[email protected]>
---
include/linux/shmem_fs.h | 1 +
mm/shmem.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index bef2cf0..cfe8a34 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -17,6 +17,7 @@ struct shmem_inode_info {
char *symlink; /* unswappable short symlink */
};
struct shared_policy policy; /* NUMA memory alloc policy */
+ unsigned long node_offset; /* bias for interleaved nodes */
struct list_head swaplist; /* chain of maybes on swap */
struct list_head xattr_list; /* list of shmem_xattr */
struct inode vfs_inode;
diff --git a/mm/shmem.c b/mm/shmem.c
index d576b84..69a47fb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
/*
* alloc_page_vma() will drop the shared policy reference
*/
- return alloc_page_vma(gfp, &pvma, 0);
+ return alloc_page_vma(gfp, &pvma, info->node_offset << PAGE_SHIFT );
}
#else /* !CONFIG_NUMA */
#ifdef CONFIG_TMPFS
@@ -1357,6 +1357,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
inode->i_fop = &shmem_file_operations;
mpol_shared_policy_init(&info->policy,
shmem_get_sbmpol(sbinfo));
+ info->node_offset = node_random(&node_online_map);
break;
case S_IFDIR:
inc_nlink(inode);
--
1.6.0.2
On Thu, 31 May 2012 09:39:17 -0500
Nathan Zimmer <[email protected]> wrote:
> When tmpfs has the memory policy interleaved it always starts allocating at each
> file at node 0. When there are many small files the lower nodes fill up
> disproportionately.
> This patch attempts to spread out node usage by starting files at nodes other
> then 0. I disturbed the addr parameter since alloc_pages_vma will only use it
> when the policy is MPOL_INTERLEAVE. Random was picked over using another
> variable which would require some sort of contention management.
The patch title is a bit scummy ;) It describes a kernel problem, not
the patch. I renamed it to "tmpfs: implement NUMA node interleaving".
It looks nice and simple
> Cc: [email protected]
We could probably sneak this past Greg, but should we? It's a feature
and a performance enhancement. Such things are not normally added to
-stable. If there were some nice performance improvements in workloads
which our users care about then I guess we could backport it.
But you've provided us with no measurements at all, hence no reason to
backport it.
(5/31/12 10:39 AM), Nathan Zimmer wrote:
> When tmpfs has the memory policy interleaved it always starts allocating at each
> file at node 0. When there are many small files the lower nodes fill up
> disproportionately.
> This patch attempts to spread out node usage by starting files at nodes other
> then 0. I disturbed the addr parameter since alloc_pages_vma will only use it
> when the policy is MPOL_INTERLEAVE. Random was picked over using another
> variable which would require some sort of contention management.
>
> Cc: Christoph Lameter<[email protected]>
> Cc: Nick Piggin<[email protected]>
> Cc: Hugh Dickins<[email protected]>
> Cc: Lee Schermerhorn<[email protected]>
> Acked-by: Rik van Riel<[email protected]>
> Cc: [email protected]
> Signed-off-by: Nathan T Zimmer<[email protected]>
> ---
> include/linux/shmem_fs.h | 1 +
> mm/shmem.c | 3 ++-
> 2 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index bef2cf0..cfe8a34 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -17,6 +17,7 @@ struct shmem_inode_info {
> char *symlink; /* unswappable short symlink */
> };
> struct shared_policy policy; /* NUMA memory alloc policy */
> + unsigned long node_offset; /* bias for interleaved nodes */
> struct list_head swaplist; /* chain of maybes on swap */
> struct list_head xattr_list; /* list of shmem_xattr */
> struct inode vfs_inode;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d576b84..69a47fb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
> /*
> * alloc_page_vma() will drop the shared policy reference
> */
> - return alloc_page_vma(gfp,&pvma, 0);
> + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
3rd argument of alloc_page_vma() is an address. This is type error.
> }
> #else /* !CONFIG_NUMA */
> #ifdef CONFIG_TMPFS
> @@ -1357,6 +1357,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> inode->i_fop =&shmem_file_operations;
> mpol_shared_policy_init(&info->policy,
> shmem_get_sbmpol(sbinfo));
> + info->node_offset = node_random(&node_online_map);
> break;
> case S_IFDIR:
> inc_nlink(inode);
On Thu, 31 May 2012 16:09:15 -0400
KOSAKI Motohiro <[email protected]> wrote:
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
> > /*
> > * alloc_page_vma() will drop the shared policy reference
> > */
> > - return alloc_page_vma(gfp,&pvma, 0);
> > + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
>
> 3rd argument of alloc_page_vma() is an address. This is type error.
Well, it's an unsigned long...
But yes, it is conceptually wrong and *looks* weird. I think we can
address that by overcoming our peculair aversion to documenting our
code, sigh. This?
--- a/mm/shmem.c~tmpfs-implement-numa-node-interleaving-fix
+++ a/mm/shmem.c
@@ -927,9 +927,12 @@ static struct page *shmem_alloc_page(gfp
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
/*
- * alloc_page_vma() will drop the shared policy reference
+ * alloc_page_vma() will drop the shared policy reference.
+ *
+ * To avoid allocating all tmpfs pages on node 0, we fake up a virtual
+ * address based on this file's predetermined preferred node.
*/
- return alloc_page_vma(gfp, &pvma, info->node_offset << PAGE_SHIFT );
+ return alloc_page_vma(gfp, &pvma, info->node_offset << PAGE_SHIFT);
}
#else /* !CONFIG_NUMA */
#ifdef CONFIG_TMPFS
_
(5/31/12 4:25 PM), Andrew Morton wrote:
> On Thu, 31 May 2012 16:09:15 -0400
> KOSAKI Motohiro<[email protected]> wrote:
>
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>>> /*
>>> * alloc_page_vma() will drop the shared policy reference
>>> */
>>> - return alloc_page_vma(gfp,&pvma, 0);
>>> + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
>>
>> 3rd argument of alloc_page_vma() is an address. This is type error.
>
> Well, it's an unsigned long...
>
> But yes, it is conceptually wrong and *looks* weird. I think we can
> address that by overcoming our peculair aversion to documenting our
> code, sigh. This?
Sorry, no.
addr agrument of alloc_pages_vma() have two meanings.
1) interleave node seed
2) look-up key of shmem policy
I think this patch break (2). shmem_get_policy(pol, addr) assume caller honor to
pass correct address.
Oh, yes. *NOW*, we are discussing shmem policy removing. but it haven't be removed.
Please don't break.
On Thu, May 31, 2012 at 04:35:53PM -0400, KOSAKI Motohiro wrote:
> (5/31/12 4:25 PM), Andrew Morton wrote:
>> On Thu, 31 May 2012 16:09:15 -0400
>> KOSAKI Motohiro<[email protected]> wrote:
>>
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>>>> /*
>>>> * alloc_page_vma() will drop the shared policy reference
>>>> */
>>>> - return alloc_page_vma(gfp,&pvma, 0);
>>>> + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
>>>
>>> 3rd argument of alloc_page_vma() is an address. This is type error.
>>
>> Well, it's an unsigned long...
>>
>> But yes, it is conceptually wrong and *looks* weird. I think we can
>> address that by overcoming our peculair aversion to documenting our
>> code, sigh. This?
>
> Sorry, no.
>
> addr agrument of alloc_pages_vma() have two meanings.
>
> 1) interleave node seed
> 2) look-up key of shmem policy
>
> I think this patch break (2). shmem_get_policy(pol, addr) assume caller honor to
> pass correct address.
But the pseudo vma we generated in shmem_alloc_page the vm_ops are set to NULL.
So get_vma_policy will return the policy provided by the pseudo vma and not reach
the shmem_get_policy.
(6/1/12 10:24 AM), Nathan Zimmer wrote:
> On Thu, May 31, 2012 at 04:35:53PM -0400, KOSAKI Motohiro wrote:
>> (5/31/12 4:25 PM), Andrew Morton wrote:
>>> On Thu, 31 May 2012 16:09:15 -0400
>>> KOSAKI Motohiro<[email protected]> wrote:
>>>
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>>>>> /*
>>>>> * alloc_page_vma() will drop the shared policy reference
>>>>> */
>>>>> - return alloc_page_vma(gfp,&pvma, 0);
>>>>> + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
>>>>
>>>> 3rd argument of alloc_page_vma() is an address. This is type error.
>>>
>>> Well, it's an unsigned long...
>>>
>>> But yes, it is conceptually wrong and *looks* weird. I think we can
>>> address that by overcoming our peculair aversion to documenting our
>>> code, sigh. This?
>>
>> Sorry, no.
>>
>> addr agrument of alloc_pages_vma() have two meanings.
>>
>> 1) interleave node seed
>> 2) look-up key of shmem policy
>>
>> I think this patch break (2). shmem_get_policy(pol, addr) assume caller honor to
>> pass correct address.
>
> But the pseudo vma we generated in shmem_alloc_page the vm_ops are set to NULL.
> So get_vma_policy will return the policy provided by the pseudo vma and not reach
> the shmem_get_policy.
yes, and it is bug source. we may need to change soon. I guess the right way is
to make vm_ops->interleave and interleave_nid uses it if povided.
btw, I don't think node_random() is good idea. it is random(pid + jiffies + cycle).
current->cpuset_mem_spread_rotor is per-thread value. but you now need per-inode
interleave offset. maybe, just inode addition is enough. Why do you need randomness?
On Fri, Jun 01, 2012 at 01:22:15PM -0400, KOSAKI Motohiro wrote:
> (6/1/12 10:24 AM), Nathan Zimmer wrote:
>> On Thu, May 31, 2012 at 04:35:53PM -0400, KOSAKI Motohiro wrote:
>>> (5/31/12 4:25 PM), Andrew Morton wrote:
>>>> On Thu, 31 May 2012 16:09:15 -0400
>>>> KOSAKI Motohiro<[email protected]> wrote:
>>>>
>>>>>> --- a/mm/shmem.c
>>>>>> +++ b/mm/shmem.c
>>>>>> @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>>>>>> /*
>>>>>> * alloc_page_vma() will drop the shared policy reference
>>>>>> */
>>>>>> - return alloc_page_vma(gfp,&pvma, 0);
>>>>>> + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
>>>>>
>>>>> 3rd argument of alloc_page_vma() is an address. This is type error.
>>>>
>>>> Well, it's an unsigned long...
>>>>
>>>> But yes, it is conceptually wrong and *looks* weird. I think we can
>>>> address that by overcoming our peculair aversion to documenting our
>>>> code, sigh. This?
>>>
>>> Sorry, no.
>>>
>>> addr agrument of alloc_pages_vma() have two meanings.
>>>
>>> 1) interleave node seed
>>> 2) look-up key of shmem policy
>>>
>>> I think this patch break (2). shmem_get_policy(pol, addr) assume caller honor to
>>> pass correct address.
>>
>> But the pseudo vma we generated in shmem_alloc_page the vm_ops are set to NULL.
>> So get_vma_policy will return the policy provided by the pseudo vma and not reach
>> the shmem_get_policy.
>
> yes, and it is bug source. we may need to change soon. I guess the right way is
> to make vm_ops->interleave and interleave_nid uses it if povided.
>
If we provide vm_ops then won't shmem_get_policy get called?
That would be an issue since shmem_get_policy assumes vm_file is non NULL.
> btw, I don't think node_random() is good idea. it is random(pid + jiffies + cycle).
> current->cpuset_mem_spread_rotor is per-thread value. but you now need per-inode
> interleave offset. maybe, just inode addition is enough. Why do you need randomness?
>
I don't really need the randomness, the rotor should be good enough.
The correct way to get that is cpuset_mem_spread_node(), yes?
Also apologies for such a delay in my response.
(6/19/12 7:21 PM), Nathan Zimmer wrote:
> On Fri, Jun 01, 2012 at 01:22:15PM -0400, KOSAKI Motohiro wrote:
>> (6/1/12 10:24 AM), Nathan Zimmer wrote:
>>> On Thu, May 31, 2012 at 04:35:53PM -0400, KOSAKI Motohiro wrote:
>>>> (5/31/12 4:25 PM), Andrew Morton wrote:
>>>>> On Thu, 31 May 2012 16:09:15 -0400
>>>>> KOSAKI Motohiro<[email protected]> wrote:
>>>>>
>>>>>>> --- a/mm/shmem.c
>>>>>>> +++ b/mm/shmem.c
>>>>>>> @@ -929,7 +929,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>>>>>>> /*
>>>>>>> * alloc_page_vma() will drop the shared policy reference
>>>>>>> */
>>>>>>> - return alloc_page_vma(gfp,&pvma, 0);
>>>>>>> + return alloc_page_vma(gfp,&pvma, info->node_offset<< PAGE_SHIFT );
>>>>>>
>>>>>> 3rd argument of alloc_page_vma() is an address. This is type error.
>>>>>
>>>>> Well, it's an unsigned long...
>>>>>
>>>>> But yes, it is conceptually wrong and *looks* weird. I think we can
>>>>> address that by overcoming our peculair aversion to documenting our
>>>>> code, sigh. This?
>>>>
>>>> Sorry, no.
>>>>
>>>> addr agrument of alloc_pages_vma() have two meanings.
>>>>
>>>> 1) interleave node seed
>>>> 2) look-up key of shmem policy
>>>>
>>>> I think this patch break (2). shmem_get_policy(pol, addr) assume caller honor to
>>>> pass correct address.
>>>
>>> But the pseudo vma we generated in shmem_alloc_page the vm_ops are set to NULL.
>>> So get_vma_policy will return the policy provided by the pseudo vma and not reach
>>> the shmem_get_policy.
>>
>> yes, and it is bug source. we may need to change soon. I guess the right way is
>> to make vm_ops->interleave and interleave_nid uses it if povided.
>
> If we provide vm_ops then won't shmem_get_policy get called?
> That would be an issue since shmem_get_policy assumes vm_file is non NULL.
>
>> btw, I don't think node_random() is good idea. it is random(pid + jiffies + cycle).
>> current->cpuset_mem_spread_rotor is per-thread value. but you now need per-inode
>> interleave offset. maybe, just inode addition is enough. Why do you need randomness?
>
> I don't really need the randomness, the rotor should be good enough.
> The correct way to get that is cpuset_mem_spread_node(), yes?
I think that's good idea too.