2007-05-16 05:47:54

by dean gaudet

[permalink] [raw]
Subject: 2.6.21 numa policy and huge pages not working

prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB and
the interleave policy would be respected. as of 2.6.21 it doesn't seem to
respect the policy on SHM_HUGETLB request.

see test program below.

output from pre-2.6.21:

2ab196200000 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 N2=8 N3=8
2ab19a200000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096

output from 2.6.21:

2b49b1c00000 default file=/10\040(deleted) huge dirty=32 N3=32
2b49b5c00000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096

was this an intentional behaviour change? it seems to be only affecting
SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.)

run with "numactl --interleave=all ./shmtest"

-dean

/* shmtest.c */

#include <sys/mman.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>


static void *alloc_arena_shm(size_t arena_size, unsigned flags)
{
FILE *fh;
char buf[512];
size_t huge_page_size;
char *p;
int shmid;
void *arena;

// find Hugepagesize in /proc/meminfo
if ((fh = fopen("/proc/meminfo", "r")) == NULL) {
perror("open(/proc/meminfo)");
exit(1);
}
for (;;) {
if (fgets(buf, sizeof(buf)-1, fh) == NULL) {
fprintf(stderr, "didn't find Hugepagesize in /proc/meminfo");
exit(1);
}
buf[sizeof(buf)-1] = '\0';
if (strncmp(buf, "Hugepagesize:", 13) == 0) break;
}
p = strchr(buf, ':') + 1;
huge_page_size = strtoul(p, 0, 0) * 1024;
fclose(fh);

// round the size up to multiple of huge_page_size
arena_size = (arena_size + huge_page_size - 1) & ~(huge_page_size - 1);

shmid = shmget(IPC_PRIVATE, arena_size, IPC_CREAT|IPC_EXCL|flags|0600);
if (shmid == -1) {
perror("shmget");
exit(1);
}

arena = shmat(shmid, NULL, 0);
if (arena == (void *)-1) {
perror("shmat");
exit(1);
}

if (shmctl(shmid, IPC_RMID, 0) == -1) {
perror("shmctl warning");
}

return arena;
}

int main(int argc, char **argv)
{
char buf[1024];
const size_t sz = 64*1024*1024;
void *arena = alloc_arena_shm(sz, SHM_HUGETLB);
memset(arena, 0, sz);
snprintf(buf, sizeof(buf), "grep ^%llx /proc/%d/numa_maps", (unsigned long long)arena, (int)getpid());
system(buf);

arena = alloc_arena_shm(sz, 0);
memset(arena, 0, sz);
snprintf(buf, sizeof(buf), "grep ^%llx /proc/%d/numa_maps", (unsigned long long)arena, (int)getpid());
system(buf);

return 0;
}


2007-05-16 06:12:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.21 numa policy and huge pages not working

On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote:
> prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB and
> the interleave policy would be respected. as of 2.6.21 it doesn't seem to
> respect the policy on SHM_HUGETLB request.
> see test program below.
> output from pre-2.6.21:
> 2ab196200000 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 N2=8 N3=8
> 2ab19a200000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> output from 2.6.21:
> 2b49b1c00000 default file=/10\040(deleted) huge dirty=32 N3=32
> 2b49b5c00000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> was this an intentional behaviour change? it seems to be only affecting
> SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.)
> run with "numactl --interleave=all ./shmtest"

This was not intentional. I'll search for where it broke.


-- wli

2007-06-09 18:06:29

by dean gaudet

[permalink] [raw]
Subject: Re: 2.6.21 numa policy and huge pages not working

On Tue, 15 May 2007, William Lee Irwin III wrote:

> On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote:
> > prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB and
> > the interleave policy would be respected. as of 2.6.21 it doesn't seem to
> > respect the policy on SHM_HUGETLB request.
> > see test program below.
> > output from pre-2.6.21:
> > 2ab196200000 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 N2=8 N3=8
> > 2ab19a200000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> > output from 2.6.21:
> > 2b49b1c00000 default file=/10\040(deleted) huge dirty=32 N3=32
> > 2b49b5c00000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> > was this an intentional behaviour change? it seems to be only affecting
> > SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.)
> > run with "numactl --interleave=all ./shmtest"
>
> This was not intentional. I'll search for where it broke.

any luck? i just tested with 2.6.22-rc4 and it's still broken... hmm
maybe i should learn how to git bisect.

-dean

2007-06-10 04:10:59

by dean gaudet

[permalink] [raw]
Subject: Re: 2.6.21 numa policy and huge pages not working

On Tue, 15 May 2007, William Lee Irwin III wrote:

> On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote:
> > prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB and
> > the interleave policy would be respected. as of 2.6.21 it doesn't seem to
> > respect the policy on SHM_HUGETLB request.
> > see test program below.
> > output from pre-2.6.21:
> > 2ab196200000 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 N2=8 N3=8
> > 2ab19a200000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> > output from 2.6.21:
> > 2b49b1c00000 default file=/10\040(deleted) huge dirty=32 N3=32
> > 2b49b5c00000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> > was this an intentional behaviour change? it seems to be only affecting
> > SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.)
> > run with "numactl --interleave=all ./shmtest"
>
> This was not intentional. I'll search for where it broke.

ok i've narrowed it some... maybe.

in commit 8ef8286689c6b5bc76212437b85bdd2ba749ee44 things work fine, numa
policy is respected...

the very next commit bc56bba8f31bd99f350a5ebfd43d50f411b620c7 breaks shm
badly causing the test program to oops the kernel.

commit 516dffdcd8827a40532798602830dfcfc672294c fixes that breakage but
numa policy is no longer respected.

i've added the authors of those two commits to the recipient list and
reattached the test program. hopefully someone can shed light on the
problem.

-dean


Attachments:
shmtest.c (1.75 kB)

2007-06-10 04:51:04

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.21 numa policy and huge pages not working

On Sat, Jun 09, 2007 at 09:10:51PM -0700, dean gaudet wrote:
> ok i've narrowed it some... maybe.
> in commit 8ef8286689c6b5bc76212437b85bdd2ba749ee44 things work fine, numa
> policy is respected...
> the very next commit bc56bba8f31bd99f350a5ebfd43d50f411b620c7 breaks shm
> badly causing the test program to oops the kernel.
> commit 516dffdcd8827a40532798602830dfcfc672294c fixes that breakage but
> numa policy is no longer respected.
> i've added the authors of those two commits to the recipient list and
> reattached the test program. hopefully someone can shed light on the
> problem.

Thanks, this helps a lot.


-- wli

2007-06-11 16:23:30

by Adam Litke

[permalink] [raw]
Subject: Re: 2.6.21 numa policy and huge pages not working

On Sat, 2007-06-09 at 21:10 -0700, dean gaudet wrote:
> On Tue, 15 May 2007, William Lee Irwin III wrote:
>
> > On Tue, May 15, 2007 at 10:41:06PM -0700, dean gaudet wrote:
> > > prior to 2.6.21 i could "numactl --interleave=all" and use SHM_HUGETLB and
> > > the interleave policy would be respected. as of 2.6.21 it doesn't seem to
> > > respect the policy on SHM_HUGETLB request.
> > > see test program below.
> > > output from pre-2.6.21:
> > > 2ab196200000 interleave=0-3 file=/2\040(deleted) huge dirty=32 N0=8 N1=8 N2=8 N3=8
> > > 2ab19a200000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> > > output from 2.6.21:
> > > 2b49b1c00000 default file=/10\040(deleted) huge dirty=32 N3=32
> > > 2b49b5c00000 default file=/SYSV00000000\040(deleted) dirty=16384 active=0 N0=4096 N1=4096 N2=4096 N3=4096
> > > was this an intentional behaviour change? it seems to be only affecting
> > > SHM_HUGETLB allocations. (i haven't tested hugetlbfs yet.)
> > > run with "numactl --interleave=all ./shmtest"
> >
> > This was not intentional. I'll search for where it broke.
>
> ok i've narrowed it some... maybe.

Thanks a lot for the detailed information. I am on it.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2007-06-11 21:35:10

by Adam Litke

[permalink] [raw]
Subject: [shm][hugetlb] Fix get_policy for stacked shared memory files

Here's another breakage as a result of shared memory stacked files :(

The NUMA policy for a VMA is determined by checking the following (in the order
given):

1) vma->vm_ops->get_policy() (if defined)
2) vma->vm_policy (if defined)
3) task->mempolicy (if defined)
4) Fall back to default_policy

By switching to stacked files for shared memory, get_policy() is now always set
to shm_get_policy which is a wrapper function. This causes us to stop at step
1, which yields NULL for hugetlb instead of task->mempolicy which was the
previous (and correct) result.

This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
wrapped vm_ops. Andi and Christoph, does this look right to you?

Signed-off-by: Adam Litke <[email protected]>

diff --git a/ipc/shm.c b/ipc/shm.c
index 4fefbad..8d2672d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr)

if (sfd->vm_ops->get_policy)
pol = sfd->vm_ops->get_policy(vma, addr);
- else
+ else if (vma->vm_policy)
pol = vma->vm_policy;
+ else
+ pol = current->mempolicy;
return pol;
}
#endif

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2007-06-12 03:36:25

by dean gaudet

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

On Mon, 11 Jun 2007, Adam Litke wrote:

> Here's another breakage as a result of shared memory stacked files :(
>
> The NUMA policy for a VMA is determined by checking the following (in the order
> given):
>
> 1) vma->vm_ops->get_policy() (if defined)
> 2) vma->vm_policy (if defined)
> 3) task->mempolicy (if defined)
> 4) Fall back to default_policy
>
> By switching to stacked files for shared memory, get_policy() is now always set
> to shm_get_policy which is a wrapper function. This causes us to stop at step
> 1, which yields NULL for hugetlb instead of task->mempolicy which was the
> previous (and correct) result.
>
> This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> wrapped vm_ops. Andi and Christoph, does this look right to you?

thanks for the patch -- seems to do the trick for me. it seems like it
would be a candidate for stable series as well.

-dean

2007-06-12 03:50:52

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

On Mon, Jun 11, 2007 at 04:34:54PM -0500, Adam Litke wrote:
> Here's another breakage as a result of shared memory stacked files :(
> The NUMA policy for a VMA is determined by checking the following (in
> the order given):
> 1) vma->vm_ops->get_policy() (if defined)
> 2) vma->vm_policy (if defined)
> 3) task->mempolicy (if defined)
> 4) Fall back to default_policy
> By switching to stacked files for shared memory, get_policy() is now
> always set to shm_get_policy which is a wrapper function. This
> causes us to stop at step 1, which yields NULL for hugetlb instead of
> task->mempolicy which was the previous (and correct) result.
> This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> wrapped vm_ops. Andi and Christoph, does this look right to you?
> Signed-off-by: Adam Litke <[email protected]>

Thanks for fielding this. The fix is certainly clear enough.

Acked-by: William Irwin <[email protected]>


-- wli

2007-06-12 04:30:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

On Mon, 11 Jun 2007 16:34:54 -0500 Adam Litke <[email protected]> wrote:

> Here's another breakage as a result of shared memory stacked files :(
>
> The NUMA policy for a VMA is determined by checking the following (in the order
> given):
>
> 1) vma->vm_ops->get_policy() (if defined)
> 2) vma->vm_policy (if defined)
> 3) task->mempolicy (if defined)
> 4) Fall back to default_policy
>
> By switching to stacked files for shared memory, get_policy() is now always set
> to shm_get_policy which is a wrapper function. This causes us to stop at step
> 1, which yields NULL for hugetlb instead of task->mempolicy which was the
> previous (and correct) result.
>
> This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> wrapped vm_ops. Andi and Christoph, does this look right to you?
>

Can we just double-check the refcounting please?

> index 4fefbad..8d2672d 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr)
>
> if (sfd->vm_ops->get_policy)
> pol = sfd->vm_ops->get_policy(vma, addr);

afacit this takes a ref on the underlying policy

> - else
> + else if (vma->vm_policy)
> pol = vma->vm_policy;
> + else
> + pol = current->mempolicy;

but these two do not.

> return pol;
> }
> #endif

Is is all correct?

2007-06-12 04:47:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

On Mon, Jun 11, 2007 at 09:30:20PM -0700, Andrew Morton wrote:
> Can we just double-check the refcounting please?

The refcounting for mpol's doesn't look good in general. I'm more
curious as to what releases the refcounts. alloc_page_vma(), for
instance, does get_vma_policy() which eventually takes a reference,
without ever releasing the reference it acquires. get_vma_policy()
itself uses a similar idiom to that used in aglitke's patch. I think
mpol refcounting needs to be addressed elsewhere besides this patch.


-- wli

2007-06-12 06:22:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

Adam Litke <[email protected]> writes:

> Here's another breakage as a result of shared memory stacked files :(
>
> The NUMA policy for a VMA is determined by checking the following (in the order
> given):
>
> 1) vma->vm_ops->get_policy() (if defined)
> 2) vma->vm_policy (if defined)
> 3) task->mempolicy (if defined)
> 4) Fall back to default_policy
>
> By switching to stacked files for shared memory, get_policy() is now always set
> to shm_get_policy which is a wrapper function. This causes us to stop at step
> 1, which yields NULL for hugetlb instead of task->mempolicy which was the
> previous (and correct) result.
>
> This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> wrapped vm_ops. Andi and Christoph, does this look right to you?

I'm confused.

I agree that the behavior you describe is correct.
However I only see two code paths were get_policy is called and
both of them take a NULL result and change it to task->mempolicy:

>From mm/mempolicy.c

> long do_get_mempolicy(int *policy, nodemask_t *nmask,
> unsigned long addr, unsigned long flags)
> {
> int err;
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma = NULL;
> struct mempolicy *pol = current->mempolicy;
>
> cpuset_update_task_memory_state();
> if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> return -EINVAL;
> if (flags & MPOL_F_ADDR) {
> down_read(&mm->mmap_sem);
> vma = find_vma_intersection(mm, addr, addr+1);
> if (!vma) {
> up_read(&mm->mmap_sem);
> return -EFAULT;
> }
> if (vma->vm_ops && vma->vm_ops->get_policy)
> pol = vma->vm_ops->get_policy(vma, addr);
> else
> pol = vma->vm_policy;
> } else if (addr)
> return -EINVAL;
>
> if (!pol)
> pol = &default_policy;



> /* Return effective policy for a VMA */
> static struct mempolicy * get_vma_policy(struct task_struct *task,
> struct vm_area_struct *vma, unsigned long addr)
> {
> struct mempolicy *pol = task->mempolicy;
>
> if (vma) {
> if (vma->vm_ops && vma->vm_ops->get_policy)
> pol = vma->vm_ops->get_policy(vma, addr);
> else if (vma->vm_policy &&
> vma->vm_policy->policy != MPOL_DEFAULT)
> pol = vma->vm_policy;
> }
> if (!pol)
> pol = &default_policy;
> return pol;
> }


Does this perhaps need to be:
> Signed-off-by: Adam Litke <[email protected]>
>
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 4fefbad..8d2672d 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
> *vma, unsigned long addr)

+ pol = NULL;
>
> if (sfd->vm_ops->get_policy)
> pol = sfd->vm_ops->get_policy(vma, addr);
> - else
> + else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
> pol = vma->vm_policy;
> return pol;
> }
> #endif

Sorry I'm just a little dense at the moment.

Eric

2007-06-12 06:57:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

On Tue, Jun 12, 2007 at 12:20:52AM -0600, Eric W. Biederman wrote:
> Does this perhaps need to be:
>> diff --git a/ipc/shm.c b/ipc/shm.c
>> index 4fefbad..8d2672d 100644
>> --- a/ipc/shm.c
>> +++ b/ipc/shm.c
>> @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
>> *vma, unsigned long addr)
>>
>> + pol = NULL;
>>
>> if (sfd->vm_ops->get_policy)
>> pol = sfd->vm_ops->get_policy(vma, addr);
>> - else
>> + else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
>> pol = vma->vm_policy;
>> return pol;

Those paths are above the level where shm_get_policy() is called.
It may be that shm_get_policy() doesn't need to recapitulate them
if it's only ever called through such codepaths. It's not clear to
me whether that's intended as an invariant or is coincidental and
not guaranteed for future callsites.


-- wli

2007-06-12 10:20:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files


> Can we just double-check the refcounting please?
>
> > index 4fefbad..8d2672d 100644
> > --- a/ipc/shm.c
> > +++ b/ipc/shm.c
> > @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct *vma, unsigned long addr)
> >
> > if (sfd->vm_ops->get_policy)
> > pol = sfd->vm_ops->get_policy(vma, addr);
>
> afacit this takes a ref on the underlying policy
>
> > - else
> > + else if (vma->vm_policy)
> > pol = vma->vm_policy;
> > + else
> > + pol = current->mempolicy;
>
> but these two do not.
>
> > return pol;
> > }
> > #endif
>
> Is is all correct?

No it looks broken.

-Andi

2007-06-12 14:33:16

by Adam Litke

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

On 6/12/07, Eric W. Biederman <[email protected]> wrote:
> Adam Litke <[email protected]> writes:
>
> > Here's another breakage as a result of shared memory stacked files :(
> >
> > The NUMA policy for a VMA is determined by checking the following (in the order
> > given):
> >
> > 1) vma->vm_ops->get_policy() (if defined)
> > 2) vma->vm_policy (if defined)
> > 3) task->mempolicy (if defined)
> > 4) Fall back to default_policy
> >
> > By switching to stacked files for shared memory, get_policy() is now always set
> > to shm_get_policy which is a wrapper function. This causes us to stop at step
> > 1, which yields NULL for hugetlb instead of task->mempolicy which was the
> > previous (and correct) result.
> >
> > This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> > wrapped vm_ops. Andi and Christoph, does this look right to you?
>
> I'm confused.
>
> I agree that the behavior you describe is correct.
> However I only see two code paths were get_policy is called and
> both of them take a NULL result and change it to task->mempolicy:

The coffee hasn't quite absorbed yet, but don't those two code paths
take a NULL result from get_policy() and turn it into default_policy,
not task->mempolicy?

> From mm/mempolicy.c
>
> > long do_get_mempolicy(int *policy, nodemask_t *nmask,
> > unsigned long addr, unsigned long flags)
> > {
> > int err;
> > struct mm_struct *mm = current->mm;
> > struct vm_area_struct *vma = NULL;
> > struct mempolicy *pol = current->mempolicy;
> >
> > cpuset_update_task_memory_state();
> > if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> > return -EINVAL;
> > if (flags & MPOL_F_ADDR) {
> > down_read(&mm->mmap_sem);
> > vma = find_vma_intersection(mm, addr, addr+1);
> > if (!vma) {
> > up_read(&mm->mmap_sem);
> > return -EFAULT;
> > }
> > if (vma->vm_ops && vma->vm_ops->get_policy)
> > pol = vma->vm_ops->get_policy(vma, addr);
> > else
> > pol = vma->vm_policy;
> > } else if (addr)
> > return -EINVAL;
> >
> > if (!pol)
> > pol = &default_policy;
>
>
>
> > /* Return effective policy for a VMA */
> > static struct mempolicy * get_vma_policy(struct task_struct *task,
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > struct mempolicy *pol = task->mempolicy;
> >
> > if (vma) {
> > if (vma->vm_ops && vma->vm_ops->get_policy)
> > pol = vma->vm_ops->get_policy(vma, addr);
> > else if (vma->vm_policy &&
> > vma->vm_policy->policy != MPOL_DEFAULT)
> > pol = vma->vm_policy;
> > }
> > if (!pol)
> > pol = &default_policy;
> > return pol;
> > }
>
>
> Does this perhaps need to be:
> > Signed-off-by: Adam Litke <[email protected]>
> >
> > diff --git a/ipc/shm.c b/ipc/shm.c
> > index 4fefbad..8d2672d 100644
> > --- a/ipc/shm.c
> > +++ b/ipc/shm.c
> > @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
> > *vma, unsigned long addr)
>
> + pol = NULL;
> >
> > if (sfd->vm_ops->get_policy)
> > pol = sfd->vm_ops->get_policy(vma, addr);
> > - else
> > + else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
> > pol = vma->vm_policy;
> > return pol;
> > }
> > #endif

afaict this would provide no way for pol to be set to task->mempolicy
for hugetlb per my comment above.

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center

2007-06-12 18:23:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

"Adam Litke" <[email protected]> writes:

> On 6/12/07, Eric W. Biederman <[email protected]> wrote:
>> Adam Litke <[email protected]> writes:
>>
>> > Here's another breakage as a result of shared memory stacked files :(
>> >
>> > The NUMA policy for a VMA is determined by checking the following (in the
> order
>> > given):

Where that doesn't appear to be what the current code does.
If there is a VMA it doesn't appear that we check task->mempolicy.

>> > 1) vma->vm_ops->get_policy() (if defined)
>> > 2) vma->vm_policy (if defined)
>> > 3) task->mempolicy (if defined)
>> > 4) Fall back to default_policy
>> >
>> > By switching to stacked files for shared memory, get_policy() is now always
> set
>> > to shm_get_policy which is a wrapper function. This causes us to stop at
> step
>> > 1, which yields NULL for hugetlb instead of task->mempolicy which was the
>> > previous (and correct) result.
>> >
>> > This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for
> the
>> > wrapped vm_ops. Andi and Christoph, does this look right to you?
>>
>> I'm confused.
>>
>> I agree that the behavior you describe is correct.
>> However I only see two code paths were get_policy is called and
>> both of them take a NULL result and change it to task->mempolicy:
>
> The coffee hasn't quite absorbed yet, but don't those two code paths
> take a NULL result from get_policy() and turn it into default_policy,
> not task->mempolicy?

True, a NULL is turned into default_policy. However the basic
confusion remains. I don't see how we ever return task->mempolicy
on those paths if we have a vma, and those appear to be the only
call sites of get_policy.

>> From mm/mempolicy.c
>>
>> > long do_get_mempolicy(int *policy, nodemask_t *nmask,
>> > unsigned long addr, unsigned long flags)
>> > {
>> > int err;
>> > struct mm_struct *mm = current->mm;
>> > struct vm_area_struct *vma = NULL;
>> > struct mempolicy *pol = current->mempolicy;
>> >
>> > cpuset_update_task_memory_state();
>> > if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>> > return -EINVAL;
>> > if (flags & MPOL_F_ADDR) {
>> > down_read(&mm->mmap_sem);
>> > vma = find_vma_intersection(mm, addr, addr+1);
>> > if (!vma) {
>> > up_read(&mm->mmap_sem);
>> > return -EFAULT;
>> > }
>> > if (vma->vm_ops && vma->vm_ops->get_policy)
>> > pol = vma->vm_ops->get_policy(vma, addr);
>> > else
>> > pol = vma->vm_policy;
>> > } else if (addr)
>> > return -EINVAL;
>> >
>> > if (!pol)
>> > pol = &default_policy;
>>
>>
>>
>> > /* Return effective policy for a VMA */
>> > static struct mempolicy * get_vma_policy(struct task_struct *task,
>> > struct vm_area_struct *vma, unsigned long addr)
>> > {
>> > struct mempolicy *pol = task->mempolicy;
>> >
>> > if (vma) {
>> > if (vma->vm_ops && vma->vm_ops->get_policy)
>> > pol = vma->vm_ops->get_policy(vma, addr);
>> > else if (vma->vm_policy &&
>> > vma->vm_policy->policy != MPOL_DEFAULT)
>> > pol = vma->vm_policy;
>> > }
>> > if (!pol)
>> > pol = &default_policy;
>> > return pol;
>> > }
>>
>>
>> Does this perhaps need to be:
>> > Signed-off-by: Adam Litke <[email protected]>
>> >
>> > diff --git a/ipc/shm.c b/ipc/shm.c
>> > index 4fefbad..8d2672d 100644
>> > --- a/ipc/shm.c
>> > +++ b/ipc/shm.c
>> > @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
>> > *vma, unsigned long addr)
>>
>> + pol = NULL;
>> >
>> > if (sfd->vm_ops->get_policy)
>> > pol = sfd->vm_ops->get_policy(vma, addr);
>> > - else
>> > + else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
>> > pol = vma->vm_policy;
>> > return pol;
>> > }
>> > #endif
>
> afaict this would provide no way for pol to be set to task->mempolicy
> for hugetlb per my comment above.

Being dense I don't see a way for us to have ever returned task->mempolicy.

So perhaps something else changed. Or something was made more
consistent and another bug was revealed.

I don't truly see the bug in shm_get_policy. I can see why it would
not have the expected affect but that is different.

Eric