2012-10-08 15:10:00

by Dave Jones

[permalink] [raw]
Subject: mpol_to_str revisited.

Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
a user triggerable BUG in mempolicy.

Ben Hutchings pointed out to me that my change introduced a potential leak
of stack contents to userspace, because none of the callers check the return value.

This patch adds the missing return checking, and also clears the buffer beforehand.

Reported-by: Ben Hutchings <[email protected]>
Cc: [email protected]
Signed-off-by: Dave Jones <[email protected]>

---
unanswered question: why are the buffer sizes here different ? which is correct?


diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
--- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400
+++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400
@@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
struct mm_walk walk = {};
struct mempolicy *pol;
int n;
+ int ret;
char buffer[50];

if (!mm)
@@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
walk.mm = mm;

pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
- mpol_to_str(buffer, sizeof(buffer), pol, 0);
+ memset(buffer, 0, sizeof(buffer));
+ ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
+ if (ret < 0)
+ return 0;
+
mpol_cond_put(pol);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);
diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c
--- src/git-trees/kernel/linux/mm/shmem.c 2012-10-02 15:49:51.977277944 -0400
+++ linux-dj/mm/shmem.c 2012-10-04 19:32:28.862949907 -0400
@@ -885,13 +885,15 @@ redirty:
static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
char buffer[64];
+ int ret;

if (!mpol || mpol->mode == MPOL_DEFAULT)
return; /* show nothing */

- mpol_to_str(buffer, sizeof(buffer), mpol, 1);
-
- seq_printf(seq, ",mpol=%s", buffer);
+ memset(buffer, 0, sizeof(buffer));
+ ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1);
+ if (ret > 0)
+ seq_printf(seq, ",mpol=%s", buffer);
}

static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)


2012-10-08 15:16:02

by Dave Jones

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, Oct 08, 2012 at 11:09:49AM -0400, Dave Jones wrote:
> Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
> a user triggerable BUG in mempolicy.
>
> Ben Hutchings pointed out to me that my change introduced a potential leak
> of stack contents to userspace, because none of the callers check the return value.
>
> This patch adds the missing return checking, and also clears the buffer beforehand.
>
> Reported-by: Ben Hutchings <[email protected]>
> Cc: [email protected]
> Signed-off-by: Dave Jones <[email protected]>
>
> ---
> unanswered question: why are the buffer sizes here different ? which is correct?

A further unanswered question is how the state got so screwed up that we hit that
default case at all. Looking at the original report: https://lkml.org/lkml/2012/9/6/356
What's in RAX looks suspiciously like left-over slab poison.

If pol->mode was poisoned, that smells like we have a race where policy is getting freed
while another process is reading it.

Am I missing something, or is there no locking around that at all ?

Dave

2012-10-08 20:35:57

by David Rientjes

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, 8 Oct 2012, Dave Jones wrote:

> unanswered question: why are the buffer sizes here different ? which is correct?
>

Given the current set of mempolicy modes and flags, it's 34, but this can
change if new modes or flags are added with longer names. I see no reason
why shmem shouldn't round up to the nearest power-of-2 of 64 like it
already does, but 50 is certainly safe as well in task_mmu.c.

> diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
> --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400
> +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400
> @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
> struct mm_walk walk = {};
> struct mempolicy *pol;
> int n;
> + int ret;
> char buffer[50];
>
> if (!mm)
> @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
> walk.mm = mm;
>
> pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> - mpol_to_str(buffer, sizeof(buffer), pol, 0);
> + memset(buffer, 0, sizeof(buffer));
> + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
> + if (ret < 0)
> + return 0;

We should need the mpol_cond_put(pol) here before returning.

> +
> mpol_cond_put(pol);
>
> seq_printf(m, "%08lx %s", vma->vm_start, buffer);
> diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c
> --- src/git-trees/kernel/linux/mm/shmem.c 2012-10-02 15:49:51.977277944 -0400
> +++ linux-dj/mm/shmem.c 2012-10-04 19:32:28.862949907 -0400
> @@ -885,13 +885,15 @@ redirty:
> static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
> {
> char buffer[64];
> + int ret;
>
> if (!mpol || mpol->mode == MPOL_DEFAULT)
> return; /* show nothing */
>
> - mpol_to_str(buffer, sizeof(buffer), mpol, 1);
> -
> - seq_printf(seq, ",mpol=%s", buffer);
> + memset(buffer, 0, sizeof(buffer));
> + ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1);
> + if (ret > 0)
> + seq_printf(seq, ",mpol=%s", buffer);
> }
>
> static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)

2012-10-08 20:46:44

by David Rientjes

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, 8 Oct 2012, Dave Jones wrote:

> If pol->mode was poisoned, that smells like we have a race where policy is getting freed
> while another process is reading it.
>
> Am I missing something, or is there no locking around that at all ?
>

The only thing that is held during the read() is a reference to the
task_struct so it doesn't disappear from under us. The protection needed
for a task's mempolicy, however, is task_lock() and that is not held.

2012-10-08 20:52:24

by Dave Jones

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, Oct 08, 2012 at 01:35:42PM -0700, David Rientjes wrote:

> > unanswered question: why are the buffer sizes here different ? which is correct?
> >
> Given the current set of mempolicy modes and flags, it's 34, but this can
> change if new modes or flags are added with longer names. I see no reason
> why shmem shouldn't round up to the nearest power-of-2 of 64 like it
> already does, but 50 is certainly safe as well in task_mmu.c.

Ok. I'll leave that for now.

> > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
> > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400
> > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400
> > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
> > struct mm_walk walk = {};
> > struct mempolicy *pol;
> > int n;
> > + int ret;
> > char buffer[50];
> >
> > if (!mm)
> > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
> > walk.mm = mm;
> >
> > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> > - mpol_to_str(buffer, sizeof(buffer), pol, 0);
> > + memset(buffer, 0, sizeof(buffer));
> > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
> > + if (ret < 0)
> > + return 0;
>
> We should need the mpol_cond_put(pol) here before returning.

good catch. I'll respin the patch later with this changed.

thanks,

Dave

2012-10-09 00:33:22

by Ben Hutchings

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, 2012-10-08 at 11:09 -0400, Dave Jones wrote:
> Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
> a user triggerable BUG in mempolicy.
>
> Ben Hutchings pointed out to me that my change introduced a potential leak
> of stack contents to userspace, because none of the callers check the return value.
>
> This patch adds the missing return checking, and also clears the buffer beforehand.
>
> Reported-by: Ben Hutchings <[email protected]>

I was wearing my other hat at the time ([email protected]).

> Cc: [email protected]
> Signed-off-by: Dave Jones <[email protected]>
>
> ---
> unanswered question: why are the buffer sizes here different ? which is correct?
[...]

Further question: why even use an intermediate buffer on the stack?
Both callers want to write the result to a seq_file. Should mpol_str()
then be replaced with a seq_mpol()?

Ben.

--
Ben Hutchings
Who are all these weirdos? - David Bowie, about L-Space IRC channel #afp


Attachments:
signature.asc (828.00 B)
This is a digitally signed message part

2012-10-16 00:48:36

by David Rientjes

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, 8 Oct 2012, Dave Jones wrote:

> > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
> > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400
> > > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400
> > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
> > > struct mm_walk walk = {};
> > > struct mempolicy *pol;
> > > int n;
> > > + int ret;
> > > char buffer[50];
> > >
> > > if (!mm)
> > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
> > > walk.mm = mm;
> > >
> > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> > > - mpol_to_str(buffer, sizeof(buffer), pol, 0);
> > > + memset(buffer, 0, sizeof(buffer));
> > > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
> > > + if (ret < 0)
> > > + return 0;
> >
> > We should need the mpol_cond_put(pol) here before returning.
>
> good catch. I'll respin the patch later with this changed.
>

Did you get a chance to fix this issue?

2012-10-16 02:35:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, Oct 8, 2012 at 11:09 AM, Dave Jones <[email protected]> wrote:
> Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
> a user triggerable BUG in mempolicy.
>
> Ben Hutchings pointed out to me that my change introduced a potential leak
> of stack contents to userspace, because none of the callers check the return value.
>
> This patch adds the missing return checking, and also clears the buffer beforehand.

I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. we should
close a race (or kill remain ref count leak) if we still have.
Because of, this patch makes unstable /proc output and might lead to
userland confusing.

2012-10-16 03:58:37

by David Rientjes

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, 15 Oct 2012, KOSAKI Motohiro wrote:

> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.

It's certainly not a complete fix, but I think it's a much better result
of the race, i.e. we don't panic anymore, we simply fail the read()
instead.

> we should
> close a race (or kill remain ref count leak) if we still have.

As I mentioned earlier in the thread, the read() is done here on a task
while only a reference to the task_struct is taken and we do not hold
task_lock() which is required for task->mempolicy. Once that is fixed,
mpol_to_str() should never be called for !task->mempolicy so it will never
need to return -EINVAL in such a condition.

2012-10-16 05:10:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Mon, Oct 15, 2012 at 11:58 PM, David Rientjes <[email protected]> wrote:
> On Mon, 15 Oct 2012, KOSAKI Motohiro wrote:
>
>> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
>
> It's certainly not a complete fix, but I think it's a much better result
> of the race, i.e. we don't panic anymore, we simply fail the read()
> instead.

Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
to caller complex. That's not good and have no worth.

>> we should
>> close a race (or kill remain ref count leak) if we still have.
>
> As I mentioned earlier in the thread, the read() is done here on a task
> while only a reference to the task_struct is taken and we do not hold
> task_lock() which is required for task->mempolicy. Once that is fixed,
> mpol_to_str() should never be called for !task->mempolicy so it will never
> need to return -EINVAL in such a condition.

I agree that's obviously a bug and we should fix it.

2012-10-16 06:10:14

by David Rientjes

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
> >
> > It's certainly not a complete fix, but I think it's a much better result
> > of the race, i.e. we don't panic anymore, we simply fail the read()
> > instead.
>
> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
> to caller complex. That's not good and have no worth.
>

Before: the kernel panics, all workloads cease.
After: the file shows garbage, all workloads continue.

This is better, in my opinion, but at best it's only a judgment call and
has no effect on anything.

I agree it would be better to respect the return value of mpol_to_str()
since there are other possible error conditions other than a freed
mempolicy, but let's not consider reverting 80de7c3138. It is obviously
not a full solution to the problem, though, and we need to serialize with
task_lock().

Dave, are you interested in coming up with a patch?

2012-10-16 23:39:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Tue, Oct 16, 2012 at 2:10 AM, David Rientjes <[email protected]> wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
>> >
>> > It's certainly not a complete fix, but I think it's a much better result
>> > of the race, i.e. we don't panic anymore, we simply fail the read()
>> > instead.
>>
>> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
>> to caller complex. That's not good and have no worth.
>>
>
> Before: the kernel panics, all workloads cease.
> After: the file shows garbage, all workloads continue.
>
> This is better, in my opinion, but at best it's only a judgment call and
> has no effect on anything.

Kernel panics help to find our serious mistake.


> I agree it would be better to respect the return value of mpol_to_str()
> since there are other possible error conditions other than a freed
> mempolicy, but let's not consider reverting 80de7c3138. It is obviously
> not a full solution to the problem, though, and we need to serialize with
> task_lock().

Sorry no. I will have to revert it. mempolicy have already a lot of
meaningless complex and bring us a lot of problems. I haven't
seen any reason adding more.


> Dave, are you interested in coming up with a patch?

2012-10-17 00:12:53

by David Rientjes

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
> >> to caller complex. That's not good and have no worth.
> >>
> >
> > Before: the kernel panics, all workloads cease.
> > After: the file shows garbage, all workloads continue.
> >
> > This is better, in my opinion, but at best it's only a judgment call and
> > has no effect on anything.
>
> Kernel panics help to find our serious mistake.
>

Kernel panics are not your little debugging tool to let users suffer
through for non-fatal issues.

> > I agree it would be better to respect the return value of mpol_to_str()
> > since there are other possible error conditions other than a freed
> > mempolicy, but let's not consider reverting 80de7c3138. It is obviously
> > not a full solution to the problem, though, and we need to serialize with
> > task_lock().
>
> Sorry no. I will have to revert it.

Feel free to revert anything you wish in your own tree, I couldn't care
less. If you try to propose it upstream, Andrew will surely ask you to
justify the BUG(), good luck on that.

I'll reply to this message with the fix that I think is best.

2012-10-17 00:31:28

by David Rientjes

[permalink] [raw]
Subject: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

When reading /proc/pid/numa_maps, it's possible to return the contents of
the stack where the mempolicy string should be printed if the policy gets
freed from beneath us.

This happens because mpol_to_str() may return an error the
stack-allocated buffer is then printed without ever being stored.

There are two possible error conditions in mpol_to_str():

- if the buffer allocated is insufficient for the string to be stored,
and

- if the mempolicy has an invalid mode.

The first error condition is not triggered in any of the callers to
mpol_to_str(): at least 50 bytes is always allocated on the stack and this
is sufficient for the string to be written. A future patch should convert
this into BUILD_BUG_ON() since we know the maximum strlen possible, but
that's not -rc material.

The second error condition is possible if a race occurs in dropping a
reference to a task's mempolicy causing it to be freed during the read().
The slab poison value is then used for the mode and mpol_to_str() returns
-EINVAL.

This race is only possible because get_vma_policy() believes that
mm->mmap_sem protects task->mempolicy, which isn't true. The exit path
does not hold mm->mmap_sem when dropping the reference or setting
task->mempolicy to NULL: it uses task_lock(task) instead.

Thus, it's required for the caller of a task mempolicy to hold
task_lock(task) while grabbing the mempolicy and reading it. Callers with
a vma policy store their mempolicy earlier and can simply increment the
reference count so it's guaranteed not to be freed.

Reported-by: Dave Jones <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
fs/proc/task_mmu.c | 7 +++++--
mm/mempolicy.c | 5 ++---
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
struct vm_area_struct *vma = v;
struct numa_maps *md = &numa_priv->md;
struct file *file = vma->vm_file;
+ struct task_struct *task = proc_priv->task;
struct mm_struct *mm = vma->vm_mm;
struct mm_walk walk = {};
struct mempolicy *pol;
@@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
walk.private = md;
walk.mm = mm;

- pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
+ task_lock(task);
+ pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
+ task_unlock(task);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);

@@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
seq_printf(m, " heap");
} else {
- pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid);
+ pid_t tid = vm_is_stack(task, vma, is_pid);
if (tid != 0) {
/*
* Thread stack in /proc/PID/task/TID/maps or
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b78fb9..d04a8a5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
*
* Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies
- * are protected by the task's mmap_sem, which must be held for read by
- * the caller.
+ * Current or other task's task mempolicy and non-shared vma policies must be
+ * protected by task_lock(task) by the caller.
* Shared policies [those marked as MPOL_F_SHARED] require an extra reference
* count--added by the get_policy() vm_op, as appropriate--to protect against
* freeing by another task. It is the caller's responsibility to free the

2012-10-17 01:34:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: mpol_to_str revisited.

On Tue, Oct 16, 2012 at 8:12 PM, David Rientjes <[email protected]> wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
>> >> to caller complex. That's not good and have no worth.
>> >>
>> >
>> > Before: the kernel panics, all workloads cease.
>> > After: the file shows garbage, all workloads continue.
>> >
>> > This is better, in my opinion, but at best it's only a judgment call and
>> > has no effect on anything.
>>
>> Kernel panics help to find our serious mistake.
>
> Kernel panics are not your little debugging tool to let users suffer
> through for non-fatal issues.

use after free is fatal, no doubt.

>
>> > I agree it would be better to respect the return value of mpol_to_str()
>> > since there are other possible error conditions other than a freed
>> > mempolicy, but let's not consider reverting 80de7c3138. It is obviously
>> > not a full solution to the problem, though, and we need to serialize with
>> > task_lock().
>>
>> Sorry no. I will have to revert it.
>
> Feel free to revert anything you wish in your own tree, I couldn't care
> less. If you try to propose it upstream, Andrew will surely ask you to
> justify the BUG(), good luck on that.

Yeah.
I'm ok just remove both BUG() and EINVAL, but current situation (i.e. ignoring
EINVAL by caller) is surely bad. So, just revert is best IMHO.


>
> I'll reply to this message with the fix that I think is best.

2012-10-17 01:38:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes <[email protected]> wrote:
> When reading /proc/pid/numa_maps, it's possible to return the contents of
> the stack where the mempolicy string should be printed if the policy gets
> freed from beneath us.
>
> This happens because mpol_to_str() may return an error the
> stack-allocated buffer is then printed without ever being stored.
>
> There are two possible error conditions in mpol_to_str():
>
> - if the buffer allocated is insufficient for the string to be stored,
> and
>
> - if the mempolicy has an invalid mode.
>
> The first error condition is not triggered in any of the callers to
> mpol_to_str(): at least 50 bytes is always allocated on the stack and this
> is sufficient for the string to be written. A future patch should convert
> this into BUILD_BUG_ON() since we know the maximum strlen possible, but
> that's not -rc material.
>
> The second error condition is possible if a race occurs in dropping a
> reference to a task's mempolicy causing it to be freed during the read().
> The slab poison value is then used for the mode and mpol_to_str() returns
> -EINVAL.
>
> This race is only possible because get_vma_policy() believes that
> mm->mmap_sem protects task->mempolicy, which isn't true. The exit path
> does not hold mm->mmap_sem when dropping the reference or setting
> task->mempolicy to NULL: it uses task_lock(task) instead.
>
> Thus, it's required for the caller of a task mempolicy to hold
> task_lock(task) while grabbing the mempolicy and reading it. Callers with
> a vma policy store their mempolicy earlier and can simply increment the
> reference count so it's guaranteed not to be freed.
>
> Reported-by: Dave Jones <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> fs/proc/task_mmu.c | 7 +++++--
> mm/mempolicy.c | 5 ++---
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
> struct vm_area_struct *vma = v;
> struct numa_maps *md = &numa_priv->md;
> struct file *file = vma->vm_file;
> + struct task_struct *task = proc_priv->task;
> struct mm_struct *mm = vma->vm_mm;
> struct mm_walk walk = {};
> struct mempolicy *pol;
> @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
> walk.private = md;
> walk.mm = mm;
>
> - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> + task_lock(task);
> + pol = get_vma_policy(task, vma, vma->vm_start);
> mpol_to_str(buffer, sizeof(buffer), pol, 0);
> mpol_cond_put(pol);
> + task_unlock(task);
>
> seq_printf(m, "%08lx %s", vma->vm_start, buffer);
>
> @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
> } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
> seq_printf(m, " heap");
> } else {
> - pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid);
> + pid_t tid = vm_is_stack(task, vma, is_pid);
> if (tid != 0) {
> /*
> * Thread stack in /proc/PID/task/TID/maps or
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0b78fb9..d04a8a5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
> *
> * Returns effective policy for a VMA at specified address.
> * Falls back to @task or system default policy, as necessary.
> - * Current or other task's task mempolicy and non-shared vma policies
> - * are protected by the task's mmap_sem, which must be held for read by
> - * the caller.
> + * Current or other task's task mempolicy and non-shared vma policies must be
> + * protected by task_lock(task) by the caller.

This is not correct. mmap_sem is needed for protecting vma. task_lock()
is needed to close vs exit race only when task != current. In other word,
caller must held both mmap_sem and task_lock if task != current.




> * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
> * count--added by the get_policy() vm_op, as appropriate--to protect against
> * freeing by another task. It is the caller's responsibility to free the

2012-10-17 01:49:06

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 0b78fb9..d04a8a5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
> > *
> > * Returns effective policy for a VMA at specified address.
> > * Falls back to @task or system default policy, as necessary.
> > - * Current or other task's task mempolicy and non-shared vma policies
> > - * are protected by the task's mmap_sem, which must be held for read by
> > - * the caller.
> > + * Current or other task's task mempolicy and non-shared vma policies must be
> > + * protected by task_lock(task) by the caller.
>
> This is not correct. mmap_sem is needed for protecting vma. task_lock()
> is needed to close vs exit race only when task != current. In other word,
> caller must held both mmap_sem and task_lock if task != current.
>

The comment is specifically addressing non-shared vma policies, you do not
need to hold mmap_sem to access another thread's mempolicy.

2012-10-17 01:53:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes <[email protected]> wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 0b78fb9..d04a8a5 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
>> > *
>> > * Returns effective policy for a VMA at specified address.
>> > * Falls back to @task or system default policy, as necessary.
>> > - * Current or other task's task mempolicy and non-shared vma policies
>> > - * are protected by the task's mmap_sem, which must be held for read by
>> > - * the caller.
>> > + * Current or other task's task mempolicy and non-shared vma policies must be
>> > + * protected by task_lock(task) by the caller.
>>
>> This is not correct. mmap_sem is needed for protecting vma. task_lock()
>> is needed to close vs exit race only when task != current. In other word,
>> caller must held both mmap_sem and task_lock if task != current.
>
> The comment is specifically addressing non-shared vma policies, you do not
> need to hold mmap_sem to access another thread's mempolicy.

I didn't say old comment is true. I just only your new comment also false.

2012-10-17 04:05:50

by Dave Jones

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote:

> - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> + task_lock(task);
> + pol = get_vma_policy(task, vma, vma->vm_start);
> mpol_to_str(buffer, sizeof(buffer), pol, 0);
> mpol_cond_put(pol);
> + task_unlock(task);

This seems to cause some fallout for me..

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
3 locks on stack by trinity-child2/8558:
#0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
#1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
#2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
Call Trace:
[<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
[<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
[<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
[<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
[<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
[<ffffffff81254fa3>] show_numa_map+0x163/0x610
[<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
[<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
[<ffffffff81255483>] show_pid_numa_map+0x13/0x20
[<ffffffff8120c902>] traverse+0xf2/0x230
[<ffffffff8120cd8b>] seq_lseek+0xab/0x120
[<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
[<ffffffff816ca088>] tracesys+0xe1/0xe6


same problem, different syscall..


BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3
3 locks on stack by trinity-child3/21996:
#0: held: (&p->lock){+.+.+.}, instance: ffff88008d712c08, at: [<ffffffff8120ce3d>] seq_read+0x3d/0x3e0
#1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
#2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32
Call Trace:
[<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
[<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
[<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
[<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
[<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
[<ffffffff81254fa3>] show_numa_map+0x163/0x610
[<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
[<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
[<ffffffff81255483>] show_pid_numa_map+0x13/0x20
[<ffffffff8120c902>] traverse+0xf2/0x230
[<ffffffff8120d14b>] seq_read+0x34b/0x3e0
[<ffffffff8120ce00>] ? seq_lseek+0x120/0x120
[<ffffffff811e751a>] do_loop_readv_writev+0x5a/0x90
[<ffffffff811e7851>] do_readv_writev+0x1c1/0x1e0
[<ffffffff810b0de1>] ? get_parent_ip+0x11/0x50
[<ffffffff811e7905>] vfs_readv+0x35/0x60
[<ffffffff811e7b72>] sys_preadv+0xc2/0xe0
[<ffffffff816ca088>] tracesys+0xe1/0xe6

2012-10-17 05:24:37

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 17 Oct 2012, Dave Jones wrote:

> BUG: sleeping function called from invalid context at kernel/mutex.c:269
> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
> 3 locks on stack by trinity-child2/8558:
> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
> Call Trace:
> [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
> [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
> [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
> [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
> [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
> [<ffffffff81254fa3>] show_numa_map+0x163/0x610
> [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
> [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
> [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
> [<ffffffff8120c902>] traverse+0xf2/0x230
> [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
> [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
> [<ffffffff816ca088>] tracesys+0xe1/0xe6
>

Hmm, looks like we need to change the refcount semantics entirely. We'll
need to make get_vma_policy() always take a reference and then drop it
accordingly. This work sif get_vma_policy() can grab a reference while
holding task_lock() for the task policy fallback case.

Comments on this approach?
---
fs/proc/task_mmu.c | 4 +---
include/linux/mm.h | 3 +--
mm/hugetlb.c | 4 ++--
mm/mempolicy.c | 41 ++++++++++++++++++++++-------------------
4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
walk.private = md;
walk.mm = mm;

- task_lock(task);
pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
- mpol_cond_put(pol);
- task_unlock(task);
+ __mpol_put(pol);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -216,8 +216,7 @@ struct vm_operations_struct {
* get_policy() op must add reference [mpol_get()] to any policy at
* (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure
* in mm/mempolicy.c will do this automatically.
- * get_policy() must NOT add a ref if the policy at (vma,addr) is not
- * marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
+ * vma policies are protected by the mmap_sem.
* If no [shared/vma] mempolicy exists at the addr, get_policy() op
* must return NULL--i.e., do not "fallback" to task or system default
* policy.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
}
}

- mpol_cond_put(mpol);
+ __mpol_put(mpol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;

err:
- mpol_cond_put(mpol);
+ __mpol_put(mpol);
return NULL;
}

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
*
* Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task. It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
*/
struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
- struct mempolicy *pol = task->mempolicy;
+ struct mempolicy *pol;
+
+ task_lock(task);
+ pol = task->mempolicy;
+ mpol_get(pol);
+ task_unlock(task);

if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
addr);
- if (vpol)
+ if (vpol) {
+ mpol_put(pol);
pol = vpol;
+ if (!mpol_needs_cond_ref(pol))
+ mpol_get(pol);
+ }
} else if (vma->vm_policy) {
+ mpol_put(pol);
pol = vma->vm_policy;
-
- /*
- * shmem_alloc_page() passes MPOL_F_SHARED policy with
- * a pseudo vma whose vma->vm_ops=NULL. Take a reference
- * count on these policies which will be dropped by
- * mpol_cond_put() later
- */
- if (mpol_needs_cond_ref(pol))
- mpol_get(pol);
+ mpol_get(pol);
}
}
- if (!pol)
+ if (!pol) {
pol = &default_policy;
+ mpol_get(pol);
+ }
return pol;
}

@@ -1919,7 +1921,7 @@ retry_cpuset:
unsigned nid;

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
- mpol_cond_put(pol);
+ __mpol_put(pol);
page = alloc_page_interleave(gfp, order, nid);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
@@ -1943,6 +1945,7 @@ retry_cpuset:
*/
page = __alloc_pages_nodemask(gfp, order, zl,
policy_nodemask(gfp, pol));
+ __mpol_put(pol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;

2012-10-17 05:43:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

(2012/10/17 14:24), David Rientjes wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>> 3 locks on stack by trinity-child2/8558:
>> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
>> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>> Call Trace:
>> [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>> [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>> [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>> [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>> [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>> [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>> [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>> [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>> [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>> [<ffffffff8120c902>] traverse+0xf2/0x230
>> [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>> [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>> [<ffffffff816ca088>] tracesys+0xe1/0xe6
>>
>
> Hmm, looks like we need to change the refcount semantics entirely. We'll
> need to make get_vma_policy() always take a reference and then drop it
> accordingly. This work sif get_vma_policy() can grab a reference while
> holding task_lock() for the task policy fallback case.
>
> Comments on this approach?

I think this refcounting is better than using task_lock().

Thanks,
-Kame

2012-10-17 08:49:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki
<[email protected]> wrote:
> (2012/10/17 14:24), David Rientjes wrote:
>>
>> On Wed, 17 Oct 2012, Dave Jones wrote:
>>
>>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>>> 3 locks on stack by trinity-child2/8558:
>>> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at:
>>> [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>>> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at:
>>> [<ffffffff81254437>] m_start+0xa7/0x190
>>> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance:
>>> ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
>>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>>> Call Trace:
>>> [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>>> [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>>> [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>>> [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>>> [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>>> [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>>> [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>>> [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>>> [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>>> [<ffffffff8120c902>] traverse+0xf2/0x230
>>> [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>>> [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>>> [<ffffffff816ca088>] tracesys+0xe1/0xe6
>>>
>>
>> Hmm, looks like we need to change the refcount semantics entirely. We'll
>> need to make get_vma_policy() always take a reference and then drop it
>> accordingly. This work sif get_vma_policy() can grab a reference while
>> holding task_lock() for the task policy fallback case.
>>
>> Comments on this approach?
>
>
> I think this refcounting is better than using task_lock().

I don't think so. get_vma_policy() is used from fast path. In other
words, number of
atomic ops is sensible for allocation performance. Instead, I'd like
to use spinlock
for shared mempolicy instead of mutex.

2012-10-17 18:14:43

by Dave Jones

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
> > BUG: sleeping function called from invalid context at kernel/mutex.c:269
>
> Hmm, looks like we need to change the refcount semantics entirely. We'll
> need to make get_vma_policy() always take a reference and then drop it
> accordingly. This work sif get_vma_policy() can grab a reference while
> holding task_lock() for the task policy fallback case.
>
> Comments on this approach?

Seems to be surviving my testing at least..

Dave

2012-10-17 19:21:24

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 17 Oct 2012, Dave Jones wrote:

> On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
> > On Wed, 17 Oct 2012, Dave Jones wrote:
> >
> > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
> >
> > Hmm, looks like we need to change the refcount semantics entirely. We'll
> > need to make get_vma_policy() always take a reference and then drop it
> > accordingly. This work sif get_vma_policy() can grab a reference while
> > holding task_lock() for the task policy fallback case.
> >
> > Comments on this approach?
>
> Seems to be surviving my testing at least..
>

Sounds good. Is it possible to verify that policy_cache isn't getting
larger than normal in /proc/slabinfo, i.e. when all processes with a
task mempolicy or shared vma policy have exited, are there still a
significant number of active objects?

2012-10-17 19:32:45

by Dave Jones

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
> > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
> > > On Wed, 17 Oct 2012, Dave Jones wrote:
> > >
> > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
> > >
> > > Hmm, looks like we need to change the refcount semantics entirely. We'll
> > > need to make get_vma_policy() always take a reference and then drop it
> > > accordingly. This work sif get_vma_policy() can grab a reference while
> > > holding task_lock() for the task policy fallback case.
> > >
> > > Comments on this approach?
> >
> > Seems to be surviving my testing at least..
> >
>
> Sounds good. Is it possible to verify that policy_cache isn't getting
> larger than normal in /proc/slabinfo, i.e. when all processes with a
> task mempolicy or shared vma policy have exited, are there still a
> significant number of active objects?

Killing the fuzzer caused it to drop dramatically.

Before:
(15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy
shared_policy_node 2931 2967 376 43 4 : tunables 0 0 0 : slabdata 69 69 0
numa_policy 2971 6545 464 35 4 : tunables 0 0 0 : slabdata 187 187 0

After:
(15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy
shared_policy_node 0 215 376 43 4 : tunables 0 0 0 : slabdata 5 5 0
numa_policy 15 175 464 35 4 : tunables 0 0 0 : slabdata 5 5 0

2012-10-17 19:38:59

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 17 Oct 2012, Dave Jones wrote:

> > Sounds good. Is it possible to verify that policy_cache isn't getting
> > larger than normal in /proc/slabinfo, i.e. when all processes with a
> > task mempolicy or shared vma policy have exited, are there still a
> > significant number of active objects?
>
> Killing the fuzzer caused it to drop dramatically.
>
> Before:
> (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy
> shared_policy_node 2931 2967 376 43 4 : tunables 0 0 0 : slabdata 69 69 0
> numa_policy 2971 6545 464 35 4 : tunables 0 0 0 : slabdata 187 187 0
>
> After:
> (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy
> shared_policy_node 0 215 376 43 4 : tunables 0 0 0 : slabdata 5 5 0
> numa_policy 15 175 464 35 4 : tunables 0 0 0 : slabdata 5 5 0
>

Excellent, thanks. This shows that the refcounting is working properly
and we're not leaking any references as a result of this change causing
the mempolicies to never be freed. ("numa_policy" turns out to be
policy_cache in the code, so thanks for checking both of them.)

Could I add your tested-by?

2012-10-17 19:45:20

by Dave Jones

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote:

> > > Sounds good. Is it possible to verify that policy_cache isn't getting
> > > larger than normal in /proc/slabinfo, i.e. when all processes with a
> > > task mempolicy or shared vma policy have exited, are there still a
> > > significant number of active objects?
> >
> > Killing the fuzzer caused it to drop dramatically.
> >
> Excellent, thanks. This shows that the refcounting is working properly
> and we're not leaking any references as a result of this change causing
> the mempolicies to never be freed. ("numa_policy" turns out to be
> policy_cache in the code, so thanks for checking both of them.)
>
> Could I add your tested-by?

Sure. Here's a fresh one I just baked.

Tested-by: Dave Jones <[email protected]>

Dave

2012-10-17 19:50:24

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

> > I think this refcounting is better than using task_lock().
>
> I don't think so. get_vma_policy() is used from fast path. In other
> words, number of
> atomic ops is sensible for allocation performance.

There are enhancements that we can make with refcounting: for instance, we
may want to avoid doing it in the super-fast path when the policy is
default_policy and then just do

if (mpol != &default_policy)
mpol_put(mpol);

> Instead, I'd like
> to use spinlock
> for shared mempolicy instead of mutex.
>

Um, this was just changed to a mutex last week in commit b22d127a39dd
("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
Are you nacking that patch, which you acked, now?

2012-10-17 20:28:53

by David Rientjes

[permalink] [raw]
Subject: [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
contents in numa_maps"), the mutex protecting a shared policy can be
inadvertently taken while holding task_lock(task).

Recently, commit b22d127a39dd ("mempolicy: fix a race in
shared_policy_replace()") switched the spinlock within a shared policy to
a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on
all mempolicies returned by get_vma_policy() so it isn't freed while being
passed to mpol_to_str() when reading /proc/pid/numa_maps.

This patch only takes task_lock() while dereferencing task->mempolicy in
get_vma_policy() to increment its refcount. This ensures it will remain
in memory until dropped by __mpol_put() after mpol_to_str() is called.

Refcounts of shared policies are grabbed by the ->get_policy() function of
the vma, all others will be grabbed directly in get_vma_policy(). Now
that this is done, all callers now unconditionally drop the refcount.

Tested-by: Dave Jones <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
fs/proc/task_mmu.c | 4 +--
include/linux/mempolicy.h | 12 +------
mm/hugetlb.c | 4 +--
mm/mempolicy.c | 79 +++++++++++++++++++--------------------------
4 files changed, 38 insertions(+), 61 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
walk.private = md;
walk.mm = mm;

- task_lock(task);
pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
- mpol_cond_put(pol);
- task_unlock(task);
+ __mpol_put(pol);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol)
*/
static inline int mpol_needs_cond_ref(struct mempolicy *pol)
{
- return (pol && (pol->flags & MPOL_F_SHARED));
-}
-
-static inline void mpol_cond_put(struct mempolicy *pol)
-{
- if (mpol_needs_cond_ref(pol))
- __mpol_put(pol);
+ return pol->flags & MPOL_F_SHARED;
}

extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol,
@@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p)
{
}

-static inline void mpol_cond_put(struct mempolicy *pol)
-{
-}
-
static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to,
struct mempolicy *from)
{
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
}
}

- mpol_cond_put(mpol);
+ __mpol_put(mpol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;

err:
- mpol_cond_put(mpol);
+ __mpol_put(mpol);
return NULL;
}

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
}

out:
- mpol_cond_put(pol);
+ if (mpol_needs_cond_ref(pol))
+ __mpol_put(pol);
if (vma)
up_read(&current->mm->mmap_sem);
return err;
@@ -1527,48 +1528,52 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
}

#endif
-
-/*
- * get_vma_policy(@task, @vma, @addr)
- * @task - task for fallback if vma policy == default
- * @vma - virtual memory area whose policy is sought
- * @addr - address in @vma for shared policy lookup
+/**
+ * get_vma_policy() - return effective policy for a vma at specified address
+ * @task: task for fallback if vma policy == default_policy
+ * @vma: virtual memory area whose policy is sought
+ * @addr: address in @vma for shared policy lookup
*
- * Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task. It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
*/
struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
- struct mempolicy *pol = task->mempolicy;
+ struct mempolicy *pol;
+
+ /*
+ * Grab a reference before task has the potential to exit and free its
+ * mempolicy.
+ */
+ task_lock(task);
+ pol = task->mempolicy;
+ mpol_get(pol);
+ task_unlock(task);

if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
addr);
- if (vpol)
+ if (vpol) {
+ mpol_put(pol);
pol = vpol;
+ if (!mpol_needs_cond_ref(pol))
+ mpol_get(pol);
+ }
} else if (vma->vm_policy) {
+ mpol_put(pol);
pol = vma->vm_policy;
-
- /*
- * shmem_alloc_page() passes MPOL_F_SHARED policy with
- * a pseudo vma whose vma->vm_ops=NULL. Take a reference
- * count on these policies which will be dropped by
- * mpol_cond_put() later
- */
- if (mpol_needs_cond_ref(pol))
- mpol_get(pol);
+ mpol_get(pol);
}
}
- if (!pol)
+ if (!pol) {
pol = &default_policy;
+ mpol_get(pol);
+ }
return pol;
}

@@ -1919,30 +1924,14 @@ retry_cpuset:
unsigned nid;

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
- mpol_cond_put(pol);
page = alloc_page_interleave(gfp, order, nid);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
- goto retry_cpuset;
-
- return page;
+ goto out;
}
zl = policy_zonelist(gfp, pol, node);
- if (unlikely(mpol_needs_cond_ref(pol))) {
- /*
- * slow path: ref counted shared policy
- */
- struct page *page = __alloc_pages_nodemask(gfp, order,
- zl, policy_nodemask(gfp, pol));
- __mpol_put(pol);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
- goto retry_cpuset;
- return page;
- }
- /*
- * fast path: default or task policy
- */
page = __alloc_pages_nodemask(gfp, order, zl,
policy_nodemask(gfp, pol));
+out:
+ __mpol_put(pol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;

2012-10-17 21:05:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes <[email protected]> wrote:
> On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:
>
>> > I think this refcounting is better than using task_lock().
>>
>> I don't think so. get_vma_policy() is used from fast path. In other
>> words, number of
>> atomic ops is sensible for allocation performance.
>
> There are enhancements that we can make with refcounting: for instance, we
> may want to avoid doing it in the super-fast path when the policy is
> default_policy and then just do
>
> if (mpol != &default_policy)
> mpol_put(mpol);
>
>> Instead, I'd like
>> to use spinlock
>> for shared mempolicy instead of mutex.
>>
>
> Um, this was just changed to a mutex last week in commit b22d127a39dd
> ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
> can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
> Are you nacking that patch, which you acked, now?

Yes, sadly. /proc usage is a corner case issue. It's not worth to
strike main path.
see commit 52cd3b0740 and around patches. That explain why we avoided your
approach.

2012-10-17 21:27:39

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

> > Um, this was just changed to a mutex last week in commit b22d127a39dd
> > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
> > can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
> > Are you nacking that patch, which you acked, now?
>
> Yes, sadly. /proc usage is a corner case issue. It's not worth to
> strike main path.

It also simplifies the fastpath since we can now unconditionally drop the
reference.

2012-10-17 21:31:13

by David Rientjes

[permalink] [raw]
Subject: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
contents in numa_maps"), the mutex protecting a shared policy can be
inadvertently taken while holding task_lock(task).

Recently, commit b22d127a39dd ("mempolicy: fix a race in
shared_policy_replace()") switched the spinlock within a shared policy to
a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on
all mempolicies returned by get_vma_policy() so it isn't freed while being
passed to mpol_to_str() when reading /proc/pid/numa_maps.

This patch only takes task_lock() while dereferencing task->mempolicy in
get_vma_policy() if it's non-NULL in the lockess check to increment its
refcount. This ensures it will remain in memory until dropped by
__mpol_put() after mpol_to_str() is called.

Refcounts of shared policies are grabbed by the ->get_policy() function of
the vma, all others will be grabbed directly in get_vma_policy(). Now
that this is done, all callers now unconditionally drop the refcount.

Tested-by: Dave Jones <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
v2: optimized task_lock() in get_vma_policy(): test for a non-NULL
task->mempolicy before taking task_lock() and grabbing the reference
so we don't take the lock unnecessarily.

fs/proc/task_mmu.c | 4 +--
include/linux/mempolicy.h | 12 +------
mm/hugetlb.c | 4 +--
mm/mempolicy.c | 79 ++++++++++++++++++++-------------------------
4 files changed, 39 insertions(+), 60 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..5709e70 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
walk.private = md;
walk.mm = mm;

- task_lock(task);
pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
- mpol_cond_put(pol);
- task_unlock(task);
+ __mpol_put(pol);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..f76f7e0 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol)
*/
static inline int mpol_needs_cond_ref(struct mempolicy *pol)
{
- return (pol && (pol->flags & MPOL_F_SHARED));
-}
-
-static inline void mpol_cond_put(struct mempolicy *pol)
-{
- if (mpol_needs_cond_ref(pol))
- __mpol_put(pol);
+ return pol->flags & MPOL_F_SHARED;
}

extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol,
@@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p)
{
}

-static inline void mpol_cond_put(struct mempolicy *pol)
-{
-}
-
static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to,
struct mempolicy *from)
{
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..5080808 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
}
}

- mpol_cond_put(mpol);
+ __mpol_put(mpol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;

err:
- mpol_cond_put(mpol);
+ __mpol_put(mpol);
return NULL;
}

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..a0bb463 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
}

out:
- mpol_cond_put(pol);
+ if (mpol_needs_cond_ref(pol))
+ __mpol_put(pol);
if (vma)
up_read(&current->mm->mmap_sem);
return err;
@@ -1527,48 +1528,54 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
}

#endif
-
-/*
- * get_vma_policy(@task, @vma, @addr)
- * @task - task for fallback if vma policy == default
- * @vma - virtual memory area whose policy is sought
- * @addr - address in @vma for shared policy lookup
+/**
+ * get_vma_policy() - return effective policy for a vma at specified address
+ * @task: task for fallback if vma policy == default_policy
+ * @vma: virtual memory area whose policy is sought
+ * @addr: address in @vma for shared policy lookup
*
- * Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task. It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
*/
struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = task->mempolicy;

+ /*
+ * Grab a reference before task has the potential to exit and free its
+ * mempolicy.
+ */
+ if (pol) {
+ task_lock(task);
+ pol = task->mempolicy;
+ mpol_get(pol);
+ task_unlock(task);
+ }
+
if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
addr);
- if (vpol)
+ if (vpol) {
+ mpol_put(pol);
pol = vpol;
+ if (!mpol_needs_cond_ref(pol))
+ mpol_get(pol);
+ }
} else if (vma->vm_policy) {
+ mpol_put(pol);
pol = vma->vm_policy;
-
- /*
- * shmem_alloc_page() passes MPOL_F_SHARED policy with
- * a pseudo vma whose vma->vm_ops=NULL. Take a reference
- * count on these policies which will be dropped by
- * mpol_cond_put() later
- */
- if (mpol_needs_cond_ref(pol))
- mpol_get(pol);
+ mpol_get(pol);
}
}
- if (!pol)
+ if (!pol) {
pol = &default_policy;
+ mpol_get(pol);
+ }
return pol;
}

@@ -1919,30 +1926,14 @@ retry_cpuset:
unsigned nid;

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
- mpol_cond_put(pol);
page = alloc_page_interleave(gfp, order, nid);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
- goto retry_cpuset;
-
- return page;
+ goto out;
}
zl = policy_zonelist(gfp, pol, node);
- if (unlikely(mpol_needs_cond_ref(pol))) {
- /*
- * slow path: ref counted shared policy
- */
- struct page *page = __alloc_pages_nodemask(gfp, order,
- zl, policy_nodemask(gfp, pol));
- __mpol_put(pol);
- if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
- goto retry_cpuset;
- return page;
- }
- /*
- * fast path: default or task policy
- */
page = __alloc_pages_nodemask(gfp, order, zl,
policy_nodemask(gfp, pol));
+out:
+ __mpol_put(pol);
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
return page;

2012-10-18 04:07:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

(2012/10/18 6:31), David Rientjes wrote:
> As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
> contents in numa_maps"), the mutex protecting a shared policy can be
> inadvertently taken while holding task_lock(task).
>
> Recently, commit b22d127a39dd ("mempolicy: fix a race in
> shared_policy_replace()") switched the spinlock within a shared policy to
> a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on
> all mempolicies returned by get_vma_policy() so it isn't freed while being
> passed to mpol_to_str() when reading /proc/pid/numa_maps.
>
> This patch only takes task_lock() while dereferencing task->mempolicy in
> get_vma_policy() if it's non-NULL in the lockess check to increment its
> refcount. This ensures it will remain in memory until dropped by
> __mpol_put() after mpol_to_str() is called.
>
> Refcounts of shared policies are grabbed by the ->get_policy() function of
> the vma, all others will be grabbed directly in get_vma_policy(). Now
> that this is done, all callers now unconditionally drop the refcount.
>

please add original problem description....

from your 1st patch.
> When reading /proc/pid/numa_maps, it's possible to return the contents of
> the stack where the mempolicy string should be printed if the policy gets
> freed from beneath us.
>
> This happens because mpol_to_str() may return an error the
> stack-allocated buffer is then printed without ever being stored.
.....

Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something.

I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy()
because it's called every time alloc_pages_vma() is called, at every page fault.
So, it seems he doesn't agree this fix because of performance concern on big NUMA,


Can't we have another way to fix ? like this ? too ugly ?
Again, I'm sorry if I misunderstand the points.

==

From bfe7e2ab1c1375b134ec12efce6517149318f75d Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 18 Oct 2012 13:17:25 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

/proc/<pid>/numa_maps scans vma and show mempolicy under
mmap_sem. It sometimes accesses task->mempolicy which can
be freed without mmap_sem and numa_maps can show some
garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..d92e868 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
{
if (vma && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
+#ifdef CONFIG_NUMA
+ task_lock(priv->task);
+ __mpol_put(priv->task->mempolicy);
+ task_unlock(priv->task);
+#endif
up_read(&mm->mmap_sem);
mmput(mm);
}
@@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return mm;
down_read(&mm->mmap_sem);

+ /*
+ * task->mempolicy can be freed even if mmap_sem is down (see kernel/exit.c)
+ * We grab refcount for stable access.
+ * repleacement of task->mmpolicy is guarded by mmap_sem.
+ */
+#ifdef CONFIG_NUMA
+ task_lock(priv->task);
+ mpol_get(priv->task->mempolicy);
+ task_unlock(priv->task);
+#endif
tail_vma = get_gate_vma(priv->task->mm);
priv->tail_vma = tail_vma;

@@ -161,6 +176,11 @@ out:

/* End of vmas has been reached */
m->version = (tail_vma != NULL)? 0: -1UL;
+#ifdef CONFIG_NUMA
+ task_lock(priv->task);
+ __mpol_put(priv->task->mempolicy);
+ task_unlock(priv->task);
+#endif
up_read(&mm->mmap_sem);
mmput(mm);
return tail_vma;
--
1.7.10.2












2012-10-18 04:15:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki
<[email protected]> wrote:
> if (vma && vma != priv->tail_vma) {
> struct mm_struct *mm = vma->vm_mm;
> +#ifdef CONFIG_NUMA
> + task_lock(priv->task);
> + __mpol_put(priv->task->mempolicy);
> + task_unlock(priv->task);
> +#endif
> up_read(&mm->mmap_sem);
> mmput(mm);

Please don't put #ifdef's inside code. It makes things really ugly and
hard to read.

And that is *especially* true in this case, since there's a pattern to
all these things:

> +#ifdef CONFIG_NUMA
> + task_lock(priv->task);
> + mpol_get(priv->task->mempolicy);
> + task_unlock(priv->task);
> +#endif

> +#ifdef CONFIG_NUMA
> + task_lock(priv->task);
> + __mpol_put(priv->task->mempolicy);
> + task_unlock(priv->task);
> +#endif

it really sounds like what you want to do is to just abstract a
"numa_policy_get/put(priv)" operation.

So you could make it be something like

#ifdef CONFIG_NUMA
static inline numa_policy_get(struct proc_maps_private *priv)
{
task_lock(priv->task);
mpol_get(priv->task->mempolicy);
task_unlock(priv->task);
}
.. same for the "put" function ..
#else
#define numa_policy_get(priv) do { } while (0)
#define numa_policy_put(priv) do { } while (0)
#endif

and then you wouldn't have to have the #ifdef's in the middle of code,
and I think it will be more readable in general.

Sure, it is going to be a few more actual lines of patch, but there's
no duplicated code sequence, and the added lines are just the syntax
that makes it look better.

Linus

2012-10-18 04:34:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

(2012/10/18 13:06), Kamezawa Hiroyuki wrote:
> (2012/10/18 6:31), David Rientjes wrote:
>> As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
>> contents in numa_maps"), the mutex protecting a shared policy can be
>> inadvertently taken while holding task_lock(task).
>>
>> Recently, commit b22d127a39dd ("mempolicy: fix a race in
>> shared_policy_replace()") switched the spinlock within a shared policy to
>> a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on
>> all mempolicies returned by get_vma_policy() so it isn't freed while being
>> passed to mpol_to_str() when reading /proc/pid/numa_maps.
>>
>> This patch only takes task_lock() while dereferencing task->mempolicy in
>> get_vma_policy() if it's non-NULL in the lockess check to increment its
>> refcount. This ensures it will remain in memory until dropped by
>> __mpol_put() after mpol_to_str() is called.
>>
>> Refcounts of shared policies are grabbed by the ->get_policy() function of
>> the vma, all others will be grabbed directly in get_vma_policy(). Now
>> that this is done, all callers now unconditionally drop the refcount.
>>
>
> please add original problem description....
>
> from your 1st patch.
>> When reading /proc/pid/numa_maps, it's possible to return the contents of
>> the stack where the mempolicy string should be printed if the policy gets
>> freed from beneath us.
>>
>> This happens because mpol_to_str() may return an error the
>> stack-allocated buffer is then printed without ever being stored.
> .....
>
> Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something.
>
> I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy()
> because it's called every time alloc_pages_vma() is called, at every page fault.
> So, it seems he doesn't agree this fix because of performance concern on big NUMA,
>
>
> Can't we have another way to fix ? like this ? too ugly ?
> Again, I'm sorry if I misunderstand the points.
>
Sorry this patch itself may be buggy. please don't test..
I missed that kernel/exit.c sets task->mempolicy to be NULL.
fixed one here.

--
From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 18 Oct 2012 13:50:29 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

/proc/<pid>/numa_maps scans vma and show mempolicy under
mmap_sem. It sometimes accesses task->mempolicy which can
be freed without mmap_sem and numa_maps can show some
garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

V1->V2
- access task->mempolicy only once and remember it. Becase kernel/exit.c
can overwrite it.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/internal.h | 4 ++++
fs/proc/task_mmu.c | 33 ++++++++++++++++++++++++++++++++-
2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cceaab0..43973b0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/proc_fs.h>
struct ctl_table_header;
+struct mempolicy;

extern struct proc_dir_entry proc_root;
#ifdef CONFIG_PROC_SYSCTL
@@ -74,6 +75,9 @@ struct proc_maps_private {
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
#endif
+#ifdef CONFIG_NUMA
+ struct mempolicy *task_mempolicy;
+#endif
};

void proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..624927d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len)
len = 1;
seq_printf(m, "%*c", len, ' ');
}
+#ifdef CONFIG_NUMA
+/*
+ * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
+ * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed
+ * under task_lock() (see kernel/exit.c) replacement of it is guarded by
+ * mmap_sem. So, take referenceount under task_lock() before we start
+ * scanning and drop it when numa_maps reaches the end.
+ */
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+ struct task_struct *task = priv->task;
+
+ task_lock(task);
+ priv->task_mempolicy = task->mempolicy;
+ mpol_get(priv->task_mempolicy);
+ task_unlock(task);
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+ mpol_put(priv->task_mempolicy);
+}
+#else
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+#endif

static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
{
if (vma && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
+ release_task_mempolicy(priv);
up_read(&mm->mmap_sem);
mmput(mm);
}
@@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)

tail_vma = get_gate_vma(priv->task->mm);
priv->tail_vma = tail_vma;
-
+ hold_task_mempolicy(priv);
/* Start with last addr hint */
vma = find_vma(mm, last_addr);
if (last_addr && vma) {
@@ -159,6 +189,7 @@ out:
if (vma)
return vma;

+ release_task_mempolicy(priv);
/* End of vmas has been reached */
m->version = (tail_vma != NULL)? 0: -1UL;
up_read(&mm->mmap_sem);
--
1.7.10.2


2012-10-18 04:35:57

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:

> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 14df880..d92e868 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct
> vm_area_struct *vma)
> {
> if (vma && vma != priv->tail_vma) {
> struct mm_struct *mm = vma->vm_mm;
> +#ifdef CONFIG_NUMA
> + task_lock(priv->task);
> + __mpol_put(priv->task->mempolicy);
> + task_unlock(priv->task);
> +#endif
> up_read(&mm->mmap_sem);
> mmput(mm);
> }
> @@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
> return mm;
> down_read(&mm->mmap_sem);
> + /*
> + * task->mempolicy can be freed even if mmap_sem is down (see
> kernel/exit.c)
> + * We grab refcount for stable access.
> + * repleacement of task->mmpolicy is guarded by mmap_sem.
> + */
> +#ifdef CONFIG_NUMA
> + task_lock(priv->task);
> + mpol_get(priv->task->mempolicy);
> + task_unlock(priv->task);
> +#endif
> tail_vma = get_gate_vma(priv->task->mm);
> priv->tail_vma = tail_vma;
> @@ -161,6 +176,11 @@ out:
> /* End of vmas has been reached */
> m->version = (tail_vma != NULL)? 0: -1UL;
> +#ifdef CONFIG_NUMA
> + task_lock(priv->task);
> + __mpol_put(priv->task->mempolicy);
> + task_unlock(priv->task);
> +#endif
> up_read(&mm->mmap_sem);
> mmput(mm);
> return tail_vma;

Yes, I must admit that this is better than my version and it looks like
all the ->show() functions that use these start, next, stop functions
don't take task_lock() and this would generally be useful: we already hold
current->mm->mmap_sem so there is little harm in holding
task_lock(current) when reading these files as long as we're not touching
the fastpath.

These routines seem like it would nicely be added to mempolicy.h since we
depend on CONFIG_NUMA there already.

Please fix up the mess I made in show_numa_map() in 32f8516a8c73 ("mm,
mempolicy: fix printing stack contents in numa_maps") by simply removing
the task_lock() and task_unlock() as part of your patch.

Thanks Kame!

2012-10-18 04:41:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

(2012/10/18 13:14), Linus Torvalds wrote:
> On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki
> <[email protected]> wrote:
>> if (vma && vma != priv->tail_vma) {
>> struct mm_struct *mm = vma->vm_mm;
>> +#ifdef CONFIG_NUMA
>> + task_lock(priv->task);
>> + __mpol_put(priv->task->mempolicy);
>> + task_unlock(priv->task);
>> +#endif
>> up_read(&mm->mmap_sem);
>> mmput(mm);
>
> Please don't put #ifdef's inside code. It makes things really ugly and
> hard to read.
>
> And that is *especially* true in this case, since there's a pattern to
> all these things:
>
>> +#ifdef CONFIG_NUMA
>> + task_lock(priv->task);
>> + mpol_get(priv->task->mempolicy);
>> + task_unlock(priv->task);
>> +#endif
>
>> +#ifdef CONFIG_NUMA
>> + task_lock(priv->task);
>> + __mpol_put(priv->task->mempolicy);
>> + task_unlock(priv->task);
>> +#endif
>
> it really sounds like what you want to do is to just abstract a
> "numa_policy_get/put(priv)" operation.
>
> So you could make it be something like
>
> #ifdef CONFIG_NUMA
> static inline numa_policy_get(struct proc_maps_private *priv)
> {
> task_lock(priv->task);
> mpol_get(priv->task->mempolicy);
> task_unlock(priv->task);
> }
> .. same for the "put" function ..
> #else
> #define numa_policy_get(priv) do { } while (0)
> #define numa_policy_put(priv) do { } while (0)
> #endif
>
> and then you wouldn't have to have the #ifdef's in the middle of code,
> and I think it will be more readable in general.
>
> Sure, it is going to be a few more actual lines of patch, but there's
> no duplicated code sequence, and the added lines are just the syntax
> that makes it look better.
>

you're right, I shouldn't send an ugly patch. I'm sorry.
V2 uses suggested style, I think.

Regards,
-Kame

2012-10-18 20:03:44

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:

> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cceaab0..43973b0 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -12,6 +12,7 @@
> #include <linux/sched.h>
> #include <linux/proc_fs.h>
> struct ctl_table_header;
> +struct mempolicy;
> extern struct proc_dir_entry proc_root;
> #ifdef CONFIG_PROC_SYSCTL
> @@ -74,6 +75,9 @@ struct proc_maps_private {
> #ifdef CONFIG_MMU
> struct vm_area_struct *tail_vma;
> #endif
> +#ifdef CONFIG_NUMA
> + struct mempolicy *task_mempolicy;
> +#endif
> };
> void proc_init_inodecache(void);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 14df880..624927d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len)
> len = 1;
> seq_printf(m, "%*c", len, ' ');
> }
> +#ifdef CONFIG_NUMA
> +/*
> + * numa_maps scans all vmas under mmap_sem and checks their mempolicy.

Doesn't only affect numa_maps, it also affects maps and smaps although
they don't need the refcounts.

> + * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed
> + * under task_lock() (see kernel/exit.c) replacement of it is guarded by
> + * mmap_sem.

I think this should be a little more verbose making it clear that
task->mempolicy can be cleared and freed if its refcount drops to 0 and is
only protected by task_lock() and that we're safe from task->mempolicy
changing between ->start(), ->next(), and ->stop() because
task->mm->mmap_sem is held for the duration.

> So, take referenceount under task_lock() before we start
> + * scanning and drop it when numa_maps reaches the end.
> + */
> +static void hold_task_mempolicy(struct proc_maps_private *priv)
> +{
> + struct task_struct *task = priv->task;
> +
> + task_lock(task);
> + priv->task_mempolicy = task->mempolicy;
> + mpol_get(priv->task_mempolicy);
> + task_unlock(task);
> +}
> +static void release_task_mempolicy(struct proc_maps_private *priv)
> +{
> + mpol_put(priv->task_mempolicy);
> +}
> +#else
> +static void hold_task_mempolicy(struct proc_maps_private *priv)
> +{
> +}
> +static void release_task_mempolicy(struct proc_maps_private *priv)
> +{
> +}
> +#endif
> static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct
> *vma)
> {
> if (vma && vma != priv->tail_vma) {
> struct mm_struct *mm = vma->vm_mm;
> + release_task_mempolicy(priv);
> up_read(&mm->mmap_sem);
> mmput(mm);
> }
> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
> tail_vma = get_gate_vma(priv->task->mm);
> priv->tail_vma = tail_vma;
> -
> + hold_task_mempolicy(priv);
> /* Start with last addr hint */
> vma = find_vma(mm, last_addr);
> if (last_addr && vma) {
> @@ -159,6 +189,7 @@ out:
> if (vma)
> return vma;
> + release_task_mempolicy(priv);
> /* End of vmas has been reached */
> m->version = (tail_vma != NULL)? 0: -1UL;
> up_read(&mm->mmap_sem);

Otherwise looks good, but please remove the two task_lock()'s in
show_numa_map() that I added as part of this since you're replacing the
need for locking.

2012-10-19 06:51:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps

>> Can't we have another way to fix ? like this ? too ugly ?
>> Again, I'm sorry if I misunderstand the points.
>>
> Sorry this patch itself may be buggy. please don't test..
> I missed that kernel/exit.c sets task->mempolicy to be NULL.
> fixed one here.
>
> --
> From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 18 Oct 2012 13:50:29 +0900
>
> Subject: [PATCH] hold task->mempolicy while numa_maps scans.
>
> /proc/<pid>/numa_maps scans vma and show mempolicy under
> mmap_sem. It sometimes accesses task->mempolicy which can
> be freed without mmap_sem and numa_maps can show some
> garbage while scanning.
>
> This patch tries to take reference count of task->mempolicy at reading
> numa_maps before calling get_vma_policy(). By this, task->mempolicy
> will not be freed until numa_maps reaches its end.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> V1->V2
> - access task->mempolicy only once and remember it. Becase kernel/exit.c
> can overwrite it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Ok, this is acceptable to me. go ahead.

2012-10-19 08:36:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.

(2012/10/19 5:03), David Rientjes wrote:
> On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:
>> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>> tail_vma = get_gate_vma(priv->task->mm);
>> priv->tail_vma = tail_vma;
>> -
>> + hold_task_mempolicy(priv);
>> /* Start with last addr hint */
>> vma = find_vma(mm, last_addr);
>> if (last_addr && vma) {
>> @@ -159,6 +189,7 @@ out:
>> if (vma)
>> return vma;
>> + release_task_mempolicy(priv);
>> /* End of vmas has been reached */
>> m->version = (tail_vma != NULL)? 0: -1UL;
>> up_read(&mm->mmap_sem);
>
> Otherwise looks good, but please remove the two task_lock()'s in
> show_numa_map() that I added as part of this since you're replacing the
> need for locking.
>
Thank you for your review.
How about this ?

==
From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Fri, 19 Oct 2012 17:00:55 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

/proc/<pid>/numa_maps scans vma and show mempolicy under
mmap_sem. It sometimes accesses task->mempolicy which can
be freed without mmap_sem and numa_maps can show some
garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

V2->v3
- updated comments to be more verbose.
- removed task_lock() in numa_maps code.
V1->V2
- access task->mempolicy only once and remember it. Becase kernel/exit.c
can overwrite it.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/internal.h | 4 ++++
fs/proc/task_mmu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cceaab0..43973b0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/proc_fs.h>
struct ctl_table_header;
+struct mempolicy;

extern struct proc_dir_entry proc_root;
#ifdef CONFIG_PROC_SYSCTL
@@ -74,6 +75,9 @@ struct proc_maps_private {
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
#endif
+#ifdef CONFIG_NUMA
+ struct mempolicy *task_mempolicy;
+#endif
};

void proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..2371fea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len)
len = 1;
seq_printf(m, "%*c", len, ' ');
}
+#ifdef CONFIG_NUMA
+/*
+ * These functions are for numa_maps but called in generic **maps seq_file
+ * ->start(), ->stop() ops.
+ *
+ * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
+ * Each mempolicy object is controlled by reference counting. The problem here
+ * is how to avoid accessing dead mempolicy object.
+ *
+ * Because we're holding mmap_sem while reading seq_file, it's safe to access
+ * each vma's mempolicy, no vma objects will never drop refs to mempolicy.
+ *
+ * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy
+ * is set and replaced under mmap_sem but unrefed and cleared under task_lock().
+ * So, without task_lock(), we cannot trust get_vma_policy() because we cannot
+ * gurantee the task never exits under us. But taking task_lock() around
+ * get_vma_plicy() causes lock order problem.
+ *
+ * To access task->mempolicy without lock, we hold a reference count of an
+ * object pointed by task->mempolicy and remember it. This will guarantee
+ * that task->mempolicy points to an alive object or NULL in numa_maps accesses.
+ */
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+ struct task_struct *task = priv->task;
+
+ task_lock(task);
+ priv->task_mempolicy = task->mempolicy;
+ mpol_get(priv->task_mempolicy);
+ task_unlock(task);
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+ mpol_put(priv->task_mempolicy);
+}
+#else
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+#endif

static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
{
if (vma && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
+ release_task_mempolicy(priv);
up_read(&mm->mmap_sem);
mmput(mm);
}
@@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)

tail_vma = get_gate_vma(priv->task->mm);
priv->tail_vma = tail_vma;
-
+ hold_task_mempolicy(priv);
/* Start with last addr hint */
vma = find_vma(mm, last_addr);
if (last_addr && vma) {
@@ -159,6 +203,7 @@ out:
if (vma)
return vma;

+ release_task_mempolicy(priv);
/* End of vmas has been reached */
m->version = (tail_vma != NULL)? 0: -1UL;
up_read(&mm->mmap_sem);
@@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
walk.private = md;
walk.mm = mm;

- task_lock(task);
pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
- task_unlock(task);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);

--
1.7.10.2


2012-10-19 09:28:48

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.

On Fri, 19 Oct 2012, Kamezawa Hiroyuki wrote:

> From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Fri, 19 Oct 2012 17:00:55 +0900
> Subject: [PATCH] hold task->mempolicy while numa_maps scans.
>
> /proc/<pid>/numa_maps scans vma and show mempolicy under
> mmap_sem. It sometimes accesses task->mempolicy which can
> be freed without mmap_sem and numa_maps can show some
> garbage while scanning.
>
> This patch tries to take reference count of task->mempolicy at reading
> numa_maps before calling get_vma_policy(). By this, task->mempolicy
> will not be freed until numa_maps reaches its end.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Looks good, but the patch is whitespace damaged so it doesn't apply. When
that's fixed:

Acked-by: David Rientjes <[email protected]>

Thanks for following through on this!

2012-10-19 19:15:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.

On Fri, Oct 19, 2012 at 4:35 AM, Kamezawa Hiroyuki
<[email protected]> wrote:
> (2012/10/19 5:03), David Rientjes wrote:
>>
>> On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:
>>>
>>> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>>> tail_vma = get_gate_vma(priv->task->mm);
>>> priv->tail_vma = tail_vma;
>>> -
>>> + hold_task_mempolicy(priv);
>>> /* Start with last addr hint */
>>> vma = find_vma(mm, last_addr);
>>> if (last_addr && vma) {
>>> @@ -159,6 +189,7 @@ out:
>>> if (vma)
>>> return vma;
>>> + release_task_mempolicy(priv);
>>> /* End of vmas has been reached */
>>> m->version = (tail_vma != NULL)? 0: -1UL;
>>> up_read(&mm->mmap_sem);
>>
>>
>> Otherwise looks good, but please remove the two task_lock()'s in
>> show_numa_map() that I added as part of this since you're replacing the
>> need for locking.
>>
> Thank you for your review.
> How about this ?
>
> ==
> From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Fri, 19 Oct 2012 17:00:55 +0900
> Subject: [PATCH] hold task->mempolicy while numa_maps scans.
>
> /proc/<pid>/numa_maps scans vma and show mempolicy under
> mmap_sem. It sometimes accesses task->mempolicy which can
> be freed without mmap_sem and numa_maps can show some
> garbage while scanning.
>
> This patch tries to take reference count of task->mempolicy at reading
> numa_maps before calling get_vma_policy(). By this, task->mempolicy
> will not be freed until numa_maps reaches its end.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> V2->v3
> - updated comments to be more verbose.
> - removed task_lock() in numa_maps code.
> V1->V2
> - access task->mempolicy only once and remember it. Becase kernel/exit.c
> can overwrite it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: KOSAKI Motohiro <[email protected]>

2012-10-22 02:47:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.

(2012/10/19 18:28), David Rientjes wrote:

> Looks good, but the patch is whitespace damaged so it doesn't apply. When
> that's fixed:
>
> Acked-by: David Rientjes <[email protected]>

Sorry, I hope this one is not broken...
==
From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Fri, 19 Oct 2012 17:00:55 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

/proc/<pid>/numa_maps scans vma and show mempolicy under
mmap_sem. It sometimes accesses task->mempolicy which can
be freed without mmap_sem and numa_maps can show some
garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Acked-by: David Rientjes <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

V2->v3
- updated comments to be more verbose.
- removed task_lock() in numa_maps code.
V1->V2
- access task->mempolicy only once and remember it. Becase kernel/exit.c
can overwrite it.

---
fs/proc/internal.h | 4 ++++
fs/proc/task_mmu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cceaab0..43973b0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/proc_fs.h>
struct ctl_table_header;
+struct mempolicy;

extern struct proc_dir_entry proc_root;
#ifdef CONFIG_PROC_SYSCTL
@@ -74,6 +75,9 @@ struct proc_maps_private {
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
#endif
+#ifdef CONFIG_NUMA
+ struct mempolicy *task_mempolicy;
+#endif
};

void proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..2371fea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len)
len = 1;
seq_printf(m, "%*c", len, ' ');
}
+#ifdef CONFIG_NUMA
+/*
+ * These functions are for numa_maps but called in generic **maps seq_file
+ * ->start(), ->stop() ops.
+ *
+ * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
+ * Each mempolicy object is controlled by reference counting. The problem here
+ * is how to avoid accessing dead mempolicy object.
+ *
+ * Because we're holding mmap_sem while reading seq_file, it's safe to access
+ * each vma's mempolicy, no vma objects will never drop refs to mempolicy.
+ *
+ * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy
+ * is set and replaced under mmap_sem but unrefed and cleared under task_lock().
+ * So, without task_lock(), we cannot trust get_vma_policy() because we cannot
+ * gurantee the task never exits under us. But taking task_lock() around
+ * get_vma_plicy() causes lock order problem.
+ *
+ * To access task->mempolicy without lock, we hold a reference count of an
+ * object pointed by task->mempolicy and remember it. This will guarantee
+ * that task->mempolicy points to an alive object or NULL in numa_maps accesses.
+ */
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+ struct task_struct *task = priv->task;
+
+ task_lock(task);
+ priv->task_mempolicy = task->mempolicy;
+ mpol_get(priv->task_mempolicy);
+ task_unlock(task);
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+ mpol_put(priv->task_mempolicy);
+}
+#else
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+#endif

static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
{
if (vma && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
+ release_task_mempolicy(priv);
up_read(&mm->mmap_sem);
mmput(mm);
}
@@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)

tail_vma = get_gate_vma(priv->task->mm);
priv->tail_vma = tail_vma;
-
+ hold_task_mempolicy(priv);
/* Start with last addr hint */
vma = find_vma(mm, last_addr);
if (last_addr && vma) {
@@ -159,6 +203,7 @@ out:
if (vma)
return vma;

+ release_task_mempolicy(priv);
/* End of vmas has been reached */
m->version = (tail_vma != NULL)? 0: -1UL;
up_read(&mm->mmap_sem);
@@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
walk.private = md;
walk.mm = mm;

- task_lock(task);
pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
- task_unlock(task);

seq_printf(m, "%08lx %s", vma->vm_start, buffer);

--
1.7.10.2



2012-10-22 20:56:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.

On Mon, 22 Oct 2012 11:47:31 +0900
Kamezawa Hiroyuki <[email protected]> wrote:

> (2012/10/19 18:28), David Rientjes wrote:
>
> > Looks good, but the patch is whitespace damaged so it doesn't apply. When
> > that's fixed:
> >
> > Acked-by: David Rientjes <[email protected]>
>
> Sorry, I hope this one is not broken...
>
> ...
>
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -12,6 +12,7 @@
> #include <linux/sched.h>
> #include <linux/proc_fs.h>
> struct ctl_table_header;
> +struct mempolicy;
>
> extern struct proc_dir_entry proc_root;
> #ifdef CONFIG_PROC_SYSCTL
> @@ -74,6 +75,9 @@ struct proc_maps_private {
> #ifdef CONFIG_MMU
> struct vm_area_struct *tail_vma;
> #endif
> +#ifdef CONFIG_NUMA
> + struct mempolicy *task_mempolicy;
> +#endif
> };

The mail client space-stuffed it.

We merged this three days ago, in 9e7814404b77c3e8920b. Please check
that it landed OK - there's a newline fixup in there but it looks good
to me.

2012-10-22 20:57:05

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.

On Mon, 22 Oct 2012, Kamezawa Hiroyuki wrote:

> > Looks good, but the patch is whitespace damaged so it doesn't apply. When
> > that's fixed:
> >
> > Acked-by: David Rientjes <[email protected]>
>
> Sorry, I hope this one is not broken...

Looks like Linus picked this up directly, thanks Kame!

2012-10-24 23:30:51

by Sasha Levin

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes <[email protected]> wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>> 3 locks on stack by trinity-child2/8558:
>> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
>> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>> Call Trace:
>> [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>> [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>> [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>> [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>> [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>> [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>> [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>> [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>> [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>> [<ffffffff8120c902>] traverse+0xf2/0x230
>> [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>> [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>> [<ffffffff816ca088>] tracesys+0xe1/0xe6
>>
>
> Hmm, looks like we need to change the refcount semantics entirely. We'll
> need to make get_vma_policy() always take a reference and then drop it
> accordingly. This work sif get_vma_policy() can grab a reference while
> holding task_lock() for the task policy fallback case.
>
> Comments on this approach?
> ---
[snip]

I'm not sure about the status of the patch, but it doesn't apply on
top of -next, and I still
see the warnings when fuzzing on -next.


Thanks,
Sasha

2012-10-24 23:34:55

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 24 Oct 2012, Sasha Levin wrote:

> I'm not sure about the status of the patch, but it doesn't apply on
> top of -next, and I still
> see the warnings when fuzzing on -next.
>

This should be fixed by 9e7814404b77 ("hold task->mempolicy while
numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
/proc/pid/numa_maps on that kernel?

2012-10-24 23:45:30

by Sasha Levin

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes <[email protected]> wrote:
> On Wed, 24 Oct 2012, Sasha Levin wrote:
>
>> I'm not sure about the status of the patch, but it doesn't apply on
>> top of -next, and I still
>> see the warnings when fuzzing on -next.
>>
>
> This should be fixed by 9e7814404b77 ("hold task->mempolicy while
> numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
> /proc/pid/numa_maps on that kernel?

I was actually referring to the warnings Dave Jones saw when fuzzing
with trinity after the
original patch was applied.

I still see the following when fuzzing:

[ 338.467156] BUG: sleeping function called from invalid context at
kernel/mutex.c:269
[ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
[ 338.481199] 2 locks held by trinity-main/6361:
[ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>]
__do_page_fault+0x1e4/0x4f0
[ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at:
[<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
[ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W
3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
[ 338.530318] Call Trace:
[ 338.534088] [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
[ 338.539358] [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
[ 338.545253] [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
[ 338.545258] [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
[ 338.545264] [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
[ 338.545267] [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
[ 338.545272] [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
[ 338.545278] [<ffffffff81131e04>] ? __rcu_read_unlock+0x44/0xb0
[ 338.545282] [<ffffffff81230baa>] handle_mm_fault+0x32a/0x360
[ 338.545286] [<ffffffff810aa5b0>] __do_page_fault+0x480/0x4f0
[ 338.545293] [<ffffffff8111a706>] ? del_timer+0x26/0x80
[ 338.545298] [<ffffffff811c7313>] ? rcu_cleanup_after_idle+0x23/0x170
[ 338.545302] [<ffffffff811ca9a4>] ? rcu_eqs_exit_common+0x64/0x3a0
[ 338.545305] [<ffffffff811c8c66>] ? rcu_eqs_enter_common+0x7c6/0x970
[ 338.545309] [<ffffffff811cafdc>] ? rcu_eqs_exit+0x9c/0xb0
[ 338.545312] [<ffffffff810aa666>] do_page_fault+0x26/0x40
[ 338.545317] [<ffffffff810a3a40>] do_async_page_fault+0x30/0xa0
[ 338.545321] [<ffffffff83ae9268>] async_page_fault+0x28/0x30

2012-10-25 00:08:16

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 24 Oct 2012, Sasha Levin wrote:

> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while
> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
> > /proc/pid/numa_maps on that kernel?
>
> I was actually referring to the warnings Dave Jones saw when fuzzing
> with trinity after the
> original patch was applied.
>
> I still see the following when fuzzing:
>
> [ 338.467156] BUG: sleeping function called from invalid context at
> kernel/mutex.c:269
> [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
> [ 338.481199] 2 locks held by trinity-main/6361:
> [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>]
> __do_page_fault+0x1e4/0x4f0
> [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at:
> [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
> [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W
> 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
> [ 338.530318] Call Trace:
> [ 338.534088] [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
> [ 338.539358] [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
> [ 338.545253] [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
> [ 338.545258] [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
> [ 338.545264] [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
> [ 338.545267] [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
> [ 338.545272] [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
> [ 338.545278] [<ffffffff81131e04>] ? __rcu_read_unlock+0x44/0xb0
> [ 338.545282] [<ffffffff81230baa>] handle_mm_fault+0x32a/0x360
> [ 338.545286] [<ffffffff810aa5b0>] __do_page_fault+0x480/0x4f0
> [ 338.545293] [<ffffffff8111a706>] ? del_timer+0x26/0x80
> [ 338.545298] [<ffffffff811c7313>] ? rcu_cleanup_after_idle+0x23/0x170
> [ 338.545302] [<ffffffff811ca9a4>] ? rcu_eqs_exit_common+0x64/0x3a0
> [ 338.545305] [<ffffffff811c8c66>] ? rcu_eqs_enter_common+0x7c6/0x970
> [ 338.545309] [<ffffffff811cafdc>] ? rcu_eqs_exit+0x9c/0xb0
> [ 338.545312] [<ffffffff810aa666>] do_page_fault+0x26/0x40
> [ 338.545317] [<ffffffff810a3a40>] do_async_page_fault+0x30/0xa0
> [ 338.545321] [<ffffffff83ae9268>] async_page_fault+0x28/0x30
>

Ok, this looks the same but it's actually a different issue:
mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
calls get_vma_policy() which may take the shared policy mutex. This
happens while holding page_table_lock from do_huge_pmd_numa_page() but
also from do_numa_page() while holding a spinlock on the ptl, which is
coming from the sched/numa branch.

Is there anyway that we can avoid changing the shared policy mutex back
into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
in shared_policy_replace()"])?

Adding Peter, Rik, and Mel to the cc.

2012-10-25 00:54:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes <[email protected]> wrote:
> On Wed, 24 Oct 2012, Sasha Levin wrote:
>
>> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while
>> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
>> > /proc/pid/numa_maps on that kernel?
>>
>> I was actually referring to the warnings Dave Jones saw when fuzzing
>> with trinity after the
>> original patch was applied.
>>
>> I still see the following when fuzzing:
>>
>> [ 338.467156] BUG: sleeping function called from invalid context at
>> kernel/mutex.c:269
>> [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
>> [ 338.481199] 2 locks held by trinity-main/6361:
>> [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>]
>> __do_page_fault+0x1e4/0x4f0
>> [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at:
>> [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
>> [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W
>> 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
>> [ 338.530318] Call Trace:
>> [ 338.534088] [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
>> [ 338.539358] [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
>> [ 338.545253] [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
>> [ 338.545258] [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
>> [ 338.545264] [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
>> [ 338.545267] [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
>> [ 338.545272] [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
>> [ 338.545278] [<ffffffff81131e04>] ? __rcu_read_unlock+0x44/0xb0
>> [ 338.545282] [<ffffffff81230baa>] handle_mm_fault+0x32a/0x360
>> [ 338.545286] [<ffffffff810aa5b0>] __do_page_fault+0x480/0x4f0
>> [ 338.545293] [<ffffffff8111a706>] ? del_timer+0x26/0x80
>> [ 338.545298] [<ffffffff811c7313>] ? rcu_cleanup_after_idle+0x23/0x170
>> [ 338.545302] [<ffffffff811ca9a4>] ? rcu_eqs_exit_common+0x64/0x3a0
>> [ 338.545305] [<ffffffff811c8c66>] ? rcu_eqs_enter_common+0x7c6/0x970
>> [ 338.545309] [<ffffffff811cafdc>] ? rcu_eqs_exit+0x9c/0xb0
>> [ 338.545312] [<ffffffff810aa666>] do_page_fault+0x26/0x40
>> [ 338.545317] [<ffffffff810a3a40>] do_async_page_fault+0x30/0xa0
>> [ 338.545321] [<ffffffff83ae9268>] async_page_fault+0x28/0x30
>>
>
> Ok, this looks the same but it's actually a different issue:
> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
> calls get_vma_policy() which may take the shared policy mutex. This
> happens while holding page_table_lock from do_huge_pmd_numa_page() but
> also from do_numa_page() while holding a spinlock on the ptl, which is
> coming from the sched/numa branch.
>
> Is there anyway that we can avoid changing the shared policy mutex back
> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
> in shared_policy_replace()"])?
>
> Adding Peter, Rik, and Mel to the cc.

Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
I guess you commited it, right? If so, may I review your mempolicy
changes? Now mempolicy has a lot of horrible buggy code and I hope to
maintain carefully. Which tree should i see?

2012-10-25 01:15:17

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 24 Oct 2012, KOSAKI Motohiro wrote:

> Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
> I guess you commited it, right? If so, may I review your mempolicy
> changes? Now mempolicy has a lot of horrible buggy code and I hope to
> maintain carefully. Which tree should i see?
>

Check out sched/numa from
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

$ git diff v3.7-rc2.. mm/mempolicy.c | diffstat
mempolicy.c | 444 +++++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 277 insertions(+), 167 deletions(-)

2012-10-25 12:20:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
> Ok, this looks the same but it's actually a different issue:
> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
> calls get_vma_policy() which may take the shared policy mutex. This
> happens while holding page_table_lock from do_huge_pmd_numa_page() but
> also from do_numa_page() while holding a spinlock on the ptl, which is
> coming from the sched/numa branch.
>
> Is there anyway that we can avoid changing the shared policy mutex back
> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
> in shared_policy_replace()"])?
>
> Adding Peter, Rik, and Mel to the cc.

Urgh, crud I totally missed that.

So the problem is that we need to compute if the current page is placed
'right' while holding pte_lock in order to avoid multiple pte_lock
acquisitions on the 'fast' path.

I'll look into this in a bit, but one thing that comes to mind is having
both a spnilock and a mutex and require holding both for modification
while either one is sufficient for read.

That would allow sp_lookup() to use the spinlock, while insert and
replace can hold both.

Not sure it will work for this, need to stare at this code a little
more.

2012-10-25 14:40:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
> > Ok, this looks the same but it's actually a different issue:
> > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
> > calls get_vma_policy() which may take the shared policy mutex. This
> > happens while holding page_table_lock from do_huge_pmd_numa_page() but
> > also from do_numa_page() while holding a spinlock on the ptl, which is
> > coming from the sched/numa branch.
> >
> > Is there anyway that we can avoid changing the shared policy mutex back
> > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
> > in shared_policy_replace()"])?
> >
> > Adding Peter, Rik, and Mel to the cc.
>
> Urgh, crud I totally missed that.
>
> So the problem is that we need to compute if the current page is placed
> 'right' while holding pte_lock in order to avoid multiple pte_lock
> acquisitions on the 'fast' path.
>
> I'll look into this in a bit, but one thing that comes to mind is having
> both a spnilock and a mutex and require holding both for modification
> while either one is sufficient for read.
>
> That would allow sp_lookup() to use the spinlock, while insert and
> replace can hold both.
>
> Not sure it will work for this, need to stare at this code a little
> more.

So I think the below should work, we hold the spinlock over both rb-tree
modification as sp free, this makes mpol_shared_policy_lookup() which
returns the policy with an incremented refcount work with just the
spinlock.

Comments?

---
include/linux/mempolicy.h | 1 +
mm/mempolicy.c | 23 ++++++++++++++++++-----
2 files changed, 19 insertions(+), 5 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,7 @@ struct sp_node {

struct shared_policy {
struct rb_root root;
+ spinlock_t lock;
struct mutex mutex;
};

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s
*
* Remember policies even when nobody has shared memory mapped.
* The policies are kept in Red-Black tree linked from the inode.
- * They are protected by the sp->lock spinlock, which should be held
- * for any accesses to the tree.
+ *
+ * The rb-tree is locked using both a mutex and a spinlock. Every modification
+ * to the tree must hold both the mutex and the spinlock, lookups can hold
+ * either to observe a stable tree.
+ *
+ * In particular, sp_insert() and sp_delete() take the spinlock, whereas
+ * sp_lookup() doesn't, this so users have choice.
+ *
+ * shared_policy_replace() and mpol_free_shared_policy() take the mutex
+ * and call sp_insert(), sp_delete().
*/

/* lookup first element intersecting start-end */
-/* Caller holds sp->mutex */
+/* Caller holds either sp->lock and/or sp->mutex */
static struct sp_node *
sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
{
@@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli
struct rb_node *parent = NULL;
struct sp_node *nd;

+ spin_lock(&sp->lock);
while (*p) {
parent = *p;
nd = rb_entry(parent, struct sp_node, nd);
@@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli
}
rb_link_node(&new->nd, parent, p);
rb_insert_color(&new->nd, &sp->root);
+ spin_unlock(&sp->lock);
pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
new->policy ? new->policy->mode : 0);
}
@@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_

if (!sp->root.rb_node)
return NULL;
- mutex_lock(&sp->mutex);
+ spin_lock(&sp->lock);
sn = sp_lookup(sp, idx, idx+1);
if (sn) {
mpol_get(sn->policy);
pol = sn->policy;
}
- mutex_unlock(&sp->mutex);
+ spin_unlock(&sp->lock);
return pol;
}

@@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
+ spin_lock(&sp->lock);
rb_erase(&n->nd, &sp->root);
sp_free(n);
+ spin_unlock(&sp->lock);
}

static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
@@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar
int ret;

sp->root = RB_ROOT; /* empty tree == default mempolicy */
+ spin_lock_init(&sp->lock);
mutex_init(&sp->mutex);

if (mpol) {

2012-10-25 17:24:06

by Sasha Levin

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On 10/25/2012 10:39 AM, Peter Zijlstra wrote:
> On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
>> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
>>> Ok, this looks the same but it's actually a different issue:
>>> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
>>> calls get_vma_policy() which may take the shared policy mutex. This
>>> happens while holding page_table_lock from do_huge_pmd_numa_page() but
>>> also from do_numa_page() while holding a spinlock on the ptl, which is
>>> coming from the sched/numa branch.
>>>
>>> Is there anyway that we can avoid changing the shared policy mutex back
>>> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
>>> in shared_policy_replace()"])?
>>>
>>> Adding Peter, Rik, and Mel to the cc.
>>
>> Urgh, crud I totally missed that.
>>
>> So the problem is that we need to compute if the current page is placed
>> 'right' while holding pte_lock in order to avoid multiple pte_lock
>> acquisitions on the 'fast' path.
>>
>> I'll look into this in a bit, but one thing that comes to mind is having
>> both a spnilock and a mutex and require holding both for modification
>> while either one is sufficient for read.
>>
>> That would allow sp_lookup() to use the spinlock, while insert and
>> replace can hold both.
>>
>> Not sure it will work for this, need to stare at this code a little
>> more.
>
> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
>
> Comments?
>
> ---

It made the warnings I've reported go away.


Thanks,
Sasha

2012-10-25 20:22:06

by David Rientjes

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Thu, 25 Oct 2012, Peter Zijlstra wrote:

> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
>
> Comments?
>

It's rather unfortunate that we need to protect modification with a
spinlock and a mutex but since sharing was removed in commit 869833f2c5c6
("mempolicy: remove mempolicy sharing") it requires that sp_alloc() is
blockable to do the whole mpol_new() and rebind if necessary, which could
require mm->mmap_sem; it's not as simple as just converting all the
allocations to GFP_ATOMIC.

It looks as though there is no other alternative other than protecting
modification with both the spinlock and mutex, which is a clever
solution, so it looks good to me, thanks!

2012-10-25 23:10:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <[email protected]> wrote:
>
> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
>
> Comments?

Looks reasonable, if annoyingly complex for something that shouldn't
be important enough for this. Oh well.

However, please check me on this: the need for this is only for
linux-next right now, correct? All the current users in my tree are ok
with just the mutex, no?

Linus

2012-10-26 08:49:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > So I think the below should work, we hold the spinlock over both rb-tree
> > modification as sp free, this makes mpol_shared_policy_lookup() which
> > returns the policy with an incremented refcount work with just the
> > spinlock.
> >
> > Comments?
>
> Looks reasonable, if annoyingly complex for something that shouldn't
> be important enough for this. Oh well.

I agree with that.. Its just that when doing numa placement one needs to
respect the pre-existing placement constraints. I've not seen a way
around this.

> However, please check me on this: the need for this is only for
> linux-next right now, correct? All the current users in my tree are ok
> with just the mutex, no?

Yes, the need comes from the numa stuff and I'll stick this patch in
there.

I completely missed Mel's patch turning it into a mutex, but I guess
that's what -next is for :-).

2012-10-31 18:30:22

by Sasha Levin

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <[email protected]> wrote:
>> >
>> > So I think the below should work, we hold the spinlock over both rb-tree
>> > modification as sp free, this makes mpol_shared_policy_lookup() which
>> > returns the policy with an incremented refcount work with just the
>> > spinlock.
>> >
>> > Comments?
>>
>> Looks reasonable, if annoyingly complex for something that shouldn't
>> be important enough for this. Oh well.
>
> I agree with that.. Its just that when doing numa placement one needs to
> respect the pre-existing placement constraints. I've not seen a way
> around this.
>
>> However, please check me on this: the need for this is only for
>> linux-next right now, correct? All the current users in my tree are ok
>> with just the mutex, no?
>
> Yes, the need comes from the numa stuff and I'll stick this patch in
> there.
>
> I completely missed Mel's patch turning it into a mutex, but I guess
> that's what -next is for :-).

So I've been fuzzing with it for the past couple of days and it's been
looking fine with it. Can someone grab it into his tree please?


Thanks,
Sasha

2012-11-21 01:00:24

by Sasha Levin

[permalink] [raw]
Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

Ping? Can someone take it before it's lost?

On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin <[email protected]> wrote:
> On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
>>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <[email protected]> wrote:
>>> >
>>> > So I think the below should work, we hold the spinlock over both rb-tree
>>> > modification as sp free, this makes mpol_shared_policy_lookup() which
>>> > returns the policy with an incremented refcount work with just the
>>> > spinlock.
>>> >
>>> > Comments?
>>>
>>> Looks reasonable, if annoyingly complex for something that shouldn't
>>> be important enough for this. Oh well.
>>
>> I agree with that.. Its just that when doing numa placement one needs to
>> respect the pre-existing placement constraints. I've not seen a way
>> around this.
>>
>>> However, please check me on this: the need for this is only for
>>> linux-next right now, correct? All the current users in my tree are ok
>>> with just the mutex, no?
>>
>> Yes, the need comes from the numa stuff and I'll stick this patch in
>> there.
>>
>> I completely missed Mel's patch turning it into a mutex, but I guess
>> that's what -next is for :-).
>
> So I've been fuzzing with it for the past couple of days and it's been
> looking fine with it. Can someone grab it into his tree please?
>
>
> Thanks,
> Sasha