sys_move_pages() and sys_migrate_pages() are a pretty nice copy
and paste job of each other. They both take a pid, find the task
struct, and then grab a ref on the mm. They both also do an
rcu_read_unlock() after they've taken the mm and then proceed to
access 'task'. I think this is a bug in both cases.
I haven't been able to get it to trigger without adding some
healthy spinning in migrate_pages():
for (x = 0; x < 1ULL<<28; x++)
barrier();
Granted, that's a horribly silly thing to do, but I think it's
"valid" in that it does not sleep; it just just widens the race
window. After adding the loop, if I just run a kernel compile
and then a bunch of copies of "migratepages `pidof -s gcc` 0 0"
it'll oops pretty fast dereferencing cred:
SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
...
rcu_read_lock();
tcred = __task_cred(task);
---> if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
rcu_read_unlock();
err = -EPERM;
goto out;
}
rcu_read_unlock();
I think I got lucky that my task_struct was bogus in the oops
below. It's probably quite feasible that a task_struct could get
freed back in to the slab, reallocated as another task_struct,
and then we do these cred checks against a valid, but basically
random task.
This patch takes the pid-to-task code along with the credential
and security checks in sys_move_pages() and sys_migrate_pages()
and consolidates them. It now takes a task reference in
the new function and requires the caller to drop it. I
believe this resolves the race.
Sample oops below:
BUG: unable to handle kernel paging request at ffff880072ab8ce0
IP: [<ffffffff810c3b1c>] sys_migrate_pages+0xc0/0x180
PGD 1606063 PUD 1fdfd067 PMD 1ff93067 PTE 8000000072ab8160
Oops: 0000 [#17] PREEMPT SMP DEBUG_PAGEALLOC
CPU 14
Modules linked in:
Pid: 12880, comm: migratepages Tainted: G D 3.2.0-rc2-qemubigsmp-00110-gb6955fa-dirty #505 Bochs Bochs
RIP: 0010:[<ffffffff810c3b1c>] [<ffffffff810c3b1c>] sys_migrate_pages+0xc0/0x180
RSP: 0018:ffff88005f54df28 EFLAGS: 00010202
RAX: ffff8800724d59d0 RBX: 00000000ffffffea RCX: 0000000000000000
RDX: 0000000000000014 RSI: 0000000000000000 RDI: ffff880072ab8e60
RBP: ffff88005f54df78 R08: 1fffffffffffffff R09: 00000000000031ed
R10: 0000000000603130 R11: 0000000000000246 R12: ffff880072ab89d0
R13: ffff880060c0bca0 R14: ffff8800724d59d0 R15: 0000000000603130
FS: 00007fb05c7ad720(0000) GS:ffff88007fdc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880072ab8ce0 CR3: 000000006e47f000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migratepages (pid: 12880, threadinfo ffff88005f54c000, task ffff8800724d59d0)
Stack:
000031ed00400b60 ffff880072a08818 0000000000000001 0000000000000001
00007fff66bbae50 00000000000031ed 0000000000603110 00007fff66bbae40
0000000000000000 0000000000000000 00000000006030d0 ffffffff813ebe3b
Call Trace:
[<ffffffff813ebe3b>] system_call_fastpath+0x16/0x1b
Code: 83 f6 ff 49 89 c5 e8 90 c3 fa ff 31 c0 48 ff c0 48 3d 00 00 00 10 75 f5 4d 85 ed bb ea ff ff ff 0f 84 b3 00 00 00 e8 f5 bc fa ff
8b 84 24 10 03 00 00 48 8b 4d b8 8b 70 0c 8b 51 14 39 f2 74
RIP [<ffffffff810c3b1c>] sys_migrate_pages+0xc0/0x180
RSP <ffff88005f54df28>
CR2: ffff880072ab8ce0
---[ end trace 942060673021a7ae ]---
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/include/linux/migrate.h | 1
linux-2.6.git-dave/mm/mempolicy.c | 44 +++-------------
linux-2.6.git-dave/mm/migrate.c | 76 +++++++++++++++++------------
3 files changed, 57 insertions(+), 64 deletions(-)
diff -puN include/linux/migrate.h~movememory-helper include/linux/migrate.h
--- linux-2.6.git/include/linux/migrate.h~movememory-helper 2012-02-16 09:59:17.270207242 -0800
+++ linux-2.6.git-dave/include/linux/migrate.h 2012-02-16 09:59:17.286207438 -0800
@@ -31,6 +31,7 @@ extern int migrate_vmas(struct mm_struct
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+struct task_struct *can_migrate_get_task(pid_t pid);
#else
#define PAGE_MIGRATION 0
diff -puN mm/mempolicy.c~movememory-helper mm/mempolicy.c
--- linux-2.6.git/mm/mempolicy.c~movememory-helper 2012-02-16 09:59:17.274207291 -0800
+++ linux-2.6.git-dave/mm/mempolicy.c 2012-02-16 09:59:17.286207438 -0800
@@ -1314,59 +1314,33 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
if (err)
goto out;
- /* Find the mm_struct */
- rcu_read_lock();
- task = pid ? find_task_by_vpid(pid) : current;
- if (!task) {
- rcu_read_unlock();
- err = -ESRCH;
- goto out;
- }
+ task = can_migrate_get_task(pid);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
mm = get_task_mm(task);
- rcu_read_unlock();
-
err = -EINVAL;
if (!mm)
- goto out;
-
- /*
- * Check if this process has the right to modify the specified
- * process. The right exists if the process has administrative
- * capabilities, superuser privileges or the same
- * userid as the target process.
- */
- rcu_read_lock();
- tcred = __task_cred(task);
- if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
- cred->uid != tcred->suid && cred->uid != tcred->uid &&
- !capable(CAP_SYS_NICE)) {
- rcu_read_unlock();
- err = -EPERM;
- goto out;
- }
- rcu_read_unlock();
+ goto out_put_task;
task_nodes = cpuset_mems_allowed(task);
/* Is the user allowed to access the target nodes? */
if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
err = -EPERM;
- goto out;
+ goto out_put_task;
}
if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
err = -EINVAL;
- goto out;
+ goto out_put_task;
}
- err = security_task_movememory(task);
- if (err)
- goto out;
-
err = do_migrate_pages(mm, old, new,
capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
-out:
+out_put_task:
if (mm)
mmput(mm);
+ put_task_struct(task);
+out:
NODEMASK_SCRATCH_FREE(scratch);
return err;
diff -puN mm/migrate.c~movememory-helper mm/migrate.c
--- linux-2.6.git/mm/migrate.c~movememory-helper 2012-02-16 09:59:17.278207340 -0800
+++ linux-2.6.git-dave/mm/migrate.c 2012-02-16 09:59:17.286207438 -0800
@@ -1339,38 +1339,22 @@ static int do_pages_stat(struct mm_struc
}
/*
- * Move a list of pages in the address space of the currently executing
- * process.
+ * If successful, takes a task_struct reference that
+ * the caller is responsible for releasing.
*/
-SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
- const void __user * __user *, pages,
- const int __user *, nodes,
- int __user *, status, int, flags)
+struct task_struct *can_migrate_get_task(pid_t pid)
{
- const struct cred *cred = current_cred(), *tcred;
struct task_struct *task;
- struct mm_struct *mm;
- int err;
-
- /* Check flags */
- if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
- return -EINVAL;
-
- if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
- return -EPERM;
+ const struct cred *cred = current_cred(), *tcred;
+ int err = 0;
- /* Find the mm_struct */
rcu_read_lock();
task = pid ? find_task_by_vpid(pid) : current;
if (!task) {
rcu_read_unlock();
- return -ESRCH;
+ return ERR_PTR(-ESRCH);
}
- mm = get_task_mm(task);
- rcu_read_unlock();
-
- if (!mm)
- return -EINVAL;
+ get_task_struct(task);
/*
* Check if this process has the right to modify the specified
@@ -1378,20 +1362,53 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
- rcu_read_unlock();
err = -EPERM;
goto out;
}
- rcu_read_unlock();
err = security_task_movememory(task);
- if (err)
- goto out;
+
+out:
+ rcu_read_unlock();
+ if (err) {
+ put_task_struct(task);
+ return ERR_PTR(err);
+ }
+ return task;
+}
+
+/*
+ * Move a list of pages in the address space of the currently executing
+ * process.
+ */
+SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
+ const void __user * __user *, pages,
+ const int __user *, nodes,
+ int __user *, status, int, flags)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+ int err;
+
+ /* Check flags */
+ if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+ return -EINVAL;
+
+ if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
+ return -EPERM;
+
+ task = can_migrate_get_task(pid);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+ mm = get_task_mm(task);
+ if (!mm) {
+ err = -EINVAL;
+ goto out_put_task;
+ }
if (nodes) {
err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
@@ -1400,8 +1417,9 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
err = do_pages_stat(mm, nr_pages, pages, status);
}
-out:
mmput(mm);
+out_put_task:
+ put_task_struct(task);
return err;
}
_
Dave Hansen <[email protected]> writes:
> sys_move_pages() and sys_migrate_pages() are a pretty nice copy
> and paste job of each other. They both take a pid, find the task
> struct, and then grab a ref on the mm. They both also do an
> rcu_read_unlock() after they've taken the mm and then proceed to
> access 'task'. I think this is a bug in both cases.
Can we share code?
>
> This patch takes the pid-to-task code along with the credential
> and security checks in sys_move_pages() and sys_migrate_pages()
> and consolidates them. It now takes a task reference in
> the new function and requires the caller to drop it. I
> believe this resolves the race.
Looks good to me.
Reviewed-by: Andi Kleen <[email protected]>
BTW looks like we really need a better stress test for these
syscalls.
-Andi
--
[email protected] -- Speaking for myself only
On Thu, 23 Feb 2012, Dave Hansen wrote:
> I think I got lucky that my task_struct was bogus in the oops
> below. It's probably quite feasible that a task_struct could get
> freed back in to the slab, reallocated as another task_struct,
> and then we do these cred checks against a valid, but basically
> random task.
Ok I buy that.
> This patch takes the pid-to-task code along with the credential
> and security checks in sys_move_pages() and sys_migrate_pages()
> and consolidates them. It now takes a task reference in
> the new function and requires the caller to drop it. I
> believe this resolves the race.
And this way its safer?
> diff -puN include/linux/migrate.h~movememory-helper include/linux/migrate.h
> --- linux-2.6.git/include/linux/migrate.h~movememory-helper 2012-02-16 09:59:17.270207242 -0800
> +++ linux-2.6.git-dave/include/linux/migrate.h 2012-02-16 09:59:17.286207438 -0800
> @@ -31,6 +31,7 @@ extern int migrate_vmas(struct mm_struct
> extern void migrate_page_copy(struct page *newpage, struct page *page);
> extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page);
> +struct task_struct *can_migrate_get_task(pid_t pid);
Could we use something easier to understand? try_get_task()?
> +++ linux-2.6.git-dave/mm/mempolicy.c 2012-02-16 09:59:17.286207438 -0800
> diff -puN mm/migrate.c~movememory-helper mm/migrate.c
> --- linux-2.6.git/mm/migrate.c~movememory-helper 2012-02-16 09:59:17.278207340 -0800
> +++ linux-2.6.git-dave/mm/migrate.c 2012-02-16 09:59:17.286207438 -0800
> @@ -1339,38 +1339,22 @@ static int do_pages_stat(struct mm_struc
> }
>
> /*
> - * Move a list of pages in the address space of the currently executing
> - * process.
> + * If successful, takes a task_struct reference that
> + * the caller is responsible for releasing.
> */
> -SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
> - const void __user * __user *, pages,
> - const int __user *, nodes,
> - int __user *, status, int, flags)
> +struct task_struct *can_migrate_get_task(pid_t pid)
> {
> - const struct cred *cred = current_cred(), *tcred;
> struct task_struct *task;
> - struct mm_struct *mm;
> - int err;
> -
> - /* Check flags */
> - if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> - return -EINVAL;
> -
> - if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
> - return -EPERM;
> + const struct cred *cred = current_cred(), *tcred;
> + int err = 0;
>
> - /* Find the mm_struct */
> rcu_read_lock();
> task = pid ? find_task_by_vpid(pid) : current;
> if (!task) {
> rcu_read_unlock();
> - return -ESRCH;
> + return ERR_PTR(-ESRCH);
> }
> - mm = get_task_mm(task);
> - rcu_read_unlock();
> -
> - if (!mm)
> - return -EINVAL;
> + get_task_struct(task);
Hmmm isnt the race still there between the determination of the task and
the get_task_struct()? You would have to verify after the get_task_struct
that this is really the task we wanted to avoid the race.
On 02/23/2012 10:45 AM, Christoph Lameter wrote:
> On Thu, 23 Feb 2012, Dave Hansen wrote:
>> This patch takes the pid-to-task code along with the credential
>> and security checks in sys_move_pages() and sys_migrate_pages()
>> and consolidates them. It now takes a task reference in
>> the new function and requires the caller to drop it. I
>> believe this resolves the race.
>
> And this way its safer?
I think so... I'll talk about it below.
>> diff -puN include/linux/migrate.h~movememory-helper include/linux/migrate.h
>> --- linux-2.6.git/include/linux/migrate.h~movememory-helper 2012-02-16 09:59:17.270207242 -0800
>> +++ linux-2.6.git-dave/include/linux/migrate.h 2012-02-16 09:59:17.286207438 -0800
>> @@ -31,6 +31,7 @@ extern int migrate_vmas(struct mm_struct
>> extern void migrate_page_copy(struct page *newpage, struct page *page);
>> extern int migrate_huge_page_move_mapping(struct address_space *mapping,
>> struct page *newpage, struct page *page);
>> +struct task_struct *can_migrate_get_task(pid_t pid);
>
> Could we use something easier to understand? try_get_task()?
It's hard to see in the patch context, but can_migrate_get_task() does
two migration-specific operations:
> tcred = __task_cred(task);
> if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
> cred->uid != tcred->suid && cred->uid != tcred->uid &&
> !capable(CAP_SYS_NICE)) {
> err = -EPERM;
> goto out;
> }
>
> err = security_task_movememory(task);
So, I was trying to relate that it's checking the current's permissions
to _do_ migration on task. try_get_task() wouldn't really say much
about that part of its job.
>> +struct task_struct *can_migrate_get_task(pid_t pid)
>> {
>> - const struct cred *cred = current_cred(), *tcred;
>> struct task_struct *task;
>> - struct mm_struct *mm;
>> - int err;
>> -
>> - /* Check flags */
>> - if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
>> - return -EINVAL;
>> -
>> - if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
>> - return -EPERM;
>> + const struct cred *cred = current_cred(), *tcred;
>> + int err = 0;
>>
>> - /* Find the mm_struct */
>> rcu_read_lock();
>> task = pid ? find_task_by_vpid(pid) : current;
>> if (!task) {
>> rcu_read_unlock();
>> - return -ESRCH;
>> + return ERR_PTR(-ESRCH);
>> }
>> - mm = get_task_mm(task);
>> - rcu_read_unlock();
>> -
>> - if (!mm)
>> - return -EINVAL;
>> + get_task_struct(task);
>
> Hmmm isnt the race still there between the determination of the task and
> the get_task_struct()? You would have to verify after the get_task_struct
> that this is really the task we wanted to avoid the race.
It's true that selecting a task by pid is inherently racy. What that
code does is ensure that the task you've got current has 'pid', but not
ensure that 'pid' has never represented another task. But, that's what
we do everywhere else in the kernel; there's not much better that we can do.
Maybe "race" is the wrong word for what we've got here. It's a lack of
a refcount being taken.
On Thu, 23 Feb 2012, Dave Hansen wrote:
> > Hmmm isnt the race still there between the determination of the task and
> > the get_task_struct()? You would have to verify after the get_task_struct
> > that this is really the task we wanted to avoid the race.
>
> It's true that selecting a task by pid is inherently racy. What that
> code does is ensure that the task you've got current has 'pid', but not
> ensure that 'pid' has never represented another task. But, that's what
> we do everywhere else in the kernel; there's not much better that we can do.
We may at this point be getting a reference to a task struct from another
process not only from the current process (where the above procedure is
valid). You rightly pointed out that the slab rcu free mechanism allows a
free and a reallocation within the RCU period. The effect is that the task
struct could be pointing to a task with another pid that what we were
looking for and therefore migrate_pages could subsequently be operating on
a totally different process.
The patch does not fix that race so far.
I think you have to verify that the pid of the task matches after you took
the refcount in order to be safe. If it does not match then abort.
> Maybe "race" is the wrong word for what we've got here. It's a lack of
> a refcount being taken.
Is that a real difference or are you just playing with words?
On 02/23/2012 11:40 AM, Christoph Lameter wrote:
> On Thu, 23 Feb 2012, Dave Hansen wrote:
>>> Hmmm isnt the race still there between the determination of the task and
>>> the get_task_struct()? You would have to verify after the get_task_struct
>>> that this is really the task we wanted to avoid the race.
>>
>> It's true that selecting a task by pid is inherently racy. What that
>> code does is ensure that the task you've got current has 'pid', but not
>> ensure that 'pid' has never represented another task. But, that's what
>> we do everywhere else in the kernel; there's not much better that we can do.
>
> We may at this point be getting a reference to a task struct from another
> process not only from the current process (where the above procedure is
> valid). You rightly pointed out that the slab rcu free mechanism allows a
> free and a reallocation within the RCU period.
I didn't _mean_ to point that out, but I think I realize what you're
talking about. What we have before this patch is this:
rcu_read_lock();
task = pid ? find_task_by_vpid(pid) : current;
rcu_read_unlock();
task->foo;
So, the task at task->foo time is neither RCU-protected nor protected by
having a reference. I changed it to:
rcu_read_lock();
task = pid ? find_task_by_vpid(pid) : current;
get_task_struct(task);
rcu_read_unlock();
task->foo;
That keeps task from being freed. But, as you point out
> The effect is that the task
> struct could be pointing to a task with another pid that what we were
> looking for and therefore migrate_pages could subsequently be operating on
> a totally different process.
>
> The patch does not fix that race so far.
Agreed, this patch would not fix such an issue.
I think this also implies that stuff like get_task_pid() is broken,
along with virtually all of the users of find_task_by_vpid(). Eric, any
thoughts on this?
> I think you have to verify that the pid of the task matches after you took
> the refcount in order to be safe. If it does not match then abort.
>
>> Maybe "race" is the wrong word for what we've got here. It's a lack of
>> a refcount being taken.
>
> Is that a real difference or are you just playing with words?
I think we're talking about two different things:
1. does RCU protect the pid->task lookup sufficiently?
2. Can the task simply go away in the move/migrate_pages() calls?
I think we're on the same page now.
On Thu, 23 Feb 2012, Dave Hansen wrote:
> > We may at this point be getting a reference to a task struct from another
> > process not only from the current process (where the above procedure is
> > valid). You rightly pointed out that the slab rcu free mechanism allows a
> > free and a reallocation within the RCU period.
>
> I didn't _mean_ to point that out, but I think I realize what you're
> talking about. What we have before this patch is this:
>
> rcu_read_lock();
> task = pid ? find_task_by_vpid(pid) : current;
We take a refcount here on the mm ... See the code. We could simply take a
refcount on the task as well if this is considered safe enough. If we have
a refcount on the task then we do not need the refcount on the mm. Thats
was your approach...
> rcu_read_unlock();
> > Is that a real difference or are you just playing with words?
>
> I think we're talking about two different things:
> 1. does RCU protect the pid->task lookup sufficiently?
I dont know
> 2. Can the task simply go away in the move/migrate_pages() calls?
The task may go away but we need the mm to stay for migration.
That is why a refcount is taken on the mm.
The bug in migrate_pages() is that we do a rcu_unlock and a rcu_lock. If
we drop those then we should be safe if the use of a task pointer within a
rcu section is safe without taking a refcount.
Christoph Lameter <[email protected]> writes:
> On Thu, 23 Feb 2012, Dave Hansen wrote:
>
>> > We may at this point be getting a reference to a task struct from another
>> > process not only from the current process (where the above procedure is
>> > valid). You rightly pointed out that the slab rcu free mechanism allows a
>> > free and a reallocation within the RCU period.
>>
>> I didn't _mean_ to point that out, but I think I realize what you're
>> talking about. What we have before this patch is this:
>>
>> rcu_read_lock();
>> task = pid ? find_task_by_vpid(pid) : current;
>
> We take a refcount here on the mm ... See the code. We could simply take a
> refcount on the task as well if this is considered safe enough. If we have
> a refcount on the task then we do not need the refcount on the mm. Thats
> was your approach...
>
>> rcu_read_unlock();
>
>> > Is that a real difference or are you just playing with words?
>>
>> I think we're talking about two different things:
>> 1. does RCU protect the pid->task lookup sufficiently?
>
> I dont know
Yes. See below.
>> 2. Can the task simply go away in the move/migrate_pages() calls?
>
> The task may go away but we need the mm to stay for migration.
> That is why a refcount is taken on the mm.
>
> The bug in migrate_pages() is that we do a rcu_unlock and a rcu_lock. If
> we drop those then we should be safe if the use of a task pointer within a
> rcu section is safe without taking a refcount.
Yes the user of a task_struct pointer found via a userspace pid is valid
for the life of an rcu critical section, and the bug is indeed that we
drop the rcu_lock and somehow expect the task to remain valid.
The guarantee comes from release_task. In release_task we call
__exit_signal which calls __unhash_process, and then we call
delayed_put_task to guarantee that the task lives until the end
of the rcu interval.
In migrate_pages we have a lot of task accesses outside of the
rcu critical section, and without a reference count on task.
I tell you the truth trying to figure out what that code needs to be
correct if task != current makes my head hurt.
I think we need to grab a reference on task_struct, to stop the task
from going away, and in addition we need to hold task_lock. To keep
task->mm from changing (see exec_mmap). But we can't do that and sleep
so I think the entire function needs to be rewritten, and the need for
task deep in the migrate_pages path needs to be removed as even with the
reference count held we can race with someone calling exec.
The only easy fix I see is to add:
if (pid)
return -EINVAL;
Then we are working with current and only current change it's mm making
things much, much, much simpler.
Eric
On Thu, 23 Feb 2012, Eric W. Biederman wrote:
> > The bug in migrate_pages() is that we do a rcu_unlock and a rcu_lock. If
> > we drop those then we should be safe if the use of a task pointer within a
> > rcu section is safe without taking a refcount.
>
> Yes the user of a task_struct pointer found via a userspace pid is valid
> for the life of an rcu critical section, and the bug is indeed that we
> drop the rcu_lock and somehow expect the task to remain valid.
>
> The guarantee comes from release_task. In release_task we call
> __exit_signal which calls __unhash_process, and then we call
> delayed_put_task to guarantee that the task lives until the end of the
> rcu interval.
Ah. Ok. Great.
> In migrate_pages we have a lot of task accesses outside of the rcu
> critical section, and without a reference count on task.
Yes but that is only of interesting for setup and verification of
permissions. What matters during migration is that the mm_struct does not
go away and we take a refcount on that one.
> I tell you the truth trying to figure out what that code needs to be
> correct if task != current makes my head hurt.
Hmm...
> I think we need to grab a reference on task_struct, to stop the task
> from going away, and in addition we need to hold task_lock. To keep
> task->mm from changing (see exec_mmap). But we can't do that and sleep
> so I think the entire function needs to be rewritten, and the need for
> task deep in the migrate_pages path needs to be removed as even with the
> reference count held we can race with someone calling exec.
We dont need the task during migration. We only need the mm. The task
is safe until rcu_read_unlock therefore maybe the following should fix
migrate pages:
Subject: migration: Do not do rcu_read_unlock until the last time we need the task_struct pointer
Migration functions perform the rcu_read_unlock too early. As a result the
task pointed to may change. Bugs were introduced when adding security checks
because rcu_unlock/lock sequences were inserted. Plus the security checks
and do_move_pages used the task_struct pointer after rcu_unlock.
Fix those issues by removing the unlock/lock sequences and moving the
rcu_read_unlock after the last use of the task struct pointer.
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 22 +++++++++++-----------
mm/migrate.c | 28 +++++++++++++++-------------
2 files changed, 26 insertions(+), 24 deletions(-)
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2012-01-13 04:04:36.229807226 -0600
+++ linux-2.6/mm/mempolicy.c 2012-02-24 03:11:44.913710625 -0600
@@ -1318,16 +1318,14 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
rcu_read_lock();
task = pid ? find_task_by_vpid(pid) : current;
if (!task) {
- rcu_read_unlock();
err = -ESRCH;
- goto out;
+ goto unlock_out;
}
mm = get_task_mm(task);
- rcu_read_unlock();
err = -EINVAL;
if (!mm)
- goto out;
+ goto unlock_out;
/*
* Check if this process has the right to modify the specified
@@ -1335,33 +1333,31 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
- rcu_read_unlock();
err = -EPERM;
- goto out;
+ goto unlock_out;
}
- rcu_read_unlock();
task_nodes = cpuset_mems_allowed(task);
/* Is the user allowed to access the target nodes? */
if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
err = -EPERM;
- goto out;
+ goto unlock_out;
}
if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
err = -EINVAL;
- goto out;
+ goto unlock_out;
}
err = security_task_movememory(task);
if (err)
- goto out;
+ goto unlock_out;
+ rcu_read_unlock();
err = do_migrate_pages(mm, old, new,
capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
out:
@@ -1370,6 +1366,10 @@ out:
NODEMASK_SCRATCH_FREE(scratch);
return err;
+
+unlock_out:
+ rcu_read_unlock();
+ goto out;
}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c 2012-02-06 04:25:35.857094372 -0600
+++ linux-2.6/mm/migrate.c 2012-02-24 03:18:55.569698851 -0600
@@ -1176,20 +1176,17 @@ set_status:
* Migrate an array of page address onto an array of nodes and fill
* the corresponding array of status.
*/
-static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
+static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
unsigned long nr_pages,
const void __user * __user *pages,
const int __user *nodes,
int __user *status, int flags)
{
struct page_to_node *pm;
- nodemask_t task_nodes;
unsigned long chunk_nr_pages;
unsigned long chunk_start;
int err;
- task_nodes = cpuset_mems_allowed(task);
-
err = -ENOMEM;
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
if (!pm)
@@ -1351,6 +1348,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
struct task_struct *task;
struct mm_struct *mm;
int err;
+ nodemask_t task_nodes;
/* Check flags */
if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
@@ -1367,10 +1365,11 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
return -ESRCH;
}
mm = get_task_mm(task);
- rcu_read_unlock();
- if (!mm)
+ if (!mm) {
+ rcu_read_unlock();
return -EINVAL;
+ }
/*
* Check if this process has the right to modify the specified
@@ -1378,24 +1377,23 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
- rcu_read_unlock();
err = -EPERM;
- goto out;
+ goto unlock_out;
}
- rcu_read_unlock();
err = security_task_movememory(task);
if (err)
- goto out;
+ goto unlock_out;
+ task_nodes = cpuset_mems_allowed(task);
+ rcu_read_unlock();
if (nodes) {
- err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
- flags);
+ err = do_pages_move(mm, task_nodes, nr_pages, pages, nodes,
+ status, flags);
} else {
err = do_pages_stat(mm, nr_pages, pages, status);
}
@@ -1403,6 +1401,10 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
out:
mmput(mm);
return err;
+
+unlock_out:
+ rcu_read_unlock();
+ goto out;
}
/*
Christoph Lameter <[email protected]> writes:
> On Thu, 23 Feb 2012, Eric W. Biederman wrote:
>
>> > The bug in migrate_pages() is that we do a rcu_unlock and a rcu_lock. If
>> > we drop those then we should be safe if the use of a task pointer within a
>> > rcu section is safe without taking a refcount.
>>
>> Yes the user of a task_struct pointer found via a userspace pid is valid
>> for the life of an rcu critical section, and the bug is indeed that we
>> drop the rcu_lock and somehow expect the task to remain valid.
>>
>> The guarantee comes from release_task. In release_task we call
>> __exit_signal which calls __unhash_process, and then we call
>> delayed_put_task to guarantee that the task lives until the end of the
>> rcu interval.
>
> Ah. Ok. Great.
>
>> In migrate_pages we have a lot of task accesses outside of the rcu
>> critical section, and without a reference count on task.
>
> Yes but that is only of interesting for setup and verification of
> permissions. What matters during migration is that the mm_struct does not
> go away and we take a refcount on that one.
>
>> I tell you the truth trying to figure out what that code needs to be
>> correct if task != current makes my head hurt.
>
> Hmm...
Especially because I was looking at move_pages...
>> I think we need to grab a reference on task_struct, to stop the task
>> from going away, and in addition we need to hold task_lock. To keep
>> task->mm from changing (see exec_mmap). But we can't do that and sleep
>> so I think the entire function needs to be rewritten, and the need for
>> task deep in the migrate_pages path needs to be removed as even with the
>> reference count held we can race with someone calling exec.
>
> We dont need the task during migration. We only need the mm. The task
> is safe until rcu_read_unlock therefore maybe the following should fix
> migrate pages:
>
>
> Subject: migration: Do not do rcu_read_unlock until the last time we need the task_struct pointer
>
> Migration functions perform the rcu_read_unlock too early. As a result the
> task pointed to may change. Bugs were introduced when adding security checks
> because rcu_unlock/lock sequences were inserted. Plus the security checks
> and do_move_pages used the task_struct pointer after rcu_unlock.
>
> Fix those issues by removing the unlock/lock sequences and moving the
> rcu_read_unlock after the last use of the task struct pointer.
>
> Signed-off-by: Christoph Lameter <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Somehow in my skimming through the code earlier I thought the
situation was worse. On the assumption there are not any
sleeping functions along the path this looks like a good fix.
> ---
> mm/mempolicy.c | 22 +++++++++++-----------
> mm/migrate.c | 28 +++++++++++++++-------------
> 2 files changed, 26 insertions(+), 24 deletions(-)
>
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c 2012-01-13 04:04:36.229807226 -0600
> +++ linux-2.6/mm/mempolicy.c 2012-02-24 03:11:44.913710625 -0600
> @@ -1318,16 +1318,14 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> rcu_read_lock();
> task = pid ? find_task_by_vpid(pid) : current;
> if (!task) {
> - rcu_read_unlock();
> err = -ESRCH;
> - goto out;
> + goto unlock_out;
> }
> mm = get_task_mm(task);
> - rcu_read_unlock();
>
> err = -EINVAL;
> if (!mm)
> - goto out;
> + goto unlock_out;
>
> /*
> * Check if this process has the right to modify the specified
> @@ -1335,33 +1333,31 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> * capabilities, superuser privileges or the same
> * userid as the target process.
> */
> - rcu_read_lock();
> tcred = __task_cred(task);
> if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
> cred->uid != tcred->suid && cred->uid != tcred->uid &&
> !capable(CAP_SYS_NICE)) {
> - rcu_read_unlock();
> err = -EPERM;
> - goto out;
> + goto unlock_out;
> }
> - rcu_read_unlock();
>
> task_nodes = cpuset_mems_allowed(task);
> /* Is the user allowed to access the target nodes? */
> if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
> err = -EPERM;
> - goto out;
> + goto unlock_out;
> }
>
> if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
> err = -EINVAL;
> - goto out;
> + goto unlock_out;
> }
>
> err = security_task_movememory(task);
> if (err)
> - goto out;
> + goto unlock_out;
>
> + rcu_read_unlock();
> err = do_migrate_pages(mm, old, new,
> capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
> out:
> @@ -1370,6 +1366,10 @@ out:
> NODEMASK_SCRATCH_FREE(scratch);
>
> return err;
> +
> +unlock_out:
> + rcu_read_unlock();
> + goto out;
> }
>
>
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c 2012-02-06 04:25:35.857094372 -0600
> +++ linux-2.6/mm/migrate.c 2012-02-24 03:18:55.569698851 -0600
> @@ -1176,20 +1176,17 @@ set_status:
> * Migrate an array of page address onto an array of nodes and fill
> * the corresponding array of status.
> */
> -static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
> +static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> unsigned long nr_pages,
> const void __user * __user *pages,
> const int __user *nodes,
> int __user *status, int flags)
> {
> struct page_to_node *pm;
> - nodemask_t task_nodes;
> unsigned long chunk_nr_pages;
> unsigned long chunk_start;
> int err;
>
> - task_nodes = cpuset_mems_allowed(task);
> -
> err = -ENOMEM;
> pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
> if (!pm)
> @@ -1351,6 +1348,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> struct task_struct *task;
> struct mm_struct *mm;
> int err;
> + nodemask_t task_nodes;
>
> /* Check flags */
> if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> @@ -1367,10 +1365,11 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> return -ESRCH;
> }
> mm = get_task_mm(task);
> - rcu_read_unlock();
>
> - if (!mm)
> + if (!mm) {
> + rcu_read_unlock();
> return -EINVAL;
> + }
>
> /*
> * Check if this process has the right to modify the specified
> @@ -1378,24 +1377,23 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> * capabilities, superuser privileges or the same
> * userid as the target process.
> */
> - rcu_read_lock();
> tcred = __task_cred(task);
> if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
> cred->uid != tcred->suid && cred->uid != tcred->uid &&
> !capable(CAP_SYS_NICE)) {
> - rcu_read_unlock();
> err = -EPERM;
> - goto out;
> + goto unlock_out;
> }
> - rcu_read_unlock();
>
> err = security_task_movememory(task);
> if (err)
> - goto out;
> + goto unlock_out;
>
> + task_nodes = cpuset_mems_allowed(task);
> + rcu_read_unlock();
> if (nodes) {
> - err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
> - flags);
> + err = do_pages_move(mm, task_nodes, nr_pages, pages, nodes,
> + status, flags);
> } else {
> err = do_pages_stat(mm, nr_pages, pages, status);
> }
> @@ -1403,6 +1401,10 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> out:
> mmput(mm);
> return err;
> +
> +unlock_out:
> + rcu_read_unlock();
> + goto out;
> }
>
> /*
On 02/24/2012 07:20 AM, Christoph Lameter wrote:
> Subject: migration: Do not do rcu_read_unlock until the last time we need the task_struct pointer
>
> Migration functions perform the rcu_read_unlock too early. As a result the
> task pointed to may change. Bugs were introduced when adding security checks
> because rcu_unlock/lock sequences were inserted. Plus the security checks
> and do_move_pages used the task_struct pointer after rcu_unlock.
>
> Fix those issues by removing the unlock/lock sequences and moving the
> rcu_read_unlock after the last use of the task struct pointer.
>
> Signed-off-by: Christoph Lameter <[email protected]>
It doesn't fix the code duplication, but it definitely does fix the bug
I was originally trying to address.
Acked-by: Dave Hansen <[email protected]>
On Fri, 24 Feb 2012, Dave Hansen wrote:
> On 02/24/2012 07:20 AM, Christoph Lameter wrote:
> > Subject: migration: Do not do rcu_read_unlock until the last time we need the task_struct pointer
> >
> > Migration functions perform the rcu_read_unlock too early. As a result the
> > task pointed to may change. Bugs were introduced when adding security checks
> > because rcu_unlock/lock sequences were inserted. Plus the security checks
> > and do_move_pages used the task_struct pointer after rcu_unlock.
> >
> > Fix those issues by removing the unlock/lock sequences and moving the
> > rcu_read_unlock after the last use of the task struct pointer.
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
>
> It doesn't fix the code duplication, but it definitely does fix the bug
Yes I did not want to override all your good work.
Could you do another patch that removed the duplication?
(2/24/12 10:20 AM), Christoph Lameter wrote:
> On Thu, 23 Feb 2012, Eric W. Biederman wrote:
>
>>> The bug in migrate_pages() is that we do a rcu_unlock and a rcu_lock. If
>>> we drop those then we should be safe if the use of a task pointer within a
>>> rcu section is safe without taking a refcount.
>>
>> Yes the user of a task_struct pointer found via a userspace pid is valid
>> for the life of an rcu critical section, and the bug is indeed that we
>> drop the rcu_lock and somehow expect the task to remain valid.
>>
>> The guarantee comes from release_task. In release_task we call
>> __exit_signal which calls __unhash_process, and then we call
>> delayed_put_task to guarantee that the task lives until the end of the
>> rcu interval.
>
> Ah. Ok. Great.
>
>> In migrate_pages we have a lot of task accesses outside of the rcu
>> critical section, and without a reference count on task.
>
> Yes but that is only of interesting for setup and verification of
> permissions. What matters during migration is that the mm_struct does not
> go away and we take a refcount on that one.
>
>> I tell you the truth trying to figure out what that code needs to be
>> correct if task != current makes my head hurt.
>
> Hmm...
>
>> I think we need to grab a reference on task_struct, to stop the task
>> from going away, and in addition we need to hold task_lock. To keep
>> task->mm from changing (see exec_mmap). But we can't do that and sleep
>> so I think the entire function needs to be rewritten, and the need for
>> task deep in the migrate_pages path needs to be removed as even with the
>> reference count held we can race with someone calling exec.
>
> We dont need the task during migration. We only need the mm. The task
> is safe until rcu_read_unlock therefore maybe the following should fix
> migrate pages:
>
>
> Subject: migration: Do not do rcu_read_unlock until the last time we need the task_struct pointer
>
> Migration functions perform the rcu_read_unlock too early. As a result the
> task pointed to may change. Bugs were introduced when adding security checks
> because rcu_unlock/lock sequences were inserted. Plus the security checks
> and do_move_pages used the task_struct pointer after rcu_unlock.
>
> Fix those issues by removing the unlock/lock sequences and moving the
> rcu_read_unlock after the last use of the task struct pointer.
>
> Signed-off-by: Christoph Lameter<[email protected]>
>
>
Acked-by: KOSAKI Motohiro <[email protected]>
Oh and another thing. There are certain functions that are now called
under rcu read lock()
security_task_movememory
and
cpuset_mems_allowed
cpuset_mems_allowed takes a mutex. Hmmm... Under rcu?
security_task_movememory does some kind of security hook.
Is that all safe? If not then we need to take a refcount on the task
struct after all.
On 02/24/2012 08:54 AM, Christoph Lameter wrote:
> Could you do another patch that removed the duplication?
Yup, working on it already.
On 02/24/2012 09:08 AM, Christoph Lameter wrote:
> cpuset_mems_allowed takes a mutex. Hmmm... Under rcu?
>
> security_task_movememory does some kind of security hook.
>
> Is that all safe? If not then we need to take a refcount on the task
> struct after all.
Urg, no we can't sleep under an rcu_read_lock().
On Fri, 24 Feb 2012, Dave Hansen wrote:
> > Is that all safe? If not then we need to take a refcount on the task
> > struct after all.
>
> Urg, no we can't sleep under an rcu_read_lock().
Ok so take a count and drop it before entering the main migration
function?
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 12 +++++++-----
mm/migrate.c | 20 +++++++++++---------
2 files changed, 18 insertions(+), 14 deletions(-)
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2012-02-24 04:10:01.621614996 -0600
+++ linux-2.6/mm/mempolicy.c 2012-02-24 05:01:43.621530156 -0600
@@ -1293,7 +1293,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
{
const struct cred *cred = current_cred(), *tcred;
struct mm_struct *mm = NULL;
- struct task_struct *task;
+ struct task_struct *task = NULL;
nodemask_t task_nodes;
int err;
nodemask_t *old;
@@ -1318,10 +1318,10 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
rcu_read_lock();
task = pid ? find_task_by_vpid(pid) : current;
if (!task) {
- rcu_read_unlock();
err = -ESRCH;
goto out;
}
+ get_task_struct(task);
mm = get_task_mm(task);
rcu_read_unlock();
@@ -1335,16 +1335,13 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
- rcu_read_unlock();
err = -EPERM;
goto out;
}
- rcu_read_unlock();
task_nodes = cpuset_mems_allowed(task);
/* Is the user allowed to access the target nodes? */
@@ -1362,9 +1359,14 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
if (err)
goto out;
+ put_task_struct(task);
+ task = NULL;
err = do_migrate_pages(mm, old, new,
capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
out:
+ if (task)
+ put_task_struct(task);
+
if (mm)
mmput(mm);
NODEMASK_SCRATCH_FREE(scratch);
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c 2012-02-24 04:10:01.609614993 -0600
+++ linux-2.6/mm/migrate.c 2012-02-24 05:07:39.493520424 -0600
@@ -1176,20 +1176,17 @@ set_status:
* Migrate an array of page address onto an array of nodes and fill
* the corresponding array of status.
*/
-static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
+static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
unsigned long nr_pages,
const void __user * __user *pages,
const int __user *nodes,
int __user *status, int flags)
{
struct page_to_node *pm;
- nodemask_t task_nodes;
unsigned long chunk_nr_pages;
unsigned long chunk_start;
int err;
- task_nodes = cpuset_mems_allowed(task);
-
err = -ENOMEM;
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
if (!pm)
@@ -1351,6 +1348,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
struct task_struct *task;
struct mm_struct *mm;
int err;
+ nodemask_t task_nodes;
/* Check flags */
if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
@@ -1366,6 +1364,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
rcu_read_unlock();
return -ESRCH;
}
+ get_task_struct(task);
mm = get_task_mm(task);
rcu_read_unlock();
@@ -1378,30 +1377,33 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
- rcu_read_unlock();
err = -EPERM;
goto out;
}
- rcu_read_unlock();
err = security_task_movememory(task);
if (err)
goto out;
+ task_nodes = cpuset_mems_allowed(task);
+ put_task_struct(task);
+ task = NULL;
+
if (nodes) {
- err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
- flags);
+ err = do_pages_move(mm, task_nodes, nr_pages, pages, nodes,
+ status, flags);
} else {
err = do_pages_stat(mm, nr_pages, pages, status);
}
out:
mmput(mm);
+ if (task)
+ put_task_struct(task);
return err;
}
On 02/24/2012 09:32 AM, Christoph Lameter wrote:
> @@ -1318,10 +1318,10 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> rcu_read_lock();
> task = pid ? find_task_by_vpid(pid) : current;
> if (!task) {
> - rcu_read_unlock();
> err = -ESRCH;
> goto out;
> }
...
> + put_task_struct(task);
> + task = NULL;
> err = do_migrate_pages(mm, old, new,
> capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
> out:
> + if (task)
> + put_task_struct(task);
> +
> if (mm)
> mmput(mm);
> NODEMASK_SCRATCH_FREE(scratch);
Man, patch did not like this for some reason. I kept throwing most of
the mempolicy.c hunks away. I've never seen anything like it.
Anyway... This looks fine except I think that rcu_read_unlock() need to
stay. There's currently no release of it after out:.
Christoph Lameter <[email protected]> writes:
> On Fri, 24 Feb 2012, Dave Hansen wrote:
>
>> > Is that all safe? If not then we need to take a refcount on the task
>> > struct after all.
>>
>> Urg, no we can't sleep under an rcu_read_lock().
>
> Ok so take a count and drop it before entering the main migration
> function?
As an alternative way at looking at things.
Taking a quick look it does appear that in cpuset_mems_allowed and it's
cousins we never sleep under "callback_mutex" so that lock looks like it
could become a spinlock.
But I have to say something just bothers me about the permissions for
modifying an mm living in the task. We can have different rules
for modifying an mm depending on the path to tme mm?
Especially in things like which numa nodes we can put pages in?
So by specifying a different pid to access them mm through the call can
either work or succeed? Are these checks really sane?
Eric
> ---
> mm/mempolicy.c | 12 +++++++-----
> mm/migrate.c | 20 +++++++++++---------
> 2 files changed, 18 insertions(+), 14 deletions(-)
>
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c 2012-02-24 04:10:01.621614996 -0600
> +++ linux-2.6/mm/mempolicy.c 2012-02-24 05:01:43.621530156 -0600
> @@ -1293,7 +1293,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> {
>
> const struct cred *cred = current_cred(), *tcred;
> struct mm_struct *mm = NULL;
> - struct task_struct *task;
> + struct task_struct *task = NULL;
> nodemask_t task_nodes;
> int err;
> nodemask_t *old;
> @@ -1318,10 +1318,10 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> rcu_read_lock();
> task = pid ? find_task_by_vpid(pid) : current;
> if (!task) {
> - rcu_read_unlock();
> err = -ESRCH;
> goto out;
> }
> + get_task_struct(task);
> mm = get_task_mm(task);
> rcu_read_unlock();
>
> @@ -1335,16 +1335,13 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> * capabilities, superuser privileges or the same
> * userid as the target process.
> */
> - rcu_read_lock();
> tcred = __task_cred(task);
> if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
> cred->uid != tcred->suid && cred->uid != tcred->uid &&
> !capable(CAP_SYS_NICE)) {
> - rcu_read_unlock();
> err = -EPERM;
> goto out;
> }
> - rcu_read_unlock();
>
> task_nodes = cpuset_mems_allowed(task);
> /* Is the user allowed to access the target nodes? */
> @@ -1362,9 +1359,14 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
> if (err)
> goto out;
>
> + put_task_struct(task);
> + task = NULL;
> err = do_migrate_pages(mm, old, new,
> capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
> out:
> + if (task)
> + put_task_struct(task);
> +
> if (mm)
> mmput(mm);
> NODEMASK_SCRATCH_FREE(scratch);
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c 2012-02-24 04:10:01.609614993 -0600
> +++ linux-2.6/mm/migrate.c 2012-02-24 05:07:39.493520424 -0600
> @@ -1176,20 +1176,17 @@ set_status:
> * Migrate an array of page address onto an array of nodes and fill
> * the corresponding array of status.
> */
> -static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
> +static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> unsigned long nr_pages,
> const void __user * __user *pages,
> const int __user *nodes,
> int __user *status, int flags)
> {
> struct page_to_node *pm;
> - nodemask_t task_nodes;
> unsigned long chunk_nr_pages;
> unsigned long chunk_start;
> int err;
>
> - task_nodes = cpuset_mems_allowed(task);
> -
> err = -ENOMEM;
> pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
> if (!pm)
> @@ -1351,6 +1348,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> struct task_struct *task;
> struct mm_struct *mm;
> int err;
> + nodemask_t task_nodes;
>
> /* Check flags */
> if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> @@ -1366,6 +1364,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> rcu_read_unlock();
> return -ESRCH;
> }
> + get_task_struct(task);
> mm = get_task_mm(task);
> rcu_read_unlock();
>
> @@ -1378,30 +1377,33 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
> * capabilities, superuser privileges or the same
> * userid as the target process.
> */
> - rcu_read_lock();
> tcred = __task_cred(task);
> if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
> cred->uid != tcred->suid && cred->uid != tcred->uid &&
> !capable(CAP_SYS_NICE)) {
> - rcu_read_unlock();
> err = -EPERM;
> goto out;
> }
> - rcu_read_unlock();
>
> err = security_task_movememory(task);
> if (err)
> goto out;
>
> + task_nodes = cpuset_mems_allowed(task);
> + put_task_struct(task);
> + task = NULL;
> +
> if (nodes) {
> - err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
> - flags);
> + err = do_pages_move(mm, task_nodes, nr_pages, pages, nodes,
> + status, flags);
> } else {
> err = do_pages_stat(mm, nr_pages, pages, status);
> }
>
> out:
> mmput(mm);
> + if (task)
> + put_task_struct(task);
> return err;
> }
Christoph Lameter <[email protected]> writes:
> On Fri, 24 Feb 2012, Dave Hansen wrote:
>
>> > Is that all safe? If not then we need to take a refcount on the task
>> > struct after all.
>>
>> Urg, no we can't sleep under an rcu_read_lock().
>
> Ok so take a count and drop it before entering the main migration
> function?
For correct operation of kernel code a count sounds fine.
If you are going to allow sleeping how do you ensure that an exec that
happens between the taking of the reference count and checking the
permissions does not mess things up.
At the very least the patch description needs an explanation of what
the thinking will be in that case.
At the moment I suspect the permissions checks are not safe unless
performed under both rcu_read_lock and task_lock to ensure that
the task<->mm association does not change on us while we are
working. Even with that the cred can change under us but at least
we know the cred will be valid until rcu_read_unlock happens.
This entire thinhg of modifying another process is a pain.
Perhaps you can play with task->self_exec_id to detect an exec and fail
the system call if there was an exec in between when we find the task
and when we drop the task reference.
Eric
On Fri, 24 Feb 2012, Eric W. Biederman wrote:
> Taking a quick look it does appear that in cpuset_mems_allowed and it's
> cousins we never sleep under "callback_mutex" so that lock looks like it
> could become a spinlock.
>
> But I have to say something just bothers me about the permissions for
> modifying an mm living in the task. We can have different rules
> for modifying an mm depending on the path to tme mm?
Yes. Permissions are associated with pids which refer to tasks. Tasks have
address spaces and tasks may share address spaces.
> Especially in things like which numa nodes we can put pages in?
Things = address spaces? The page migration functionality is about moving
the location of physical memory from one numa node to the other. It does
not affect the execution just the latencies experienced by the processes.
> So by specifying a different pid to access them mm through the call can
> either work or succeed? Are these checks really sane?
Yes if you can create two pids with the same address space and give
those those pids to different owners then the permission checks on one
may fail and succeed on the other. We have no way to refer to address
spaces from user space outside of a pid.
On Sat, 25 Feb 2012, Eric W. Biederman wrote:
> > Ok so take a count and drop it before entering the main migration
> > function?
>
> For correct operation of kernel code a count sounds fine.
>
> If you are going to allow sleeping how do you ensure that an exec that
> happens between the taking of the reference count and checking the
> permissions does not mess things up.
Ok in that case there is a race between which of the two address space
structures (mm structs) are used. But that is up to the user to resolve if
he wants to.
> At the moment I suspect the permissions checks are not safe unless
> performed under both rcu_read_lock and task_lock to ensure that
> the task<->mm association does not change on us while we are
> working. Even with that the cred can change under us but at least
> we know the cred will be valid until rcu_read_unlock happens.
The permissions check only refer to the task struct.
> This entire thinhg of modifying another process is a pain.
>
> Perhaps you can play with task->self_exec_id to detect an exec and fail
> the system call if there was an exec in between when we find the task
> and when we drop the task reference.
I am not sure why there would be an issue. We have to operate on one mm
the pid refers to. If it changes then we may either operate on the old
one or the new one.
We can move the determination of the mm to the last point possible to show
that it is not used earlier?
Subject: migration: Fix rcu and task refcounting
Migration functions perform the rcu_read_unlock too early. As a result the
task pointed to may change from under us.
The following patch extend the period of the rcu_read_lock until after the
permissions checks are done. We also take a refcount so that the task
reference is stable when calling security check functions and performing
cpuset node validation (which takes a mutex).
The refcount is dropped before actual page migration occurs so there is no
change to the refcounts held during page migration.
Also move the determination of the mm of the task struct to immediately
before the do_migrate*() calls so that it is clear that we switch from
handling the task during permission checks to the mm for the actual
migration.
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 25 +++++++++++++++----------
mm/migrate.c | 37 ++++++++++++++++++++-----------------
2 files changed, 35 insertions(+), 27 deletions(-)
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2012-02-27 06:27:37.322300127 -0600
+++ linux-2.6/mm/mempolicy.c 2012-02-27 06:57:56.606250374 -0600
@@ -1293,7 +1293,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
{
const struct cred *cred = current_cred(), *tcred;
struct mm_struct *mm = NULL;
- struct task_struct *task;
+ struct task_struct *task = NULL;
nodemask_t task_nodes;
int err;
nodemask_t *old;
@@ -1322,12 +1322,9 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
err = -ESRCH;
goto out;
}
- mm = get_task_mm(task);
- rcu_read_unlock();
+ get_task_struct(task);
err = -EINVAL;
- if (!mm)
- goto out;
/*
* Check if this process has the right to modify the specified
@@ -1335,7 +1332,6 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
@@ -1362,11 +1358,20 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
if (err)
goto out;
- err = do_migrate_pages(mm, old, new,
- capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
-out:
+ mm = get_task_mm(task);
+ put_task_struct(task);
+ task = NULL;
if (mm)
- mmput(mm);
+ err = do_migrate_pages(mm, old, new,
+ capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+ else
+ err = -EINVAL;
+
+ mmput(mm);
+out:
+ if (task)
+ put_task_struct(task);
+
NODEMASK_SCRATCH_FREE(scratch);
return err;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c 2012-02-27 06:27:37.314300125 -0600
+++ linux-2.6/mm/migrate.c 2012-02-27 06:56:50.654252173 -0600
@@ -1176,20 +1176,17 @@ set_status:
* Migrate an array of page address onto an array of nodes and fill
* the corresponding array of status.
*/
-static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
+static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
unsigned long nr_pages,
const void __user * __user *pages,
const int __user *nodes,
int __user *status, int flags)
{
struct page_to_node *pm;
- nodemask_t task_nodes;
unsigned long chunk_nr_pages;
unsigned long chunk_start;
int err;
- task_nodes = cpuset_mems_allowed(task);
-
err = -ENOMEM;
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
if (!pm)
@@ -1351,6 +1348,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
struct task_struct *task;
struct mm_struct *mm;
int err;
+ nodemask_t task_nodes;
/* Check flags */
if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
@@ -1366,11 +1364,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
rcu_read_unlock();
return -ESRCH;
}
- mm = get_task_mm(task);
- rcu_read_unlock();
-
- if (!mm)
- return -EINVAL;
+ get_task_struct(task);
/*
* Check if this process has the right to modify the specified
@@ -1378,7 +1372,6 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
@@ -1393,15 +1386,25 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
if (err)
goto out;
- if (nodes) {
- err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
- flags);
- } else {
- err = do_pages_stat(mm, nr_pages, pages, status);
- }
+ task_nodes = cpuset_mems_allowed(task);
+ mm = get_task_mm(task);
+ put_task_struct(task);
+ task = NULL;
+
+ if (mm) {
+ if (nodes)
+ err = do_pages_move(mm, task_nodes, nr_pages, pages, nodes,
+ status, flags);
+ else
+ err = do_pages_stat(mm, nr_pages, pages, status);
+ } else
+ err = -EINVAL;
-out:
mmput(mm);
+
+out:
+ if (task)
+ put_task_struct(task);
return err;
}
Christoph Lameter <[email protected]> writes:
> On Sat, 25 Feb 2012, Eric W. Biederman wrote:
>
>> > Ok so take a count and drop it before entering the main migration
>> > function?
>>
>> For correct operation of kernel code a count sounds fine.
>>
>> If you are going to allow sleeping how do you ensure that an exec that
>> happens between the taking of the reference count and checking the
>> permissions does not mess things up.
>
> Ok in that case there is a race between which of the two address space
> structures (mm structs) are used. But that is up to the user to resolve if
> he wants to.
>
>> At the moment I suspect the permissions checks are not safe unless
>> performed under both rcu_read_lock and task_lock to ensure that
>> the task<->mm association does not change on us while we are
>> working. Even with that the cred can change under us but at least
>> we know the cred will be valid until rcu_read_unlock happens.
>
> The permissions check only refer to the task struct.
>
>> This entire thinhg of modifying another process is a pain.
>>
>> Perhaps you can play with task->self_exec_id to detect an exec and fail
>> the system call if there was an exec in between when we find the task
>> and when we drop the task reference.
>
> I am not sure why there would be an issue. We have to operate on one mm
> the pid refers to. If it changes then we may either operate on the old
> one or the new one.
>
> We can move the determination of the mm to the last point possible to show
> that it is not used earlier?
If we are just changing the numa node on which the pages reside it isn't
too bad of a problem. Somehow from the names I thought we were moving
pages from one task to another.
The problem that I see is that we may race with a suid exec in which
case the permissions checks might pass for the pre-exec state and then
we get the post exec mm that we don't actually have permissions for,
but we manipulate it anyway.
Another possibility is that half the permission checks could be
performed on the pre-exec state and another half the permission checks
on the post-exec state, and we happen to pass as a fluke in a situation
where neither the pre nor the post exec state would be allowed (for
different reasons) but looking at the inconsistent allowed us to pass.
So we really need to do something silly like get task and
task->self_exec_id. Then perform the permission checks and get the mm.
Then if just before we perform the operation task->self_exec_id is
different restart the system call, or fail with something like -EAGAIN.
Eric
On Mon, 27 Feb 2012, Eric W. Biederman wrote:
> The problem that I see is that we may race with a suid exec in which
> case the permissions checks might pass for the pre-exec state and then
> we get the post exec mm that we don't actually have permissions for,
> but we manipulate it anyway.
So what? Page migration does not change the behavior of the code. It only
changes the latencies seen. The hacker can mess up the code so that the
suid exec runs slower?
> So we really need to do something silly like get task and
> task->self_exec_id. Then perform the permission checks and get the mm.
> Then if just before we perform the operation task->self_exec_id is
> different restart the system call, or fail with something like -EAGAIN.
I am still not convinced as to why we would do this.
Here is a cleaned up version of the patch. No longer rely on the
task_struct pointer to be NULL for release of the refcount.
Subject: migration: Fix rcu and task refcounting
Migration functions perform the rcu_read_unlock too early. As a result the
task pointed to may change from under us.
The following patch extend the period of the rcu_read_lock until after the
permissions checks are done. We also take a refcount so that the task
reference is stable when calling security check functions and performing
cpuset node validation (which takes a mutex).
The refcount is dropped before actual page migration occurs so there is no
change to the refcounts held during page migration.
Also move the determination of the mm of the task struct to immediately
before the do_migrate*() calls so that it is clear that we switch from
handling the task during permission checks to the mm for the actual
migration. Since the determination is only done once and we then no longer
use the task_struct we can be sure that we operate on a specific address
space that will not change from under us.
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 32 +++++++++++++++++++-------------
mm/migrate.c | 36 +++++++++++++++++++-----------------
2 files changed, 38 insertions(+), 30 deletions(-)
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2012-02-28 03:56:41.236184783 -0600
+++ linux-2.6/mm/mempolicy.c 2012-02-28 07:22:35.245552282 -0600
@@ -1322,12 +1322,9 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
err = -ESRCH;
goto out;
}
- mm = get_task_mm(task);
- rcu_read_unlock();
+ get_task_struct(task);
err = -EINVAL;
- if (!mm)
- goto out;
/*
* Check if this process has the right to modify the specified
@@ -1335,14 +1332,13 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
!capable(CAP_SYS_NICE)) {
rcu_read_unlock();
err = -EPERM;
- goto out;
+ goto out_put;
}
rcu_read_unlock();
@@ -1350,26 +1346,36 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
/* Is the user allowed to access the target nodes? */
if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
err = -EPERM;
- goto out;
+ goto out_put;
}
if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
err = -EINVAL;
- goto out;
+ goto out_put;
}
err = security_task_movememory(task);
if (err)
- goto out;
+ goto out_put;
- err = do_migrate_pages(mm, old, new,
- capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
-out:
+ mm = get_task_mm(task);
+ put_task_struct(task);
if (mm)
- mmput(mm);
+ err = do_migrate_pages(mm, old, new,
+ capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+ else
+ err = -EINVAL;
+
+ mmput(mm);
+out:
NODEMASK_SCRATCH_FREE(scratch);
return err;
+
+out_put:
+ put_task_struct(task);
+ goto out;
+
}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c 2012-02-28 06:15:38.239956766 -0600
+++ linux-2.6/mm/migrate.c 2012-02-28 07:18:08.237559577 -0600
@@ -1176,20 +1176,17 @@ set_status:
* Migrate an array of page address onto an array of nodes and fill
* the corresponding array of status.
*/
-static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
+static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
unsigned long nr_pages,
const void __user * __user *pages,
const int __user *nodes,
int __user *status, int flags)
{
struct page_to_node *pm;
- nodemask_t task_nodes;
unsigned long chunk_nr_pages;
unsigned long chunk_start;
int err;
- task_nodes = cpuset_mems_allowed(task);
-
err = -ENOMEM;
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
if (!pm)
@@ -1351,6 +1348,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
struct task_struct *task;
struct mm_struct *mm;
int err;
+ nodemask_t task_nodes;
/* Check flags */
if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
@@ -1366,11 +1364,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
rcu_read_unlock();
return -ESRCH;
}
- mm = get_task_mm(task);
- rcu_read_unlock();
-
- if (!mm)
- return -EINVAL;
+ get_task_struct(task);
/*
* Check if this process has the right to modify the specified
@@ -1378,7 +1372,6 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
* capabilities, superuser privileges or the same
* userid as the target process.
*/
- rcu_read_lock();
tcred = __task_cred(task);
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
cred->uid != tcred->suid && cred->uid != tcred->uid &&
@@ -1393,16 +1386,25 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid,
if (err)
goto out;
- if (nodes) {
- err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
- flags);
- } else {
- err = do_pages_stat(mm, nr_pages, pages, status);
- }
+ task_nodes = cpuset_mems_allowed(task);
+ mm = get_task_mm(task);
+ put_task_struct(task);
+
+ if (mm) {
+ if (nodes)
+ err = do_pages_move(mm, task_nodes, nr_pages, pages, nodes,
+ status, flags);
+ else
+ err = do_pages_stat(mm, nr_pages, pages, status);
+ } else
+ err = -EINVAL;
-out:
mmput(mm);
return err;
+
+out:
+ put_task_struct(task);
+ return err;
}
/*
On Tue, 28 Feb 2012 13:30:19 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> Migration functions perform the rcu_read_unlock too early. As a result the
> task pointed to may change from under us.
>
> The following patch extend the period of the rcu_read_lock until after the
> permissions checks are done. We also take a refcount so that the task
> reference is stable when calling security check functions and performing
> cpuset node validation (which takes a mutex).
>
> The refcount is dropped before actual page migration occurs so there is no
> change to the refcounts held during page migration.
>
> Also move the determination of the mm of the task struct to immediately
> before the do_migrate*() calls so that it is clear that we switch from
> handling the task during permission checks to the mm for the actual
> migration. Since the determination is only done once and we then no longer
> use the task_struct we can be sure that we operate on a specific address
> space that will not change from under us.
What was the user-visible impact of the bug?
Please always include info this in bug fix changelogs - it helps me and
others to decide which kernel version(s) the patch should be merged
into.
On Wed, 29 Feb 2012, Andrew Morton wrote:
> What was the user-visible impact of the bug?
THis was an oops reported by Dave Hansen after he inserted some loops in
the code to trigger a race condition.
On 02/29/2012 12:31 PM, Andrew Morton wrote:
> What was the user-visible impact of the bug?
It'll probably oops dereferencing task->cred:
https://lkml.org/lkml/2012/2/23/302
Although I was never actually able to get it to trigger without some
code to enlarge the race window.