2014-10-15 12:31:47

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free


This WARN_ON_ONCE() placed into __schedule() triggers warning:

@@ -2852,6 +2852,7 @@ static void __sched __schedule(void)

if (likely(prev != next)) {
rq->nr_switches++;
+ WARN_ON_ONCE(atomic_read(&prev->usage) == 1);
rq->curr = next;
++*switch_count;

WARNING: CPU: 2 PID: 6497 at kernel/sched/core.c:2855 __schedule+0x656/0x8a0()
Modules linked in:
CPU: 2 PID: 6497 Comm: cat Not tainted 3.17.0+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000009 ffff88022f50bdd8 ffffffff81518c78 0000000000000004
0000000000000000 ffff88022f50be18 ffffffff8104b1ac ffff88022f50be18
ffff880239912b40 ffff88022e5720d0 0000000000000002 0000000000000000
Call Trace:
[<ffffffff81518c78>] dump_stack+0x4f/0x7c
[<ffffffff8104b1ac>] warn_slowpath_common+0x7c/0xa0
[<ffffffff8104b275>] warn_slowpath_null+0x15/0x20
[<ffffffff8151bad6>] __schedule+0x656/0x8a0
[<ffffffff8151bd44>] schedule+0x24/0x70
[<ffffffff8104c7aa>] do_exit+0x72a/0xb40
[<ffffffff81071b31>] ? get_parent_ip+0x11/0x50
[<ffffffff8104da6a>] do_group_exit+0x3a/0xa0
[<ffffffff8104dadf>] SyS_exit_group+0xf/0x10
[<ffffffff8151fe92>] system_call_fastpath+0x12/0x17
---[ end trace d07155396c4faa0c ]---

This means the final put_task_struct() happens against RCU rules.
Regarding to scheduler this may be a reason of use-after-free.

task_numa_compare() schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...

If other subsystems have a similar link to task, then the problem is also possible
there.

Delayed put_task_struct() was introduced in 8c7904a00b06:
"task: RCU protect task->usage" at "Fri Mar 31 02:31:37 2006".

It looks like it was safe to use that way, but now it's not. Something has changed
(preempt RCU?). Welcome to the analysis!

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/sched.h | 3 ++-
kernel/exit.c | 8 ++++----
kernel/fork.c | 1 -
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..6bfc041 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)

extern void __put_task_struct(struct task_struct *t);
+extern void __put_task_struct_cb(struct rcu_head *rhp);

static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ call_rcu(&t->rcu, __put_task_struct_cb);
}

#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019..326eae7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
}
}

-static void delayed_put_task_struct(struct rcu_head *rhp)
+void __put_task_struct_cb(struct rcu_head *rhp)
{
struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);

perf_event_delayed_put(tsk);
trace_sched_process_free(tsk);
- put_task_struct(tsk);
+ __put_task_struct(tsk);
}
-
+EXPORT_SYMBOL_GPL(__put_task_struct_cb);

void release_task(struct task_struct *p)
{
@@ -207,7 +207,7 @@ void release_task(struct task_struct *p)

write_unlock_irq(&tasklist_lock);
release_thread(p);
- call_rcu(&p->rcu, delayed_put_task_struct);
+ put_task_struct(p);

p = leader;
if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..4d3ac3c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
}
-EXPORT_SYMBOL_GPL(__put_task_struct);

void __init __weak arch_task_cache_init(void) { }




2014-10-15 15:10:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 10/15, Kirill Tkhai wrote:
>
> This WARN_ON_ONCE() placed into __schedule() triggers warning:
>
> @@ -2852,6 +2852,7 @@ static void __sched __schedule(void)
>
> if (likely(prev != next)) {
> rq->nr_switches++;
> + WARN_ON_ONCE(atomic_read(&prev->usage) == 1);

I think you know this, but let me clarify just in case that this WARN()
is wrong, prev->usage == 1 is fine if the task does its last schedule()
and it was already (auto)reaped.

> This means the final put_task_struct() happens against RCU rules.

Well, yes, it doesn't use delayed_put_pid(). But this should be fine,
this drops the extra reference created by dup_task_struct().

However,

> Regarding to scheduler this may be a reason of use-after-free.
>
> task_numa_compare() schedule()
> rcu_read_lock() ...
> cur = ACCESS_ONCE(dst_rq->curr) ...
> ... rq->curr = next;
> ... context_switch()
> ... finish_task_switch()
> ... put_task_struct()
> ... __put_task_struct()
> ... free_task_struct()
> task_numa_assign() ...
> get_task_struct() ...

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.

At least the code like

rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();

is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
> #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
>
> extern void __put_task_struct(struct task_struct *t);
> +extern void __put_task_struct_cb(struct rcu_head *rhp);
>
> static inline void put_task_struct(struct task_struct *t)
> {
> if (atomic_dec_and_test(&t->usage))
> - __put_task_struct(t);
> + call_rcu(&t->rcu, __put_task_struct_cb);
> }
>
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 5d30019..326eae7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
> }
> }
>
> -static void delayed_put_task_struct(struct rcu_head *rhp)
> +void __put_task_struct_cb(struct rcu_head *rhp)
> {
> struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
>
> perf_event_delayed_put(tsk);
> trace_sched_process_free(tsk);
> - put_task_struct(tsk);
> + __put_task_struct(tsk);
> }
> -
> +EXPORT_SYMBOL_GPL(__put_task_struct_cb);
>
> void release_task(struct task_struct *p)
> {
> @@ -207,7 +207,7 @@ void release_task(struct task_struct *p)
>
> write_unlock_irq(&tasklist_lock);
> release_thread(p);
> - call_rcu(&p->rcu, delayed_put_task_struct);
> + put_task_struct(p);
>
> p = leader;
> if (unlikely(zap_leader))
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..4d3ac3c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
> if (!profile_handoff_task(tsk))
> free_task(tsk);
> }
> -EXPORT_SYMBOL_GPL(__put_task_struct);
>
> void __init __weak arch_task_cache_init(void) { }

Hmm. I am not sure I understand how this patch can actually fix this problem.
It seems that it is still possible that get_task_struct() can be called after
call_rcu(__put_task_struct_cb) ? But perhaps I misread this patch.

And I think it adds another problem. Suppose we have a zombie which already
called schedule() in TASK_DEAD state. IOW, its ->usage == 1, its parent will
free this task when it calls sys_wait().

With this patch the code like

rcu_read_lock();
for_each_process(p) {
if (pred(p) {
get_task_struct(p);
return p;
}
}
rcu_read_unlock();

becomes unsafe: we can race with release_task(p) and get_task_struct() can
can be called when prev->usage is already 0 and this task_struct can be freed
omce you drop rcu_read_lock().

Oleg.

2014-10-15 19:44:19

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 10/15, Oleg Nesterov wrote:
>
> On 10/15, Kirill Tkhai wrote:
> >
> > Regarding to scheduler this may be a reason of use-after-free.
> >
> > task_numa_compare() schedule()
> > rcu_read_lock() ...
> > cur = ACCESS_ONCE(dst_rq->curr) ...
> > ... rq->curr = next;
> > ... context_switch()
> > ... finish_task_switch()
> > ... put_task_struct()
> > ... __put_task_struct()
> > ... free_task_struct()
> > task_numa_assign() ...
> > get_task_struct() ...
>
> Agreed. I don't understand this code (will try to take another look later),
> but at first glance this looks wrong.
>
> At least the code like
>
> rcu_read_lock();
> get_task_struct(foreign_rq->curr);
> rcu_read_unlock();
>
> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Yes, but perhaps in this particular case another simple fix makes more
sense. The patch below needs a comment to explain that we check PF_EXITING
because:

1. It doesn't make sense to migrate the exiting task. Although perhaps
we could check ->mm == NULL instead.

But let me repeat that I do not understand this code, I am not sure
we can equally treat is_idle_task() and PF_EXITING here...

2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
won't be called until we drop rcu_read_lock(), and thus get_task_struct()
is safe.

And. it seems that there is another problem? Can't task_h_load(cur) race
with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
p == current but we do not disable preemption.

What do you think?

Oleg.

--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur = NULL;

/*

2014-10-15 21:46:15

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Yeah, you're sure about initial patch. Thanks for signal explanation.

On 15.10.2014 23:40, Oleg Nesterov wrote:
> On 10/15, Oleg Nesterov wrote:
>>
>> On 10/15, Kirill Tkhai wrote:
>>>
>>> Regarding to scheduler this may be a reason of use-after-free.
>>>
>>> task_numa_compare() schedule()
>>> rcu_read_lock() ...
>>> cur = ACCESS_ONCE(dst_rq->curr) ...
>>> ... rq->curr = next;
>>> ... context_switch()
>>> ... finish_task_switch()
>>> ... put_task_struct()
>>> ... __put_task_struct()
>>> ... free_task_struct()
>>> task_numa_assign() ...
>>> get_task_struct() ...
>>
>> Agreed. I don't understand this code (will try to take another look later),
>> but at first glance this looks wrong.
>>
>> At least the code like
>>
>> rcu_read_lock();
>> get_task_struct(foreign_rq->curr);
>> rcu_read_unlock();
>>
>> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
>> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...
>
> Yes, but perhaps in this particular case another simple fix makes more
> sense. The patch below needs a comment to explain that we check PF_EXITING
> because:
>
> 1. It doesn't make sense to migrate the exiting task. Although perhaps
> we could check ->mm == NULL instead.
>
> But let me repeat that I do not understand this code, I am not sure
> we can equally treat is_idle_task() and PF_EXITING here...
>
> 2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
> won't be called until we drop rcu_read_lock(), and thus get_task_struct()
> is safe.
>

Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
than release_task() is called.

Shouldn't we use smp_rmb/smp_wmb here?

> And. it seems that there is another problem? Can't task_h_load(cur) race
> with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
> in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
> doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
> p == current but we do not disable preemption.
>
> What do you think?

We use it completely unlocked, so nothing good is here. Also we work
with pointers.

As I understand in update_cfs_rq_h_load() we go from bottom to top,
and then from top to bottom. We set cfs_rq::h_load_next to be able
to do top-bottom passage (top is a root of "tree").

Yeah, this "way" may be overwritten by competitor. Also, task may change
its cfs_rq.

> --- x/kernel/sched/fair.c
> +++ x/kernel/sched/fair.c
> @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
>
> rcu_read_lock();
> cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> cur = NULL;
>
> /*
>

Looks like, we have to use the same fix for task_numa_group().

grp = rcu_dereference(tsk->numa_group);

Below we dereference grp->nr_tasks.

Also, the same in rt.c and deadline.c, but we do no take second
reference there. Wrong pointer dereference is not possible there,
not so bad.

Kirill

2014-10-15 22:02:44

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 16.10.2014 01:46, Kirill Tkhai wrote:
> Yeah, you're sure about initial patch. Thanks for signal explanation.
>
> On 15.10.2014 23:40, Oleg Nesterov wrote:
>> On 10/15, Oleg Nesterov wrote:
>>>
>>> On 10/15, Kirill Tkhai wrote:
>>>>
>>>> Regarding to scheduler this may be a reason of use-after-free.
>>>>
>>>> task_numa_compare() schedule()
>>>> rcu_read_lock() ...
>>>> cur = ACCESS_ONCE(dst_rq->curr) ...
>>>> ... rq->curr = next;
>>>> ... context_switch()
>>>> ... finish_task_switch()
>>>> ... put_task_struct()
>>>> ... __put_task_struct()
>>>> ... free_task_struct()
>>>> task_numa_assign() ...
>>>> get_task_struct() ...
>>>
>>> Agreed. I don't understand this code (will try to take another look later),
>>> but at first glance this looks wrong.
>>>
>>> At least the code like
>>>
>>> rcu_read_lock();
>>> get_task_struct(foreign_rq->curr);
>>> rcu_read_unlock();
>>>
>>> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
>>> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...
>>
>> Yes, but perhaps in this particular case another simple fix makes more
>> sense. The patch below needs a comment to explain that we check PF_EXITING
>> because:
>>
>> 1. It doesn't make sense to migrate the exiting task. Although perhaps
>> we could check ->mm == NULL instead.
>>
>> But let me repeat that I do not understand this code, I am not sure
>> we can equally treat is_idle_task() and PF_EXITING here...
>>
>> 2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
>> won't be called until we drop rcu_read_lock(), and thus get_task_struct()
>> is safe.
>>
>
> Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
> than release_task() is called.
>
> Shouldn't we use smp_rmb/smp_wmb here?
>
>> And. it seems that there is another problem? Can't task_h_load(cur) race
>> with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
>> in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
>> doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
>> p == current but we do not disable preemption.
>>
>> What do you think?
>
> We use it completely unlocked, so nothing good is here. Also we work
> with pointers.
>
> As I understand in update_cfs_rq_h_load() we go from bottom to top,
> and then from top to bottom. We set cfs_rq::h_load_next to be able
> to do top-bottom passage (top is a root of "tree").

> Yeah, this "way" may be overwritten by competitor. Also, task may change
> its cfs_rq.

Wrong, it's not a task... Brain is sleepy, it's better tomorrow.

>
>> --- x/kernel/sched/fair.c
>> +++ x/kernel/sched/fair.c
>> @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
>>
>> rcu_read_lock();
>> cur = ACCESS_ONCE(dst_rq->curr);
>> - if (cur->pid == 0) /* idle */
>> + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
>> cur = NULL;
>>
>> /*
>>
>
> Looks like, we have to use the same fix for task_numa_group().
>
> grp = rcu_dereference(tsk->numa_group);
>
> Below we dereference grp->nr_tasks.
>
> Also, the same in rt.c and deadline.c, but we do no take second
> reference there. Wrong pointer dereference is not possible there,
> not so bad.
>
> Kirill
>

2014-10-16 07:57:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On Wed, Oct 15, 2014 at 09:40:44PM +0200, Oleg Nesterov wrote:
> What do you think?
>
> Oleg.
>
> --- x/kernel/sched/fair.c
> +++ x/kernel/sched/fair.c
> @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
>
> rcu_read_lock();
> cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> cur = NULL;
>
> /*

That makes sense, is_idle_task() is indeed the right function there, and
PF_EXITING avoids doing work where it doesn't make sense.

2014-10-16 07:59:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > --- x/kernel/sched/fair.c
> > +++ x/kernel/sched/fair.c
> > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> >
> > rcu_read_lock();
> > cur = ACCESS_ONCE(dst_rq->curr);
> > - if (cur->pid == 0) /* idle */
> > + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > cur = NULL;
> >
> > /*
> >
>
> Looks like, we have to use the same fix for task_numa_group().

Don't think so, task_numa_group() is only called from task_numa_fault()
which is on 'current' and neither idle and PF_EXITING should be
faulting.

2014-10-16 08:01:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On Wed, Oct 15, 2014 at 05:06:41PM +0200, Oleg Nesterov wrote:
>
> At least the code like
>
> rcu_read_lock();
> get_task_struct(foreign_rq->curr);
> rcu_read_unlock();
>
> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

There is an rcu_read_lock() around it through task_numa_compare().

2014-10-16 08:16:55

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > --- x/kernel/sched/fair.c
> > > +++ x/kernel/sched/fair.c
> > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > >
> > > rcu_read_lock();
> > > cur = ACCESS_ONCE(dst_rq->curr);
> > > - if (cur->pid == 0) /* idle */
> > > + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > cur = NULL;
> > >
> > > /*
> > >
> >
> > Looks like, we have to use the same fix for task_numa_group().
>
> Don't think so, task_numa_group() is only called from task_numa_fault()
> which is on 'current' and neither idle and PF_EXITING should be
> faulting.

Isn't task_numa_group() fully preemptible?

It seems cpu_rq(cpu)->curr is not always equal to p.

2014-10-16 09:43:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > --- x/kernel/sched/fair.c
> > > > +++ x/kernel/sched/fair.c
> > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > >
> > > > rcu_read_lock();
> > > > cur = ACCESS_ONCE(dst_rq->curr);
> > > > - if (cur->pid == 0) /* idle */
> > > > + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > cur = NULL;
> > > >
> > > > /*
> > > >
> > >
> > > Looks like, we have to use the same fix for task_numa_group().
> >
> > Don't think so, task_numa_group() is only called from task_numa_fault()
> > which is on 'current' and neither idle and PF_EXITING should be
> > faulting.
>
> Isn't task_numa_group() fully preemptible?

Not seeing how that is relevant.

> It seems cpu_rq(cpu)->curr is not always equal to p.

It should be afaict:

task_numa_fault()
p = current;

task_numa_group(p, ..);

And like said, idle tasks and PF_EXITING task should never get (numa)
faults for they should never be touching userspace.

2014-10-16 09:50:07

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

В Чт, 16/10/2014 в 11:43 +0200, Peter Zijlstra пишет:
> On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> > В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > > --- x/kernel/sched/fair.c
> > > > > +++ x/kernel/sched/fair.c
> > > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > > >
> > > > > rcu_read_lock();
> > > > > cur = ACCESS_ONCE(dst_rq->curr);
> > > > > - if (cur->pid == 0) /* idle */
> > > > > + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > > cur = NULL;
> > > > >
> > > > > /*
> > > > >
> > > >
> > > > Looks like, we have to use the same fix for task_numa_group().
> > >
> > > Don't think so, task_numa_group() is only called from task_numa_fault()
> > > which is on 'current' and neither idle and PF_EXITING should be
> > > faulting.
> >
> > Isn't task_numa_group() fully preemptible?
>
> Not seeing how that is relevant.
>
> > It seems cpu_rq(cpu)->curr is not always equal to p.
>
> It should be afaict:
>
> task_numa_fault()
> p = current;
>
> task_numa_group(p, ..);
>
> And like said, idle tasks and PF_EXITING task should never get (numa)
> faults for they should never be touching userspace.

I mean p can be moved to other cpu.

tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);

tsk is not p, (i.e current) here.

2014-10-16 09:51:58

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

В Чт, 16/10/2014 в 13:50 +0400, Kirill Tkhai пишет:
> В Чт, 16/10/2014 в 11:43 +0200, Peter Zijlstra пишет:
> > On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> > > В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > > > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > > > --- x/kernel/sched/fair.c
> > > > > > +++ x/kernel/sched/fair.c
> > > > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > > > >
> > > > > > rcu_read_lock();
> > > > > > cur = ACCESS_ONCE(dst_rq->curr);
> > > > > > - if (cur->pid == 0) /* idle */
> > > > > > + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > > > cur = NULL;
> > > > > >
> > > > > > /*
> > > > > >
> > > > >
> > > > > Looks like, we have to use the same fix for task_numa_group().
> > > >
> > > > Don't think so, task_numa_group() is only called from task_numa_fault()
> > > > which is on 'current' and neither idle and PF_EXITING should be
> > > > faulting.
> > >
> > > Isn't task_numa_group() fully preemptible?
> >
> > Not seeing how that is relevant.
> >
> > > It seems cpu_rq(cpu)->curr is not always equal to p.
> >
> > It should be afaict:
> >
> > task_numa_fault()
> > p = current;
> >
> > task_numa_group(p, ..);
> >
> > And like said, idle tasks and PF_EXITING task should never get (numa)
> > faults for they should never be touching userspace.
>
> I mean p can be moved to other cpu.
>
> tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
>
> tsk is not p, (i.e current) here.

Maybe I undestand wrong and preemption is disabled in memory fault?

2014-10-16 10:04:58

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

В Чт, 16/10/2014 в 13:51 +0400, Kirill Tkhai пишет:
> В Чт, 16/10/2014 в 13:50 +0400, Kirill Tkhai пишет:
> > В Чт, 16/10/2014 в 11:43 +0200, Peter Zijlstra пишет:
> > > On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> > > > В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > > > > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > > > > --- x/kernel/sched/fair.c
> > > > > > > +++ x/kernel/sched/fair.c
> > > > > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > > > > >
> > > > > > > rcu_read_lock();
> > > > > > > cur = ACCESS_ONCE(dst_rq->curr);
> > > > > > > - if (cur->pid == 0) /* idle */
> > > > > > > + if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > > > > cur = NULL;
> > > > > > >
> > > > > > > /*
> > > > > > >
> > > > > >
> > > > > > Looks like, we have to use the same fix for task_numa_group().
> > > > >
> > > > > Don't think so, task_numa_group() is only called from task_numa_fault()
> > > > > which is on 'current' and neither idle and PF_EXITING should be
> > > > > faulting.
> > > >
> > > > Isn't task_numa_group() fully preemptible?
> > >
> > > Not seeing how that is relevant.
> > >
> > > > It seems cpu_rq(cpu)->curr is not always equal to p.
> > >
> > > It should be afaict:
> > >
> > > task_numa_fault()
> > > p = current;
> > >
> > > task_numa_group(p, ..);
> > >
> > > And like said, idle tasks and PF_EXITING task should never get (numa)
> > > faults for they should never be touching userspace.
> >
> > I mean p can be moved to other cpu.
> >
> > tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
> >
> > tsk is not p, (i.e current) here.
>
> Maybe I undestand wrong and preemption is disabled in memory fault?

Ah, I found pagefault_disable(). No questions.

2014-10-16 22:09:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 10/16, Peter Zijlstra wrote:
>
> On Wed, Oct 15, 2014 at 05:06:41PM +0200, Oleg Nesterov wrote:
> >
> > At least the code like
> >
> > rcu_read_lock();
> > get_task_struct(foreign_rq->curr);
> > rcu_read_unlock();
> >
> > is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> > we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...
>
> There is an rcu_read_lock() around it through task_numa_compare().

Yes, and the code above has rcu_read_lock() too. But it doesn't help
as Kirill pointed out.

Sorry, didn't have time today to read other emails in this thread,
will do tomorrow and (probably) send the patch which adds PF_EXITING
check.

Oleg.

2014-10-17 21:38:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 10/16, Kirill Tkhai wrote:
>
> Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
> than release_task() is called.

OK, thanks, I am sending the patch...

> Shouldn't we use smp_rmb/smp_wmb here?

No, we do not. call_rcu(delayed_put_pid) itself implies the barrier on
all CPUs. IOW, by the time RCU actually calls delayed_put_pid() every
CPU must see all memory changes which were done before call_rcu() was
called. And otoh, all rcu-read-lock critical sections which could miss
PF_EXITING should be already finished.

Oleg.

2014-10-17 21:40:12

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().

And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp.

Reported-by: Kirill Tkhai <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0090e8c..52049b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe; note that rcu_read_lock() can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if (is_idle_task(cur) || (cur->flags & PF_EXITING))
cur = NULL;

/*
--
1.5.5.1

2014-10-18 08:15:14

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

18.10.2014, 01:40, "Oleg Nesterov" <[email protected]>:
> The lockless get_task_struct(tsk) is only safe if tsk == current
> and didn't pass exit_notify(), or if this tsk was found on a rcu
> protected list (say, for_each_process() or find_task_by_vpid()).
> IOW, it is only safe if release_task() was not called before we
> take rcu_read_lock(), in this case we can rely on the fact that
> delayed_put_pid() can not drop the (potentially) last reference
> until rcu_read_unlock().
>
> And as Kirill pointed out task_numa_compare()->task_numa_assign()
> path does get_task_struct(dst_rq->curr) and this is not safe. The
> task_struct itself can't go away, but rcu_read_lock() can't save
> us from the final put_task_struct() in finish_task_switch(); this
> reference goes away without rcu gp.
>
> Reported-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> ?kernel/sched/fair.c | ???8 +++++++-
> ?1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0090e8c..52049b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>
> ?????????rcu_read_lock();
> ?????????cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + /*
> + * No need to move the exiting task, and this ensures that ->curr
> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> + * is safe; note that rcu_read_lock() can't protect from the final
> + * put_task_struct() after the last schedule().
> + */
> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> ?????????????????cur = NULL;
>
> ?????????/*

Oleg, I've looked once again, and now it's not good for me.
Where is the guarantee this memory hasn't been allocated again?
If so, PF_EXITING is not of the task we are interesting, but it's
not a task's even.

rcu_read_lock() ... ...
cur = ACCESS_ONCE(dst_rq->curr); ... ...
<interrupt> rq->curr = next; ...
<interrupt> put_prev_task() ...
<interrupt> __put_prev_task ...
<interrupt> kmem_cache_free() ...
<interrupt> ... <alocated again>
<interrupt> ... memset(, 0, )
<interrupt> ... ...
if (cur->flags & PF_EXITING) ... ...
<no> ... ...
get_task_struct() ... ...

Kirill

2014-10-18 08:33:39

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

18.10.2014, 12:15, "Kirill Tkhai" <[email protected]>:
> 18.10.2014, 01:40, "Oleg Nesterov" <[email protected]>:
>> ?The lockless get_task_struct(tsk) is only safe if tsk == current
>> ?and didn't pass exit_notify(), or if this tsk was found on a rcu
>> ?protected list (say, for_each_process() or find_task_by_vpid()).
>> ?IOW, it is only safe if release_task() was not called before we
>> ?take rcu_read_lock(), in this case we can rely on the fact that
>> ?delayed_put_pid() can not drop the (potentially) last reference
>> ?until rcu_read_unlock().
>>
>> ?And as Kirill pointed out task_numa_compare()->task_numa_assign()
>> ?path does get_task_struct(dst_rq->curr) and this is not safe. The
>> ?task_struct itself can't go away, but rcu_read_lock() can't save
>> ?us from the final put_task_struct() in finish_task_switch(); this
>> ?reference goes away without rcu gp.
>>
>> ?Reported-by: Kirill Tkhai <[email protected]>
>> ?Signed-off-by: Oleg Nesterov <[email protected]>
>> ?---
>> ??kernel/sched/fair.c | ???8 +++++++-
>> ??1 files changed, 7 insertions(+), 1 deletions(-)
>>
>> ?diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> ?index 0090e8c..52049b9 100644
>> ?--- a/kernel/sched/fair.c
>> ?+++ b/kernel/sched/fair.c
>> ?@@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>>
>> ??????????rcu_read_lock();
>> ??????????cur = ACCESS_ONCE(dst_rq->curr);
>> ?- if (cur->pid == 0) /* idle */
>> ?+ /*
>> ?+ * No need to move the exiting task, and this ensures that ->curr
>> ?+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
>> ?+ * is safe; note that rcu_read_lock() can't protect from the final
>> ?+ * put_task_struct() after the last schedule().
>> ?+ */
>> ?+ if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>> ??????????????????cur = NULL;
>>
>> ??????????/*
>
> Oleg, I've looked once again, and now it's not good for me.
> Where is the guarantee this memory hasn't been allocated again?
> If so, PF_EXITING is not of the task we are interesting, but it's
> not a task's even.
>
> rcu_read_lock() ??????????????????... ??????????????????????????...
> cur = ACCESS_ONCE(dst_rq->curr); ?... ??????????????????????????...
> <interrupt> ??????????????????????rq->curr = next; ?????????????...
> <interrupt> ??????????????????????????put_prev_task() ??????????...
> <interrupt> ??????????????????????????????__put_prev_task ??????...
> <interrupt> ?????????????????????????????????kmem_cache_free() ?...
> <interrupt> ?????????????????????????????????... ???????????????<alocated again>
> <interrupt> ?????????????????????????????????... ???????????????memset(, 0, )
> <interrupt> ?????????????????????????????????... ???????????????...
> if (cur->flags & PF_EXITING) ????????????????... ???????????????...
> ????<no> ????????????????????????????????????... ???????????????...
> get_task_struct() ???????????????????????????... ???????????????...

How about this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b78280c..d46427e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe; note that rcu_read_lock() can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if (is_idle_task(cur) || (cur->flags & PF_EXITING))
+ cur = NULL;
+ /*
+ * Check once again to be sure curr is still on dst_rq. Even if
+ * it points on a new task, which is using the memory of freed
+ * cur, it's OK, because we've locked RCU before
+ * delayed_put_task_struct() callback is called to put its struct.
+ */
+ if (cur != ACCESS_ONCE(dst_rq->curr))
cur = NULL;

/*

2014-10-18 09:16:50

by Kirill Tkhai

[permalink] [raw]
Subject: Re:[PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

And smp_rmb() beetween ifs which is pairs with rq unlocking

> 18.10.2014, 12:15, "Kirill Tkhai" <[email protected]>:
>
>> 18.10.2014, 01:40, "Oleg Nesterov" <[email protected]>:
>>
>>> The lockless get_task_struct(tsk) is only safe if tsk == current
>>> and didn't pass exit_notify(), or if this tsk was found on a rcu
>>> protected list (say, for_each_process() or find_task_by_vpid()).
>>> IOW, it is only safe if release_task() was not called before we
>>> take rcu_read_lock(), in this case we can rely on the fact that
>>> delayed_put_pid() can not drop the (potentially) last reference
>>> until rcu_read_unlock().
>>>
>>> And as Kirill pointed out task_numa_compare()->task_numa_assign()
>>> path does get_task_struct(dst_rq->curr) and this is not safe. The
>>> task_struct itself can't go away, but rcu_read_lock() can't save
>>> us from the final put_task_struct() in finish_task_switch(); this
>>> reference goes away without rcu gp.
>>>
>>> Reported-by: Kirill Tkhai <[email protected]>
>>> Signed-off-by: Oleg Nesterov <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 8 +++++++-
>>> 1 files changed, 7 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 0090e8c..52049b9 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>>>
>>> rcu_read_lock();
>>> cur = ACCESS_ONCE(dst_rq->curr);
>>> - if (cur->pid == 0) /* idle */
>>> + /*
>>> + * No need to move the exiting task, and this ensures that ->curr
>>> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
>>> + * is safe; note that rcu_read_lock() can't protect from the final
>>> + * put_task_struct() after the last schedule().
>>> + */
>>> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>> cur = NULL;
>>>
>>> /*
>>
>> Oleg, I've looked once again, and now it's not good for me.
>> Where is the guarantee this memory hasn't been allocated again?
>> If so, PF_EXITING is not of the task we are interesting, but it's
>> not a task's even.
>>
>> rcu_read_lock() ... ...
>> cur = ACCESS_ONCE(dst_rq->curr); ... ...
>> <interrupt> rq->curr = next; ...
>> <interrupt> put_prev_task() ...
>> <interrupt> __put_prev_task ...
>> <interrupt> kmem_cache_free() ...
>> <interrupt> ... <alocated again>
>> <interrupt> ... memset(, 0, )
>> <interrupt> ... ...
>> if (cur->flags & PF_EXITING) ... ...
>> <no> ... ...
>> get_task_struct() ... ...
>
> How about this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b78280c..d46427e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>
> rcu_read_lock();
> cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + /*
> + * No need to move the exiting task, and this ensures that ->curr
> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> + * is safe; note that rcu_read_lock() can't protect from the final
> + * put_task_struct() after the last schedule().
> + */
> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> + cur = NULL;
> + /*
> + * Check once again to be sure curr is still on dst_rq. Even if
> + * it points on a new task, which is using the memory of freed
> + * cur, it's OK, because we've locked RCU before
> + * delayed_put_task_struct() callback is called to put its struct.
> + */
> + if (cur != ACCESS_ONCE(dst_rq->curr))
> cur = NULL;
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-10-18 19:36:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On Sat, Oct 18, 2014 at 12:33:27PM +0400, Kirill Tkhai wrote:
> How about this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b78280c..d46427e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>
> rcu_read_lock();
> cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + /*
> + * No need to move the exiting task, and this ensures that ->curr
> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> + * is safe; note that rcu_read_lock() can't protect from the final
> + * put_task_struct() after the last schedule().
> + */
> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> + cur = NULL;
> + /*
> + * Check once again to be sure curr is still on dst_rq. Even if
> + * it points on a new task, which is using the memory of freed
> + * cur, it's OK, because we've locked RCU before
> + * delayed_put_task_struct() callback is called to put its struct.
> + */
> + if (cur != ACCESS_ONCE(dst_rq->curr))
> cur = NULL;
>
> /*

So you worry about the refcount doing 0->1 ? In which case the above is
still wrong and we should be using atomic_inc_not_zero() in order to
acquire the reference count.

2014-10-18 21:00:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On 10/18, Kirill Tkhai wrote:
>
> 18.10.2014, 01:40, "Oleg Nesterov" <[email protected]>:
> > ...
> > The
> > task_struct itself can't go away,
> > ...
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
> >
> > ?????????rcu_read_lock();
> > ?????????cur = ACCESS_ONCE(dst_rq->curr);
> > - if (cur->pid == 0) /* idle */
> > + /*
> > + * No need to move the exiting task, and this ensures that ->curr
> > + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> > + * is safe; note that rcu_read_lock() can't protect from the final
> > + * put_task_struct() after the last schedule().
> > + */
> > + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> > ?????????????????cur = NULL;
> >
> > ?????????/*
>
> Oleg, I've looked once again, and now it's not good for me.

Ah. Thanks a lot Kirill for correcting me!

I was looking at this rcu_read_lock() and I didn't even try to think
what it can actually protect. Nothing.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>^^
> rcu_read_lock();
> cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + /*
> + * No need to move the exiting task, and this ensures that ->curr
> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> + * is safe; note that rcu_read_lock() can't protect from the final
> + * put_task_struct() after the last schedule().
> + */
> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> + cur = NULL;
> + /*
> + * Check once again to be sure curr is still on dst_rq. Even if
> + * it points on a new task, which is using the memory of freed
> + * cur, it's OK, because we've locked RCU before
> + * delayed_put_task_struct() callback is called to put its struct.
> + */
> + if (cur != ACCESS_ONCE(dst_rq->curr))

No, I don't think this can work. Let's look at the current code:

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
if (cur->pid == 0) /* idle */

And any dereference, even reading ->pid is not safe. This memory can be
freed, unmapped, reused, etc.

Looks like, task_numa_compare() needs to take dst_rq->lock and get the
refernce first.

Or, perhaps, we need to change the rules to ensure that any "task_struct *"
pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
avoid this if possible.

Hmm. I'll try to think more.

Thanks!

Oleg.

2014-10-18 21:22:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On 10/18, Peter Zijlstra wrote:
>
> So you worry about the refcount doing 0->1 ? In which case the above is
> still wrong and we should be using atomic_inc_not_zero() in order to
> acquire the reference count.

It is actually worse, please see my reply to Kirill. We simply can't
dereference foreign_rq->curr lockless.

Again, task_struct is only protected by RCU if it was found on a RCU
protected list. rq->curr is not protected by rcu. Perhaps we have to
change this... but this will be a bit unfortunate.

Oleg.

2014-10-18 23:13:38

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

19.10.2014, 00:59, "Oleg Nesterov" <[email protected]>:
> ?On 10/18, Kirill Tkhai wrote:
>> ??18.10.2014, 01:40, "Oleg Nesterov" <[email protected]>:
>>> ??...
>>> ??The
>>> ??task_struct itself can't go away,
>>> ??...
>>> ??--- a/kernel/sched/fair.c
>>> ??+++ b/kernel/sched/fair.c
>>> ??@@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>>>
>>> ???????????rcu_read_lock();
>>> ???????????cur = ACCESS_ONCE(dst_rq->curr);
>>> ??- if (cur->pid == 0) /* idle */
>>> ??+ /*
>>> ??+ * No need to move the exiting task, and this ensures that ->curr
>>> ??+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
>>> ??+ * is safe; note that rcu_read_lock() can't protect from the final
>>> ??+ * put_task_struct() after the last schedule().
>>> ??+ */
>>> ??+ if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>> ???????????????????cur = NULL;
>>>
>>> ???????????/*
>> ??Oleg, I've looked once again, and now it's not good for me.
> ?Ah. Thanks a lot Kirill for correcting me!
>
> ?I was looking at this rcu_read_lock() and I didn't even try to think
> ?what it can actually protect. Nothing.

<snip>

> ?No, I don't think this can work. Let's look at the current code:
>
> ?????????rcu_read_lock();
> ?????????cur = ACCESS_ONCE(dst_rq->curr);
> ?????????if (cur->pid == 0) /* idle */
>
> ?And any dereference, even reading ->pid is not safe. This memory can be
> ?freed, unmapped, reused, etc.
>
> ?Looks like, task_numa_compare() needs to take dst_rq->lock and get the
> ?refernce first.

Yeah, detection of idle is not save. If we reorder the checks almost all
problems will be gone. All except unmapping. JFI, is it possible with
such kernel structures as task_struct? I.e. do mem caches use high memory
in their bosoms?
Thanks!

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b78280c..114ec33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe; note that rcu_read_lock() can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if (cur->flags & PF_EXITING)
+ cur = NULL;
+ smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */
+ /*
+ * Check once again to be sure curr is still on dst_rq. Three situations
+ * are possible here:
+ * 1)cur has gone and been freed, and dst_rq->curr is pointing on other
+ * memory. In this case the check will fail;
+ * 2)cur is pointing to a new task, which is using the memory of just
+ * freed cur (and it is new dst_rq->curr). It's OK, because we've
+ * locked RCU before the new task has been even created
+ * (so delayed_put_task_struct() hasn't been called);
+ * 3)we've taken not exiting task (likely case). No need to worry.
+ */
+ if (cur != ACCESS_ONCE(dst_rq->curr))
+ cur = NULL;
+
+ if (is_idle_task(cur))
cur = NULL;

/*


> ?Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> ?pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> ?avoid this if possible.

RT tree has:

https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/tree/patches/sched-delay-put-task.patch

But other problem was decided there..

> ?Hmm. I'll try to think more.
>
> ?Thanks!

Kirill

2014-10-19 08:21:15

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On 18.10.2014 23:36, Peter Zijlstra wrote:
> On Sat, Oct 18, 2014 at 12:33:27PM +0400, Kirill Tkhai wrote:
>> How about this?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b78280c..d46427e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>>
>> rcu_read_lock();
>> cur = ACCESS_ONCE(dst_rq->curr);
>> - if (cur->pid == 0) /* idle */
>> + /*
>> + * No need to move the exiting task, and this ensures that ->curr
>> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
>> + * is safe; note that rcu_read_lock() can't protect from the final
>> + * put_task_struct() after the last schedule().
>> + */
>> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>> + cur = NULL;
>> + /*
>> + * Check once again to be sure curr is still on dst_rq. Even if
>> + * it points on a new task, which is using the memory of freed
>> + * cur, it's OK, because we've locked RCU before
>> + * delayed_put_task_struct() callback is called to put its struct.
>> + */
>> + if (cur != ACCESS_ONCE(dst_rq->curr))
>> cur = NULL;
>>
>> /*
>
> So you worry about the refcount doing 0->1 ? In which case the above is
> still wrong and we should be using atomic_inc_not_zero() in order to
> acquire the reference count.
>

We can't use atomic_inc_not_zero(). The problem is that cur is pointing
to a memory, which may be not a task_struct even. No guarantees at all.

2014-10-19 19:28:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On 10/19, Kirill Tkhai wrote:
>
> 19.10.2014, 00:59, "Oleg Nesterov" <[email protected]>:
>
> > ?No, I don't think this can work. Let's look at the current code:
> >
> > ?????????rcu_read_lock();
> > ?????????cur = ACCESS_ONCE(dst_rq->curr);
> > ?????????if (cur->pid == 0) /* idle */
> >
> > ?And any dereference, even reading ->pid is not safe. This memory can be
> > ?freed, unmapped, reused, etc.
> >
> > ?Looks like, task_numa_compare() needs to take dst_rq->lock and get the
> > ?refernce first.
>
> Yeah, detection of idle is not save. If we reorder the checks almost all
> problems will be gone. All except unmapping. JFI, is it possible with
> such kernel structures as task_struct?

Yes, if DEBUG_PAGEALLOC. See kernel_map_pages() in arch/x86/mm/pageattr.c
kernel_map_pages(enable => false) clears PAGE_PRESENT if slab returns the
pages to system.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
>
> rcu_read_lock();
> cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + /*
> + * No need to move the exiting task, and this ensures that ->curr
> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> + * is safe; note that rcu_read_lock() can't protect from the final
> + * put_task_struct() after the last schedule().
> + */
> + if (cur->flags & PF_EXITING)
> + cur = NULL;

so this needs probe_kernel_read(&cur->flags).

> + if (cur != ACCESS_ONCE(dst_rq->curr))
> + cur = NULL;

Yes, if this task_struct was freed in between we do not care if this memory
was reused (except PF_EXITING can be false positive). If it was freed and
now the same memory is ->curr again we know that delayed_put_task_struct()
can't be called until we drop rcu lock, even if PF_EXITING is already set
again.

I won't argue, but you need to convince Peter to accept this hack ;)

> > ?Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> > ?pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> > ?avoid this if possible.
>
> RT tree has:
>
> https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> tree/patches/sched-delay-put-task.patch

Yes, and this obviously implies more rcu callbacks in flight, and another
gp before __put_task_struct(). but may be we will need to do this anyway...

Oleg.

2014-10-19 19:41:19

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On 10/19, Oleg Nesterov wrote:
>
> On 10/19, Kirill Tkhai wrote:
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
> >
> > rcu_read_lock();
> > cur = ACCESS_ONCE(dst_rq->curr);
> > - if (cur->pid == 0) /* idle */
> > + /*
> > + * No need to move the exiting task, and this ensures that ->curr
> > + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> > + * is safe; note that rcu_read_lock() can't protect from the final
> > + * put_task_struct() after the last schedule().
> > + */
> > + if (cur->flags & PF_EXITING)
> > + cur = NULL;
>
> so this needs probe_kernel_read(&cur->flags).
>
> > + if (cur != ACCESS_ONCE(dst_rq->curr))
> > + cur = NULL;
>
> Yes, if this task_struct was freed in between we do not care if this memory
> was reused (except PF_EXITING can be false positive). If it was freed and
> now the same memory is ->curr again we know that delayed_put_task_struct()
> can't be called until we drop rcu lock, even if PF_EXITING is already set
> again.
>
> I won't argue, but you need to convince Peter to accept this hack ;)
>
> > > ?Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> > > ?pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> > > ?avoid this if possible.
> >
> > RT tree has:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > tree/patches/sched-delay-put-task.patch
>
> Yes, and this obviously implies more rcu callbacks in flight, and another
> gp before __put_task_struct(). but may be we will need to do this anyway...

Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
in this case ->curr (or any other "task_struct *" ponter) can not go away
under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
but we do not need to recheck ->curr or probe_kernel_read().

Oleg.

2014-10-19 19:46:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On 10/19, Oleg Nesterov wrote:
>
> Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> in this case ->curr (or any other "task_struct *" ponter) can not go away
> under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> but we do not need to recheck ->curr or probe_kernel_read().

Damn, please ignore ;) we still need to recheck ->curr.

Oleg.

2014-10-19 21:38:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

On Sun, Oct 19, 2014 at 03:13:31AM +0400, Kirill Tkhai wrote:

I'm too tired for all this, but:

> + smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */

RELEASE does not imply a WMB.

2014-10-20 08:56:41

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

В Вс, 19/10/2014 в 23:38 +0200, Peter Zijlstra пишет:
> On Sun, Oct 19, 2014 at 03:13:31AM +0400, Kirill Tkhai wrote:
>
> I'm too tired for all this, but:
>
> > + smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */
>
> RELEASE does not imply a WMB.

Thanks, please see, I've sent new version.

2014-10-20 09:00:24

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

В Вс, 19/10/2014 в 21:24 +0200, Oleg Nesterov пишет:
> On 10/19, Kirill Tkhai wrote:
> >
> > 19.10.2014, 00:59, "Oleg Nesterov" <[email protected]>:
> >
> > > No, I don't think this can work. Let's look at the current code:
> > >
> > > rcu_read_lock();
> > > cur = ACCESS_ONCE(dst_rq->curr);
> > > if (cur->pid == 0) /* idle */
> > >
> > > And any dereference, even reading ->pid is not safe. This memory can be
> > > freed, unmapped, reused, etc.
> > >
> > > Looks like, task_numa_compare() needs to take dst_rq->lock and get the
> > > refernce first.
> >
> > Yeah, detection of idle is not save. If we reorder the checks almost all
> > problems will be gone. All except unmapping. JFI, is it possible with
> > such kernel structures as task_struct?
>
> Yes, if DEBUG_PAGEALLOC. See kernel_map_pages() in arch/x86/mm/pageattr.c
> kernel_map_pages(enable => false) clears PAGE_PRESENT if slab returns the
> pages to system.

Thanks, Oleg!

>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
> >
> > rcu_read_lock();
> > cur = ACCESS_ONCE(dst_rq->curr);
> > - if (cur->pid == 0) /* idle */
> > + /*
> > + * No need to move the exiting task, and this ensures that ->curr
> > + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> > + * is safe; note that rcu_read_lock() can't protect from the final
> > + * put_task_struct() after the last schedule().
> > + */
> > + if (cur->flags & PF_EXITING)
> > + cur = NULL;
>
> so this needs probe_kernel_read(&cur->flags).
>
> > + if (cur != ACCESS_ONCE(dst_rq->curr))
> > + cur = NULL;
>
> Yes, if this task_struct was freed in between we do not care if this memory
> was reused (except PF_EXITING can be false positive). If it was freed and
> now the same memory is ->curr again we know that delayed_put_task_struct()
> can't be called until we drop rcu lock, even if PF_EXITING is already set
> again.
>
> I won't argue, but you need to convince Peter to accept this hack ;)

Just sent a new version with all of you suggestions :) Thanks!

>
> > > Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> > > pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> > > avoid this if possible.
> >
> > RT tree has:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > tree/patches/sched-delay-put-task.patch
>
> Yes, and this obviously implies more rcu callbacks in flight, and another
> gp before __put_task_struct(). but may be we will need to do this anyway...

Kirill

2014-10-20 09:03:30

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

В Вс, 19/10/2014 в 21:43 +0200, Oleg Nesterov пишет:
> On 10/19, Oleg Nesterov wrote:
> >
> > Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> > in this case ->curr (or any other "task_struct *" ponter) can not go away
> > under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> > but we do not need to recheck ->curr or probe_kernel_read().
>
> Damn, please ignore ;) we still need to recheck ->curr.

Yeah, this bug like "collect puzzle" :)

2014-10-20 09:13:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()


OK, I think I'm finally awake enough to see what you're all talking
about :-)

On Sun, Oct 19, 2014 at 09:37:44PM +0200, Oleg Nesterov wrote:
> > > RT tree has:
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > > tree/patches/sched-delay-put-task.patch

(answering the other email asking about this)

RT does this because we call put_task_struct() with preempt disabled and
on RT the memory allocator has sleeping locks.

> > Yes, and this obviously implies more rcu callbacks in flight, and another
> > gp before __put_task_struct(). but may be we will need to do this anyway...
>
> Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> in this case ->curr (or any other "task_struct *" ponter) can not go away
> under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> but we do not need to recheck ->curr or probe_kernel_read().

I think I would prefer SLAB_DESTROY_BY_RCU for this, because as you
pointed out, I'm not sure mainline would like the extra callbacks.

2014-10-20 10:36:19

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

В Пн, 20/10/2014 в 11:13 +0200, Peter Zijlstra пишет:
> OK, I think I'm finally awake enough to see what you're all talking
> about :-)
>
> On Sun, Oct 19, 2014 at 09:37:44PM +0200, Oleg Nesterov wrote:
> > > > RT tree has:
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > > > tree/patches/sched-delay-put-task.patch
>
> (answering the other email asking about this)
>
> RT does this because we call put_task_struct() with preempt disabled and
> on RT the memory allocator has sleeping locks.

Now it's clearly for me. I though it's because task_struct freeing is slow.
Thanks!

> > > Yes, and this obviously implies more rcu callbacks in flight, and another
> > > gp before __put_task_struct(). but may be we will need to do this anyway...
> >
> > Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> > in this case ->curr (or any other "task_struct *" ponter) can not go away
> > under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> > but we do not need to recheck ->curr or probe_kernel_read().
>
> I think I would prefer SLAB_DESTROY_BY_RCU for this, because as you
> pointed out, I'm not sure mainline would like the extra callbacks.

I've sent one more patch with this:

"[PATCH v3] sched/numa: fix unsafe get_task_struct() in
task_numa_assign()"

Kirill