LinuxLists.cc - [PATCH RFC] sched: Revert delayed_put_task

2014-10-15 12:31:47

Subject: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

This WARN_ON_ONCE() placed into __schedule() triggers warning:

@@ -2852,6 +2852,7 @@ static void __sched __schedule(void)

if (likely(prev != next)) {
rq->nr_switches++;
+ WARN_ON_ONCE(atomic_read(&prev->usage) == 1);
rq->curr = next;
++*switch_count;

WARNING: CPU: 2 PID: 6497 at kernel/sched/core.c:2855 __schedule+0x656/0x8a0()
Modules linked in:
CPU: 2 PID: 6497 Comm: cat Not tainted 3.17.0+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000009 ffff88022f50bdd8 ffffffff81518c78 0000000000000004
0000000000000000 ffff88022f50be18 ffffffff8104b1ac ffff88022f50be18
ffff880239912b40 ffff88022e5720d0 0000000000000002 0000000000000000
Call Trace:
[<ffffffff81518c78>] dump_stack+0x4f/0x7c
[<ffffffff8104b1ac>] warn_slowpath_common+0x7c/0xa0
[<ffffffff8104b275>] warn_slowpath_null+0x15/0x20
[<ffffffff8151bad6>] __schedule+0x656/0x8a0
[<ffffffff8151bd44>] schedule+0x24/0x70
[<ffffffff8104c7aa>] do_exit+0x72a/0xb40
[<ffffffff81071b31>] ? get_parent_ip+0x11/0x50
[<ffffffff8104da6a>] do_group_exit+0x3a/0xa0
[<ffffffff8104dadf>] SyS_exit_group+0xf/0x10
[<ffffffff8151fe92>] system_call_fastpath+0x12/0x17
---[ end trace d07155396c4faa0c ]---

This means the final put_task_struct() happens against RCU rules.
Regarding to scheduler this may be a reason of use-after-free.

task_numa_compare() schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...

If other subsystems have a similar link to task, then the problem is also possible
there.

Delayed put_task_struct() was introduced in 8c7904a00b06:
"task: RCU protect task->usage" at "Fri Mar 31 02:31:37 2006".

It looks like it was safe to use that way, but now it's not. Something has changed
(preempt RCU?). Welcome to the analysis!

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/sched.h | 3 ++-
kernel/exit.c | 8 ++++----
kernel/fork.c | 1 -
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..6bfc041 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)

extern void __put_task_struct(struct task_struct *t);
+extern void __put_task_struct_cb(struct rcu_head *rhp);

static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ call_rcu(&t->rcu, __put_task_struct_cb);
}

#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019..326eae7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
}
}

-static void delayed_put_task_struct(struct rcu_head *rhp)
+void __put_task_struct_cb(struct rcu_head *rhp)
{
struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);

perf_event_delayed_put(tsk);
trace_sched_process_free(tsk);
- put_task_struct(tsk);
+ __put_task_struct(tsk);
}
-
+EXPORT_SYMBOL_GPL(__put_task_struct_cb);

void release_task(struct task_struct *p)
{
@@ -207,7 +207,7 @@ void release_task(struct task_struct *p)

write_unlock_irq(&tasklist_lock);
release_thread(p);
- call_rcu(&p->rcu, delayed_put_task_struct);
+ put_task_struct(p);

p = leader;
if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..4d3ac3c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
}
-EXPORT_SYMBOL_GPL(__put_task_struct);

void __init __weak arch_task_cache_init(void) { }

2014-10-15 15:10:13

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 10/15, Kirill Tkhai wrote:
>
> This WARN_ON_ONCE() placed into __schedule() triggers warning:
>
> @@ -2852,6 +2852,7 @@ static void __sched __schedule(void)
>
> if (likely(prev != next)) {
> rq->nr_switches++;
> + WARN_ON_ONCE(atomic_read(&prev->usage) == 1);

I think you know this, but let me clarify just in case that this WARN()
is wrong, prev->usage == 1 is fine if the task does its last schedule()
and it was already (auto)reaped.

> This means the final put_task_struct() happens against RCU rules.

Well, yes, it doesn't use delayed_put_pid(). But this should be fine,
this drops the extra reference created by dup_task_struct().

However,

> Regarding to scheduler this may be a reason of use-after-free.
>
> task_numa_compare() schedule()
> rcu_read_lock() ...
> cur = ACCESS_ONCE(dst_rq->curr) ...
> ... rq->curr = next;
> ... context_switch()
> ... finish_task_switch()
> ... put_task_struct()
> ... __put_task_struct()
> ... free_task_struct()
> task_numa_assign() ...
> get_task_struct() ...

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.

At least the code like

rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();

is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
> #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
>
> extern void __put_task_struct(struct task_struct *t);
> +extern void __put_task_struct_cb(struct rcu_head *rhp);
>
> static inline void put_task_struct(struct task_struct *t)
> {
> if (atomic_dec_and_test(&t->usage))
> - __put_task_struct(t);
> + call_rcu(&t->rcu, __put_task_struct_cb);
> }
>
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 5d30019..326eae7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
> }
> }
>
> -static void delayed_put_task_struct(struct rcu_head *rhp)
> +void __put_task_struct_cb(struct rcu_head *rhp)
> {
> struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
>
> perf_event_delayed_put(tsk);
> trace_sched_process_free(tsk);
> - put_task_struct(tsk);
> + __put_task_struct(tsk);
> }
> -
> +EXPORT_SYMBOL_GPL(__put_task_struct_cb);
>
> void release_task(struct task_struct *p)
> {
> @@ -207,7 +207,7 @@ void release_task(struct task_struct *p)
>
> write_unlock_irq(&tasklist_lock);
> release_thread(p);
> - call_rcu(&p->rcu, delayed_put_task_struct);
> + put_task_struct(p);
>
> p = leader;
> if (unlikely(zap_leader))
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..4d3ac3c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
> if (!profile_handoff_task(tsk))
> free_task(tsk);
> }
> -EXPORT_SYMBOL_GPL(__put_task_struct);
>
> void __init __weak arch_task_cache_init(void) { }

Hmm. I am not sure I understand how this patch can actually fix this problem.
It seems that it is still possible that get_task_struct() can be called after
call_rcu(__put_task_struct_cb) ? But perhaps I misread this patch.

And I think it adds another problem. Suppose we have a zombie which already
called schedule() in TASK_DEAD state. IOW, its ->usage == 1, its parent will
free this task when it calls sys_wait().

With this patch the code like

rcu_read_lock();
for_each_process(p) {
if (pred(p) {
get_task_struct(p);
return p;
}
}
rcu_read_unlock();

becomes unsafe: we can race with release_task(p) and get_task_struct() can
can be called when prev->usage is already 0 and this task_struct can be freed
omce you drop rcu_read_lock().

Oleg.

2014-10-15 19:44:19

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

On 10/15, Oleg Nesterov wrote:
>
> On 10/15, Kirill Tkhai wrote:
> >
> > Regarding to scheduler this may be a reason of use-after-free.
> >
> > task_numa_compare() schedule()
> > rcu_read_lock() ...
> > cur = ACCESS_ONCE(dst_rq->curr) ...
> > ... rq->curr = next;
> > ... context_switch()
> > ... finish_task_switch()
> > ... put_task_struct()
> > ... __put_task_struct()
> > ... free_task_struct()
> > task_numa_assign() ...
> > get_task_struct() ...
>
> Agreed. I don't understand this code (will try to take another look later),
> but at first glance this looks wrong.
>
> At least the code like
>
> rcu_read_lock();
> get_task_struct(foreign_rq->curr);
> rcu_read_unlock();
>
> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Yes, but perhaps in this particular case another simple fix makes more
sense. The patch below needs a comment to explain that we check PF_EXITING
because:

1. It doesn't make sense to migrate the exiting task. Although perhaps
we could check ->mm == NULL instead.

But let me repeat that I do not understand this code, I am not sure
we can equally treat is_idle_task() and PF_EXITING here...

2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
won't be called until we drop rcu_read_lock(), and thus get_task_struct()
is safe.

And. it seems that there is another problem? Can't task_h_load(cur) race
with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
p == current but we do not disable preemption.

What do you think?

Oleg.

--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur = NULL;

/*

2014-10-15 21:46:15

Subject: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Subject: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re:[PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

Subject: Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()