2013-04-12 19:29:56

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH] kernel: move exit_task_work() past exit_notify()

exit_task_work() must be called after exit_notify, because
exit_task_namespaces() may release a file and fput() enqueues a work.

exit_notify
exit_task_namespaces
free_ipc_ns
shm_destroy
fput
task_work_add

so if task works don't run after exit_notify(), a few files may leak.

Cc: Andrew Morton <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
kernel/exit.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 60bc027..1d1129b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -795,7 +795,6 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
- exit_task_work(tsk);
check_stack_usage();
exit_thread();

@@ -822,6 +821,7 @@ void do_exit(long code)
ptrace_put_breakpoints(tsk);

exit_notify(tsk, group_dead);
+ exit_task_work(tsk);
#ifdef CONFIG_NUMA
task_lock(tsk);
mpol_put(tsk->mempolicy);
--
1.8.1.4


2013-04-13 14:25:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] kernel: move exit_task_work() past exit_notify()

On 04/12, Andrey Vagin wrote:
> exit_task_work() must be called after exit_notify, because
> exit_task_namespaces() may release a file and fput() enqueues a work.
>
> exit_notify
> exit_task_namespaces
> free_ipc_ns
> shm_destroy
> fput
> task_work_add
>
> so if task works don't run after exit_notify(), a few files may leak.
>
> Cc: Andrew Morton <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: David Howells <[email protected]>
> Signed-off-by: Andrey Vagin <[email protected]>
> ---
> kernel/exit.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 60bc027..1d1129b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -795,7 +795,6 @@ void do_exit(long code)
> exit_shm(tsk);
> exit_files(tsk);
> exit_fs(tsk);
> - exit_task_work(tsk);
> check_stack_usage();
> exit_thread();
>
> @@ -822,6 +821,7 @@ void do_exit(long code)
> ptrace_put_breakpoints(tsk);
>
> exit_notify(tsk, group_dead);
> + exit_task_work(tsk);

I am not comfortable with this change...

The task is "really dead" after exit_notify(), even release_task(current)
can be called.

Let me think a bit... It seems that we have the alternative.

Oleg.

2013-04-13 15:57:39

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/1] (Was: kernel: move exit_task_work() past exit_notify())

On 04/13, Oleg Nesterov wrote:
>
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -795,7 +795,6 @@ void do_exit(long code)
> > exit_shm(tsk);
> > exit_files(tsk);
> > exit_fs(tsk);
> > - exit_task_work(tsk);
> > check_stack_usage();
> > exit_thread();
> >
> > @@ -822,6 +821,7 @@ void do_exit(long code)
> > ptrace_put_breakpoints(tsk);
> >
> > exit_notify(tsk, group_dead);
> > + exit_task_work(tsk);
>
> I am not comfortable with this change...
>
> The task is "really dead" after exit_notify(), even release_task(current)
> can be called.
>
> Let me think a bit... It seems that we have the alternative.

Andrey, Eric, how about this patch?

COMPLETELY UNTESTED and I need to recheck, but perhaps you can review?

Oleg.

2013-04-13 15:58:01

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()

exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

However, after 32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().

Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().

This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add(). And this allows us do simplify
exit_notify(), we can avoid unlock/lock(tasklist) and we can change
->exit_state instead of PF_EXITING in forget_original_parent().

Reported-by: Andrey Vagin <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
* jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
*/
forget_original_parent(tsk);
- exit_task_namespaces(tsk);

write_lock_irq(&tasklist_lock);
if (group_dead)
@@ -795,6 +794,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
+ exit_task_namespaces(tsk);
exit_task_work(tsk);
check_stack_usage();
exit_thread();

2013-04-14 01:53:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()

Oleg Nesterov <[email protected]> writes:

> exit_notify() does exit_task_namespaces() after
> forget_original_parent(). This was needed to ensure that ->nsproxy
> can't be cleared prematurely, an exiting child we are going to
> reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
>
> However, after 32084504 "pidns: use task_active_pid_ns in
> do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> on task_active_pid_ns().
>
> Move exit_task_namespaces() from exit_notify() to do_exit(), after
> exit_fs() and before exit_task_work().
>
> This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> does fput() which needs task_work_add(). And this allows us do simplify
> exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> ->exit_state instead of PF_EXITING in forget_original_parent().

It feels like this ought to work, certainly the pid namespace should not
need this, and the pid namespace was the motivating case for most of the
movement. However we haven't called exit_task_namespaces this early
since 2006.

Ugh. I goofed and used that field in scm.c. Sigh. I will push a patch
to rename that field nsproxy->childrens_pid_ns so it is harder to
make the mistake I just made.

None of the uses of nsproxy->net_ns look like they will be used on the
exit path.

The /proc/<pid>/ns/{uts,ipc,net,mnt,pid} files are fine as nsproxy
itself is what becomes NULL and they test for that. Well except the pid
file uses task_active_pid_ns.

nsproxy->ipc_ns is isolated to files under ipc so it is probably fine.

Likewise the nsproxy->uts_ns uses look like they will be fine.

Likewise the nsproxy->mnt_ns uses look like they will be fine.

So in a quick skim through the uses no problem cases stick out, nor
can I think of anything that would cause trouble. This looks like a
good patch.

Acked-by: "Eric W. Biederman" <[email protected]>

> Reported-by: Andrey Vagin <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
> * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
> */
> forget_original_parent(tsk);
> - exit_task_namespaces(tsk);
>
> write_lock_irq(&tasklist_lock);
> if (group_dead)
> @@ -795,6 +794,7 @@ void do_exit(long code)
> exit_shm(tsk);
> exit_files(tsk);
> exit_fs(tsk);
> + exit_task_namespaces(tsk);
> exit_task_work(tsk);
> check_stack_usage();
> exit_thread();

2013-04-15 09:57:28

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()

2013/4/13 Oleg Nesterov <[email protected]>
>
> exit_notify() does exit_task_namespaces() after
> forget_original_parent(). This was needed to ensure that ->nsproxy
> can't be cleared prematurely, an exiting child we are going to
> reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
>
> However, after 32084504 "pidns: use task_active_pid_ns in
> do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> on task_active_pid_ns().
>
> Move exit_task_namespaces() from exit_notify() to do_exit(), after
> exit_fs() and before exit_task_work().
>
> This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> does fput() which needs task_work_add(). And this allows us do simplify
> exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> ->exit_state instead of PF_EXITING in forget_original_parent().
>

It looks good for me. I have tested it a bit and don't find any problem.
Oleg, thank you.

Acked-by: Andrew Vagin <[email protected]>

> Reported-by: Andrey Vagin <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
> * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
> */
> forget_original_parent(tsk);
> - exit_task_namespaces(tsk);
>
> write_lock_irq(&tasklist_lock);
> if (group_dead)
> @@ -795,6 +794,7 @@ void do_exit(long code)
> exit_shm(tsk);
> exit_files(tsk);
> exit_fs(tsk);
> + exit_task_namespaces(tsk);
> exit_task_work(tsk);
> check_stack_usage();
> exit_thread();
>

2013-04-15 15:35:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()

On 04/15, Andrey Wagin wrote:
>
> It looks good for me. I have tested it a bit and don't find any problem.
> Oleg, thank you.
>
> Acked-by: Andrew Vagin <[email protected]>

Thanks Andrey and Eric.

> > --- x/kernel/exit.c
> > +++ x/kernel/exit.c
> > @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
> > * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
> > */
> > forget_original_parent(tsk);
> > - exit_task_namespaces(tsk);
> >
> > write_lock_irq(&tasklist_lock);
> > if (group_dead)
> > @@ -795,6 +794,7 @@ void do_exit(long code)
> > exit_shm(tsk);
> > exit_files(tsk);
> > exit_fs(tsk);
> > + exit_task_namespaces(tsk);
> > exit_task_work(tsk);

I do not see any problems with this patch too... but still I am worried.

Even if fput() can work correctly after exit_task_namespaces(), this limits
the usage of task_work_add(). Probably this is fine, but can't we at least
discuss another change?

We can change fput() so that it can always work, even after exit_task_work(),

void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
struct task_struct *task = current;
unsigned long flags;

file_sb_list_del(file);
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
init_task_work(&file->f_u.fu_rcuhead, ____fput);
if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
return;
}

spin_lock_irqsave(&delayed_fput_lock, flags);
list_add(&file->f_u.fu_list, &delayed_fput_list);
schedule_work(&delayed_fput_work);
spin_unlock_irqrestore(&delayed_fput_lock, flags);
}
}

Al, what do you think?

Untested patch below.

Oleg.

--- x/fs/file_table.c
+++ x/fs/file_table.c
@@ -306,17 +306,19 @@ void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
struct task_struct *task = current;
+ unsigned long flags;
+
file_sb_list_del(file);
- if (unlikely(in_interrupt() || task->flags & PF_KTHREAD)) {
- unsigned long flags;
- spin_lock_irqsave(&delayed_fput_lock, flags);
- list_add(&file->f_u.fu_list, &delayed_fput_list);
- schedule_work(&delayed_fput_work);
- spin_unlock_irqrestore(&delayed_fput_lock, flags);
- return;
+ if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+ init_task_work(&file->f_u.fu_rcuhead, ____fput);
+ if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
+ return;
}
- init_task_work(&file->f_u.fu_rcuhead, ____fput);
- task_work_add(task, &file->f_u.fu_rcuhead, true);
+
+ spin_lock_irqsave(&delayed_fput_lock, flags);
+ list_add(&file->f_u.fu_list, &delayed_fput_list);
+ schedule_work(&delayed_fput_work);
+ spin_unlock_irqrestore(&delayed_fput_lock, flags);
}
}

2013-06-13 09:26:58

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()

On Sat, Apr 13, 2013 at 05:55:21PM +0200, Oleg Nesterov wrote:
> exit_notify() does exit_task_namespaces() after
> forget_original_parent(). This was needed to ensure that ->nsproxy
> can't be cleared prematurely, an exiting child we are going to
> reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
>
> However, after 32084504 "pidns: use task_active_pid_ns in
> do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> on task_active_pid_ns().
>
> Move exit_task_namespaces() from exit_notify() to do_exit(), after
> exit_fs() and before exit_task_work().
>
> This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> does fput() which needs task_work_add(). And this allows us do simplify
> exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> ->exit_state instead of PF_EXITING in forget_original_parent().
>

It looks good for me. kmemleak doesn't report any leaks. CRIU test
cases, which use namespaces, work without any errors.

Thanks.

Acked-by: Andrey Vagin <[email protected]>

> Reported-by: Andrey Vagin <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
> * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
> */
> forget_original_parent(tsk);
> - exit_task_namespaces(tsk);
>
> write_lock_irq(&tasklist_lock);
> if (group_dead)
> @@ -795,6 +794,7 @@ void do_exit(long code)
> exit_shm(tsk);
> exit_files(tsk);
> exit_fs(tsk);
> + exit_task_namespaces(tsk);
> exit_task_work(tsk);
> check_stack_usage();
> exit_thread();
>

2013-06-14 13:23:58

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()

On 06/13, Andrew Vagin wrote:
>
> On Sat, Apr 13, 2013 at 05:55:21PM +0200, Oleg Nesterov wrote:
> > exit_notify() does exit_task_namespaces() after
> > forget_original_parent(). This was needed to ensure that ->nsproxy
> > can't be cleared prematurely, an exiting child we are going to
> > reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
> >
> > However, after 32084504 "pidns: use task_active_pid_ns in
> > do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> > on task_active_pid_ns().
> >
> > Move exit_task_namespaces() from exit_notify() to do_exit(), after
> > exit_fs() and before exit_task_work().
> >
> > This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> > does fput() which needs task_work_add(). And this allows us do simplify
> > exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> > ->exit_state instead of PF_EXITING in forget_original_parent().
> >
>
> It looks good for me. kmemleak doesn't report any leaks. CRIU test
> cases, which use namespaces, work without any errors.

OK, thanks.

I guess I need to resend this patch with your and Eric's acks.

But in fact I am going to send the 2nd change as well, it should
fix the same problem too and imho it makes sense anyway. And to
me the 2nd one looks like 3.10 material, it is much more simple
and straightforward.

Oleg.