2018-11-27 22:58:12

by Jürg Billeter

[permalink] [raw]
Subject: [PATCH 0/1] Add prctl to kill descendants on exit

This patch adds a new prctl to kill all descendant processes on exit.
See commit message for details of the prctl.

This is a replacement of PR_SET_PDEATHSIG_PROC I proposed last year [1].
In the following discussion, Oleg suggested this approach.

The motivation for this is to provide a lightweight mechanism to prevent
stray processes. There is also a related Bugzilla entry [2].

PID namespaces can also be used to prevent stray processes, of course.
However, they are not quite as lightweight as they typically also
require a new mount namespace to be able to mount a new /proc. And they
require CAP_SYS_ADMIN. User namespaces can help to gain CAP_SYS_ADMIN,
however, that further increases the overhead and the other effects of
the user namespace may not be desired.

PID 1 in PID namespaces also exhibits non-standard signal behavior
(SIGNAL_UNKILLABLE) [3].

[1] https://lkml.kernel.org/lkml/[email protected]/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=43300
[3] https://lkml.kernel.org/lkml/[email protected]/

Jürg Billeter (1):
prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

fs/exec.c | 6 ++++++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/exit.c | 12 ++++++++++++
kernel/sys.c | 11 +++++++++++
security/apparmor/lsm.c | 1 +
security/selinux/hooks.c | 3 +++
7 files changed, 40 insertions(+)

--
2.19.2



2018-11-27 22:56:41

by Jürg Billeter

[permalink] [raw]
Subject: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

This introduces a new thread group flag that can be set by calling

prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)

When a thread group exits with this flag set, it will send SIGKILL to
all descendant processes. This can be used to prevent stray child
processes.

This flag is cleared on privilege gaining execve(2) to ensure an
unprivileged process cannot get a privileged process to send SIGKILL.

Descendants that are orphaned and reparented to an ancestor of the
current process before the current process exits, will not be killed.
PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.

If a descendant gained privileges, the current process may not be
allowed to kill it, and the descendant process will survive.
PR_SET_NO_NEW_PRIVS can be used to prevent descendant processes from
gaining privileges.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Jürg Billeter <[email protected]>
---
fs/exec.c | 6 ++++++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/exit.c | 12 ++++++++++++
kernel/sys.c | 11 +++++++++++
security/apparmor/lsm.c | 1 +
security/selinux/hooks.c | 3 +++
7 files changed, 40 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 1ebf6e5a521d..f48ff4333393 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1339,6 +1339,12 @@ void setup_new_exec(struct linux_binprm * bprm)
/* Make sure parent cannot signal privileged process. */
current->pdeath_signal = 0;

+ /*
+ * Do not send SIGKILL from privileged process as it may
+ * have been requested by an unprivileged process.
+ */
+ current->signal->kill_descendants_on_exit = 0;
+
/*
* For secureexec, reset the stack limit to sane default to
* avoid bad behavior from the prior rlimits. This has to
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1be35729c2c5..3bfb71701488 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -124,6 +124,9 @@ struct signal_struct {
unsigned int is_child_subreaper:1;
unsigned int has_child_subreaper:1;

+ /* Send SIGKILL to descendant processes on exit */
+ unsigned int kill_descendants_on_exit:1;
+
#ifdef CONFIG_POSIX_TIMERS

/* POSIX.1b Interval Timers */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0bf5b6..2ac4da1f282b 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -198,6 +198,10 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4

+/* Send SIGKILL to descendant processes on exit */
+#define PR_SET_KILL_DESCENDANTS_ON_EXIT 48
+#define PR_GET_KILL_DESCENDANTS_ON_EXIT 49
+
/* arm64 Scalable Vector Extension controls */
/* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */
#define PR_SVE_SET_VL 50 /* set task vector length */
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..7fe0c694685a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -694,6 +694,15 @@ static void forget_original_parent(struct task_struct *father,
list_splice_tail_init(&father->children, &reaper->children);
}

+static int kill_descendant_visitor(struct task_struct *p, void *data)
+{
+ /* This may fail, e.g., when a descendant process gained privileges. */
+ group_send_sig_info(SIGKILL, SEND_SIG_NOINFO, p, PIDTYPE_TGID);
+
+ /* Always continue walking the process tree. */
+ return 1;
+}
+
/*
* Send signals to all our closest relatives so that they know
* to properly mourn us..
@@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
struct task_struct *p, *n;
LIST_HEAD(dead);

+ if (group_dead && tsk->signal->kill_descendants_on_exit)
+ walk_process_tree(tsk, kill_descendant_visitor, NULL);
+
write_lock_irq(&tasklist_lock);
forget_original_parent(tsk, &dead);

diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73046ec..8d9af81da093 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_KILL_DESCENDANTS_ON_EXIT:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ me->signal->kill_descendants_on_exit = !!arg2;
+ break;
+ case PR_GET_KILL_DESCENDANTS_ON_EXIT:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->kill_descendants_on_exit,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 8b8b70620bbe..d0d8f88130fb 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -695,6 +695,7 @@ static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
aa_inherit_files(bprm->cred, current->files);

current->pdeath_signal = 0;
+ current->signal->kill_descendants_on_exit = 0;

/* reset soft limits and set hard limits for the new label */
__aa_transition_rlimits(label, new_label);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index ad9a9b8e9979..313a7be43a98 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2653,6 +2653,9 @@ static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
/* Always clear parent death signal on SID transitions. */
current->pdeath_signal = 0;

+ /* Disable SIGKILL requested from before the SID transition. */
+ current->signal->kill_descendants_on_exit = 0;
+
/* Check whether the new SID can inherit resource limits from the old
* SID. If not, reset all soft limits to the lower of the current
* task's hard limit and the init task's soft limit.
--
2.19.2


2018-11-28 14:45:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On 11/27, J?rg Billeter wrote:
>
> @@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> struct task_struct *p, *n;
> LIST_HEAD(dead);
>
> + if (group_dead && tsk->signal->kill_descendants_on_exit)
> + walk_process_tree(tsk, kill_descendant_visitor, NULL);

Well, this is not exactly right, at least this is suboptimal in that
other sub-threads can too call walk_process_tree(kill_descendant_visitor)
later for no reason.



2018-11-28 15:25:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

Oleg Nesterov <[email protected]> writes:

> On 11/27, Jürg Billeter wrote:
>>
>> @@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>> struct task_struct *p, *n;
>> LIST_HEAD(dead);
>>
>> + if (group_dead && tsk->signal->kill_descendants_on_exit)
>> + walk_process_tree(tsk, kill_descendant_visitor, NULL);
>
> Well, this is not exactly right, at least this is suboptimal in that
> other sub-threads can too call walk_process_tree(kill_descendant_visitor)
> later for no reason.

Oleg I think I am missing something.

Reading kernel/exit.c I see "group_dead = atomic_dec_and_test(&tsk->signal->live)".
Which seems like enough to ensure exactly one task/thread calls walk_process_tree.

Can you explain what I am missing?

Eric


2018-11-29 12:35:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On 11/28, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > On 11/27, J?rg Billeter wrote:
> >>
> >> @@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >> struct task_struct *p, *n;
> >> LIST_HEAD(dead);
> >>
> >> + if (group_dead && tsk->signal->kill_descendants_on_exit)
> >> + walk_process_tree(tsk, kill_descendant_visitor, NULL);
> >
> > Well, this is not exactly right, at least this is suboptimal in that
> > other sub-threads can too call walk_process_tree(kill_descendant_visitor)
> > later for no reason.
>
> Oleg I think I am missing something.

No, it is stupid me who can't read,

> Reading kernel/exit.c I see "group_dead = atomic_dec_and_test(&tsk->signal->live)".
> Which seems like enough to ensure exactly one task/thread calls walk_process_tree.

Of course you right, sorry for confusion.

To me it would be more clean to call walk_process_tree(kill_descendant_visitor)
unconditionally in find_new_reaper() right before "if (has_child_subreaper)", but
then we will need to shift read_lock(tasklist) from walk_process_tree().

So I think the patch is mostly fine, the only problem I can see is that
PR_SET_KILL_DESCENDANTS_ON_EXIT can race with PR_SET_CHILD_SUBREAPER, they both
need to update the bits in the same word.

Oleg.


2018-11-29 15:42:59

by Jürg Billeter

[permalink] [raw]
Subject: Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

Hi Oleg,

Thanks for the review.

On Thu, 2018-11-29 at 13:34 +0100, Oleg Nesterov wrote:
> To me it would be more clean to call walk_process_tree(kill_descendant_visitor)
> unconditionally in find_new_reaper() right before "if (has_child_subreaper)", but
> then we will need to shift read_lock(tasklist) from walk_process_tree().

Yes, that's the reason why I added the call before the tasklist lock.
Let me know if you want me to move the read lock from
walk_process_tree() to PR_SET_CHILD_SUBREAPER (the only caller)
instead.

> So I think the patch is mostly fine, the only problem I can see is that
> PR_SET_KILL_DESCENDANTS_ON_EXIT can race with PR_SET_CHILD_SUBREAPER, they both
> need to update the bits in the same word.

Good point. I'll make it a regular bool instead of a bitfield for v2,
unless you have another approach in mind to fix this.

Jürg


2018-11-30 08:01:15

by Jürg Billeter

[permalink] [raw]
Subject: [PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

This introduces a new thread group flag that can be set by calling

prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)

When a thread group exits with this flag set, it will send SIGKILL to
all descendant processes. This can be used to prevent stray child
processes.

This flag is cleared on privilege gaining execve(2) to ensure an
unprivileged process cannot get a privileged process to send SIGKILL.

Descendants that are orphaned and reparented to an ancestor of the
current process before the current process exits, will not be killed.
PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.

If a descendant gained privileges, the current process may not be
allowed to kill it, and the descendant process will survive.
PR_SET_NO_NEW_PRIVS can be used to prevent descendant processes from
gaining privileges.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Jürg Billeter <[email protected]>
---
fs/exec.c | 6 ++++++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/exit.c | 12 ++++++++++++
kernel/sys.c | 11 +++++++++++
security/apparmor/lsm.c | 1 +
security/selinux/hooks.c | 3 +++
7 files changed, 40 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 044e296f2381..1c9520d83d6b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1343,6 +1343,12 @@ void setup_new_exec(struct linux_binprm * bprm)
/* Make sure parent cannot signal privileged process. */
current->pdeath_signal = 0;

+ /*
+ * Do not send SIGKILL from privileged process as it may
+ * have been requested by an unprivileged process.
+ */
+ current->signal->kill_descendants_on_exit = false;
+
/*
* For secureexec, reset the stack limit to sane default to
* avoid bad behavior from the prior rlimits. This has to
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0c3e396dca04..91ed7f480b60 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -124,6 +124,9 @@ struct signal_struct {
unsigned int is_child_subreaper:1;
unsigned int has_child_subreaper:1;

+ /* Send SIGKILL to descendant processes on exit */
+ bool kill_descendants_on_exit;
+
#ifdef CONFIG_POSIX_TIMERS

/* POSIX.1b Interval Timers */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b17201edfa09..a31141236064 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -198,6 +198,10 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4

+/* Send SIGKILL to descendant processes on exit */
+#define PR_SET_KILL_DESCENDANTS_ON_EXIT 48
+#define PR_GET_KILL_DESCENDANTS_ON_EXIT 49
+
/* arm64 Scalable Vector Extension controls */
/* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */
#define PR_SVE_SET_VL 50 /* set task vector length */
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..7fe0c694685a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -694,6 +694,15 @@ static void forget_original_parent(struct task_struct *father,
list_splice_tail_init(&father->children, &reaper->children);
}

+static int kill_descendant_visitor(struct task_struct *p, void *data)
+{
+ /* This may fail, e.g., when a descendant process gained privileges. */
+ group_send_sig_info(SIGKILL, SEND_SIG_NOINFO, p, PIDTYPE_TGID);
+
+ /* Always continue walking the process tree. */
+ return 1;
+}
+
/*
* Send signals to all our closest relatives so that they know
* to properly mourn us..
@@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
struct task_struct *p, *n;
LIST_HEAD(dead);

+ if (group_dead && tsk->signal->kill_descendants_on_exit)
+ walk_process_tree(tsk, kill_descendant_visitor, NULL);
+
write_lock_irq(&tasklist_lock);
forget_original_parent(tsk, &dead);

diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73046ec..8d9af81da093 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_KILL_DESCENDANTS_ON_EXIT:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ me->signal->kill_descendants_on_exit = !!arg2;
+ break;
+ case PR_GET_KILL_DESCENDANTS_ON_EXIT:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->kill_descendants_on_exit,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 7027c7aee00d..5de086d7546c 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -798,6 +798,7 @@ static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
aa_inherit_files(bprm->cred, current->files);

current->pdeath_signal = 0;
+ current->signal->kill_descendants_on_exit = false;

/* reset soft limits and set hard limits for the new label */
__aa_transition_rlimits(label, new_label);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 67f1344e728a..4c041212702c 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2657,6 +2657,9 @@ static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
/* Always clear parent death signal on SID transitions. */
current->pdeath_signal = 0;

+ /* Disable SIGKILL requested from before the SID transition. */
+ current->signal->kill_descendants_on_exit = false;
+
/* Check whether the new SID can inherit resource limits from the old
* SID. If not, reset all soft limits to the lower of the current
* task's hard limit and the init task's soft limit.
--
2.19.2


2018-11-30 08:01:26

by Jürg Billeter

[permalink] [raw]
Subject: [PATCH v2 0/1] Add prctl to kill descendants on exit

This patch adds a new prctl to kill all descendant processes on exit.
See commit message for details of the prctl.

This is a replacement of PR_SET_PDEATHSIG_PROC I proposed last year [1].
In the following discussion, Oleg suggested this approach.

The motivation for this is to provide a lightweight mechanism to prevent
stray processes. There is also a related Bugzilla entry [2].

PID namespaces can also be used to prevent stray processes, of course.
However, they are not quite as lightweight as they typically also
require a new mount namespace to be able to mount a new /proc. And they
require CAP_SYS_ADMIN. User namespaces can help to gain CAP_SYS_ADMIN,
however, that further increases the overhead and the other effects of
the user namespace may not be desired.

PID 1 in PID namespaces also exhibits non-standard signal behavior
(SIGNAL_UNKILLABLE) [3].

Changes in v2:
- Use bool instead of bitfield to avoid race with
PR_SET_CHILD_SUBREAPER

[1] https://lkml.kernel.org/lkml/[email protected]/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=43300
[3] https://lkml.kernel.org/lkml/[email protected]/

Jürg Billeter (1):
prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

fs/exec.c | 6 ++++++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/exit.c | 12 ++++++++++++
kernel/sys.c | 11 +++++++++++
security/apparmor/lsm.c | 1 +
security/selinux/hooks.c | 3 +++
7 files changed, 40 insertions(+)

--
2.19.2


2018-11-30 10:34:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On 11/29, J?rg Billeter wrote:
>
> On Thu, 2018-11-29 at 13:34 +0100, Oleg Nesterov wrote:
> > To me it would be more clean to call walk_process_tree(kill_descendant_visitor)
> > unconditionally in find_new_reaper() right before "if (has_child_subreaper)", but
> > then we will need to shift read_lock(tasklist) from walk_process_tree().
>
> Yes, that's the reason why I added the call before the tasklist lock.
> Let me know if you want me to move the read lock from
> walk_process_tree() to PR_SET_CHILD_SUBREAPER (the only caller)
> instead.

I am fine either way. We can do this later, lets keep your patch simple.

> > So I think the patch is mostly fine, the only problem I can see is that
> > PR_SET_KILL_DESCENDANTS_ON_EXIT can race with PR_SET_CHILD_SUBREAPER, they both
> > need to update the bits in the same word.
>
> Good point. I'll make it a regular bool instead of a bitfield for v2,

Agreed,

> unless you have another approach in mind to fix this.

Well, I think that is_child_subreaper/has_child_subreaper and the new
kill_descendants_on_exit should live in signal->flags, but we need some
cleanups to make this possible, so I agree with the boolean.

Oleg.


2018-11-30 11:24:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On 11/30, J?rg Billeter wrote:
>
> This introduces a new thread group flag that can be set by calling
>
> prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)
>
> When a thread group exits with this flag set, it will send SIGKILL to
> all descendant processes. This can be used to prevent stray child
> processes.

Reviewed-by: Oleg Nesterov <[email protected]>


2018-11-30 13:42:39

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

* Jürg Billeter:

> This introduces a new thread group flag that can be set by calling
>
> prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)
>
> When a thread group exits with this flag set, it will send SIGKILL to
> all descendant processes. This can be used to prevent stray child
> processes.
>
> This flag is cleared on privilege gaining execve(2) to ensure an
> unprivileged process cannot get a privileged process to send SIGKILL.

So this is inherited across regular execve? I'm not sure that's a good
idea.

> Descendants that are orphaned and reparented to an ancestor of the
> current process before the current process exits, will not be killed.
> PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.

For double- or triple-forking daemons, the reparenting will be racy, if
I understand things correctly.

Thanks,
Florian

2018-12-01 04:31:45

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On Fri, Nov 30, 2018 at 2:33 AM Oleg Nesterov <[email protected]> wrote:
>
> On 11/29, Jürg Billeter wrote:
> >
> > On Thu, 2018-11-29 at 13:34 +0100, Oleg Nesterov wrote:
> > > So I think the patch is mostly fine, the only problem I can see is that
> > > PR_SET_KILL_DESCENDANTS_ON_EXIT can race with PR_SET_CHILD_SUBREAPER, they both
> > > need to update the bits in the same word.
> >
> > Good point. I'll make it a regular bool instead of a bitfield for v2,
>
> Agreed,

Is it worth doing something for singal_struct like we did for
task_struct with the TASK_PFA_* and atomic_flags?

--
Kees Cook

2018-12-01 10:41:52

by Jürg Billeter

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On Fri, 2018-11-30 at 14:40 +0100, Florian Weimer wrote:
> * Jürg Billeter:
>
> > This introduces a new thread group flag that can be set by calling
> >
> > prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)
> >
> > When a thread group exits with this flag set, it will send SIGKILL to
> > all descendant processes. This can be used to prevent stray child
> > processes.
> >
> > This flag is cleared on privilege gaining execve(2) to ensure an
> > unprivileged process cannot get a privileged process to send SIGKILL.
>
> So this is inherited across regular execve? I'm not sure that's a good
> idea.

Yes, this matches PR_SET_CHILD_SUBREAPER (and other process
attributes). Besides consistency and allowing a parent to configure the
flag for a spawned process, this is also needed to prevent a process
from clearing the flag (in combination with a seccomp filter).

>
> > Descendants that are orphaned and reparented to an ancestor of the
> > current process before the current process exits, will not be killed.
> > PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.
>
> For double- or triple-forking daemons, the reparenting will be racy, if
> I understand things correctly.

Can you please elaborate, if you're concerned about a particular race?
As the commit message mentions, for containment this flag can be
combined with PR_SET_CHILD_SUBREAPER (and PR_SET_NO_NEW_PRIVS).

Jürg


2018-12-01 12:34:46

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

* Jürg Billeter:

> On Fri, 2018-11-30 at 14:40 +0100, Florian Weimer wrote:
>> * Jürg Billeter:
>>
>> > This introduces a new thread group flag that can be set by calling
>> >
>> > prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)
>> >
>> > When a thread group exits with this flag set, it will send SIGKILL to
>> > all descendant processes. This can be used to prevent stray child
>> > processes.
>> >
>> > This flag is cleared on privilege gaining execve(2) to ensure an
>> > unprivileged process cannot get a privileged process to send SIGKILL.
>>
>> So this is inherited across regular execve? I'm not sure that's a good
>> idea.
>
> Yes, this matches PR_SET_CHILD_SUBREAPER (and other process
> attributes). Besides consistency and allowing a parent to configure the
> flag for a spawned process, this is also needed to prevent a process
> from clearing the flag (in combination with a seccomp filter).

I think the semantics of PR_SET_CHILD_SUBREAPER are different, and the
behavior makes more sense there.

>> > Descendants that are orphaned and reparented to an ancestor of the
>> > current process before the current process exits, will not be killed.
>> > PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.
>>
>> For double- or triple-forking daemons, the reparenting will be racy, if
>> I understand things correctly.
>
> Can you please elaborate, if you're concerned about a particular race?
> As the commit message mentions, for containment this flag can be
> combined with PR_SET_CHILD_SUBREAPER (and PR_SET_NO_NEW_PRIVS).

Without PR_SET_CHILD_SUBREAPER, if a newly execve'ed daemon performs
double/triple forking to disentangle itself from the parent process
session, and the parent process which set
PR_SET_KILL_DESCENDANTS_ON_EXIT terminates, behavior depends on when
exactly the parent process terminates. The daemon process will leak if
it has completed its reparenting.

I think this could be sufficiently common that solution is needed here.

Thanks,
Florian

2018-12-01 13:59:02

by Jürg Billeter

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On Sat, 2018-12-01 at 13:28 +0100, Florian Weimer wrote:
> * Jürg Billeter:
>
> > On Fri, 2018-11-30 at 14:40 +0100, Florian Weimer wrote:
> > > * Jürg Billeter:
> > >
> > > > This introduces a new thread group flag that can be set by calling
> > > >
> > > > prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)
> > > >
> > > > When a thread group exits with this flag set, it will send SIGKILL to
> > > > all descendant processes. This can be used to prevent stray child
> > > > processes.
> > > >
> > > > This flag is cleared on privilege gaining execve(2) to ensure an
> > > > unprivileged process cannot get a privileged process to send SIGKILL.
> > >
> > > So this is inherited across regular execve? I'm not sure that's a good
> > > idea.
> >
> > Yes, this matches PR_SET_CHILD_SUBREAPER (and other process
> > attributes). Besides consistency and allowing a parent to configure the
> > flag for a spawned process, this is also needed to prevent a process
> > from clearing the flag (in combination with a seccomp filter).
>
> I think the semantics of PR_SET_CHILD_SUBREAPER are different, and the
> behavior makes more sense there.

In my opinion, introducing inconsistency by deviating from the common
behavior of retaining process attributes across execve would be more
confusing/surprising to users. I don't see why it makes sense for
PR_SET_CHILD_SUBREAPER but not for PR_SET_KILL_DESCENDANTS_ON_EXIT.

Also, the main motivation is to provide a subset of PID namespace
features to unprivileged processes with a lightweight mechanism.
Retaining kill_descendants_on_exit across execve allows very similar
usage to PID namespaces: E.g., the parent can set
PR_SET_KILL_DESCENDANTS_ON_EXIT and PR_SET_CHILD_SUBREAPER in the child
before execve and the spawned init-like executable doesn't need to know
about this flag itself, i.e., the same init-like program can function
as a leader of a PID namespace or as a subreaper with this extra flag
set without code changes.

If the flag was cleared by execve, the program would need to know about
this flag and it would be impossible for the parent to lock this down
using seccomp.

>
> > > > Descendants that are orphaned and reparented to an ancestor of the
> > > > current process before the current process exits, will not be killed.
> > > > PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.
> > >
> > > For double- or triple-forking daemons, the reparenting will be racy, if
> > > I understand things correctly.
> >
> > Can you please elaborate, if you're concerned about a particular race?
> > As the commit message mentions, for containment this flag can be
> > combined with PR_SET_CHILD_SUBREAPER (and PR_SET_NO_NEW_PRIVS).
>
> Without PR_SET_CHILD_SUBREAPER, if a newly execve'ed daemon performs
> double/triple forking to disentangle itself from the parent process
> session, and the parent process which set
> PR_SET_KILL_DESCENDANTS_ON_EXIT terminates, behavior depends on when
> exactly the parent process terminates. The daemon process will leak if
> it has completed its reparenting.
>
> I think this could be sufficiently common that solution is needed here.

I expect the common case to be that PR_SET_KILL_DESCENDANTS_ON_EXIT
will be used together with PR_SET_CHILD_SUBREAPER (and possibly
PR_SET_NO_NEW_PRIVS) to prevent stray children. And I don't see a race
condition in that case.

PR_SET_KILL_DESCENDANTS_ON_EXIT can be used for non-subreapers but I
expect this to be used in more specialized scenarios where the program
is designed/known to avoid such race conditions. We could theoretically
restrict PR_SET_KILL_DESCENDANTS_ON_EXIT to subreapers but I currently
don't see a strong enough reason for this.

Jürg


2018-12-06 15:55:37

by Jürg Billeter

[permalink] [raw]
Subject: Re: [PATCH v2 0/1] Add prctl to kill descendants on exit

On Fri, 2018-11-30 at 08:00 +0000, Jürg Billeter wrote:
> This patch adds a new prctl to kill all descendant processes on exit.
> See commit message for details of the prctl.
>
> This is a replacement of PR_SET_PDEATHSIG_PROC I proposed last year [1].
> In the following discussion, Oleg suggested this approach.
>
> The motivation for this is to provide a lightweight mechanism to prevent
> stray processes. There is also a related Bugzilla entry [2].

Andrew, Eric, does this look good to you as well?

Jürg


2019-01-18 13:21:00

by Jürg Billeter

[permalink] [raw]
Subject: [RESEND PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

This introduces a new thread group flag that can be set by calling

prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)

When a thread group exits with this flag set, it will send SIGKILL to
all descendant processes. This can be used to prevent stray child
processes.

This flag is cleared on privilege gaining execve(2) to ensure an
unprivileged process cannot get a privileged process to send SIGKILL.

Descendants that are orphaned and reparented to an ancestor of the
current process before the current process exits, will not be killed.
PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.

If a descendant gained privileges, the current process may not be
allowed to kill it, and the descendant process will survive.
PR_SET_NO_NEW_PRIVS can be used to prevent descendant processes from
gaining privileges.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Jürg Billeter <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
---
fs/exec.c | 6 ++++++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/exit.c | 12 ++++++++++++
kernel/sys.c | 11 +++++++++++
security/apparmor/lsm.c | 1 +
security/selinux/hooks.c | 3 +++
7 files changed, 40 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index fb72d36f7823..bbb5a0718223 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1342,6 +1342,12 @@ void setup_new_exec(struct linux_binprm * bprm)
/* Make sure parent cannot signal privileged process. */
current->pdeath_signal = 0;

+ /*
+ * Do not send SIGKILL from privileged process as it may
+ * have been requested by an unprivileged process.
+ */
+ current->signal->kill_descendants_on_exit = false;
+
/*
* For secureexec, reset the stack limit to sane default to
* avoid bad behavior from the prior rlimits. This has to
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d10a50e..2acf481951f6 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -124,6 +124,9 @@ struct signal_struct {
unsigned int is_child_subreaper:1;
unsigned int has_child_subreaper:1;

+ /* Send SIGKILL to descendant processes on exit */
+ bool kill_descendants_on_exit;
+
#ifdef CONFIG_POSIX_TIMERS

/* POSIX.1b Interval Timers */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b4875a93363a..d5483ca63c2d 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -198,6 +198,10 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4

+/* Send SIGKILL to descendant processes on exit */
+#define PR_SET_KILL_DESCENDANTS_ON_EXIT 48
+#define PR_GET_KILL_DESCENDANTS_ON_EXIT 49
+
/* arm64 Scalable Vector Extension controls */
/* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */
#define PR_SVE_SET_VL 50 /* set task vector length */
diff --git a/kernel/exit.c b/kernel/exit.c
index 2d14979577ee..93a812c1b670 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -694,6 +694,15 @@ static void forget_original_parent(struct task_struct *father,
list_splice_tail_init(&father->children, &reaper->children);
}

+static int kill_descendant_visitor(struct task_struct *p, void *data)
+{
+ /* This may fail, e.g., when a descendant process gained privileges. */
+ group_send_sig_info(SIGKILL, SEND_SIG_NOINFO, p, PIDTYPE_TGID);
+
+ /* Always continue walking the process tree. */
+ return 1;
+}
+
/*
* Send signals to all our closest relatives so that they know
* to properly mourn us..
@@ -704,6 +713,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
struct task_struct *p, *n;
LIST_HEAD(dead);

+ if (group_dead && tsk->signal->kill_descendants_on_exit)
+ walk_process_tree(tsk, kill_descendant_visitor, NULL);
+
write_lock_irq(&tasklist_lock);
forget_original_parent(tsk, &dead);

diff --git a/kernel/sys.c b/kernel/sys.c
index f7eb62eceb24..f6dba0ba9b77 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2485,6 +2485,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = PAC_RESET_KEYS(me, arg2);
break;
+ case PR_SET_KILL_DESCENDANTS_ON_EXIT:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ me->signal->kill_descendants_on_exit = !!arg2;
+ break;
+ case PR_GET_KILL_DESCENDANTS_ON_EXIT:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->kill_descendants_on_exit,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 2c010874329f..b09c6e0a25ff 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -698,6 +698,7 @@ static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
aa_inherit_files(bprm->cred, current->files);

current->pdeath_signal = 0;
+ current->signal->kill_descendants_on_exit = false;

/* reset soft limits and set hard limits for the new label */
__aa_transition_rlimits(label, new_label);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index f0e36c3492ba..3bac775634c8 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2510,6 +2510,9 @@ static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
/* Always clear parent death signal on SID transitions. */
current->pdeath_signal = 0;

+ /* Disable SIGKILL requested from before the SID transition. */
+ current->signal->kill_descendants_on_exit = false;
+
/* Check whether the new SID can inherit resource limits from the old
* SID. If not, reset all soft limits to the lower of the current
* task's hard limit and the init task's soft limit.
--
2.20.1


2019-01-18 13:22:01

by Jürg Billeter

[permalink] [raw]
Subject: [RESEND PATCH v2 0/1] Add prctl to kill descendants on exit

This patch adds a new prctl to kill all descendant processes on exit.
See commit message for details of the prctl.

This is a replacement of PR_SET_PDEATHSIG_PROC I proposed last year [1].
In the following discussion, Oleg suggested this approach.

The motivation for this is to provide a lightweight mechanism to prevent
stray processes. There is also a related Bugzilla entry [2].

PID namespaces can also be used to prevent stray processes, of course.
However, they are not quite as lightweight as they typically also
require a new mount namespace to be able to mount a new /proc. And they
require CAP_SYS_ADMIN. User namespaces can help to gain CAP_SYS_ADMIN,
however, that further increases the overhead and the other effects of
the user namespace may not be desired.

PID 1 in PID namespaces also exhibits non-standard signal behavior
(SIGNAL_UNKILLABLE) [3].

Changes in v2:
- Use bool instead of bitfield to avoid race with
PR_SET_CHILD_SUBREAPER

[1] https://lkml.kernel.org/lkml/[email protected]/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=43300
[3] https://lkml.kernel.org/lkml/[email protected]/

Jürg Billeter (1):
prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

fs/exec.c | 6 ++++++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/exit.c | 12 ++++++++++++
kernel/sys.c | 11 +++++++++++
security/apparmor/lsm.c | 1 +
security/selinux/hooks.c | 3 +++
7 files changed, 40 insertions(+)

--
2.20.1


2019-01-29 01:25:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [RESEND PATCH v2 1/1] prctl: add PR_{GET,SET}_KILL_DESCENDANTS_ON_EXIT

On Fri, 18 Jan 2019 14:11:30 +0100 J?rg Billeter <[email protected]> wrote:

> This introduces a new thread group flag that can be set by calling
>
> prctl(PR_SET_KILL_DESCENDANTS_ON_EXIT, 1, 0, 0, 0)
>
> When a thread group exits with this flag set, it will send SIGKILL to
> all descendant processes. This can be used to prevent stray child
> processes.
>
> This flag is cleared on privilege gaining execve(2) to ensure an
> unprivileged process cannot get a privileged process to send SIGKILL.
>
> Descendants that are orphaned and reparented to an ancestor of the
> current process before the current process exits, will not be killed.
> PR_SET_CHILD_SUBREAPER can be used to contain orphaned processes.
>
> If a descendant gained privileges, the current process may not be
> allowed to kill it, and the descendant process will survive.
> PR_SET_NO_NEW_PRIVS can be used to prevent descendant processes from
> gaining privileges.

I don't feel that I'm able to judge the usefulness of this. It would
help to have a lot more words right here in this changelog which
communicate the value of this change to our users. References are
useful, but please don't send people off to chase down mailing list and
bugzilla discussions as a substitute for properly describing the feature
and its justification.

Some test code in tools/testing/selftests/ would be helpful.

We'll need to update the prctl(2) manpage if we proceed with this.