2008-11-26 03:43:12

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH 0/5] Container init signal semantics


Container-init must behave like global-init to processes within the
container and hence it must be immune to unhandled fatal signals from
within the container (i.e SIG_DFL signals that terminate the process).

But the same container-init must behave like a normal process to
processes in ancestor namespaces and so if it receives the same fatal
signal from a process in ancestor namespace, the signal must be
processed.

Further, since processes don't have a valid pid numbers in a descendant
pid namespaces, the siginfo->si_pid field must be set to 0.

Implementing these semantics requires that send_signal() determine pid
namespace of the sender but since signals can originate from workqueues/
interrupt-handlers, determining pid namespace of sender may not always
be possible or safe.

This patchset implements the design/simplified semantics suggested by
Oleg Nesterov. These semantics are:

- container-init must never be terminated by a signal from a
descendant process.

- container-init must never be immune to SIGKILL from an ancestor
namespace (so a process in parent namespace must always be able
to terminate a descendant container).

- container-init may be immune to unhandled fatal signals (like
SIGUSR1) even if they are from ancestor namespace (SIGKILL is
the only reliable signal from ancestor namespace).

Patches in this set:

[PATCH 1/5] pid: Implement ns_of_pid
[PATCH 2/5] pid: Generalize task_active_pid_ns
[PATCH 3/5] Determine if sender is from ancestor ns
[PATCH 4/5] Protect cinit from fatal signals
[PATCH 5/5] Clear si_pid for signal from ancestor ns

TODO:
- SIGSTOP and ptrace functionality to be reviewed/fixed.

- siginfo->si_pid may need to be cleared in a few more places
(eg; __do_notify(), F_SETSIG ?).

Limitations/side-effects of current design

- Container-init is immune to suicide - kill(getpid(), SIGKILL) is
ignored. Use exit() :-)

- rt_sigqueueinfo(): siginfo->si_pid value is unreliable/undefined
when rt_sigqueueinfo() is used to signal a process in a descendant
namespace

Signed-off-by: Sukadev Bhattiprolu <[email protected]>


2008-11-26 03:45:08

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH 1/5] pid: Implement ns_of_pid

>From f3dd53544a8a7a1c43009f686cbadec561e5a489 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[email protected]>
Date: Mon, 10 Nov 2008 19:03:56 -0800
Subject: [PATCH 1/5] pid: Implement ns_of_pid

A current problem with the pid namespace is that it is
easy to do pid related work after exit_task_namespaces which
drops the nsproxy pointer.

However if we are doing pid namespace related work we are
always operating on some struct pid which retains the pid_namespace
pointer of the pid namespace it was allocated in.

So provide ns_of_pid which allows us to find the pid
namespace a pid was allocated in.

Using this we have the needed infrastructure to do pid
namespace related work at anytime we have a struct pid,
removing the chance of accidentally having a NULL
pointer dereference when accessing current->nsproxy.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/pid.h | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index d7e98ff..e9aec85 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -122,6 +122,17 @@ int next_pidmap(struct pid_namespace *pid_ns, int last);
extern struct pid *alloc_pid(struct pid_namespace *ns);
extern void free_pid(struct pid *pid);

+/* ns_of_pid returns the pid namespace in which the specified
+ * pid was allocated.
+ */
+static inline struct pid_namespace *ns_of_pid(struct pid *pid)
+{
+ struct pid_namespace *ns = NULL;
+ if (pid)
+ ns = pid->numbers[pid->level].ns;
+ return ns;
+}
+
/*
* the helpers to get the pid's id seen from different namespaces
*
--
1.5.2.5

2008-11-26 03:45:51

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns


>From 7f7caaa9d9014d7230dc0b1e0f75536f0b6ccdbf Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <[email protected]>
Date: Mon, 10 Nov 2008 19:12:02 -0800
Subject: [PATCH 2/5] pid: Generalize task_active_pid_ns

Currently task_active_pid_ns is not safe to call after a
task becomes a zombie and exit_task_namespaces is called,
as nsproxy becomes NULL. By reading the pid namespace from
the pid of the task we can trivially solve this problem at
the cost of one extra memory read in what should be the
same cacheline as we read the namespace from.

When moving things around I have made task_active_pid_ns
out of line because keeping it in pid_namespace.h would
require adding includes of pid.h and sched.h that I
don't think we want.

This change does make task_active_pid_ns unsafe to call during
copy_process until we attach a pid on the task_struct which
seems to be a reasonable trade off.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/pid_namespace.h | 6 +-----
kernel/fork.c | 4 ++--
kernel/pid.c | 6 ++++++
3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index d82fe82..38d1032 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -79,11 +79,7 @@ static inline void zap_pid_ns_processes(struct pid_namespace *ns)
}
#endif /* CONFIG_PID_NS */

-static inline struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
-{
- return tsk->nsproxy->pid_ns;
-}
-
+extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
void pidhash_init(void);
void pidmap_init(void);

diff --git a/kernel/fork.c b/kernel/fork.c
index f608356..28be39a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1111,12 +1111,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,

if (pid != &init_struct_pid) {
retval = -ENOMEM;
- pid = alloc_pid(task_active_pid_ns(p));
+ pid = alloc_pid(p->nsproxy->pid_ns);
if (!pid)
goto bad_fork_cleanup_io;

if (clone_flags & CLONE_NEWPID) {
- retval = pid_ns_prepare_proc(task_active_pid_ns(p));
+ retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
if (retval < 0)
goto bad_fork_free_pid;
}
diff --git a/kernel/pid.c b/kernel/pid.c
index 064e76a..c5513fe 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -474,6 +474,12 @@ pid_t task_session_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
}
EXPORT_SYMBOL(task_session_nr_ns);

+struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
+{
+ return ns_of_pid(task_pid(tsk));
+}
+EXPORT_SYMBOL_GPL(task_active_pid_ns);
+
/*
* Used by proc to find the first pid that is greater then or equal to nr.
*
--
1.5.2.5

2008-11-26 03:46:38

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH 3/5] Determine if sender is from ancestor ns


>From 95ae5f7dfaa77158b07d2cbdc8e5df0a81c93194 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[email protected]>
Date: Tue, 18 Nov 2008 16:55:06 -0800
Subject: [PATCH 3/5] Determine if sender is from ancestor ns

To implement container-init semantics, send_signal() must compute the pid
namespace of the sender, but since signals may originate in workqueues/
interrupt handlers, computing the namespace of sender is not always
possible/safe.

Define a flag, SIG_FROM_USER and set this flag when a signal originates
from user-space (i.e in kill(), tkill(), rt_sigqueueinfo()). When this
flag is set, send_signal() can safely compute the pid namespace of the
sender.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/signal.c | 35 ++++++++++++++++++++++++++++++++---
1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index d8d20d6..45aebf0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -793,14 +793,42 @@ static inline int legacy_queue(struct sigpending *signals, int sig)
return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
}

+/*
+ * Return 1 if this signal originated directly from a user process (i.e via
+ * kill(), tkill(), sigqueue()) that is in an ancestor pid namespace of @t.
+ * Return 0 otherwise.
+ */
+#ifdef CONFIG_PID_NS
+#define SIG_FROM_USER INT_MIN /* MSB */
+static inline int siginfo_from_ancestor_ns(struct task_struct *t,
+ siginfo_t *info)
+{
+ if (!is_si_special(info) && (info->si_signo & SIG_FROM_USER)) {
+ /* if t can't see us we are from parent ns */
+ if (task_pid_nr_ns(current, task_active_pid_ns(t)) <= 0)
+ return 1;
+ }
+ return 0;
+}
+#else
+static inline int siginfo_from_ancestor_ns(struct task_struct *t,
+ siginfo_t *info)
+{
+ return 0;
+}
+#endif
+
static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
int group)
{
struct sigpending *pending;
struct sigqueue *q;
+ int from_ancestor_ns;

trace_sched_signal_send(sig, t);

+ from_ancestor_ns = siginfo_from_ancestor_ns(t, info);
+
assert_spin_locked(&t->sighand->siglock);
if (!prepare_signal(sig, t))
return 0;
@@ -850,6 +878,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
break;
default:
copy_siginfo(&q->info, info);
+ q->info.si_signo &= ~SIG_FROM_USER;
break;
}
} else if (!is_si_special(info)) {
@@ -2202,7 +2231,7 @@ sys_kill(pid_t pid, int sig)
{
struct siginfo info;

- info.si_signo = sig;
+ info.si_signo = sig | SIG_FROM_USER;
info.si_errno = 0;
info.si_code = SI_USER;
info.si_pid = task_tgid_vnr(current);
@@ -2219,7 +2248,7 @@ static int do_tkill(pid_t tgid, pid_t pid, int sig)
unsigned long flags;

error = -ESRCH;
- info.si_signo = sig;
+ info.si_signo = sig | SIG_FROM_USER;
info.si_errno = 0;
info.si_code = SI_TKILL;
info.si_pid = task_tgid_vnr(current);
@@ -2291,7 +2320,7 @@ sys_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t __user *uinfo)
Nor can they impersonate a kill(), which adds source info. */
if (info.si_code >= 0)
return -EPERM;
- info.si_signo = sig;
+ info.si_signo = sig | SIG_FROM_USER;

/* POSIX.1b doesn't mention process groups. */
return kill_proc_info(sig, &info, pid);
--
1.5.2.5

2008-11-26 03:46:58

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH 4/5] Protect cinit from fatal signals


>From 4ea8f0b4ae48da5f18d44b68ce3634408c89f230 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[email protected]>
Date: Tue, 25 Nov 2008 10:29:10 -0800
Subject: [PATCH 4/5] Protect cinit from fatal signals

To protect container-init from fatal signals, set SIGNAL_UNKILLABLE but
clear it if it receives SIGKILL from parent namespace - so it is still
killable from ancestor namespace.

Note that container-init is still somewhat special compared to 'normal
processes' - unhandled fatal signals like SIGUSR1 to a container-init
are dropped even if they are from ancestor namespace. SIGKILL from an
ancestor namespace is the only reliable way to kill a container-init.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/fork.c | 2 ++
kernel/signal.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 28be39a..368f25c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -814,6 +814,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
atomic_set(&sig->live, 1);
init_waitqueue_head(&sig->wait_chldexit);
sig->flags = 0;
+ if (clone_flags & CLONE_NEWPID)
+ sig->flags |= SIGNAL_UNKILLABLE;
sig->group_exit_code = 0;
sig->group_exit_task = NULL;
sig->group_stop_count = 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index 45aebf0..8c294c1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -828,6 +828,8 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
trace_sched_signal_send(sig, t);

from_ancestor_ns = siginfo_from_ancestor_ns(t, info);
+ if (from_ancestor_ns && sig == SIGKILL)
+ t->signal->flags &= ~SIGNAL_UNKILLABLE;

assert_spin_locked(&t->sighand->siglock);
if (!prepare_signal(sig, t))
--
1.5.2.5

2008-11-26 03:47:26

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][PATCH 5/5] Clear si_pid for signal from ancestor ns


>From 9885a182bd9feeeb85abf7b2e09340e392ebf43d Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[email protected]>
Date: Tue, 25 Nov 2008 11:04:34 -0800
Subject: [PATCH 5/5] Clear si_pid for signal from ancestor ns

Since processes don't have valid pid numbers in a descendant pid namespaces,
the siginfo->si_pid field must be set to 0 when signal is sent from an
ancestor namespace.

TODO:
- siginfo->si_pid may need to be cleared in other locations (eg:
__do_notify(), F_SETSIG) and will be addressed in separate patches.

- rt_sigqueueinfo(): siginfo->si_pid value is unreliable/undefined
when rt_sigqueueinfo() is used to signal a process in a descendant
namespace

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/signal.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8c294c1..14eb986 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -881,6 +881,8 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
default:
copy_siginfo(&q->info, info);
q->info.si_signo &= ~SIG_FROM_USER;
+ if (from_ancestor_ns)
+ q->info.si_pid = 0;
break;
}
} else if (!is_si_special(info)) {
--
1.5.2.5

2008-11-27 01:04:27

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/5] Protect cinit from fatal signals

On Tue, Nov 25, 2008 at 07:46:34PM -0800, Sukadev Bhattiprolu wrote:
> To protect container-init from fatal signals, set SIGNAL_UNKILLABLE but
> clear it if it receives SIGKILL from parent namespace - so it is still
> killable from ancestor namespace.

This sounds like a workaround.

> Note that container-init is still somewhat special compared to 'normal
> processes' - unhandled fatal signals like SIGUSR1 to a container-init
> are dropped even if they are from ancestor namespace. SIGKILL from an
> ancestor namespace is the only reliable way to kill a container-init.

It sounds not right to make this special case for a "normal" process.

However, no idea how to do this better.

Bastian

--
The heart is not a logical organ.
-- Dr. Janet Wallace, "The Deadly Years", stardate 3479.4

2008-11-27 01:14:18

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

On Tue, Nov 25, 2008 at 07:45:28PM -0800, Sukadev Bhattiprolu wrote:
> Currently task_active_pid_ns is not safe to call after a
> task becomes a zombie and exit_task_namespaces is called,
> as nsproxy becomes NULL.

Why do you need to be able to get the pid namespace from zombie
processes? Also according to nsproxy.h this access variant is only
allowed for the current task, anything else needs to take a rcu lock.

> By reading the pid namespace from
> the pid of the task we can trivially solve this problem at
> the cost of one extra memory read in what should be the
> same cacheline as we read the namespace from.

The pid namespace may not be destructed yet?

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1111,12 +1111,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>
> if (pid != &init_struct_pid) {
> retval = -ENOMEM;
> - pid = alloc_pid(task_active_pid_ns(p));
> + pid = alloc_pid(p->nsproxy->pid_ns);

Why do you expand the old version of the function?

Bastian

--
Women professionals do tend to over-compensate.
-- Dr. Elizabeth Dehaver, "Where No Man Has Gone Before",
stardate 1312.9.

2008-11-27 01:15:43

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] pid: Implement ns_of_pid

On Tue, Nov 25, 2008 at 07:44:42PM -0800, Sukadev Bhattiprolu wrote:
> +/* ns_of_pid returns the pid namespace in which the specified
> + * pid was allocated.
> + */
> +static inline struct pid_namespace *ns_of_pid(struct pid *pid)
> +{
> + struct pid_namespace *ns = NULL;
> + if (pid)
> + ns = pid->numbers[pid->level].ns;
> + return ns;
> +}
> +

When can the pid argument be null?

Bastian

--
Another Armenia, Belgium ... the weak innocents who always seem to be
located on a natural invasion route.
-- Kirk, "Errand of Mercy", stardate 3198.4

2008-11-27 01:22:14

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

On Tue, Nov 25, 2008 at 07:46:11PM -0800, Sukadev Bhattiprolu wrote:
> +#ifdef CONFIG_PID_NS
> +#define SIG_FROM_USER INT_MIN /* MSB */

Is it really a wise idea to mix this into the signal number? Also this
definition looks odd. If you want the highest bit, you should mention
this explicitely with 1U<<31.

If I see this correctly this information is already covered in si_code
with SI_USER and SI_TKILL. SI_KERNEL is used for explicit kernel
generated signals.

> +static inline int siginfo_from_ancestor_ns(struct task_struct *t,
> + siginfo_t *info)
> +{
> + if (!is_si_special(info) && (info->si_signo & SIG_FROM_USER)) {
> + /* if t can't see us we are from parent ns */

What?

> static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
> int group)
> {
> struct sigpending *pending;
> struct sigqueue *q;
> + int from_ancestor_ns;
>
> trace_sched_signal_send(sig, t);
>
> + from_ancestor_ns = siginfo_from_ancestor_ns(t, info);
> +

This is not used at all here?

Bastian

--
Those who hate and fight must stop themselves -- otherwise it is not stopped.
-- Spock, "Day of the Dove", stardate unknown

2008-11-27 13:11:33

by Nadia Derbey

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

On Tue, 2008-11-25 at 19:45 -0800, Sukadev Bhattiprolu wrote:
> >From 7f7caaa9d9014d7230dc0b1e0f75536f0b6ccdbf Mon Sep 17 00:00:00 2001
> From: Eric W. Biederman <[email protected]>
> Date: Mon, 10 Nov 2008 19:12:02 -0800
> Subject: [PATCH 2/5] pid: Generalize task_active_pid_ns
>
> Currently task_active_pid_ns is not safe to call after a
> task becomes a zombie and exit_task_namespaces is called,
> as nsproxy becomes NULL. By reading the pid namespace from
> the pid of the task we can trivially solve this problem at
> the cost of one extra memory read in what should be the
> same cacheline as we read the namespace from.
>
> When moving things around I have made task_active_pid_ns
> out of line because keeping it in pid_namespace.h would
> require adding includes of pid.h and sched.h that I
> don't think we want.
>
> This change does make task_active_pid_ns unsafe to call during
> copy_process until we attach a pid on the task_struct which
> seems to be a reasonable trade off.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> include/linux/pid_namespace.h | 6 +-----
> kernel/fork.c | 4 ++--
> kernel/pid.c | 6 ++++++
> 3 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index d82fe82..38d1032 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -79,11 +79,7 @@ static inline void zap_pid_ns_processes(struct pid_namespace *ns)
> }
> #endif /* CONFIG_PID_NS */
>
> -static inline struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
> -{
> - return tsk->nsproxy->pid_ns;
> -}
> -
> +extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> void pidhash_init(void);
> void pidmap_init(void);
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f608356..28be39a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1111,12 +1111,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>

Suka,

I'm wondering if it is still safe to keep the call to
task_active_pid_ns() in create_new_namespaces(): copy_namespaces() is
called a couple of lines above this sequence and it calls
create_new_namespaces(). So I don't see why you're now referencing
p->nsproxy->pid_ns here and not in create_new_namespaces()?

Regards,
Nadia

> if (pid != &init_struct_pid) {
> retval = -ENOMEM;
> - pid = alloc_pid(task_active_pid_ns(p));
> + pid = alloc_pid(p->nsproxy->pid_ns);
> if (!pid)
> goto bad_fork_cleanup_io;
>
> if (clone_flags & CLONE_NEWPID) {
> - retval = pid_ns_prepare_proc(task_active_pid_ns(p));
> + retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
> if (retval < 0)
> goto bad_fork_free_pid;
> }
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 064e76a..c5513fe 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -474,6 +474,12 @@ pid_t task_session_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
> }
> EXPORT_SYMBOL(task_session_nr_ns);
>
> +struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
> +{
> + return ns_of_pid(task_pid(tsk));
> +}
> +EXPORT_SYMBOL_GPL(task_active_pid_ns);
> +
> /*
> * Used by proc to find the first pid that is greater then or equal to nr.
> *
--
Nadia Derbey <[email protected]>

2008-11-27 21:19:45

by Greg Kurz

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

On Thu, 2008-11-27 at 02:17 +0100, Bastian Blank wrote:
> On Tue, Nov 25, 2008 at 07:45:28PM -0800, Sukadev Bhattiprolu wrote:
> > Currently task_active_pid_ns is not safe to call after a
> > task becomes a zombie and exit_task_namespaces is called,
> > as nsproxy becomes NULL.
>
> Why do you need to be able to get the pid namespace from zombie
> processes? Also according to nsproxy.h this access variant is only
> allowed for the current task, anything else needs to take a rcu lock.
>

That doesn't save ->nsproxy to be NULL... as shown in the very same
example you're talking about.

I agree with Eric and Sukadev that task_active_pid_ns() is unsafe. There
isn't even a /* don't use with zombies */ in pid_namespace.h...

> > By reading the pid namespace from
> > the pid of the task we can trivially solve this problem at
> > the cost of one extra memory read in what should be the
> > same cacheline as we read the namespace from.
>
> The pid namespace may not be destructed yet?
>

Yes as long as a pid from that namespace is still in use...

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2008-12-01 20:15:55

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

Bastian Blank [[email protected]] wrote:
| On Tue, Nov 25, 2008 at 07:46:11PM -0800, Sukadev Bhattiprolu wrote:
| > +#ifdef CONFIG_PID_NS
| > +#define SIG_FROM_USER INT_MIN /* MSB */
|
| Is it really a wise idea to mix this into the signal number?

We have been trying to make this least intrusive and iiuc, extending
siginfo_t is not easy.

| Also this definition looks odd. If you want the highest bit, you should
| mention this explicitely with 1U<<31.

I think that should be fine.

|
| If I see this correctly this information is already covered in si_code
| with SI_USER and SI_TKILL. SI_KERNEL is used for explicit kernel
| generated signals.

Yes, but si_code from sys_rt_sigqueueinfo() cannot be trusted. To be sure
we would have to check for "si_code < 0", but that would include cases like
SI_ASYNCIO which can come from interrupt/work-queues. which would require
more complex checks to safely compute namespace of sender.

IOW, we need to find the namespace of the sender only if the sender is
a user process. If signal is originating from kernel, safely checking
namespace becomes more complex.

Yes, current approach is somewhat hacky. We tried other approaches
before and they were either intrusive or required non-trivial changes
to semantics of signals to global init or both.

|
| > +static inline int siginfo_from_ancestor_ns(struct task_struct *t,
| > + siginfo_t *info)
| > +{
| > + if (!is_si_special(info) && (info->si_signo & SIG_FROM_USER)) {
| > + /* if t can't see us we are from parent ns */
|
| What?

I assume your question is about the comment :-)

Yes, a process can see all its descendants and processes in descendant
namespaces. But it can only see its ancestors upto the most recent
CLONE_NEWPID. (kind of like chroot in filesystems). So if receiver
can't see sender, sender must be an ancestor.

|
| > static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
| > int group)
| > {
| > struct sigpending *pending;
| > struct sigqueue *q;
| > + int from_ancestor_ns;
| >
| > trace_sched_signal_send(sig, t);
| >
| > + from_ancestor_ns = siginfo_from_ancestor_ns(t, info);
| > +
|
| This is not used at all here?

Yes, its used in next patch.

2008-12-01 20:21:50

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/5] Protect cinit from fatal signals

Bastian Blank [[email protected]] wrote:
| On Tue, Nov 25, 2008 at 07:46:34PM -0800, Sukadev Bhattiprolu wrote:
| > To protect container-init from fatal signals, set SIGNAL_UNKILLABLE but
| > clear it if it receives SIGKILL from parent namespace - so it is still
| > killable from ancestor namespace.
|
| This sounds like a workaround.

yes...
|
| > Note that container-init is still somewhat special compared to 'normal
| > processes' - unhandled fatal signals like SIGUSR1 to a container-init
| > are dropped even if they are from ancestor namespace. SIGKILL from an
| > ancestor namespace is the only reliable way to kill a container-init.
|
| It sounds not right to make this special case for a "normal" process.
|
| However, no idea how to do this better.

... like I mentioned in the other message, we have tried different
approaches and they were either intrusive or required more drastic
changes in semantics.

Container-inits are special in some ways and this change requires SIGKILL
to terminate them.

2008-12-01 20:25:06

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] pid: Implement ns_of_pid

Bastian Blank [[email protected]] wrote:
| On Tue, Nov 25, 2008 at 07:44:42PM -0800, Sukadev Bhattiprolu wrote:
| > +/* ns_of_pid returns the pid namespace in which the specified
| > + * pid was allocated.
| > + */
| > +static inline struct pid_namespace *ns_of_pid(struct pid *pid)
| > +{
| > + struct pid_namespace *ns = NULL;
| > + if (pid)
| > + ns = pid->numbers[pid->level].ns;
| > + return ns;
| > +}
| > +
|
| When can the pid argument be null?

Soon after creation but more importantly, after detach_pid()
(release_task).

|
| Bastian
|
| --
| Another Armenia, Belgium ... the weak innocents who always seem to be
| located on a natural invasion route.
| -- Kirk, "Errand of Mercy", stardate 3198.4

2008-12-01 20:39:20

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

Nadia Derbey [[email protected]] wrote:
| On Tue, 2008-11-25 at 19:45 -0800, Sukadev Bhattiprolu wrote:
| > >From 7f7caaa9d9014d7230dc0b1e0f75536f0b6ccdbf Mon Sep 17 00:00:00 2001
| > From: Eric W. Biederman <[email protected]>
| > Date: Mon, 10 Nov 2008 19:12:02 -0800
| > Subject: [PATCH 2/5] pid: Generalize task_active_pid_ns
| >
| > Currently task_active_pid_ns is not safe to call after a
| > task becomes a zombie and exit_task_namespaces is called,
| > as nsproxy becomes NULL. By reading the pid namespace from
| > the pid of the task we can trivially solve this problem at
| > the cost of one extra memory read in what should be the
| > same cacheline as we read the namespace from.
| >
| > When moving things around I have made task_active_pid_ns
| > out of line because keeping it in pid_namespace.h would
| > require adding includes of pid.h and sched.h that I
| > don't think we want.
| >
| > This change does make task_active_pid_ns unsafe to call during
| > copy_process until we attach a pid on the task_struct which
| > seems to be a reasonable trade off.
| >
| > Signed-off-by: Eric W. Biederman <[email protected]>
| > ---
| > include/linux/pid_namespace.h | 6 +-----
| > kernel/fork.c | 4 ++--
| > kernel/pid.c | 6 ++++++
| > 3 files changed, 9 insertions(+), 7 deletions(-)
| >
| > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
| > index d82fe82..38d1032 100644
| > --- a/include/linux/pid_namespace.h
| > +++ b/include/linux/pid_namespace.h
| > @@ -79,11 +79,7 @@ static inline void zap_pid_ns_processes(struct pid_namespace *ns)
| > }
| > #endif /* CONFIG_PID_NS */
| >
| > -static inline struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
| > -{
| > - return tsk->nsproxy->pid_ns;
| > -}
| > -
| > +extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
| > void pidhash_init(void);
| > void pidmap_init(void);
| >
| > diff --git a/kernel/fork.c b/kernel/fork.c
| > index f608356..28be39a 100644
| > --- a/kernel/fork.c
| > +++ b/kernel/fork.c
| > @@ -1111,12 +1111,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
| >
|
| Suka,
|
| I'm wondering if it is still safe to keep the call to
| task_active_pid_ns() in create_new_namespaces(): copy_namespaces() is
| called a couple of lines above this sequence and it calls
| create_new_namespaces(). So I don't see why you're now referencing
| p->nsproxy->pid_ns here and not in create_new_namespaces()?

It is safe to use the new task_active_pid_ns() when the process has
a valid (fully initialized) 'struct pid'. copy_namespaces() and
create_new_namespaces() operate on the _parent's_ 'struct pid' which
is valid.

|
| Regards,
| Nadia
|
| > if (pid != &init_struct_pid) {
| > retval = -ENOMEM;
| > - pid = alloc_pid(task_active_pid_ns(p));
| > + pid = alloc_pid(p->nsproxy->pid_ns);

Here, at the call to task_active_pid_ns(), child does not have a
'struct pid' - we are just allocating it. So its important that
this use the pid_ns from nsproxy.

| > if (!pid)
| > goto bad_fork_cleanup_io;
| >
| > if (clone_flags & CLONE_NEWPID) {
| > - retval = pid_ns_prepare_proc(task_active_pid_ns(p));
| > + retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
| > if (retval < 0)
| > goto bad_fork_free_pid;
| > }
| > diff --git a/kernel/pid.c b/kernel/pid.c
| > index 064e76a..c5513fe 100644
| > --- a/kernel/pid.c
| > +++ b/kernel/pid.c
| > @@ -474,6 +474,12 @@ pid_t task_session_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
| > }
| > EXPORT_SYMBOL(task_session_nr_ns);
| >
| > +struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
| > +{
| > + return ns_of_pid(task_pid(tsk));
| > +}
| > +EXPORT_SYMBOL_GPL(task_active_pid_ns);
| > +
| > /*
| > * Used by proc to find the first pid that is greater then or equal to nr.
| > *
| --
| Nadia Derbey <[email protected]>

2008-12-01 21:15:57

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

Greg Kurz [[email protected]] wrote:
| On Thu, 2008-11-27 at 02:17 +0100, Bastian Blank wrote:
| > On Tue, Nov 25, 2008 at 07:45:28PM -0800, Sukadev Bhattiprolu wrote:
| > > Currently task_active_pid_ns is not safe to call after a
| > > task becomes a zombie and exit_task_namespaces is called,
| > > as nsproxy becomes NULL.
| >
| > Why do you need to be able to get the pid namespace from zombie
| > processes?

After exiting namespaces, the process notifies parent. With new changes
to signals (in this patchset), the signal code may need to determine
the namespace of sender (the exiting child in this case).

| Also according to nsproxy.h this access variant is only
| > allowed for the current task, anything else needs to take a rcu lock.

and sender still is 'current' (the exiting child).

| >
|
| That doesn't save ->nsproxy to be NULL... as shown in the very same
| example you're talking about.
|
| I agree with Eric and Sukadev that task_active_pid_ns() is unsafe. There
| isn't even a /* don't use with zombies */ in pid_namespace.h...

Hmm. Its not unsafe at present. It would become unsafe if the signals code
tries to determine the namespace of sender.

2008-12-02 11:48:46

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

On Mon, Dec 01, 2008 at 12:15:06PM -0800, Sukadev Bhattiprolu wrote:
> Bastian Blank [[email protected]] wrote:
> | If I see this correctly this information is already covered in si_code
> | with SI_USER and SI_TKILL. SI_KERNEL is used for explicit kernel
> | generated signals.
>
> Yes, but si_code from sys_rt_sigqueueinfo() cannot be trusted.

sys_rt_sigqueueinfo disallows setting si_code to any value which
describes kernel signals from userspace. So using SI_FROMUSER should be
sufficient.

> IOW, we need to find the namespace of the sender only if the sender is
> a user process. If signal is originating from kernel, safely checking
> namespace becomes more complex.

Where does this imply checking sender for kernel generated signals?

> Yes, current approach is somewhat hacky. We tried other approaches
> before and they were either intrusive or required non-trivial changes
> to semantics of signals to global init or both.

Message-IDs?

> | > +static inline int siginfo_from_ancestor_ns(struct task_struct *t,
> | > + siginfo_t *info)
> | > +{
> | > + if (!is_si_special(info) && (info->si_signo & SIG_FROM_USER)) {
> | > + /* if t can't see us we are from parent ns */
> | What?
> I assume your question is about the comment :-)

Yes.

> Yes, a process can see all its descendants and processes in descendant
> namespaces. But it can only see its ancestors upto the most recent
> CLONE_NEWPID. (kind of like chroot in filesystems). So if receiver
> can't see sender, sender must be an ancestor.

Please add a complete comment to the function which describes the
function. And don't us "it" for not defined entities.

Bastian

--
I have never understood the female capacity to avoid a direct answer to
any question.
-- Spock, "This Side of Paradise", stardate 3417.3

2008-12-02 11:57:40

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

On Mon, Dec 01, 2008 at 01:15:18PM -0800, Sukadev Bhattiprolu wrote:
> Greg Kurz [[email protected]] wrote:
> | On Thu, 2008-11-27 at 02:17 +0100, Bastian Blank wrote:
> | > On Tue, Nov 25, 2008 at 07:45:28PM -0800, Sukadev Bhattiprolu wrote:
> | > > Currently task_active_pid_ns is not safe to call after a
> | > > task becomes a zombie and exit_task_namespaces is called,
> | > > as nsproxy becomes NULL.
> | > Why do you need to be able to get the pid namespace from zombie
> | > processes?
> After exiting namespaces, the process notifies parent. With new changes
> to signals (in this patchset), the signal code may need to determine
> the namespace of sender (the exiting child in this case).

So the parent of a process with a new pid namespace will never get a
SIGCHLD?

What I read in the kernel source (kernel/signal.c:do_notify_parent,
include/asm-generic/siginfo.h:CLD_EXITED) is that the exit signals
(SIGCHLD) are describes as sent by the kernel.

> | I agree with Eric and Sukadev that task_active_pid_ns() is unsafe. There
> | isn't even a /* don't use with zombies */ in pid_namespace.h...
> Hmm. Its not unsafe at present. It would become unsafe if the signals code
> tries to determine the namespace of sender.

Why? Even now it may be used on zombie tasks.

Bastian

--
Vulcans do not approve of violence.
-- Spock, "Journey to Babel", stardate 3842.4

2008-12-02 11:58:52

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] pid: Implement ns_of_pid

On Mon, Dec 01, 2008 at 12:24:22PM -0800, Sukadev Bhattiprolu wrote:
> Bastian Blank [[email protected]] wrote:
> | On Tue, Nov 25, 2008 at 07:44:42PM -0800, Sukadev Bhattiprolu wrote:
> | > +/* ns_of_pid returns the pid namespace in which the specified
> | > + * pid was allocated.
> | > + */
> | > +static inline struct pid_namespace *ns_of_pid(struct pid *pid)
> | > +{
> | > + struct pid_namespace *ns = NULL;
> | > + if (pid)
> | > + ns = pid->numbers[pid->level].ns;
> | > + return ns;
> | > +}
> | > +
> | When can the pid argument be null?
> Soon after creation but more importantly, after detach_pid()
> (release_task).

pid is a function argument and the function does not call detach_pid. So
please try again.

Bastian

--
Spock: The odds of surviving another attack are 13562190123 to 1, Captain.

2008-12-02 12:06:18

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/5] Protect cinit from fatal signals

On Mon, Dec 01, 2008 at 12:21:12PM -0800, Sukadev Bhattiprolu wrote:
> Container-inits are special in some ways and this change requires SIGKILL
> to terminate them.

No. They have are not special from the outside namespace.

Also it was discussed to use pid namespaces to preserve the local pid of
a process during snapshot/restore. This means that every process may get
the state of a container-init. And then it is not longer a wise idea to
make them behave different from the outside.

Bastian

--
I'm a soldier, not a diplomat. I can only tell the truth.
-- Kirk, "Errand of Mercy", stardate 3198.9

2008-12-02 19:59:46

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

Bastian Blank [[email protected]] wrote:
| On Mon, Dec 01, 2008 at 12:15:06PM -0800, Sukadev Bhattiprolu wrote:
| > Bastian Blank [[email protected]] wrote:
| > | If I see this correctly this information is already covered in si_code
| > | with SI_USER and SI_TKILL. SI_KERNEL is used for explicit kernel
| > | generated signals.
| >
| > Yes, but si_code from sys_rt_sigqueueinfo() cannot be trusted.
|
| sys_rt_sigqueueinfo disallows setting si_code to any value which
| describes kernel signals from userspace. So using SI_FROMUSER should be
| sufficient.

Hmm, unless I am missing something, sys_rt_sigqueuinfo() does this:

if (info.si_code >= 0)
return -EPERM;

This does not prevent user from setting si_code to SI_ASYNCIO, which,
from include/asm-generic/siginfo.h is:

#define SI_ASYNCIO -4 /* sent by AIO completion */

Also,

#define SI_FROMUSER(siptr) ((siptr)->si_code <= 0)

SI_ASYNCIO qualifies as SI_FROMUSER() even when it originates from
kernel (usb/core/devio.c async_completed())...

|
| > IOW, we need to find the namespace of the sender only if the sender is
| > a user process. If signal is originating from kernel, safely checking
| > namespace becomes more complex.
|
| Where does this imply checking sender for kernel generated signals?

... so what I meant is that in send_signal(), it will be harder to
determine in the SI_ASYNCIO case whether the signal is from driver or
rt_sigqueueinfo().

If we know that it came from rt_sigqueueinfo(), we can safely check
the namespace. If it came from driver we should skip the ns check.

|
| > Yes, current approach is somewhat hacky. We tried other approaches
| > before and they were either intrusive or required non-trivial changes
| > to semantics of signals to global init or both.
|
| Message-IDs?

Yes, (Eric Biederman, Dec 2007)
https://lists.linux-foundation.org/pipermail/containers/2007-December/009152.html

Oleg Nesterov, Aug 2007:
http://marc.info/?l=linux-kernel&m=118753610515859

I had sent out a summary of the above attempts to Containers list recently:
https://lists.linux-foundation.org/pipermail/containers/2008-November/013991.html



|
| > | > +static inline int siginfo_from_ancestor_ns(struct task_struct *t,
| > | > + siginfo_t *info)
| > | > +{
| > | > + if (!is_si_special(info) && (info->si_signo & SIG_FROM_USER)) {
| > | > + /* if t can't see us we are from parent ns */
| > | What?
| > I assume your question is about the comment :-)
|
| Yes.
|
| > Yes, a process can see all its descendants and processes in descendant
| > namespaces. But it can only see its ancestors upto the most recent
| > CLONE_NEWPID. (kind of like chroot in filesystems). So if receiver
| > can't see sender, sender must be an ancestor.
|
| Please add a complete comment to the function which describes the
| function. And don't us "it" for not defined entities.

Ah, I see the problem now. The 't' refers to the task parameter - how
about changing comment to:

/* If receiver can't see us, we are from parent ns */


|
| Bastian
|
| --
| I have never understood the female capacity to avoid a direct answer to
| any question.
| -- Spock, "This Side of Paradise", stardate 3417.3

2008-12-02 20:53:11

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/5] Protect cinit from fatal signals

First of, thanks for taking the time to review/comment.


Bastian Blank [[email protected]] wrote:
| On Mon, Dec 01, 2008 at 12:21:12PM -0800, Sukadev Bhattiprolu wrote:
| > Container-inits are special in some ways and this change requires SIGKILL
| > to terminate them.
|
| No. They have are not special from the outside namespace.


I agree that they should not be. But they are special today in at least one
respect - terminating a container-init will terminate all processes in the
container even those that are in unrelated process groups.

Secondly, a poorly written container-inits can take the entire container down,
So we expect that container-inits to handle/ignore all signals rather than
SIG_DFL them. Current global inits do that today and container-inits should
too. It does not look like an unreasonable requirement.

If container-inits do not properly handle signals, it is appearing that
we need to make a trade-off in terms of semantics/complexity. See
following URL for the history.

https://lists.linux-foundation.org/pipermail/containers/2008-November/013991.html

So the basic requirements are:

- container-init receives/processes all signals from ancestor namespace.
- container-init ignores fatal signals from own namespace.

We are simplifying the first to say that:

- parent-ns must have a way to terminate container-init
- cinit will ignore SIG_DFL signals that may terminate cinit even if
they come from parent ns

|
| Also it was discussed to use pid namespaces to preserve the local pid of
| a process during snapshot/restore. This means that every process may get
| the state of a container-init. And then it is not longer a wise idea to
| make them behave different from the outside.

The one change in the state of the process I see is if someone relies on
following fields from /proc/<pid>/status

SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000

to decide if they can send, say SIGUSR1, to terminate the process. If
they do, they maybe in for a surprise. But if the container-init properly
handles/ignores signals, this info will be consistent.

Yes its not ideal and yes, the semantic change described above is a trade-off.
We are trying to find out if this change is unreasonable or will break
something really bad way.

2008-12-02 22:13:25

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] pid: Implement ns_of_pid

Bastian Blank [[email protected]] wrote:
| On Mon, Dec 01, 2008 at 12:24:22PM -0800, Sukadev Bhattiprolu wrote:
| > Bastian Blank [[email protected]] wrote:
| > | On Tue, Nov 25, 2008 at 07:44:42PM -0800, Sukadev Bhattiprolu wrote:
| > | > +/* ns_of_pid returns the pid namespace in which the specified
| > | > + * pid was allocated.
| > | > + */
| > | > +static inline struct pid_namespace *ns_of_pid(struct pid *pid)
| > | > +{
| > | > + struct pid_namespace *ns = NULL;
| > | > + if (pid)
| > | > + ns = pid->numbers[pid->level].ns;
| > | > + return ns;
| > | > +}
| > | > +
| > | When can the pid argument be null?
| > Soon after creation but more importantly, after detach_pid()
| > (release_task).
|
| pid is a function argument and the function does not call detach_pid. So
| please try again.

ns_of_pid() like pid_nr(), pid_nr_ns(), pid_task() etc is a low level helper
unction. ns_of_pid(), like the other helpers can potentially be called for a
process that has already called detach_pid() (i.e for a task that is exiting
but not yet waited on). Hence the 'if (pid)' in ns_of_pid().

Sukadev

2008-12-03 00:36:18

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/5] pid: Implement ns_of_pid

On Tue, 02 Dec 2008 12:58:41 +0100, Bastian Blank said:

> pid is a function argument and the function does not call detach_pid. So
> please try again.

What he means is that if some *other* function calls detach_pid() and then
something calls the ns_of_pid() function, it may be getting passed a null.

Note that the detach_pid() and ns_of_pid() may even be racing on two totally
separate CPUs, doing different things...


Attachments:
(No filename) (226.00 B)

2008-12-03 07:41:46

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

Bastian Blank [[email protected]] wrote:
| On Mon, Dec 01, 2008 at 01:15:18PM -0800, Sukadev Bhattiprolu wrote:
| > Greg Kurz [[email protected]] wrote:
| > | On Thu, 2008-11-27 at 02:17 +0100, Bastian Blank wrote:
| > | > On Tue, Nov 25, 2008 at 07:45:28PM -0800, Sukadev Bhattiprolu wrote:
| > | > > Currently task_active_pid_ns is not safe to call after a
| > | > > task becomes a zombie and exit_task_namespaces is called,
| > | > > as nsproxy becomes NULL.
| > | > Why do you need to be able to get the pid namespace from zombie
| > | > processes?
| > After exiting namespaces, the process notifies parent. With new changes
| > to signals (in this patchset), the signal code may need to determine
| > the namespace of sender (the exiting child in this case).
|
| So the parent of a process with a new pid namespace will never get a
| SIGCHLD?

I am wondering what I said that leads to that conclusion :-) If parent
has a handler the handler will be called as usual otherwise SIGCHLD
will be ignored.

But anyway, an earlier version of my patches checked the pid namespace
sooner and so I had to generalize task_active_pid_ns().

With the present order of checks in siginfo_from_ancestor_ns(), we don't
need to generalize task_active_pid_ns(). SIG_FROM_USER flag will be clear
when do_notify_parent() calls send_signal().

IOW, while we should eventually generalize task_active_pid_ns(), it is
not required for this signals patchset and we can ignore patches 1 and 2
for now.

|
| What I read in the kernel source (kernel/signal.c:do_notify_parent,
| include/asm-generic/siginfo.h:CLD_EXITED) is that the exit signals
| (SIGCHLD) are describes as sent by the kernel.

Yes. Are you suggesting a check like

if (!is_si_special(info) && !SI_FROMKERNEL(info)) ?
/* must be from user, safe to check ns */

But SI_ASYNCIO comes from the driver - so its not safe to check pid ns.
(sent a separate query on SI_ASYNCIO).

|
| > | I agree with Eric and Sukadev that task_active_pid_ns() is unsafe. There
| > | isn't even a /* don't use with zombies */ in pid_namespace.h...
| > Hmm. Its not unsafe at present. It would become unsafe if the signals code
| > tries to determine the namespace of sender.
|
| Why? Even now it may be used on zombie tasks.

It used to be unsafe, and iirc was fixed a while ago(in part by moving
exit_task_namespaces() into exit_notify()).

Are you saying there is another path (outside these signals patches) where
task_active_pid_ns() is called for zombies ?

2008-12-04 00:54:43

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

I don't quarrel with the intent, but I don't like this approach.
I think we can do it more cleanly. I don't have it all worked out,
but a couple of thoughts come to mind.

Firstly, it's no problem to split SIGNAL_UNKILLABLE out into multiple
flags. It's quite noninvasive, only signal.c looks at those flags. Then
we can implement the signal magic cleanly keyed on a few separate flags,
using different flag combinations for global init and container inits.

I don't mind a feature that lets an unprivileged container init magically
ignore normal exit signals. That just does something it could do itself
with sigaction, except its /proc/pid/status "SigIgn:" line will lie.
That's OK. Key that on its own flag and set that in both inits. Just
check that flag in prepare_signal() and in get_signal_to_deliver().

It's a different story for the signals that cannot be caught, blocked, or
ignored, SIGKILL and SIGSTOP. At least when sent from outside the
container, circumventing either of those would be a privilege escalation
from what's always been understood by admins and so on.

It's convenient for the implementation that we can treat these differently
from other signals, precisely because they can never be caught, blocked, or
ignored. That is, when we decide in prepare_signal() to queue a SIGKILL or
SIGSTOP, it can never turn out that we're later going to drop it because it
went from caught or ignored or blocked to uncaught, unignored, and
unblocked. (It's only because of that possibility that there is any need
to check for a suppressed exit signal in get_signal_to_deliver() rather
than only in prepare_signal().) That means that when the decision hinges
on the namespace correlation of the sender and receiver, you can check that
when it's handy (current vs p in prepare_signal) rather than trying to
reconstruct the answer for a queued signal.

Finally, one more thought. This may be moot for the problem at hand if you
take the approach I just suggested, but probably should be fixed anyway.
It seems to me that the si_pid and si_uid fields of siginfo_t ought to be
translated to the namespaces of the receiver. I think it makes most sense
to do this on the front end, i.e. in the callers that fill in the siginfo_t
in the first place (sys_kill et al, or maybe a few layers down?).
Currently it's inconsistent, but mostly wrong. do_notify_parent() and
do_notify_parent_cldstop() use the receiver's namespace to compute si_pid,
but the rest of the signal.c routines do not. A free side effect of doing
this is that si_pid for a sender whose PID is not visible to the receiver
(i.e. outside its container) would be distinctively 0 or -1 or something.
(-1 might be the best choice, since si_[pu]id=0 already arises now in case
of signal queue exhaustion and the like.) Hence, possibly one could simply
use si_pid>0 as a "sent from inside the container" check on a queued siginfo_t.


Thanks,
Roland

2008-12-04 01:07:24

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

I don't quarrel with the intent, but I don't like this approach.
I think we can do it more cleanly. I don't have it all worked out,
but a couple of thoughts come to mind.

Firstly, it's no problem to split SIGNAL_UNKILLABLE out into multiple
flags. It's quite noninvasive, only signal.c looks at those flags. Then
we can implement the signal magic cleanly keyed on a few separate flags,
using different flag combinations for global init and container inits.

I don't mind a feature that lets an unprivileged container init magically
ignore normal exit signals. That just does something it could do itself
with sigaction, except its /proc/pid/status "SigIgn:" line will lie.
That's OK. Key that on its own flag and set that in both inits. Just
check that flag in prepare_signal() and in get_signal_to_deliver().

It's a different story for the signals that cannot be caught, blocked, or
ignored, SIGKILL and SIGSTOP. At least when sent from outside the
container, circumventing either of those would be a privilege escalation
from what's always been understood by admins and so on.

It's convenient for the implementation that we can treat these differently
from other signals, precisely because they can never be caught, blocked, or
ignored. That is, when we decide in prepare_signal() to queue a SIGKILL or
SIGSTOP, it can never turn out that we're later going to drop it because it
went from caught or ignored or blocked to uncaught, unignored, and
unblocked. (It's only because of that possibility that there is any need
to check for a suppressed exit signal in get_signal_to_deliver() rather
than only in prepare_signal().) That means that when the decision hinges
on the namespace correlation of the sender and receiver, you can check that
when it's handy (current vs p in prepare_signal) rather than trying to
reconstruct the answer for a queued signal.

Finally, one more thought. This may be moot for the problem at hand if you
take the approach I just suggested, but probably should be fixed anyway.
It seems to me that the si_pid and si_uid fields of siginfo_t ought to be
translated to the namespaces of the receiver. I think it makes most sense
to do this on the front end, i.e. in the callers that fill in the siginfo_t
in the first place (sys_kill et al, or maybe a few layers down?).
Currently it's inconsistent, but mostly wrong. do_notify_parent() and
do_notify_parent_cldstop() use the receiver's namespace to compute si_pid,
but the rest of the signal.c routines do not. A free side effect of doing
this is that si_pid for a sender whose PID is not visible to the receiver
(i.e. outside its container) would be distinctively 0 or -1 or something.
(-1 might be the best choice, since si_[pu]id=0 already arises now in case
of signal queue exhaustion and the like.) Hence, possibly one could simply
use si_pid>0 as a "sent from inside the container" check on a queued siginfo_t.


Thanks,
Roland

2008-12-04 12:45:39

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns+

On Tue, Dec 02, 2008 at 11:59:04AM -0800, Sukadev Bhattiprolu wrote:
> Bastian Blank [[email protected]] wrote:
> | sys_rt_sigqueueinfo disallows setting si_code to any value which
> | describes kernel signals from userspace. So using SI_FROMUSER should be
> | sufficient.
> SI_ASYNCIO qualifies as SI_FROMUSER() even when it originates from
> kernel (usb/core/devio.c async_completed())...

SI_ASYNCIO currently qualifies as user signal, it is sent in the context
of the pid issuing the async io request. It is never used as a kernel
originated signal in any way. The code sending it even seems to do a
full permission check.

If you think this is wrong, maybe this should be fixed first.

> If we know that it came from rt_sigqueueinfo(), we can safely check
> the namespace. If it came from driver we should skip the ns check.

If it have a sender pid attached, this should be checked.

> Yes, (Eric Biederman, Dec 2007)
> https://lists.linux-foundation.org/pipermail/containers/2007-December/009152.html
> Oleg Nesterov, Aug 2007:
> http://marc.info/?l=linux-kernel&m=118753610515859
> I had sent out a summary of the above attempts to Containers list recently:
> https://lists.linux-foundation.org/pipermail/containers/2008-November/013991.html

Okay.

> | Please add a complete comment to the function which describes the
> | function. And don't us "it" for not defined entities.
> Ah, I see the problem now. The 't' refers to the task parameter - how
> about changing comment to:

No, I meant a real comment, defining the complete behaviour, each
parameter with constraints and the possible return values.

Bastian

--
Insufficient facts always invite danger.
-- Spock, "Space Seed", stardate 3141.9

2008-12-04 12:52:26

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/5] Protect cinit from fatal signals

On Tue, Dec 02, 2008 at 12:51:30PM -0800, Sukadev Bhattiprolu wrote:
> Bastian Blank [[email protected]] wrote:
> | No. They have are not special from the outside namespace.
> I agree that they should not be. But they are special today in at least one
> respect - terminating a container-init will terminate all processes in the
> container even those that are in unrelated process groups.

This is part of the definition.

> Secondly, a poorly written container-inits can take the entire container down,
> So we expect that container-inits to handle/ignore all signals rather than
> SIG_DFL them. Current global inits do that today and container-inits should
> too. It does not look like an unreasonable requirement.

So you intend to workaround tools which are used as container-init but
does not qualify for this work. Why?

> So the basic requirements are:
>
> - container-init receives/processes all signals from ancestor namespace.
> - container-init ignores fatal signals from own namespace.
>
> We are simplifying the first to say that:
>
> - parent-ns must have a way to terminate container-init
> - cinit will ignore SIG_DFL signals that may terminate cinit even if
> they come from parent ns

This is no simplification. This are more constraints.

Bastian

--
No one can guarantee the actions of another.
-- Spock, "Day of the Dove", stardate unknown

2008-12-04 12:59:04

by Bastian Blank

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/5] pid: Generalize task_active_pid_ns

On Tue, Dec 02, 2008 at 11:41:03PM -0800, Sukadev Bhattiprolu wrote:
> Bastian Blank [[email protected]] wrote:
> | On Mon, Dec 01, 2008 at 01:15:18PM -0800, Sukadev Bhattiprolu wrote:
> | > Greg Kurz [[email protected]] wrote:
> | > | On Thu, 2008-11-27 at 02:17 +0100, Bastian Blank wrote:
> | > | > On Tue, Nov 25, 2008 at 07:45:28PM -0800, Sukadev Bhattiprolu wrote:
> | > | > > Currently task_active_pid_ns is not safe to call after a
> | > | > > task becomes a zombie and exit_task_namespaces is called,
> | > | > > as nsproxy becomes NULL.
> | > | > Why do you need to be able to get the pid namespace from zombie
> | > | > processes?
> | > After exiting namespaces, the process notifies parent. With new changes
> | > to signals (in this patchset), the signal code may need to determine
> | > the namespace of sender (the exiting child in this case).
> | So the parent of a process with a new pid namespace will never get a
> | SIGCHLD?
> I am wondering what I said that leads to that conclusion :-) If parent
> has a handler the handler will be called as usual otherwise SIGCHLD
> will be ignored.

You said you need a way to check the namespace of a zombie process. The
only signal a zombie can issue is the SIGCHLD to the parent.

> | What I read in the kernel source (kernel/signal.c:do_notify_parent,
> | include/asm-generic/siginfo.h:CLD_EXITED) is that the exit signals
> | (SIGCHLD) are describes as sent by the kernel.
>
> Yes. Are you suggesting a check like
>
> if (!is_si_special(info) && !SI_FROMKERNEL(info)) ?
> /* must be from user, safe to check ns */

If "is_si_special(info) && SI_FROMKERNEL(info)" is true, there is no
need to check anything. The double negations makes this not easier to
read.

Bastian

--
Klingon phaser attack from front!!!!!
100% Damage to life support!!!!

2008-12-04 18:58:56

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/5] Protect cinit from fatal signals

Bastian Blank [[email protected]] wrote:
| > Secondly, a poorly written container-inits can take the entire container down,
| > So we expect that container-inits to handle/ignore all signals rather than
| > SIG_DFL them. Current global inits do that today and container-inits should
| > too. It does not look like an unreasonable requirement.
|
| So you intend to workaround tools which are used as container-init but
| does not qualify for this work. Why?

Sorry, but I don't understand the "does not qualify for this work" part.
Can you please rephrase ?

|
| > So the basic requirements are:
| >
| > - container-init receives/processes all signals from ancestor namespace.
| > - container-init ignores fatal signals from own namespace.
| >
| > We are simplifying the first to say that:
| >
| > - parent-ns must have a way to terminate container-init
| > - cinit will ignore SIG_DFL signals that may terminate cinit even if
| > they come from parent ns
|
| This is no simplification. This are more constraints.

Yes cinit ignoring SIG_DFL exit signals from parent-ns is a constraint.
So if we run say sshd as container-init, we can't use SIGINT to
terminate it, but need SIGKILL

The question is whether this constraint makes any serious/real cinits
unusable ?

The behavior at present is that cinits can be terminated from within
and cinits cannot do anything in user-space. With this incremental
step at least user space has an option of ignoring such signals.

Sukadev

2008-12-09 03:22:44

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns

Roland McGrath [[email protected]] wrote:


Thanks for the review and your comments.
>
> I don't quarrel with the intent, but I don't like this approach.
> I think we can do it more cleanly. I don't have it all worked out,
> but a couple of thoughts come to mind.
>
> Firstly, it's no problem to split SIGNAL_UNKILLABLE out into multiple
> flags. It's quite noninvasive, only signal.c looks at those flags. Then
> we can implement the signal magic cleanly keyed on a few separate flags,
> using different flag combinations for global init and container inits.

Ok. Sounds good. Let say for this discussion, container-inits set
SIGNAL_UNKILLABLE_CINIT, global inits continue to set SIGNAL_UNKILLABLE.

>
> I don't mind a feature that lets an unprivileged container init magically
> ignore normal exit signals. That just does something it could do itself
> with sigaction, except its /proc/pid/status "SigIgn:" line will lie.

Yes, but after Bastian mentioned it recently, I was wondering if we can
keep "SigIgn:" honest by tweaking procfs code - i.e have procfs check
the SIGNAL_UNKILLABLE flags and include SIG_DFL exit signals in the
SigIgn: line.

We could include blocked SIG_DFL signals in the SigIgn: line ? Hacky ?

> That's OK. Key that on its own flag and set that in both inits. Just
> check that flag in prepare_signal() and in get_signal_to_deliver().
>
> It's a different story for the signals that cannot be caught, blocked, or
> ignored, SIGKILL and SIGSTOP. At least when sent from outside the
> container, circumventing either of those would be a privilege escalation
> from what's always been understood by admins and so on.

So if SIGKILL or SIGSTOP are sent from inside the container, we drop them
in prepare_signal() if SIGNAL_UNKILLABLE flags are set. If sent from outside
the container, we queue them and get_signal_to_deliver() delivers these
unless SIGNAL_UNKILLABLE is set (i.e the signals are delivered to cinits but
ignored for global init).

>
> It's convenient for the implementation that we can treat these differently
> from other signals, precisely because they can never be caught, blocked, or
> ignored. That is, when we decide in prepare_signal() to queue a SIGKILL or
> SIGSTOP, it can never turn out that we're later going to drop it because it
> went from caught or ignored or blocked to uncaught, unignored, and
> unblocked. (It's only because of that possibility that there is any need
> to check for a suppressed exit signal in get_signal_to_deliver() rather
> than only in prepare_signal().) That means that when the decision hinges
> on the namespace correlation of the sender and receiver, you can check that
> when it's handy (current vs p in prepare_signal) rather than trying to
> reconstruct the answer for a queued signal.

Yes. One thing I am not clear on yet is how we decide in prepare_signal()
if it is safe to check the sender's namespace (since caller of send_signal()
could be an interrupt handler or workqueue).

As I was discussing with Bastian Blank, SI_FROMUSER() is not sufficient
since SI_ASYNCIO appears as an user-space signal.

Now is SI_ASYNCIO really supposed to be a user signal ? Or like SI_MESGQ
and SI_POLL can it be a kernel signal ? If we fix SI_ASYNCIO, can we
safely use SI_FROMUSER() for this or could there be other 'user' signals
from kernel ?

>
> Finally, one more thought. This may be moot for the problem at hand if you
> take the approach I just suggested, but probably should be fixed anyway.
> It seems to me that the si_pid and si_uid fields of siginfo_t ought to be
> translated to the namespaces of the receiver. I think it makes most sense
> to do this on the front end, i.e. in the callers that fill in the siginfo_t
> in the first place (sys_kill et al, or maybe a few layers down?).
> Currently it's inconsistent, but mostly wrong.

Yes, that makes a lot of sense to push that initialization to the callers.
We have been going through and identifying the 'si_pid's that need to be fixed.
Nadia sent out one patch for ipc/mqueue.c

> do_notify_parent() and
> do_notify_parent_cldstop() use the receiver's namespace to compute si_pid,
> but the rest of the signal.c routines do not. A free side effect of doing
> this is that si_pid for a sender whose PID is not visible to the receiver
> (i.e. outside its container) would be distinctively 0 or -1 or something.
> (-1 might be the best choice, since si_[pu]id=0 already arises now in case
> of signal queue exhaustion and the like.) Hence, possibly one could simply
> use si_pid>0 as a "sent from inside the container" check on a queued siginfo_t.

Yes, its probably moot with the other approach, but the only drawback with
relying on the "->si_pid > 0" check in get_signal_to_deliver() was that
allocation of siginfo_t could have failed. In which case it would be
unclear whether signal was from parent or same namespace.

But if we fix si_pid problems and correctly pass in siginfo_t, can we
use "si_pid > 0" check in prepare_signal() to decide whether or not
to queue the signal ?

Sukadev