Container-init must behave like global-init to processes within the
container and hence it must be immune to unhandled fatal signals from
within the container (i.e SIG_DFL signals that terminate the process).
But the same container-init must behave like a normal process to
processes in ancestor namespaces and so if it receives the same fatal
signal from a process in ancestor namespace, the signal must be
processed.
Implementing these semantics requires that send_signal() determine pid
namespace of the sender but since signals can originate from workqueues/
interrupt-handlers, determining pid namespace of sender may not always
be possible or safe.
Changelog[v3]:
Changes based on discussions of previous version:
http://lkml.org/lkml/2008/11/25/458
Major changes:
- Define SIGNAL_UNKILLABLE_FROM_NS and use in container-inits to
skip fatal signals from same namespace but process SIGKILL/SIGSTOP
from ancestor namespace.
- Use SI_FROMUSER() and si_code != SI_ASYNCIO to determine if
it is safe to dereference pid-namespace of caller. Highly
experimental :-)
- Masquerading si_pid when crossing namespace boundary: relevant
patches merged in -mm and dropped from this set.
Minor changes:
- Remove 'handler' parameter to tracehook functions
- Update sig_ignored() to drop SIG_DFL signals to global init early
(tried to address Roland's and Oleg's comments)
- Use 'same_ns' flag to drop SIGKILL/SIGSTOP to cinit from same
namespace
This patchset implements the design/simplified semantics suggested by
Oleg Nesterov. The simplified semantics for container-init are:
- container-init must never be terminated by a signal from a
descendant process.
- container-init must never be immune to SIGKILL from an ancestor
namespace (so a process in parent namespace must always be able
to terminate a descendant container).
- container-init may be immune to unhandled fatal signals (like
SIGUSR1) even if they are from ancestor namespace (SIGKILL is
the only reliable signal from ancestor namespace).
Patches in this set:
[PATCH 1/6] Remove 'handler' parameter to tracehook functions
[PATCH 2/6] Protect init from unwanted signals more
[PATCH 3/6] Define/set SIGNAL_UNKILLABLE_FROM_NS
[PATCH 4/6] Define siginfo_from_ancestor_ns()
[PATCH 5/6] Protect cinit from unblocked SIG_DFL signals
[PATCH 6/6] Protect cinit from blocked fatal signals
TODO:
- Use sig_task_unkillable() in fs/proc/array.c:task_sig() to
correctly report ignored signals for container/global init.
- Make SI_ASYNCIO a kernel signal ?
- Compile/touch tested. Need so real testing ;-)
Limitations/side-effects of current design
- Container-init is immune to suicide - kill(getpid(), SIGKILL) is
ignored. Use exit() :-)
From: Sukadev Bhattiprolu <[email protected]>
Date: Fri, 19 Dec 2008 13:33:46 -0800
Subject:[RFC][PATCH 1/6][v3] Remove 'handler' parameter to tracehook functions
Based on an earlier patch submitted by Oleg Nesterov and comments
from Roland McGrath (http://lkml.org/lkml/2008/11/19/258).
The handler parameter is currently unused in the tracehook functions.
Besides, the tracehook functions are called with siglock held, so the
functions can check the handler if they later need to.
Removing the parameter simiplifies changes to sig_ignored() in a follow-on
patch.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
arch/x86/kernel/ptrace.c | 2 +-
include/linux/tracehook.h | 13 ++++---------
kernel/signal.c | 6 +++---
3 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 0a6d8c1..d6ef716 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1585,6 +1585,6 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
* system call instruction.
*/
if (test_thread_flag(TIF_SINGLESTEP) &&
- tracehook_consider_fatal_signal(current, SIGTRAP, SIG_DFL))
+ tracehook_consider_fatal_signal(current, SIGTRAP))
send_sigtrap(current, regs, 0, TRAP_BRKPT);
}
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 6186a78..eb4c654 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -388,17 +388,14 @@ static inline void tracehook_signal_handler(int sig, siginfo_t *info,
* tracehook_consider_ignored_signal - suppress short-circuit of ignored signal
* @task: task receiving the signal
* @sig: signal number being sent
- * @handler: %SIG_IGN or %SIG_DFL
*
* Return zero iff tracing doesn't care to examine this ignored signal,
* so it can short-circuit normal delivery and never even get queued.
- * Either @handler is %SIG_DFL and @sig's default is ignore, or it's %SIG_IGN.
*
* Called with @task->sighand->siglock held.
*/
static inline int tracehook_consider_ignored_signal(struct task_struct *task,
- int sig,
- void __user *handler)
+ int sig)
{
return (task_ptrace(task) & PT_PTRACED) != 0;
}
@@ -407,19 +404,17 @@ static inline int tracehook_consider_ignored_signal(struct task_struct *task,
* tracehook_consider_fatal_signal - suppress special handling of fatal signal
* @task: task receiving the signal
* @sig: signal number being sent
- * @handler: %SIG_DFL or %SIG_IGN
*
* Return nonzero to prevent special handling of this termination signal.
- * Normally @handler is %SIG_DFL. It can be %SIG_IGN if @sig is ignored,
- * in which case force_sig() is about to reset it to %SIG_DFL.
+ * Normally handler for signal is %SIG_DFL. It can be %SIG_IGN if @sig is
+ * ignored, in which case force_sig() is about to reset it to %SIG_DFL.
* When this returns zero, this signal might cause a quick termination
* that does not give the debugger a chance to intercept the signal.
*
* Called with or without @task->sighand->siglock held.
*/
static inline int tracehook_consider_fatal_signal(struct task_struct *task,
- int sig,
- void __user *handler)
+ int sig)
{
return (task_ptrace(task) & PT_PTRACED) != 0;
}
diff --git a/kernel/signal.c b/kernel/signal.c
index 2a64304..7945e71 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -72,7 +72,7 @@ static int sig_ignored(struct task_struct *t, int sig)
/*
* Tracers may want to know about even ignored signals.
*/
- return !tracehook_consider_ignored_signal(t, sig, handler);
+ return !tracehook_consider_ignored_signal(t, sig);
}
/*
@@ -316,7 +316,7 @@ int unhandled_signal(struct task_struct *tsk, int sig)
return 1;
if (handler != SIG_IGN && handler != SIG_DFL)
return 0;
- return !tracehook_consider_fatal_signal(tsk, sig, handler);
+ return !tracehook_consider_fatal_signal(tsk, sig);
}
@@ -775,7 +775,7 @@ static void complete_signal(int sig, struct task_struct *p, int group)
!(signal->flags & (SIGNAL_UNKILLABLE | SIGNAL_GROUP_EXIT)) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL ||
- !tracehook_consider_fatal_signal(t, sig, SIG_DFL))) {
+ !tracehook_consider_fatal_signal(t, sig))) {
/*
* This signal will be fatal to the whole group.
*/
--
1.5.2.5
From: Sukadev Bhattiprolu <[email protected]>
Date: Fri, 19 Dec 2008 13:12:45 -0800
Subject: [RFC][PATCH 2/6][v3] Protect init from unwanted signals more
(This is a modified version of the patch submitted by Oleg Nesterov
http://lkml.org/lkml/2008/11/18/249 and tries to address comments
that came up in that discussion)
init ignores the SIG_DFL signals but we queue them anyway, including
SIGKILL. This is mostly OK, the signal will be dropped silently when
dequeued, but the pending SIGKILL has 2 bad implications:
- it implies fatal_signal_pending(), so we confuse things
like wait_for_completion_killable/lock_page_killable.
- for the sub-namespace inits, the pending SIGKILL can
mask (legacy_queue) the subsequent SIGKILL from the
parent namespace which must kill cinit reliably.
(preparation, cinits don't have SIGNAL_UNKILLABLE yet)
The patch can't help when init is ptraced, but ptracing of init is
not "safe" anyway.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/signal.c | 16 +++++++++++++---
1 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 7945e71..55f41b6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -53,10 +53,21 @@ static int sig_handler_ignored(void __user *handler, int sig)
(handler == SIG_DFL && sig_kernel_ignore(sig));
}
-static int sig_ignored(struct task_struct *t, int sig)
+static int sig_task_ignored(struct task_struct *t, int sig)
{
void __user *handler;
+ handler = sig_handler(t, sig);
+
+ if (unlikely(t->signal->flags & SIGNAL_UNKILLABLE) &&
+ (handler == SIG_IGN || handler == SIG_DFL))
+ return 1;
+
+ return sig_handler_ignored(handler, sig);
+}
+
+static int sig_ignored(struct task_struct *t, int sig)
+{
/*
* Blocked signals are never ignored, since the
* signal handler may change by the time it is
@@ -65,8 +76,7 @@ static int sig_ignored(struct task_struct *t, int sig)
if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
return 0;
- handler = sig_handler(t, sig);
- if (!sig_handler_ignored(handler, sig))
+ if (!sig_task_ignored(t, sig))
return 0;
/*
--
1.5.2.5
From: Sukadev Bhattiprolu <[email protected]>
Date: Sat, 20 Dec 2008 12:27:47 -0800
Subject: [RFC][PATCH 3/6][v3] Define/set SIGNAL_UNKILLABLE_FROM_NS
Define and set the SIGNAL_UNKILLABLE_FROM_NS flags for container-inits.
This flag will be used in follow-on patches to ignore/drop fatal sigals
to container-init from within the container but process the signals from
an ancestor container.
Based on discussions on earlier version of this patchset:
http://lkml.org/lkml/2008/11/25/462
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/fork.c | 2 ++
2 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 96c6703..19c4311 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,6 +604,9 @@ struct signal_struct {
#define SIGNAL_UNKILLABLE 0x00000040 /* for init: ignore fatal signals */
+/* for container-init: ignore fatal signals from within container */
+#define SIGNAL_UNKILLABLE_FROM_NS 0x00000080
+
/* If true, all threads except ->group_exit_task have pending SIGKILL */
static inline int signal_group_exit(const struct signal_struct *sig)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index dba2d3f..0a959f5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -812,6 +812,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
atomic_set(&sig->live, 1);
init_waitqueue_head(&sig->wait_chldexit);
sig->flags = 0;
+ if (clone_flags & CLONE_NEWPID)
+ sig->flags |= SIGNAL_UNKILLABLE_FROM_NS;
sig->group_exit_code = 0;
sig->group_exit_task = NULL;
sig->group_stop_count = 0;
--
1.5.2.5
From: Sukadev Bhattiprolu <[email protected]>
Date: Sat, 20 Dec 2008 12:19:29 -0800
Subject: [RFC][PATCH 4/6][v3] Define siginfo_from_ancestor_ns()
Determine if sender of a signal is from an ancestor namespace. This
function will be used in a follow-on patch.
This is an early/lightly tested RFC patch. Would it be safe to implement
siginfo_from_user() as below and then use it dereference the pid
namespace of sender ?
This is based on discussions on the patch from Oleg Nesterov and me
http://lkml.org/lkml/2008/11/25/462.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/signal.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 55f41b6..058b4c0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -820,6 +820,47 @@ static inline int legacy_queue(struct sigpending *signals, int sig)
{
return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
}
+/*
+ * Return 1 if this signal originated directly from a user process (i.e via
+ * kill(), tkill(), sigqueue()). Return 0 otherwise.
+ *
+ * TODO:
+ * Making SI_ASYNCIO a kernel signal could make this less hacky.
+ */
+#ifdef CONFIG_PID_NS
+static inline int siginfo_from_user(siginfo_t *info)
+{
+ if (!is_si_special(info) && SI_FROMUSER(info) &&
+ info->si_code != SI_ASYNCIO)
+ return 1;
+
+ return 0;
+}
+
+static inline int siginfo_from_ancestor_ns(struct task_struct *t,
+ siginfo_t *info)
+{
+ /*
+ * Ensure signal is from user-space before checking pid namespace
+ */
+ if (siginfo_from_user(info)) {
+ /*
+ * If we do not have a pid in the receiver's namespace,
+ * we must be an ancestor of the receiver.
+ */
+ if (task_pid_nr_ns(current, task_active_pid_ns(t)) <= 0)
+ return 1;
+ }
+
+ return 0;
+}
+#else
+static inline int siginfo_from_ancestor_ns(struct task_struct *t, siginfo_t *info)
+{
+ return 0;
+}
+#endif
+
static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
int group)
--
1.5.2.5
From: Sukadev Bhattiprolu <[email protected]>
Date: Sat, 20 Dec 2008 13:25:24 -0800
Subject: [RFC][PATCH 5/6][v3] Protect cinit from unblocked SIG_DFL signals
Drop early any SIG_DFL or SIG_IGN signals to container-init from within
the same container. But queue SIGSTOP and SIGKILL to the container-init
if they are from an acnestor container.
Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
container can still terminate the container-init. That will be addressed
in the next patch.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/signal.c | 43 ++++++++++++++++++++++++++++++++++---------
1 files changed, 34 insertions(+), 9 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 058b4c0..2dfca62 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -53,20 +53,39 @@ static int sig_handler_ignored(void __user *handler, int sig)
(handler == SIG_DFL && sig_kernel_ignore(sig));
}
-static int sig_task_ignored(struct task_struct *t, int sig)
+/*
+ *
+ * Return 1 if task @t is either the global init or the container-init
+ * of the sender's container. Return 0 otherwise.
+ *
+ * @same_ns = 1 indicates that sender of signal is in same pid namespace
+ * as sender.
+ */
+static int sig_task_unkillable(struct task_struct *t, int same_ns)
+{
+ int flags = t->signal->flags;
+
+ if (unlikely(flags & SIGNAL_UNKILLABLE) ||
+ (same_ns && (flags & SIGNAL_UNKILLABLE_FROM_NS)))
+ return 1;
+
+ return 0;
+}
+
+static int sig_task_ignored(struct task_struct *t, int sig, int same_ns)
{
void __user *handler;
handler = sig_handler(t, sig);
- if (unlikely(t->signal->flags & SIGNAL_UNKILLABLE) &&
+ if (sig_task_unkillable(t, same_ns) &&
(handler == SIG_IGN || handler == SIG_DFL))
return 1;
-
+
return sig_handler_ignored(handler, sig);
}
-static int sig_ignored(struct task_struct *t, int sig)
+static int sig_ignored(struct task_struct *t, int sig, int same_ns)
{
/*
* Blocked signals are never ignored, since the
@@ -76,7 +95,7 @@ static int sig_ignored(struct task_struct *t, int sig)
if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
return 0;
- if (!sig_task_ignored(t, sig))
+ if (!sig_task_ignored(t, sig, same_ns))
return 0;
/*
@@ -632,7 +651,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
* Returns true if the signal should be actually delivered, otherwise
* it should be dropped.
*/
-static int prepare_signal(int sig, struct task_struct *p)
+static int prepare_signal(int sig, struct task_struct *p, int same_ns)
{
struct signal_struct *signal = p->signal;
struct task_struct *t;
@@ -716,7 +735,7 @@ static int prepare_signal(int sig, struct task_struct *p)
}
}
- return !sig_ignored(p, sig);
+ return !sig_ignored(p, sig, same_ns);
}
/*
@@ -867,11 +886,17 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
{
struct sigpending *pending;
struct sigqueue *q;
+ int same_ns;
trace_sched_signal_send(sig, t);
assert_spin_locked(&t->sighand->siglock);
- if (!prepare_signal(sig, t))
+
+ same_ns = 1;
+ if (siginfo_from_ancestor_ns(t, info))
+ same_ns = 0;
+
+ if (!prepare_signal(sig, t, same_ns))
return 0;
pending = group ? &t->signal->shared_pending : &t->pending;
@@ -1366,7 +1391,7 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
goto ret;
ret = 1; /* the signal is ignored */
- if (!prepare_signal(sig, t))
+ if (!prepare_signal(sig, t, 1))
goto out;
ret = 0;
--
1.5.2.5
From: Sukadev Bhattiprolu <[email protected]>
Date: Sat, 20 Dec 2008 14:15:41 -0800
Subject: [RFC][PATCH 6/6][v3] Protect cinit from blocked fatal signals
Normally SIG_DFL signals to global and container-init are dropped early.
But if a signal is blocked when it is posted, we cannot drop the signal
since the receiver may install a handler before unblocking the signal.
Once this signal is queued however, the receiver container-init has
no way of knowing if the signal was sent from an ancestor or descendant
namespace. This patch ensures that contianer-init drop all SIG_DFL
signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.
If SIGSTOP/SIGKILL originate from a descendant of container-init they
are never queued (i.e dropped in sig_ignored() in an earler patch).
If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
and container-init processes the signal.
See comments in patch below for details.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/signal.c | 41 +++++++++++++++++++++++++++++++++++++++--
1 files changed, 39 insertions(+), 2 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 2dfca62..4abacf4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1816,6 +1816,42 @@ static int ptrace_signal(int signr, siginfo_t *info,
return signr;
}
+/*
+ * Return 1 if the signal @sig should NOT kill the task that owns @signal.
+ * Return 0 otherwise.
+ *
+ * If @signal refers to the global-init, it is unkillable (return 1).
+ *
+ * If @signal refers to a task a that is neither a container-init nor the
+ * global init, the task is killable (return 0).
+ *
+ * If @signal refers to a container-init and @sig is either SIGKILL or
+ * SIGSTOP, the signal must be from an ancestor container. So the task
+ * (container-init) should be killable (return 0).
+ *
+ * If @signal refers to a container-init and @sig is neither SIGKILL nor
+ * SIGSTOP, it was queued because it was blocked when it was posed. The
+ * signal may have come from same container - hence it should not be
+ * killable (return 1).
+ *
+ * Note:
+ * This means that SIGKILL is the only sure way to terminate a
+ * container-init even from ancestor namespace.
+ */
+static int sig_unkillable(struct signal_struct *signal, int sig)
+{
+ if (signal->flags & SIGNAL_UNKILLABLE_FROM_NS)
+ return !sig_kernel_only(sig);
+
+ /*
+ * We must have dropped SIGKILL/SIGSTOP in sig_ignored()
+ * TODO: Remove BUG_ON().
+ */
+ BUG_ON(signal->flags & SIGNAL_UNKILLABLE && sig_kernel_only(sig));
+
+ return (signal->flags & SIGNAL_UNKILLABLE);
+}
+
int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
struct pt_regs *regs, void *cookie)
{
@@ -1907,9 +1943,10 @@ relock:
/*
* Global init gets no signals it doesn't want.
+ * Container-init gets no signals it doesn't want from same
+ * container.
*/
- if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
- !signal_group_exit(signal))
+ if (sig_unkillable(signal, signr) && !signal_group_exit(signal))
continue;
if (sig_kernel_stop(signr)) {
--
1.5.2.5
Sukadev Bhattiprolu <[email protected]> writes:
> This patchset implements the design/simplified semantics suggested by
> Oleg Nesterov. The simplified semantics for container-init are:
>
> - container-init must never be terminated by a signal from a
> descendant process.
>
> - container-init must never be immune to SIGKILL from an ancestor
> namespace (so a process in parent namespace must always be able
> to terminate a descendant container).
>
> - container-init may be immune to unhandled fatal signals (like
> SIGUSR1) even if they are from ancestor namespace (SIGKILL is
> the only reliable signal from ancestor namespace).
It sounds you are still struggling to get something that works and gets
done what needs to be done. So let me suggest a simplified semantic that
should be easier to implement and test, and solves the biggest problem
that we must solve in the kernel.
- container-init ignores SIGKILL and SIGSTOP.
- container-init is responsible for setting the rest of the signals
to SIG_IGN.
If that isn't enough for all of the init's we can go back and
solve more in kernel land. That simplified semantic is certainly
enough for sysvinit.
> Limitations/side-effects of current design
>
> - Container-init is immune to suicide - kill(getpid(), SIGKILL) is
> ignored. Use exit() :-)
That sounds like correct behavior.
Eric
Eric W. Biederman [[email protected]] wrote:
| Sukadev Bhattiprolu <[email protected]> writes:
|
| > This patchset implements the design/simplified semantics suggested by
| > Oleg Nesterov. The simplified semantics for container-init are:
| >
| > - container-init must never be terminated by a signal from a
| > descendant process.
| >
| > - container-init must never be immune to SIGKILL from an ancestor
| > namespace (so a process in parent namespace must always be able
| > to terminate a descendant container).
| >
| > - container-init may be immune to unhandled fatal signals (like
| > SIGUSR1) even if they are from ancestor namespace (SIGKILL is
| > the only reliable signal from ancestor namespace).
|
| It sounds you are still struggling to get something that works and gets
| done what needs to be done. So let me suggest a simplified semantic that
| should be easier to implement and test, and solves the biggest problem
| that we must solve in the kernel.
|
| - container-init ignores SIGKILL and SIGSTOP.
Yes.
|
| - container-init is responsible for setting the rest of the signals
| to SIG_IGN.
Oleg pointed out that we could drop SIG_DFL signals to global init early
to ensure wait_for_completion_killable/lock_page_killable don't incorrectly
believe that a fatal signal is pending. (patch 2/6).
If that patch is valid regardless of containers, it would be a minor
extension to get container-inits to drop SIG_DFL signals too, right ?
So the bigger problem/unknown for me is the sig_from_user() in patch 4/6
(i.e determining if it safe to deref the pid-ns of sender). We went from
!in_interrupt() to the SIG_FROM_USER flag to this.
If that is correct, I am hoping it would come down to opitmizing the code
if possible (eg: can/should we avoid passing same_ns into sig_ignored()
There is probably some ugliness :-) but do you see any other correctness
issues ?
On 12/20, Sukadev Bhattiprolu wrote:
>
> + * TODO:
> + * Making SI_ASYNCIO a kernel signal could make this less hacky.
> + */
> +#ifdef CONFIG_PID_NS
> +static inline int siginfo_from_user(siginfo_t *info)
> +{
> + if (!is_si_special(info) && SI_FROMUSER(info) &&
OK, if we can trust SI_FROMUSER(), then it is better, i agree.
I was worried about in-kernel usage of .si_code <= 0 ...
> + info->si_code != SI_ASYNCIO)
but this is horrible, imho.
OK, if we can't change the ABI, then perhaps we can change
kill_pid_info_as_uid() to not send the fatal signals to UNKILLABLE
task? This helper is strange and ugly anyway,
To clarify, I do not blame the patch itself, and I do not suggest
to do this right now.
Oleg.
On 12/20, Sukadev Bhattiprolu wrote:
>
> +static int sig_task_unkillable(struct task_struct *t, int same_ns)
> +{
> + int flags = t->signal->flags;
> +
> + if (unlikely(flags & SIGNAL_UNKILLABLE) ||
> + (same_ns && (flags & SIGNAL_UNKILLABLE_FROM_NS)))
> + return 1;
Hmm. I do not understand the point of the new flag,
SIGNAL_UNKILLABLE_FROM_NS (patch 3/6).
Actually, "same_ns" is a bad name, imho. It actually means "not from
parent ns", and this is not the same as "from the same ns".
Let's suppose we rename it, then the code becomes
if (unlikely(flags & SIGNAL_UNKILLABLE) ||
(!parent_ns && (flags & SIGNAL_UNKILLABLE_FROM_NS)))
But, parent_ns == T is not possible for the global init, so why
do we need the extra flag? we can just do
if (unlikely(flags & SIGNAL_UNKILLABLE) && !parent_ns)
return 1;
No?
> @@ -867,11 +886,17 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
> {
> struct sigpending *pending;
> struct sigqueue *q;
> + int same_ns;
>
> trace_sched_signal_send(sig, t);
>
> assert_spin_locked(&t->sighand->siglock);
> - if (!prepare_signal(sig, t))
> +
> + same_ns = 1;
> + if (siginfo_from_ancestor_ns(t, info))
> + same_ns = 0;
This looks a bit strang, why not
same_ns = siginfo_from_ancestor_ns(t, info);
?
Oleg.
On 12/20, Sukadev Bhattiprolu wrote:
>
> +static int sig_unkillable(struct signal_struct *signal, int sig)
> +{
> + if (signal->flags & SIGNAL_UNKILLABLE_FROM_NS)
> + return !sig_kernel_only(sig);
> +
> + /*
> + * We must have dropped SIGKILL/SIGSTOP in sig_ignored()
> + * TODO: Remove BUG_ON().
> + */
> + BUG_ON(signal->flags & SIGNAL_UNKILLABLE && sig_kernel_only(sig));
> +
> + return (signal->flags & SIGNAL_UNKILLABLE);
> +}
> +
> int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
> struct pt_regs *regs, void *cookie)
> {
> @@ -1907,9 +1943,10 @@ relock:
>
> /*
> * Global init gets no signals it doesn't want.
> + * Container-init gets no signals it doesn't want from same
> + * container.
> */
> - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
> - !signal_group_exit(signal))
> + if (sig_unkillable(signal, signr) && !signal_group_exit(signal))
> continue;
Again, I do not understand why do we need SIGNAL_UNKILLABLE_FROM_NS.
I thought about the change in get_signal_to_deliver() during the
previous discussion, and I think what we need is:
if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
!sig_kernel_only(sig))
continue;
and this was yet another reason for "protect init from unwanted signals more".
Because, if we see SIGKILL/SIGSTOP here, this means that the signal
was sent from the parent ns, or it was generated "internally", for
example by sys_exit_group().
Oleg.
On 12/22, Oleg Nesterov wrote:
>
> On 12/20, Sukadev Bhattiprolu wrote:
> >
> > + * TODO:
> > + * Making SI_ASYNCIO a kernel signal could make this less hacky.
> > + */
> > +#ifdef CONFIG_PID_NS
> > +static inline int siginfo_from_user(siginfo_t *info)
> > +{
> > + if (!is_si_special(info) && SI_FROMUSER(info) &&
>
> OK, if we can trust SI_FROMUSER(), then it is better, i agree.
Aaah, forgot to mention...
But could you explain how are you going to fix another problem,
.si_pid mangling? This was another reason for (yes, ugly, agreed)
SIG_FROM_USER in .si_signo.
Oleg.
Oleg Nesterov [[email protected]] wrote:
| > @@ -1907,9 +1943,10 @@ relock:
| >
| > /*
| > * Global init gets no signals it doesn't want.
| > + * Container-init gets no signals it doesn't want from same
| > + * container.
| > */
| > - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| > - !signal_group_exit(signal))
| > + if (sig_unkillable(signal, signr) && !signal_group_exit(signal))
| > continue;
|
| Again, I do not understand why do we need SIGNAL_UNKILLABLE_FROM_NS.
|
| I thought about the change in get_signal_to_deliver() during the
| previous discussion, and I think what we need is:
|
| if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| !sig_kernel_only(sig))
| continue;
|
| and this was yet another reason for "protect init from unwanted signals more".
I was trying to avoid the clearing of the SIGNAL_UNKILLABLE in
send_signal() that we had last time.
But yes, you are right. I even had a BUG_ON() to confirm SIGKILL/SIGSTOP
will never happen for global-init :-). If so, SIGKLL/SIGSTOP to an init
can come only from parent ns.
So, yes, we can drop this flag.
Oleg Nesterov [[email protected]] wrote:
| On 12/20, Sukadev Bhattiprolu wrote:
| >
| > + * TODO:
| > + * Making SI_ASYNCIO a kernel signal could make this less hacky.
| > + */
| > +#ifdef CONFIG_PID_NS
| > +static inline int siginfo_from_user(siginfo_t *info)
| > +{
| > + if (!is_si_special(info) && SI_FROMUSER(info) &&
|
| OK, if we can trust SI_FROMUSER(), then it is better, i agree.
|
| I was worried about in-kernel usage of .si_code <= 0 ...
|
| > + info->si_code != SI_ASYNCIO)
|
| but this is horrible, imho.
I am beginning to accept that some amount of ugliness is inevitable
here :-) I tried to dig through history of SI_ASYNCIO, but did not
find any changes to its definition in siginfo.h in 6 years.
|
| OK, if we can't change the ABI, then perhaps we can change
| kill_pid_info_as_uid() to not send the fatal signals to UNKILLABLE
| task? This helper is strange and ugly anyway,
|
|
| To clarify, I do not blame the patch itself, and I do not suggest
| to do this right now.
By 'to do this' I assume you are referring to the kill_pid_info_as_uid()
change above ?
IOW, ugly as it is, can we go with the siginfo_from_user() as in the patch ?
On 12/22, Sukadev Bhattiprolu wrote:
>
> Oleg Nesterov [[email protected]] wrote:
> | On 12/20, Sukadev Bhattiprolu wrote:
> | >
> | > + * TODO:
> | > + * Making SI_ASYNCIO a kernel signal could make this less hacky.
> | > + */
> | > +#ifdef CONFIG_PID_NS
> | > +static inline int siginfo_from_user(siginfo_t *info)
> | > +{
> | > + if (!is_si_special(info) && SI_FROMUSER(info) &&
> |
> | OK, if we can trust SI_FROMUSER(), then it is better, i agree.
> |
> | I was worried about in-kernel usage of .si_code <= 0 ...
> |
> | > + info->si_code != SI_ASYNCIO)
> |
> | but this is horrible, imho.
>
> I am beginning to accept that some amount of ugliness is inevitable
> here :-)
heh, agreed...
> I tried to dig through history of SI_ASYNCIO, but did not
> find any changes to its definition in siginfo.h in 6 years.
basically, it was needed (afaics) because we didn't have "struct pid"
when the patch was sent. Commit 46113830a18847cff8da73005e57bc49c2f95a56
(but the fact that SI_FROMUSER(SI_ASYNCIO) == T is imho unforgivable ;)
> | OK, if we can't change the ABI, then perhaps we can change
> | kill_pid_info_as_uid() to not send the fatal signals to UNKILLABLE
> | task? This helper is strange and ugly anyway,
> |
> |
> | To clarify, I do not blame the patch itself, and I do not suggest
> | to do this right now.
>
> By 'to do this' I assume you are referring to the kill_pid_info_as_uid()
> change above ?
>
> IOW, ugly as it is, can we go with the siginfo_from_user() as in the patch ?
Yes. Sorry if I was not clear. I think that this part of patch is imho
horrible, but we should blame drivers/usb/core/devio.c. I'd personally
like to move the uglification to kill_pid_info_as_uid(), but we can do
this later.
Oleg.
Oleg Nesterov [[email protected]] wrote:
| On 12/22, Oleg Nesterov wrote:
| >
| > On 12/20, Sukadev Bhattiprolu wrote:
| > >
| > > + * TODO:
| > > + * Making SI_ASYNCIO a kernel signal could make this less hacky.
| > > + */
| > > +#ifdef CONFIG_PID_NS
| > > +static inline int siginfo_from_user(siginfo_t *info)
| > > +{
| > > + if (!is_si_special(info) && SI_FROMUSER(info) &&
| >
| > OK, if we can trust SI_FROMUSER(), then it is better, i agree.
|
| Aaah, forgot to mention...
|
| But could you explain how are you going to fix another problem,
| .si_pid mangling? This was another reason for (yes, ugly, agreed)
| SIG_FROM_USER in .si_signo.
Good point.
I was going through the ->si_pid assignments to try and fix them at
source (like the mqueue patch I sent last week).
The two cases that don't fit the model are sys_kill() and sys_tkill().
For that I was hoping we could use siginfo_from_user() again. i.e
if (siginfo_from_user())
masquerade_si_pid()
in the default: case of send_signal(). To be safe, masquerade_si_pid()
could do it only iff si_code is either SI_USER or SI_TKILL.
IOW, with some tweaks, I am trying to see if we can use siginfo_from_user()
in place of the SIG_FROM_USER.
On 12/22, Sukadev Bhattiprolu wrote:
>
> Oleg Nesterov [[email protected]] wrote:
> | > @@ -1907,9 +1943,10 @@ relock:
> | >
> | > /*
> | > * Global init gets no signals it doesn't want.
> | > + * Container-init gets no signals it doesn't want from same
> | > + * container.
> | > */
> | > - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
> | > - !signal_group_exit(signal))
> | > + if (sig_unkillable(signal, signr) && !signal_group_exit(signal))
> | > continue;
> |
> | Again, I do not understand why do we need SIGNAL_UNKILLABLE_FROM_NS.
> |
> | I thought about the change in get_signal_to_deliver() during the
> | previous discussion, and I think what we need is:
> |
> | if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
> | !sig_kernel_only(sig))
> | continue;
> |
> | and this was yet another reason for "protect init from unwanted signals more".
>
> I was trying to avoid the clearing of the SIGNAL_UNKILLABLE in
> send_signal() that we had last time.
Well, my plan was to simplify the first series of patches as much
as possible, then I thought we can change get_signal_to_deliver().
But now I tend to agree, we should not clear SIGNAL_UNKILLABLE when
we send the signal, and we should pass same_ns/from_parent_ns to
prepare_signal() from the start. This way is more "clean".
> But yes, you are right. I even had a BUG_ON() to confirm SIGKILL/SIGSTOP
> will never happen for global-init :-). If so, SIGKLL/SIGSTOP to an init
> can come only from parent ns.
>
> So, yes, we can drop this flag.
Great!
Oleg.
On 12/22, Sukadev Bhattiprolu wrote:
>
> Oleg Nesterov [[email protected]] wrote:
> | On 12/22, Oleg Nesterov wrote:
> | >
> | > On 12/20, Sukadev Bhattiprolu wrote:
> | > >
> | > > + * TODO:
> | > > + * Making SI_ASYNCIO a kernel signal could make this less hacky.
> | > > + */
> | > > +#ifdef CONFIG_PID_NS
> | > > +static inline int siginfo_from_user(siginfo_t *info)
> | > > +{
> | > > + if (!is_si_special(info) && SI_FROMUSER(info) &&
> | >
> | > OK, if we can trust SI_FROMUSER(), then it is better, i agree.
> |
> | Aaah, forgot to mention...
> |
> | But could you explain how are you going to fix another problem,
> | .si_pid mangling? This was another reason for (yes, ugly, agreed)
> | SIG_FROM_USER in .si_signo.
>
> Good point.
>
> I was going through the ->si_pid assignments to try and fix them at
> source (like the mqueue patch I sent last week).
OK.
> The two cases that don't fit the model are sys_kill() and sys_tkill().
> For that I was hoping we could use siginfo_from_user() again. i.e
>
> if (siginfo_from_user())
> masquerade_si_pid()
>
> in the default: case of send_signal(). To be safe, masquerade_si_pid()
> could do it only iff si_code is either SI_USER or SI_TKILL.
>
> IOW, with some tweaks, I am trying to see if we can use siginfo_from_user()
> in place of the SIG_FROM_USER.
sys_rt_sigqueueinfo().
But, perhaps we can just ignore the problems with sigqueueinfo() (and
document them). The only thing we must preserve is: we should not
change *info when from_parent_ns == F, but this happens "automatically".
And, the kernel just can not know what "info" means when it is sent
by sigqueueinfo() anyway. So, perhaps we can just do
if (!same_ns)
masquerade_si_pid()
?
Oleg.
Sukadev Bhattiprolu <[email protected]> writes:
> Eric W. Biederman [[email protected]] wrote:
>
> |
> | - container-init is responsible for setting the rest of the signals
> | to SIG_IGN.
>
> Oleg pointed out that we could drop SIG_DFL signals to global init early
> to ensure wait_for_completion_killable/lock_page_killable don't incorrectly
> believe that a fatal signal is pending. (patch 2/6).
>
> If that patch is valid regardless of containers, it would be a minor
> extension to get container-inits to drop SIG_DFL signals too, right ?
Yes.
> So the bigger problem/unknown for me is the sig_from_user() in patch 4/6
> (i.e determining if it safe to deref the pid-ns of sender). We went from
> !in_interrupt() to the SIG_FROM_USER flag to this.
>
> If that is correct, I am hoping it would come down to opitmizing the code
> if possible (eg: can/should we avoid passing same_ns into sig_ignored()
>
> There is probably some ugliness :-) but do you see any other correctness
> issues ?
I haven't dug in too deep but right now my concern are user space semantics,
I don't want to wind up with something ugly there because we can not change
it later.
So if we can write a description of what happens to signals to cinit
that is right 100% of the time. Something we can write a test case
for that tests all of the corner cases and it always get the same
results. I am happy.
I don't mind dropping signals early as an optimization, but if it
is just an optimization we can't count on it in cinit.
So I would rather deliver less and make user space deal with it,
then deliver more cause problems for user space.
Eric
Oleg Nesterov <[email protected]> writes:
>> I was going through the ->si_pid assignments to try and fix them at
>> source (like the mqueue patch I sent last week).
>
> OK.
Note. When a signal goes to a process group (or similar) we can't fix
si_pid at the source. We have to fix it when only a single destination
process is known. It doesn't mean that fixing it at the source
is hopeless but...
>> The two cases that don't fit the model are sys_kill() and sys_tkill().
>> For that I was hoping we could use siginfo_from_user() again. i.e
>>
>> if (siginfo_from_user())
>> masquerade_si_pid()
>>
>> in the default: case of send_signal(). To be safe, masquerade_si_pid()
>> could do it only iff si_code is either SI_USER or SI_TKILL.
>>
>> IOW, with some tweaks, I am trying to see if we can use siginfo_from_user()
>> in place of the SIG_FROM_USER.
>
> sys_rt_sigqueueinfo().
>
> But, perhaps we can just ignore the problems with sigqueueinfo() (and
> document them).
Yes. I don't think si_pid is valid in that case anyway. It is the
kernel signals where si_pid is a reliable field that are important.
Eric
Eric W. Biederman [[email protected]] wrote:
| I haven't dug in too deep but right now my concern are user space semantics,
| I don't want to wind up with something ugly there because we can not change
| it later.
The one restriction we are imposing is that SIGINT, SIGTERM etc will not
currently kill containter-inits. Only SIGKILL will. But that is good point.
Maybe we should document that as a limitation we may remove in the future ?
i.e. Its not a feature that container-inits should rely on. Like sysV init,
container-init should still SIG_IGN all unhandled signals. If they don't,
they may break in the future.
|
| So if we can write a description of what happens to signals to cinit
| that is right 100% of the time. Something we can write a test case
| for that tests all of the corner cases and it always get the same
| results. I am happy.
Yes, I believe we can say that SIGKILL/SIGSTOP from parent are always
delivered and no fatal signal from same ns is.
|
| I don't mind dropping signals early as an optimization, but if it
| is just an optimization we can't count on it in cinit.
Yes, you have a point. It started out as an optimization, but unwanted
signals are either ignored or dropped _always_ (or we have a bug).
|
| So I would rather deliver less and make user space deal with it,
| then deliver more cause problems for user space.
The user-semantics appear to be clean now.
Eric W. Biederman [[email protected]] wrote:
| Oleg Nesterov <[email protected]> writes:
|
| >> I was going through the ->si_pid assignments to try and fix them at
| >> source (like the mqueue patch I sent last week).
| >
| > OK.
|
| Note. When a signal goes to a process group (or similar) we can't fix
| si_pid at the source. We have to fix it when only a single destination
| process is known. It doesn't mean that fixing it at the source
| is hopeless but...
Right. Most calls to kill_pgrp_info() and __kill_pgrp_info() use
SEND_SIG_PRIV or SEND_SIG_NOINFO, so their ->si_pid is set.
The only place seems to be the sys_kill()/sys_tkill().
|
| >> The two cases that don't fit the model are sys_kill() and sys_tkill().
| >> For that I was hoping we could use siginfo_from_user() again. i.e
| >>
| >> if (siginfo_from_user())
| >> masquerade_si_pid()
| >>
| >> in the default: case of send_signal(). To be safe, masquerade_si_pid()
| >> could do it only iff si_code is either SI_USER or SI_TKILL.
| >>
| >> IOW, with some tweaks, I am trying to see if we can use siginfo_from_user()
| >> in place of the SIG_FROM_USER.
| >
| > sys_rt_sigqueueinfo().
| >
| > But, perhaps we can just ignore the problems with sigqueueinfo() (and
| > document them).
Yes, thats one reason I was thinking of checking si_code == SI_USER or
SI_TKILL, but rt_sigqueueinfo() could still use those values too.
In fact it could use SI_ASYNCIO :-) and mess with sig_from_user() (i.e
if rt_sigqueueinfo() sets si_code to SI_ASYNCIO and sends SIGKILL/SIGSTOP
to a descendant cinit, SIGKILL/SIGSTOP will be ignored).
We could/should warn if rt_sigqueueinfo() sets si_code to SI_ASYNCIO.
One more reason to make SI_ASYNCIO a kernel signal I guess.
|
| Yes. I don't think si_pid is valid in that case anyway. It is the
| kernel signals where si_pid is a reliable field that are important.
|
| Eric
Sukadev
Quoting Sukadev Bhattiprolu ([email protected]):
>
> Container-init must behave like global-init to processes within the
> container and hence it must be immune to unhandled fatal signals from
> within the container (i.e SIG_DFL signals that terminate the process).
>
> But the same container-init must behave like a normal process to
> processes in ancestor namespaces and so if it receives the same fatal
> signal from a process in ancestor namespace, the signal must be
> processed.
>
> Implementing these semantics requires that send_signal() determine pid
> namespace of the sender but since signals can originate from workqueues/
> interrupt-handlers, determining pid namespace of sender may not always
> be possible or safe.
Tested-by: Serge Hallyn <[email protected]>
Tested sending signals to a custom container-init.
Are you planning to address Oleg's comments with a new patch-set,
or with patches on top of this set?
thanks,
-serge
Acked-by: Roland McGrath <[email protected]>
Acked-by: Roland McGrath <[email protected]>