(lkml cced because containers list's archive is not useable)
On 11/10, Oleg Nesterov wrote:
>
> On 11/01, [email protected] wrote:
> >
> > Other approaches to try ?
>
> I think we should try to do something simple, even if not perfect. Because
> most users do not care about this problem since they do not use containers
> at all. It would be very sad to add intrusive changes to the code.
>
> I think we should fix another problem first. send_signal()->copy_siginfo()
> path must be changed anyway, when the signal comes from the parent ns we
> report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
> have "bool from_parent_ns" (or whatever) annyway.
>
> Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
> can fail.
>
> I think we should encode this "from_parent_ns" into "struct siginfo". I do
> not think it is good idea to extend this structure, I think we can introduce
> SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
> Or something. yes, sys_rt_sigqueueinfo() is problematic...
>
> Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this
> protects cinit from unwanted signals. Then we change get_signal_to_deliver()
>
> - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
> + if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info)
>
> and now we can kill cinit from parent ns. This needs more checks if we want
> to stop/strace it, but perhaps this is enough for the start. Note that we
> do not need to change complete_signal(), at least for now, the code under
> "if (sig_fatal(p, sig)" is just optimization.
>
>
> So, afaics, the only real problem is how we can handle the case when
> __sigqueue_alloc() fails. I think for the start we can just return
> -ENOMEM in this case (when from_parent_ns == T). Then we can improve
> this behaviour. We can change complete_signal() to ensure that the
> fatal signal from the upper ns always kills cinit, and in this case
> we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
> always works.
>
> Yes, this is not perfect, and it is very possible I missed something
> else. But simple.
But how can send_signal() know that the signal comes from the upper ns?
This is not trivial, we can't blindly use current to check. The signal
can be sent from irq/workqueue/etc.
Perhaps we can start with something like the patch below. Not that I like
it very much though. We should really place this code under
CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
Oleg.
--- K-IS/kernel/signal.c~T 2008-11-10 19:21:17.000000000 +0100
+++ K-IS/kernel/signal.c 2008-11-10 20:31:24.000000000 +0100
@@ -798,11 +798,19 @@ static inline int legacy_queue(struct si
return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
}
+#define SIG_FROM_USER INT_MIN /* MSB */
+
static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
int group)
{
struct sigpending *pending;
struct sigqueue *q;
+ int from_ancestor_ns;
+
+ from_ancestor_ns = !is_si_special(info) &&
+ (info->si_signo | SIG_FROM_USER) &&
+ /* if t can't see us we are from parent ns */
+ task_pid_nr_ns(current, t->current->nsproxy->pid_ns) <= 0;
trace_sched_signal_send(sig, t);
@@ -855,6 +863,7 @@ static int send_signal(int sig, struct s
break;
default:
copy_siginfo(&q->info, info);
+ q->info.si_signo &= ~SIG_FROM_USER;
break;
}
} else if (!is_si_special(info)) {
@@ -2207,7 +2216,7 @@ sys_kill(pid_t pid, int sig)
{
struct siginfo info;
- info.si_signo = sig;
+ info.si_signo = sig | SIG_FROM_USER;
info.si_errno = 0;
info.si_code = SI_USER;
info.si_pid = task_tgid_vnr(current);
@@ -2224,7 +2233,7 @@ static int do_tkill(pid_t tgid, pid_t pi
unsigned long flags;
error = -ESRCH;
- info.si_signo = sig;
+ info.si_signo = sig | SIG_FROM_USER;
info.si_errno = 0;
info.si_code = SI_TKILL;
info.si_pid = task_tgid_vnr(current);
@@ -2296,7 +2305,7 @@ sys_rt_sigqueueinfo(pid_t pid, int sig,
Nor can they impersonate a kill(), which adds source info. */
if (info.si_code >= 0)
return -EPERM;
- info.si_signo = sig;
+ info.si_signo = sig | SIG_FROM_USER;
/* POSIX.1b doesn't mention process groups. */
return kill_proc_info(sig, &info, pid);
Oleg Nesterov [[email protected]] wrote:
| (lkml cced because containers list's archive is not useable)
Hmm. what do you mean by not usable ? I see your email here:
https://lists.linux-foundation.org/pipermail/containers/2008-November/014152.html
|
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, [email protected] wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.
| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
Yes, afaics, we just need to pass one extra bit of information per signal
(whether or not sender is in ancestor-ns) from sender to receiver.
| > Or something. yes, sys_rt_sigqueueinfo() is problematic...
Yes, if user-space sets si_pid to 0.
Can we change sys_rt_sigqueueinfo() to:
if (!info->si_pid)
info->si_pid = getpid();
or would that change semantics adversely ? How about putting this
under CONFIG_PID_NS or your CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
| >
| > Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this
| > protects cinit from unwanted signals. Then we change get_signal_to_deliver()
| >
| > - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| > + if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info)
| >
| > and now we can kill cinit from parent ns. This needs more checks if we want
| > to stop/strace it, but perhaps this is enough for the start. Note that we
| > do not need to change complete_signal(), at least for now, the code under
| > "if (sig_fatal(p, sig)" is just optimization.
| >
| >
| > So, afaics, the only real problem is how we can handle the case when
| > __sigqueue_alloc() fails. I think for the start we can just return
| > -ENOMEM in this case (when from_parent_ns == T). Then we can improve
| > this behaviour. We can change complete_signal() to ensure that the
| > fatal signal from the upper ns always kills cinit, and in this case
| > we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
| > always works.
| >
| > Yes, this is not perfect, and it is very possible I missed something
| > else. But simple.
I agree
|
| But how can send_signal() know that the signal comes from the upper ns?
| This is not trivial, we can't blindly use current to check. The signal
| can be sent from irq/workqueue/etc.
You mean the in_interrupt() check we had in earlier patchset would
not be enough ?
|
| Perhaps we can start with something like the patch below. Not that I like
| it very much though. We should really place this code under
| CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
CONFIG_PID_NS ?
Oleg Nesterov [[email protected]] wrote:
| (lkml cced because containers list's archive is not useable)
|
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, [email protected] wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.
Yes, this was in both the patchsets we reviewed last year :-) I can send
this fix out independently.
| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
| > Or something. yes, sys_rt_sigqueueinfo() is problematic...
Also, what happens if a fatal signal is first received from a descendant
and while that is still pending, the same signal is received from ancestor
ns ? Won't the second one be ignored by legacy_queue() for the non-rt case ?
Of course, this is a new scenario, specific to containers, and we may be
able to define the policy without changing semantics.
On 11/10, [email protected] wrote:
>
> Oleg Nesterov [[email protected]] wrote:
> | (lkml cced because containers list's archive is not useable)
>
> Hmm. what do you mean by not usable ? I see your email here:
> https://lists.linux-foundation.org/pipermail/containers/2008-November/014152.html
Yes, but I failed to find our previous discussions via google, and actually
I prefer to see them all on marc.info, so I can quickly find them...
> | > Or something. yes, sys_rt_sigqueueinfo() is problematic...
>
> Yes, if user-space sets si_pid to 0.
>
> Can we change sys_rt_sigqueueinfo() to:
>
> if (!info->si_pid)
> info->si_pid = getpid();
I doubt very much we can do this. This can break the existing applications
which can overload ->si_pid. I think it is better to pass ->si_pid as is.
If user-space sends siginfo_t so sub-namespace, it must know what it does.
I don't think the kernel can help, it just can't know what ->si_pid actually
means. Unless this is documented somewhere, but I don't know.
> | But how can send_signal() know that the signal comes from the upper ns?
> | This is not trivial, we can't blindly use current to check. The signal
> | can be sent from irq/workqueue/etc.
>
> You mean the in_interrupt() check we had in earlier patchset would
> not be enough ?
I don't think we can rely on in_interrupt() check. Thnk about some device
drivers which can send the notification from workqueue, or from kernel
thread... Say, can you see when drivers/usb/core/devio.c does
async_completed() ? And note that SI_ASYNCIO is SI_FROMUSER.
Even _if_ it is safe to use in_interrupt() right now, I don't think we
can rely on this fact.
> | Perhaps we can start with something like the patch below. Not that I like
> | it very much though. We should really place this code under
> | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
>
> CONFIG_PID_NS ?
Ah yes, we have it ;)
Oleg.
On 11/10, [email protected] wrote:
>
> Also, what happens if a fatal signal is first received from a descendant
> and while that is still pending, the same signal is received from ancestor
> ns ? Won't the second one be ignored by legacy_queue() for the non-rt case ?
Please see my another email:
We must also change sig_ignored() to drop SIGKILL/SIGSTOP early when
it comes from the same ns. Otherwise, it can mask the next SIGKILL
from the parent ns.
But this perhaps makes sense anyway, even without containers.
Currently, when the global init has the pending SIGKILL, we can't
trust __wait_event_killable/etc, and this is actually wrong.
We can drop other SIG_DFL signals from the same namespace early as well.
I seem to already did something like sig_init_ignored(), but I forgot.
Or, we can just ignore this (imho) minor problem. The ancestor ns
must know it can't reliably kill cinit with (say) SIGTERM. It can
be ignored, or it can have have a handler, and it can be lost because
SIGTERM is already pending. Only SIGKILL is special.
Actually. I personally think that if we manage to achieve that
- the sub-namespace can't kill its init
- the ancestor can always kill cinit with SIGKILL
then imho we should not worry very much about other issues ;)
Oleg.
On 11/12, Oleg Nesterov wrote:
>
> On 11/10, [email protected] wrote:
> >
> > Oleg Nesterov [[email protected]] wrote:
> > | > Or something. yes, sys_rt_sigqueueinfo() is problematic...
> >
> > Yes, if user-space sets si_pid to 0.
> >
> > Can we change sys_rt_sigqueueinfo() to:
> >
> > if (!info->si_pid)
> > info->si_pid = getpid();
>
> I doubt very much we can do this. This can break the existing applications
> which can overload ->si_pid. I think it is better to pass ->si_pid as is.
> If user-space sends siginfo_t so sub-namespace, it must know what it does.
> I don't think the kernel can help, it just can't know what ->si_pid actually
> means. Unless this is documented somewhere, but I don't know.
On the second thought, I think perhaps we should do the following.
if sys_rt_sigqueueinfo() sends the signal to the sub-namespace, then clear
always ->sid_pid. Otherwise do not touch it.
This way we can't break the existing apps, and this simplifies send_signal()
which should take "is_it_from_ancestor_ns" into account.
What do you think?
Oleg.
Quoting Oleg Nesterov ([email protected]):
> > | Perhaps we can start with something like the patch below. Not that I like
> > | it very much though. We should really place this code under
> > | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
> >
> > CONFIG_PID_NS ?
>
> Ah yes, we have it ;)
Except I believe all distros at this point enable CONFIG_PID_NS, so
I'm not sure it's the right thing to use.
-serge
Quoting Oleg Nesterov ([email protected]):
> --- K-IS/kernel/signal.c~T 2008-11-10 19:21:17.000000000 +0100
> +++ K-IS/kernel/signal.c 2008-11-10 20:31:24.000000000 +0100
> @@ -798,11 +798,19 @@ static inline int legacy_queue(struct si
> return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
> }
>
> +#define SIG_FROM_USER INT_MIN /* MSB */
> +
> static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
> int group)
> {
> struct sigpending *pending;
> struct sigqueue *q;
> + int from_ancestor_ns;
> +
> + from_ancestor_ns = !is_si_special(info) &&
> + (info->si_signo | SIG_FROM_USER) &&
I assume you mean '&'?
> + /* if t can't see us we are from parent ns */
> + task_pid_nr_ns(current, t->current->nsproxy->pid_ns) <= 0;
This doesn't look so bad... (but still looking through followup emails)
-serge
Serge E. Hallyn [[email protected]] wrote:
| Quoting Oleg Nesterov ([email protected]):
| > > | Perhaps we can start with something like the patch below. Not that I like
| > > | it very much though. We should really place this code under
| > > | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
| > >
| > > CONFIG_PID_NS ?
| >
| > Ah yes, we have it ;)
|
| Except I believe all distros at this point enable CONFIG_PID_NS, so
| I'm not sure it's the right thing to use.
But if they do enable CONFIG_PID_NS they would want the signals to
behave correctly ? IIUC, the reason we want to the hide the code
is that it is not clean i.e if its not experimental or error-prone,
are there other reasons someone with CONFIG_PID_NS=y want to hide it ?
Quoting Sukadev Bhattiprolu ([email protected]):
> Serge E. Hallyn [[email protected]] wrote:
> | Quoting Oleg Nesterov ([email protected]):
> | > > | Perhaps we can start with something like the patch below. Not that I like
> | > > | it very much though. We should really place this code under
> | > > | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
> | > >
> | > > CONFIG_PID_NS ?
> | >
> | > Ah yes, we have it ;)
> |
> | Except I believe all distros at this point enable CONFIG_PID_NS, so
> | I'm not sure it's the right thing to use.
>
> But if they do enable CONFIG_PID_NS they would want the signals to
> behave correctly ? IIUC, the reason we want to the hide the code
> is that it is not clean i.e if its not experimental or error-prone,
> are there other reasons someone with CONFIG_PID_NS=y want to hide it ?
I was going to argue yes, but again following my reasoning to its
logical conclusion leads us to a config parameter being bad anyway.
So yeah, never mind.
-serge
Oleg Nesterov [[email protected]] wrote:
| On 11/10, [email protected] wrote:
| >
| > Also, what happens if a fatal signal is first received from a descendant
| > and while that is still pending, the same signal is received from ancestor
| > ns ? Won't the second one be ignored by legacy_queue() for the non-rt case ?
On second thoughts, cinit is a normal process in its ancestor ns so it
might very well ignore the second instance of the signal (as long as it
does not ignore SIGKILL/SIGSTOP)
|
| Please see my another email:
|
| We must also change sig_ignored() to drop SIGKILL/SIGSTOP early when
| it comes from the same ns. Otherwise, it can mask the next SIGKILL
| from the parent ns.
Ok.
|
| But this perhaps makes sense anyway, even without containers.
| Currently, when the global init has the pending SIGKILL, we can't
| trust __wait_event_killable/etc, and this is actually wrong.
|
| We can drop other SIG_DFL signals from the same namespace early as well.
I think Eric's patchset did this and iirc, we ran into the problem of
blocked SIG_DFL signals ?
| I seem to already did something like sig_init_ignored(), but I forgot.
Yes, I think we had that in the patchset but that was not merged.
|
| Or, we can just ignore this (imho) minor problem.
I think so too.
| The ancestor ns
| must know it can't reliably kill cinit with (say) SIGTERM. It can
| be ignored, or it can have have a handler, and it can be lost because
| SIGTERM is already pending. Only SIGKILL is special.
|
| Actually. I personally think that if we manage to achieve that
|
| - the sub-namespace can't kill its init
|
| - the ancestor can always kill cinit with SIGKILL
Yep.
|
| then imho we should not worry very much about other issues ;)
|
| Oleg.
Oleg Nesterov [[email protected]] wrote:
| (lkml cced because containers list's archive is not useable)
|
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, [email protected] wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.
| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
| > Or something. yes, sys_rt_sigqueueinfo() is problematic...
| >
| > Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this
| > protects cinit from unwanted signals. Then we change get_signal_to_deliver()
| >
| > - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| > + if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info)
| >
| > and now we can kill cinit from parent ns. This needs more checks if we want
| > to stop/strace it, but perhaps this is enough for the start. Note that we
| > do not need to change complete_signal(), at least for now, the code under
| > "if (sig_fatal(p, sig)" is just optimization.
| >
| >
| > So, afaics, the only real problem is how we can handle the case when
| > __sigqueue_alloc() fails. I think for the start we can just return
| > -ENOMEM in this case (when from_parent_ns == T). Then we can improve
| > this behaviour. We can change complete_signal() to ensure that the
| > fatal signal from the upper ns always kills cinit, and in this case
| > we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
| > always works.
| >
| > Yes, this is not perfect, and it is very possible I missed something
| > else. But simple.
|
| But how can send_signal() know that the signal comes from the upper ns?
| This is not trivial, we can't blindly use current to check. The signal
| can be sent from irq/workqueue/etc.
|
| Perhaps we can start with something like the patch below. Not that I like
| it very much though. We should really place this code under
| CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
|
| Oleg.
|
| --- K-IS/kernel/signal.c~T 2008-11-10 19:21:17.000000000 +0100
| +++ K-IS/kernel/signal.c 2008-11-10 20:31:24.000000000 +0100
| @@ -798,11 +798,19 @@ static inline int legacy_queue(struct si
| return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
| }
|
| +#define SIG_FROM_USER INT_MIN /* MSB */
| +
Not necessarily for the problem at hand, but in the long run, is it
worth isolating kernel's siginfo from user's siginfo_t (which is pretty
much carved in stone).
Like the existing 'struct k_sigaction' and 'struct sigaction' have
a 'struct k_siginfo' that is a superset of 'siginfo_t' ?
That might give us more flexibility in passing any additional flags/
values we need in the kernel ?
On 11/12, Sukadev Bhattiprolu wrote:
>
> Oleg Nesterov [[email protected]] wrote:
>
> | On 11/10, [email protected] wrote:
> | >
> | > Also, what happens if a fatal signal is first received from a descendant
> | > and while that is still pending, the same signal is received from ancestor
> | > ns ? Won't the second one be ignored by legacy_queue() for the non-rt case ?
>
> On second thoughts, cinit is a normal process in its ancestor ns so it
> might very well ignore the second instance of the signal (as long as it
> does not ignore SIGKILL/SIGSTOP)
>
> |
> | Please see my another email:
> |
> | We must also change sig_ignored() to drop SIGKILL/SIGSTOP early when
> | it comes from the same ns. Otherwise, it can mask the next SIGKILL
> | from the parent ns.
>
> Ok.
>
> |
> | But this perhaps makes sense anyway, even without containers.
> | Currently, when the global init has the pending SIGKILL, we can't
> | trust __wait_event_killable/etc, and this is actually wrong.
> |
> | We can drop other SIG_DFL signals from the same namespace early as well.
>
> I think Eric's patchset did this and iirc, we ran into the problem of
> blocked SIG_DFL signals ?
Yes sure, I meant unblocked SIG_DFL signals. But SIGKILL can't be
blocked fortunately.
Again, the parent ns can't rely on, say, SIGTERM. It can be missed
if cinit has a handler, we can do nothing in this case. And if it
is blocked, most probably cinit already has a handler, or it will
set it later, say, after exec. Or it can be just ignored.
> | Or, we can just ignore this (imho) minor problem.
>
> I think so too.
Great, so perhaps we can ignore the problem for now, and fix it
later if the need arises.
Oleg.