All processes must have a parent process. When a process dies, the
parent is notified by SIGCHLD and must use the wait() system call to
reap the remaining zombie.
Should the parent process die, its children are reparented to the init
daemon so that they always have a parent process.
The init daemon cannot die.
As well as a parent, processes also have a process group and a session.
This is quite complicated and takes up an entire chapter of Stevens
which few people claim to have even read.
It all comes down to connecting the life of a process to the life of a
terminal. Long-lived daemon processes don't want to be connected to a
terminal at all, so they perform a common pattern:
- the process run by the shell fork()s, creating a child
- the original process (child of the shell) exits and the shell reaps
it, while the child carries on and is now reparented to the init
daemon
- the process calls setsid() to change to a new session and process
group, it's now no longer connected to the terminal
- *but* due to a quirk of POSIX, if it were to now open() a tty device,
it would end up owning it and becoming connected to it! FAIL.
- so the process fork()s again, creating a new child
- the process (child of the child of the shell) exits and the init
daemon reaps it, while the child carries on and is reparented to the
init daemon
This remaining process is the daemon; it's a child of init, and in a
process group and session on its own, but is not the leader of either
and not connected to a terminal.
Win!
Well, almost a win. The trouble is that this also happens to completely
disconnect it from any kind of process supervisor. We want to be able
to supervise daemons.
This wouldn't be so bad, execpt that most daemons don't actually
daemonise until after they've finished initialisation, and are even
listening on their well known port. The daemonisation becomes more than
just the escape from the terminal, its the notification that they are
running.
We could change the rules, and say that new daemons shouldn't do all
that. They can assume they're running from the init daemon, remain in
the foreground, and notify init that they are running by some other
means (D-Bus maybe?)
That's not really ideal either, for a start there's a lot of commercial
applications out there we don't have the source to. Then there's all of
those that won't change because they want to be "portable", and aren't
interested in adopting Linux-specific behaviour.
And at the least, it would take a very long time to rewrite all of the
daemon software out there. Even if this were the right long-term
solution.
We want to be able to supervise daemons.
Init has a head-start already. Since it's the eventual parent of daemon
processes, it will always be notified of a daemon's death by SIGCHLD and
receives their exit status information through wait().
So while you can't escape from init, and while init can see the process
death, it has no idea what the process was and what it was supposed to
do about it.
If there's two Apache daemons running (in different chroots, or on
different IPs or ports) it can't know which of the two died because the
process that died is unknown to it.
Likewise it can't provide status information as to whether either is
running or not since the only processes it knew about (those it spawned
directly) exited immediately after it ran them.
This is because this is what happens from init's POV:
- init (1) spawns new apache process (1000)
- apache (1000) fork()s and exits
- init (1) receives SIGCHLD for 1000.
Meanwhile:
- apache (1001) reparented to init (no notification)
- apache (1001) fork()s and exits
- init (1) receives SIGCHLD for 1001
- apache (1002) reparented to init (no notification)
Later on, 1002 will die and init will receive SIGCHLD for it.
Unfortunately neither the 1001 or 1002 processes are known to init, even
though they are original children of the process it spawned (1000), for
init to be notified about them - this has been forgotten.
The only piece of information missing to make this work is the original
parent process id. If we knew that 1001 was a child of 1000, and 1002
was a child of 1001, we'd be fine.
Can we do this from userspace?
In short, no.
At the end of the day, we need a piece of information that the kernel
doesn't make available to userspace when we need it. (Because the
kernel itself forgets that information).
If init gets the SIGCHLD for 1002, 1002's ppid is 1 not 1001; likewise
for 1001/1000.
At the point that 1000 exits, its children have already been reparented
to init. We can't, even with waitid(WNOWAIT) iterate them in any way.
The closest I've come to a race-free way to do this so far is by having
init ptrace() every process it spawns, that at least allows it to follow
fork() and exec().
Even that doesn't work so well, and I'm tiring of the death threats from
security people and people whose software's behaviour is altered under
ptrace() :p
About the patch:
The patch adds a new PR_{GET,SET}_ADOPTSIG prctl(), similar to the
existing PR_{GET,SET}_PDEATHSIG control and with similar semantics. It
sets a adopt_signal member of the task struct.
- When non-zero, the process will receive the given signal if another
process is reparented to it.
- This signal has the form of SIGCHLD, with:
* si_code set to zero
* si_pid set to the process that has been reparented to init
* (most importantly) si_status set to the previous parent process id
- Notification is disabled after exec() or setsid()
This functionality only affects the init daemon, and only if the init
daemon activates the prctl(). It is safe for other processes.
There's already other init-daemon specific code in the kernel, and
already other specialist prctl() signals, so this seemed the appropriate
way to do the patch.
Since the siginfo_t contains the useful information, the signal should
generally be >= SIGRTMIN.
The signal is made pending before the SIGCHLD for the terminating
previous parent. While signal delivery order isn't guaranteed, signal
semantics are.
The kernel may actually deliver the SIGCHLD signal first (especially if
you use the suggested SIGRTMIN signal), but the adoption signal can be
checked with sigpending() and delivered with sigsuspend() while still in
the SIGCHLD signal handler. If using signalfd(), the fd will still poll
for reading if you read SIGCHLD first.
Termination order of processes isn't guaranteed either. In our example
above, 1001 may actually exit before 1000, so init won't initially
receive SIGCHLD. Happily 1000 shouldn't call wait(), so when 1000 exits
later, the 1001 zombie is reparented to init and all is well.
Notes:
I'm stuffing a pid into an int and hoping it fits. There isn't a second
pid_t member of siginfo_t, adding one would require changes to every
arch tree (each implements copy_siginfo_to_userspace on its own) and
changes to the userspace headers as well.
prctl() is evil, another option would be ioctl() but that's evil too.
I did this as a signal because there needs to be reliability between the
delivery of the adoption notice and the delivery of the child death
notice (SIGCHLD). If the adoption notice were delivered along a
different band (netlink, file descriptor, etc.) then this reliability
would be gone. (I don't think you could absolutely guarantee that the
fd would be read()able in the child signal handler).
An option would be to implement this as an "child events file
descriptor", delivering both adoption notices *and* child death notices,
etc. This would be a much larger patch, and overlap heavily with
signalfd() anyway, which is already notified in the appropriate places.
Scott
--
Scott James Remnant
[email protected]
Allow the init daemon to be notified by signal when processes are
reparented to it. The signal has the same form as SIGCHLD except that
the si_status field contains the original parent proecss id.
Notification is enabled/disabled by prctl().
Signed-off-by: Scott James Remnant <[email protected]>
---
fs/exec.c | 2 ++
include/linux/prctl.h | 4 ++++
include/linux/sched.h | 2 ++
kernel/exit.c | 3 +++
kernel/signal.c | 37 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 10 ++++++++++
security/commoncap.c | 1 +
security/selinux/hooks.c | 4 +++-
8 files changed, 62 insertions(+), 1 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index c5f1a92..07a8782 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1011,6 +1011,7 @@ int flush_old_exec(struct linux_binprm * bprm)
suid_keys(current);
set_dumpable(current->mm, suid_dumpable);
current->pdeath_signal = 0;
+ current->adopt_signal = 0;
} else if (file_permission(bprm->file, MAY_READ) ||
(bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)) {
suid_keys(current);
@@ -1099,6 +1100,7 @@ void compute_creds(struct linux_binprm *bprm)
if (bprm->e_uid != current->uid) {
suid_keys(current);
current->pdeath_signal = 0;
+ current->adopt_signal = 0;
}
exec_keys(current);
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 48d887e..1fa1b75 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -85,4 +85,8 @@
#define PR_SET_TIMERSLACK 29
#define PR_GET_TIMERSLACK 30
+/* Set/get notification of adoption by signal */
+#define PR_SET_ADOPTSIG 31 /* Second arg is a signal */
+#define PR_GET_ADOPTSIG 32 /* Second arg is a ptr to return the signal */
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..bcd2af3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1133,6 +1133,7 @@ struct task_struct {
int exit_state;
int exit_code, exit_signal;
int pdeath_signal; /* The signal sent when the parent dies */
+ int adopt_signal; /* The signal sent when a process is reparented */
/* ??? */
unsigned int personality;
unsigned did_exec:1;
@@ -1829,6 +1830,7 @@ extern int kill_pgrp(struct pid *pid, int sig, int priv);
extern int kill_pid(struct pid *pid, int sig, int priv);
extern int kill_proc_info(int, struct siginfo *, pid_t);
extern int do_notify_parent(struct task_struct *, int);
+extern void do_notify_parent_adopted(struct task_struct *, struct task_struct *);
extern void force_sig(int, struct task_struct *);
extern void force_sig_specific(int, struct task_struct *);
extern int send_sig(int, struct task_struct *, int);
diff --git a/kernel/exit.c b/kernel/exit.c
index 2d8be7e..813a232 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -813,6 +813,9 @@ static void reparent_thread(struct task_struct *p, struct task_struct *father)
/* We already hold the tasklist_lock here. */
group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
+ if (p->real_parent->adopt_signal)
+ do_notify_parent_adopted(p, father);
+
list_move_tail(&p->sibling, &p->real_parent->children);
/* If this is a threaded reparent there is no need to
diff --git a/kernel/signal.c b/kernel/signal.c
index 4530fc6..40228e2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1474,6 +1474,43 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
spin_unlock_irqrestore(&sighand->siglock, flags);
}
+/* Let init know that it has adopted a new child */
+void do_notify_parent_adopted(struct task_struct *tsk, struct task_struct *father)
+{
+ struct siginfo info;
+ unsigned long flags;
+ struct task_struct *reaper;
+ struct sighand_struct *sighand;
+ int ret;
+
+ reaper = tsk->real_parent;
+
+ memset (&info, 0, sizeof info);
+ info.si_signo = reaper->adopt_signal;
+ /*
+ * set code to the same range as SIGCHLD so the right bits of
+ * siginfo_t get copied, to userspace this will appear as si_code=0
+ */
+ info.si_code = __SI_CHLD;
+ /*
+ * see comment in do_notify_parent() about the following 4 lines
+ */
+ rcu_read_lock();
+ info.si_pid = task_pid_nr_ns(tsk, reaper->nsproxy->pid_ns);
+ info.si_status = task_pid_nr_ns(father, reaper->nsproxy->pid_ns);
+ rcu_read_unlock();
+
+ info.si_uid = tsk->uid;
+
+ info.si_utime = cputime_to_clock_t(tsk->utime);
+ info.si_stime = cputime_to_clock_t(tsk->stime);
+
+ sighand = reaper->sighand;
+ spin_lock_irqsave(&sighand->siglock, flags);
+ __group_send_sig_info(reaper->adopt_signal, &info, reaper);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+}
+
static inline int may_ptrace_stop(void)
{
if (!likely(current->ptrace & PT_PTRACED))
diff --git a/kernel/sys.c b/kernel/sys.c
index 31deba8..1720053 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1726,6 +1726,16 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
else
current->timer_slack_ns = arg2;
break;
+ case PR_SET_ADOPTSIG:
+ if (!valid_signal(arg2)) {
+ error = -EINVAL;
+ break;
+ }
+ current->adopt_signal = arg2;
+ break;
+ case PR_GET_ADOPTSIG:
+ error = put_user(current->adopt_signal, (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/security/commoncap.c b/security/commoncap.c
index 6cbec11..a2da3ab 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -365,6 +365,7 @@ void cap_bprm_apply_creds (struct linux_binprm *bprm, int unsafe)
current->cap_permitted)) {
set_dumpable(current->mm, suid_dumpable);
current->pdeath_signal = 0;
+ current->adopt_signal = 0;
if (unsafe & ~LSM_UNSAFE_PTRACE_CAP) {
if (!capable(CAP_SETUID)) {
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 75777cb..8f089c8 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2280,8 +2280,10 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
spin_unlock_irq(¤t->sighand->siglock);
}
- /* Always clear parent death signal on SID transitions. */
+ /* Always clear parent death signal and adoption notification
+ * on SID transitions. */
current->pdeath_signal = 0;
+ current->adopt_signal = 0;
/* Check whether the new SID can inherit resource limits
from the old SID. If not, reset all soft limits to
--
1.6.0.5
On 12/27, Scott James Remnant wrote:
>
> Allow the init daemon to be notified by signal when processes are
> reparented to it. The signal has the same form as SIGCHLD except that
> the si_status field contains the original parent proecss id.
I think the changelog should explain why this change is useful.
I can't judge the usefulness, just a couple of nits about the
code.
> +++ b/include/linux/sched.h
> @@ -1133,6 +1133,7 @@ struct task_struct {
> int exit_state;
> int exit_code, exit_signal;
> int pdeath_signal; /* The signal sent when the parent dies */
> + int adopt_signal; /* The signal sent when a process is reparented */
Should be cleared in copy_process() ?
And what if init is multithreaded? This should be per-process, not
per-thread.
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -813,6 +813,9 @@ static void reparent_thread(struct task_struct *p, struct task_struct *father)
> /* We already hold the tasklist_lock here. */
> group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
>
> + if (p->real_parent->adopt_signal)
> + do_notify_parent_adopted(p, father);
> +
Please note that we notify the new parent even if we re-parent to our
sub-thread, not to the new process. This means that a multithreaded
init can have the false notifications.
And, given that ->adopt_signal is per-thread, a multithreaded init
should be very careful with prctl(PR_SET_ADOPTSIG) if it does not
want to miss the notification.
> +void do_notify_parent_adopted(struct task_struct *tsk, struct task_struct *father)
> +{
> + struct siginfo info;
> + unsigned long flags;
> + struct task_struct *reaper;
> + struct sighand_struct *sighand;
> + int ret;
> +
> + reaper = tsk->real_parent;
> +
> + memset (&info, 0, sizeof info);
Unneeded? This function seems to fill all interesting fields.
> + info.si_signo = reaper->adopt_signal;
> + /*
> + * set code to the same range as SIGCHLD so the right bits of
> + * siginfo_t get copied, to userspace this will appear as si_code=0
> + */
> + info.si_code = __SI_CHLD;
> + /*
> + * see comment in do_notify_parent() about the following 4 lines
> + */
> + rcu_read_lock();
> + info.si_pid = task_pid_nr_ns(tsk, reaper->nsproxy->pid_ns);
> + info.si_status = task_pid_nr_ns(father, reaper->nsproxy->pid_ns);
perhaps it is better to use task_active_pid_ns(reaper).
> + rcu_read_unlock();
> +
> + info.si_uid = tsk->uid;
> +
> + info.si_utime = cputime_to_clock_t(tsk->utime);
> + info.si_stime = cputime_to_clock_t(tsk->stime);
> +
> + sighand = reaper->sighand;
> + spin_lock_irqsave(&sighand->siglock, flags);
> + __group_send_sig_info(reaper->adopt_signal, &info, reaper);
This looks unsafe.
At this point reaper can already change its ->adopt_signal. For example
if it was cleared, send_signal() can crash while doing (say) sigismember().
Oleg.
I'm highly skeptical that this is a desireable feature at all, and
certainly I find the abuse of siginfo_t.si_status here extremely
questionable. I think we need a clear explanation of what problems
the feature is intended to address.
Thanks,
Roland
On Sun, 2008-12-28 at 14:01 -0800, Roland McGrath wrote:
> I'm highly skeptical that this is a desireable feature at all, and
> certainly I find the abuse of siginfo_t.si_status here extremely
> questionable. I think we need a clear explanation of what problems
> the feature is intended to address.
>
Did the original e-mail not address this?
Scott
On 12/29, Scott James Remnant wrote:
>
> On Sun, 2008-12-28 at 14:01 -0800, Roland McGrath wrote:
>
> > I'm highly skeptical that this is a desireable feature at all, and
> > certainly I find the abuse of siginfo_t.si_status here extremely
> > questionable. I think we need a clear explanation of what problems
> > the feature is intended to address.
> >
> Did the original e-mail not address this?
Do you mean
[RFC] Notify init when processes are reparented to it
http://marc.info/?l=linux-kernel&m=123038049428388
?
I am not sure I really understand the problem. And thus I can't
understand how this patch can help.
OK,
> We want to be able to supervise daemons.
What do you mean?
> Later on, 1002 will die and init will receive SIGCHLD for it.
>
> Unfortunately neither the 1001 or 1002 processes are known to init, even
> though they are original children of the process it spawned (1000), for
> init to be notified about them - this has been forgotten.
Ok, with this patch /sbin/init knows that 1002 is a descendant
of apache(1000) which was spwaned by init. What can init do
with this info?
To clarify, I am not arguing, I am just trying to understand.
Oleg.
On Mon, 2008-12-29 at 14:23 +0100, Oleg Nesterov wrote:
> On 12/29, Scott James Remnant wrote:
> >
> > On Sun, 2008-12-28 at 14:01 -0800, Roland McGrath wrote:
> >
> > > I'm highly skeptical that this is a desireable feature at all, and
> > > certainly I find the abuse of siginfo_t.si_status here extremely
> > > questionable. I think we need a clear explanation of what problems
> > > the feature is intended to address.
> > >
> > Did the original e-mail not address this?
>
> Do you mean
>
> [RFC] Notify init when processes are reparented to it
> http://marc.info/?l=linux-kernel&m=123038049428388
>
> ?
>
> I am not sure I really understand the problem. And thus I can't
> understand how this patch can help.
>
No problem ;) like anything, it's only ever perfectly clear to the guy
who wrote it - and everyone else wonders what he's going on about :p
> > We want to be able to supervise daemons.
>
> What do you mean?
>
> > Later on, 1002 will die and init will receive SIGCHLD for it.
> >
> > Unfortunately neither the 1001 or 1002 processes are known to init, even
> > though they are original children of the process it spawned (1000), for
> > init to be notified about them - this has been forgotten.
>
> Ok, with this patch /sbin/init knows that 1002 is a descendant
> of apache(1000) which was spwaned by init. What can init do
> with this info?
>
Fundamentally init would now know that the apache service terminated,
and with what exit code or by what signal.
Right now, all we know is that a process terminated (and why) - we can't
link that back to a service in any kind of foolproof manner.
With the ability to do that, when the apache service dies, we can log
that in a more useful manner (including marking the service as down) -
but most importantly, we can respawn it!
This is something we can't do with processes that daemonise right now.
Scott
On 12/29, Scott James Remnant wrote:
>
> On Mon, 2008-12-29 at 14:23 +0100, Oleg Nesterov wrote:
>
> > > We want to be able to supervise daemons.
> >
> > What do you mean?
> >
> > > Later on, 1002 will die and init will receive SIGCHLD for it.
> > >
> > > Unfortunately neither the 1001 or 1002 processes are known to init, even
> > > though they are original children of the process it spawned (1000), for
> > > init to be notified about them - this has been forgotten.
> >
> > Ok, with this patch /sbin/init knows that 1002 is a descendant
> > of apache(1000) which was spwaned by init. What can init do
> > with this info?
> >
> Fundamentally init would now know that the apache service terminated,
> and with what exit code or by what signal.
>
> Right now, all we know is that a process terminated (and why) - we can't
> link that back to a service in any kind of foolproof manner.
>
> With the ability to do that, when the apache service dies, we can log
> that in a more useful manner (including marking the service as down) -
> but most importantly, we can respawn it!
Aha, thanks, I suspected something like this.
But how can we know (in general) that the service has died? We only
know that the process has exited.
IOW. when apache(1001) exits, we don't respawn. How can init know that
the death of apache(1002) means "this is the real exit of service, the
previous exits were due to initialization stage".
The only answer I can see is: because init can figure out that all
descendants of initially spawned apache(1000) have exited. But this
doesn't look very flexible/reliable to me.
And we already have CONFIG_PROC_EVENTS, init can monitor PROC_EVENT_FORK
events, so it can do this without ->adopt_signal ?
Oleg.
On Mon, 29 Dec 2008, Oleg Nesterov wrote:
> On 12/29, Scott James Remnant wrote:
>>
>> On Mon, 2008-12-29 at 14:23 +0100, Oleg Nesterov wrote:
>>
>>>> We want to be able to supervise daemons.
>>>
>>> What do you mean?
>>>
>>>> Later on, 1002 will die and init will receive SIGCHLD for it.
>>>>
>>>> Unfortunately neither the 1001 or 1002 processes are known to init, even
>>>> though they are original children of the process it spawned (1000), for
>>>> init to be notified about them - this has been forgotten.
>>>
>>> Ok, with this patch /sbin/init knows that 1002 is a descendant
>>> of apache(1000) which was spwaned by init. What can init do
>>> with this info?
>>>
>> Fundamentally init would now know that the apache service terminated,
>> and with what exit code or by what signal.
>>
>> Right now, all we know is that a process terminated (and why) - we can't
>> link that back to a service in any kind of foolproof manner.
>>
>> With the ability to do that, when the apache service dies, we can log
>> that in a more useful manner (including marking the service as down) -
>> but most importantly, we can respawn it!
>
> Aha, thanks, I suspected something like this.
>
> But how can we know (in general) that the service has died? We only
> know that the process has exited.
>
> IOW. when apache(1001) exits, we don't respawn. How can init know that
> the death of apache(1002) means "this is the real exit of service, the
> previous exits were due to initialization stage".
>
> The only answer I can see is: because init can figure out that all
> descendants of initially spawned apache(1000) have exited. But this
> doesn't look very flexible/reliable to me.
there are a number of options here.
init can look for error exit codes and respawn things that died with
errors.
init can be taught what normal behavior is for different daemons (some
sort of configuration options)
init can look for cases where all children have exited.
init could do monitoring of aplications, and if an application is deemed
'sick' can make sure that it kills all processes associated with that
application before trying to respawn it. (this is much more then what init
has doesn historicly, and may not be a good idea for the general case, but
it is still a possibility that will make sense in some cases)
I'm sure that there are other things that can be done with such a
mechansims, this is just what I can think of off the top of my head
> And we already have CONFIG_PROC_EVENTS, init can monitor PROC_EVENT_FORK
> events, so it can do this without ->adopt_signal ?
wouldn't that require init to pay attention to every fork in the system to
find the ones that it cares about?
also, an earlier post gave one reason for wanting to use signals being to
eliminate race conditions.
David Lang