Hi!
This code is absolutely, completely broken. I guess this is because it
wasn't updated when linux got threads.
I don't really understand this magic, and I don't know who maintains it
Please review.
Oleg.
> This code is absolutely, completely broken. I guess this is because it
> wasn't updated when linux got threads.
>
> I don't really understand this magic, and I don't know who maintains it
> Please review.
Noone has ever been officially designated as maintainer for this stuff.
Several people have always wanted to keep their fingers in this core code.
I understand what the job control semantics are supposed to be, if that's
the magic you mean. The implementation of this particular part is from
before my time and I won't claim at the outset to know why it's the way it
is. I began working on the thread (NPTL) semantics in the kernel after
pieces of the implementation had been around for a while. I fixed various
things I knew were wrong, and things I came across. But I didn't scour all
the corners of the existing code for related problems. (The prevailing
tendencies now are rather more receptive to perturbations of the existing
code and substantial cleanups than they were at that time.)
I wonder if at one time the pid/pgrp hashing magic entrained each thread on
a pgrp-members list, and so that code did work. (I think it must have,
since I do recall that various signals changes I made long ago had to do
with a wide variety of MT cases and ^Z.)
These patches all look like steps in the right direction.
About is_global_init, this is mechanically changed from an old ->pid == 1
check that was there from the dawn of time. Not since shortly thereafter
has anyone thought about what it's really for. So let's figure it out now.
The basic rule for an orphaned pgrp is that every member's parent is either
also a member of the pgrp or is not in the same session. In a normal
system, no other processes are actually in init's session at all. So if
init's your parent, your parent isn't in the same session. That makes it
redundant to exclude init, since it never qualifies as a potential
"controller" (not having a controller is what makes a pgrp an orphan).
There are lots of special magical corners checking for init and pid 1, but
AFAIK the fact that noone is in init's session is just because init calls
setsid whenever it forks. So imagine you were in init's session (SID 1).
The idea is you shouldn't stop (and should get SIGHUP's) when your original
controller died and you got reparented to init. So the intent for this
check is that no process be considered to be in the same session as init.
What seems most pedantically right for this check is:
is_my_init(p, p->real_parent)
i.e.
if (task_session(p->real_parent) == task_session(p) &&
task_pgrp(p->real_parent) != pgrp &&
task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
return 0;
It's excluded from counting as in your session if you consider to to be init.
Now let's consider the other problems you've cited. This is used in two
paths: the exit path (kill_orphaned_pgrp), and is_current_pgrp_orphaned.
The is_current_pgrp_orphaned path is called in two places. The tty layer
calls it to decide whether to suppress generating a stop signal. The
signals code calls it after dequeuing a stop signal to decide whether to
discard it rather than stop. In races, these paths can stand a false
negative from this check given some specific ordering constraints.
When we get as far as dequeuing a stop signal, this means
kill_orphaned_pgrp hasn't sent SIGCONT yet. We know since posting a
SIGCONT would have cleared all stop signals from the queues. When we
release the siglock to call is_current_pgrp_orphaned, no SIGCONT has come
yet, and SIGNAL_STOP_DEQUEUED is set in signal_struct.flags. If that
orphaned check said negative, we retake the siglock to do the stop. If a
SIGCONT came along, it cleared SIGNAL_STOP_DEQUEUED and so we don't stop.
If we do stop, then the SIGCONT is coming later and will wake us. So here
we can stand any false negative from is_current_pgrp_orphaned.
It's ok in the tty syscalls to post the signal and return -ERESTARTSYS
while the pgrp becomes orphaned. If the tty code's kill_pgrp comes before
the SIGHUP+SIGCONT, then the SIGCONT generation will clear the pending stop
signal as if it never happened. The syscall will restart and then the
orphaned-pgrp check will fire. If the kill_pgrp comes after the
SIGHUP+SIGCONT, then the stop signal generation will clear that pending
SIGCONT. Ideally we would not permit this ordering, and by the letter of
POSIX it would seem to require that e.g. a SIGCONT handler run as well as
the SIGHUP handler in a just-orphaned process that catches both (and
doesn't exit before seeing all signals). But in practical terms it doesn't
hurt because that missed SIGCONT handler is the only problem. When we
dequeue that stop signal, since it's after that SIGHUP+SIGCONT it's by
definition after when is_current_pgrp_orphaned starts reporting positive,
and so we won't stop.
In the main exit_notify path, we are considering whether our own
departure from the pgrp caused it to become an orphan. In this call,
our group_leader is excluded from consideration, so there is no
concern about any thread in our own group contributing to a false
negative in will_become_orphaned_pgrp. The write lock on
tasklist_lock strictly serializes all exiting process's calls from
exit_notify. This call is after group_dead hits. If some other
process in the pgrp has ->signal->live > 0 then it has not exited yet
and when it does it will do this same check, guaranteed to be after
ours, and after our ->signal->live == 0. Since it's after our own
group_dead hit, the "ignored_task" check for our own group leader is
redundant with that. So perhaps it's:
do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
if (task_session(p->real_parent) == task_session(p) &&
task_pgrp(p->real_parent) != pgrp &&
atomic_read(&p->signal->live) > 0 &&
task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
return 0;
} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
In the reparent_thread path, we are considering for one of our children
whether our death caused the child's pgrp (not ours) to become an orphan.
The same logic above applies about the ordering given the signal->live check.
Importantly all that happens with tasklist_lock write-locked. This orders
any is_current_pgrp_orphaned strictly either before it--so the SIGCONT
will come after the stop and wake it, or after it--so it too will consider
the pgrp orphaned and never try to stop.
Please check me, but I think this covers any race issues in
will_become_orphaned_pgrp.
Now, about has_stopped_jobs. It's a waste for it to be a separate loop
across the pgrp just after we did one in will_become_orphaned_pgrp. They
should be merged together.
static int check_orphaned_pgrp(struct pid *pgrp, int ignore_stopped)
{
do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
if (task_session(p->real_parent) == task_session(p) &&
task_pgrp(p->real_parent) != pgrp &&
atomic_read(&p->signal->live) > 0 &&
task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
return 0;
if (!ignore_stopped)
ignore_stopped = group_is_really_stopped(p);
} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
return ignore_stopped;
}
The ordering that matters is that if we decided the pgrp was orphaned and
had no stopped jobs, then no process in the pgrp should stop later.
I'm not sure there is a way to make the stopped check robust without
taking the siglock. Then the only hole I see is between releasing the
tasklist_lock in is_current_pgrp_orphaned and reacquiring the siglock
(after a negative check). In that interval, we want to consider that
process to be a "stopped job" (because it shortly will be). For that:
static int group_is_really_stopped(struct task_struct *p)
{
int ret = 0;
spin_lock(&p->sighand->siglock);
ret = (p->signal->flags & (SIGNAL_STOP_STOPPING|SIGNAL_STOP_STOPPED)) ||
p->signal->group_stop_count > 0;
spin_unlock(&p->sighand->siglock);
return ret;
}
along with:
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1802,6 +1802,7 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
relock:
spin_lock_irq(¤t->sighand->siglock);
+ current->signal->flags &= ~SIGNAL_STOP_STOPPING;
for (;;) {
struct k_sigaction *ka;
@@ -1883,6 +1884,7 @@ relock:
* We need to check for that and bail out if necessary.
*/
if (signr != SIGSTOP) {
+ current->signal->flags |= SIGNAL_STOP_STOPPING;
spin_unlock_irq(¤t->sighand->siglock);
/* signals can be posted during this window */
@@ -1891,6 +1893,7 @@ relock:
goto relock;
spin_lock_irq(¤t->sighand->siglock);
+ current->signal->flags &= ~SIGNAL_STOP_STOPPING;
}
if (likely(do_signal_stop(signr))) {
What do you think?
Thanks,
Roland
On 03/04, Roland McGrath wrote:
>
> [... big snip...]
Thanks Roland. I need a time to understand your explanation (I even printed
it to read in metro ;). I'll return tomorrow.
A quick note right now:
> static int group_is_really_stopped(struct task_struct *p)
> {
> int ret = 0;
> spin_lock(&p->sighand->siglock);
> ret = (p->signal->flags & (SIGNAL_STOP_STOPPING|SIGNAL_STOP_STOPPED)) ||
> p->signal->group_stop_count > 0;
> spin_unlock(&p->sighand->siglock);
> return ret;
> }
I thought abot something like that too.
However, SIGNAL_STOP_STOPPED doesn't relaible with ptrace. I mean, ptracer
doesn't clear SIGNAL_STOP_STOPPED. This is btw one of the problems which
complicates fixing do_wait(WSTOPPED).
As for SIGNAL_STOP_STOPPING... I am dreaming to find the way to eliminate
this lock-drop in get_signal_to_deliver(). Not sure this is possible, but
it is so nasty. For example, SIGNAL_STOP_DEQUEUED is racy, and I don't know
how to fix this. The patch we discussed some time ago doesn't really work
because dequeue_signal() drops the lock too. The latter is fixable afaics,
but needs very ugly changes.
Oleg.
> However, SIGNAL_STOP_STOPPED doesn't relaible with ptrace. I mean, ptracer
> doesn't clear SIGNAL_STOP_STOPPED. This is btw one of the problems which
> complicates fixing do_wait(WSTOPPED).
We can clean that up too if you'd like to bring up the details.
> As for SIGNAL_STOP_STOPPING... I am dreaming to find the way to eliminate
> this lock-drop in get_signal_to_deliver(). Not sure this is possible, but
> it is so nasty. For example, SIGNAL_STOP_DEQUEUED is racy, and I don't know
> how to fix this. The patch we discussed some time ago doesn't really work
> because dequeue_signal() drops the lock too. The latter is fixable afaics,
> but needs very ugly changes.
It all certainly deserves more careful thought.
Thanks,
Roland
On 03/04, Roland McGrath wrote:
>
> > However, SIGNAL_STOP_STOPPED doesn't relaible with ptrace. I mean, ptracer
> > doesn't clear SIGNAL_STOP_STOPPED. This is btw one of the problems which
> > complicates fixing do_wait(WSTOPPED).
>
> We can clean that up too if you'd like to bring up the details.
Sorry! I was unclear.
Multiple reasons. Just for example,
1:~$ strace -o /dev/null perl -e 'kill SIGSTOP,$$; for (;;) {}' &
[1] 6714
1:~$ kill -9 $!
[1]+ Killed strace -o /dev/null perl -e 'kill SIGSTOP,$$; for (;;) {}'
1:~$ grep State: /proc/`pidof perl`/status
State: R (running)
but its signal->flags = SIGNAL_STOP_STOPPED.
Please note that unless SIGCONT comes it will exit() with SIGNAL_STOP_STOPPED
set too, this complicates fixing do_wait(WSTOPPED). We can shadow this problem
though, do_wait() can first check TASK_ZOMBIE, and only then SIGNAL_STOP_STOPPED.
I am going to try to cook some patches soon.
> > As for SIGNAL_STOP_STOPPING... I am dreaming to find the way to eliminate
> > this lock-drop in get_signal_to_deliver(). Not sure this is possible, but
> > it is so nasty. For example, SIGNAL_STOP_DEQUEUED is racy, and I don't know
> > how to fix this. The patch we discussed some time ago doesn't really work
> > because dequeue_signal() drops the lock too. The latter is fixable afaics,
> > but needs very ugly changes.
>
> It all certainly deserves more careful thought.
Yes. Somehow the conception of orphaned pgrp doesn't fit in my brain, but I'd
like to at least try to think about this...
Oleg.
On 03/04, Roland McGrath wrote:
>
> What seems most pedantically right for this check is:
> is_my_init(p, p->real_parent)
> i.e.
> if (task_session(p->real_parent) == task_session(p) &&
> task_pgrp(p->real_parent) != pgrp &&
> task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
> return 0;
> It's excluded from counting as in your session if you consider to to be init.
Yes. Not that I really understand ;) But Eric also suggested to use
is_container_init().
> The write lock on
> tasklist_lock strictly serializes all exiting process's calls from
> exit_notify. This call is after group_dead hits. If some other
> process in the pgrp has ->signal->live > 0 then it has not exited yet
> and when it does it will do this same check, guaranteed to be after
> ours, and after our ->signal->live == 0.
Yes. I though about that too. But, unlike ->exit_state, signal->live
is not protected by tasklist, please see below.
> Since it's after our own
> group_dead hit, the "ignored_task" check for our own group leader is
> redundant with that.
Ah, good point. I didn't realize this when I was thinking about using
signal->live.
So perhaps it's:
>
> do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
> if (task_session(p->real_parent) == task_session(p) &&
> task_pgrp(p->real_parent) != pgrp &&
> atomic_read(&p->signal->live) > 0 &&
> task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
> return 0;
> } while_each_pid_task(pgrp, PIDTYPE_PGID, p);
I am hopeless, I can't understand orphaned pgrps.
But still. Let's suppose that pgrp should be considered as orphaned when
2 tasks A and B exit. They both exit at the same time and decrement ->live
down to zero. Now, they both can send SIGHUP to the stopped tasks.
No?
That said. Even if I am right, I agree that the signal->live check is
better. I like very much the possibility to kill the ugly "ignored_task".
> Now, about has_stopped_jobs. It's a waste for it to be a separate loop
> across the pgrp just after we did one in will_become_orphaned_pgrp. They
> should be merged together.
Yes sure. Looking at the code I was surprised why we don't do that.
Oleg.
Oleg Nesterov <[email protected]> writes:
> On 03/04, Roland McGrath wrote:
>> Since it's after our own
>> group_dead hit, the "ignored_task" check for our own group leader is
>> redundant with that.
>
> Ah, good point. I didn't realize this when I was thinking about using
> signal->live.
>
> So perhaps it's:
>>
>> do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
>> if (task_session(p->real_parent) == task_session(p) &&
>> task_pgrp(p->real_parent) != pgrp &&
>> atomic_read(&p->signal->live) > 0 &&
>> task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
>> return 0;
>> } while_each_pid_task(pgrp, PIDTYPE_PGID, p);
>
> I am hopeless, I can't understand orphaned pgrps.
I will give it a quick try.
When you login in text mode you get a fresh session (setsid).
If you are using job control in your shell each job is assigned
a separate process group (setpgrp).
The shell and all process groups are in the same session.
Intuitively a process group is considered orphaned when there is
are no processes in the session that know about it and can wake it
up. The goal is to prevent processes that will never wake up if
they are stopped with ^Z.
A process is defined as knowing about a process in a process group
when it is a parent of that process.
The task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) == 1 check is
the proper check, as it handles namespaces properly. If we need to
retain it.
I don't believe we need to retain the check for init at all. sysvinit
doesn't use that feature and I would be surprised if any other init
did. Except for covering programming bugs which make testing harder
I don't see how a version of init could care.
Further as init=/bin/bash is a common idiom for getting into a
simplified system and debugging it, there is a case for job control
to work properly for init. Unless I am misreading things the check
for init prevent us from using job control from our first process.
Which seems like it would make init=/bin/bash painful if job control
was ever enabled.
I believe that the only reason with a weird check for init like we are
performing that we are POSIX compliant is that our init process can
count as a special system process and can escape the definition.
Therefore I think the code would be more maintainable, and the system
would be less surprising, and more useful if we removed this special
case processing of init altogether.
I'm hoping that we can kill this check before pid namespaces are
widely deployed and a much larger number of programs can run as init.
Eric
On 03/05, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > I am hopeless, I can't understand orphaned pgrps.
>
> I will give it a quick try.
> When you login in text mode you get a fresh session (setsid).
> If you are using job control in your shell each job is assigned
> a separate process group (setpgrp).
>
> The shell and all process groups are in the same session.
>
> Intuitively a process group is considered orphaned when there is
> are no processes in the session that know about it and can wake it
> up. The goal is to prevent processes that will never wake up if
> they are stopped with ^Z.
>
> A process is defined as knowing about a process in a process group
> when it is a parent of that process.
Thanks Eric. This does help.
Stupid question, just to be sure. Suppose that SIGTSTP was sent by a
"regular" kill_pid_info() (not by tty or kill_pgrp_info). In that case
get_signal_to_deliver() still must check is_current_pgrp_orphaned(),
yes?
> The task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) == 1 check is
> the proper check, as it handles namespaces properly. If we need to
> retain it.
>
> I don't believe we need to retain the check for init at all. sysvinit
> doesn't use that feature and I would be surprised if any other init
> did. Except for covering programming bugs which make testing harder
> I don't see how a version of init could care.
>
> Further as init=/bin/bash is a common idiom for getting into a
> simplified system and debugging it, there is a case for job control
> to work properly for init. Unless I am misreading things the check
> for init prevent us from using job control from our first process.
> Which seems like it would make init=/bin/bash painful if job control
> was ever enabled.
>
> I believe that the only reason with a weird check for init like we are
> performing that we are POSIX compliant is that our init process can
> count as a special system process and can escape the definition.
>
> Therefore I think the code would be more maintainable, and the system
> would be less surprising, and more useful if we removed this special
> case processing of init altogether.
Looks like a nice changelog for the patch ;)
Oleg.
> > It's excluded from counting as in your session if you consider to to be init.
>
> Yes. Not that I really understand ;) But Eric also suggested to use
> is_container_init().
That uses a slightly different test, which I specifically did not chose.
is_container_init says "your parent thinks its PID is 1". The test I
posted says "you think your parent's PID is 1". (I'm being anal again.)
It's an arcane distinction. But it seems like the correct mapping of the
original pre-pid_ns logic.
I agree with Eric that the init special case ought to just go away.
I'd suggest we do this alone first by removing the existing
is_global_init(p) check in a tiny commit before we get into the rest
of the cleanup. If someone's init ever has a problem, they can bisect
it back to that and we can discuss what they think it should be doing.
> So perhaps it's:
> >
> > do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
> > if (task_session(p->real_parent) == task_session(p) &&
> > task_pgrp(p->real_parent) != pgrp &&
> > atomic_read(&p->signal->live) > 0 &&
> > task_tgid_nr_ns(p->real_parent, p->nsproxy->pid_ns) != 1)
> > return 0;
> > } while_each_pid_task(pgrp, PIDTYPE_PGID, p);
>
> I am hopeless, I can't understand orphaned pgrps.
Eric explained it pretty well. The idea is a pgrp is an orphan if it has
no "controller". That term is used only a couple of times in POSIX. In
real examples, the "controller" is always your shell. If the shell is
gone, stopped jobs could linger around forever. Consider:
$ foo
^Z
[1]+ Stopped foo
$ kill -9 $$
So foo gets the SIGHUP+SIGCONT to make it likely to die when it should.
In practice, you use "logout" not "kill -9 $$", and I suspect nowadays
the shell kills the jobs for you whenever it exits gracefully. But this
is how the behavior arose.
Even more likely, consider:
$ stty tostop
...
$ trap '' SIGHUP # signal(SIGHUP, SIG_IGN)
$ foo &
$ logout
You wanted foo to keep running when you were gone, and forgot about doing
stty tostop. Now it does some output, but the tty is someone else's, so
it gets SIGTTOU. If this made it stop, it would be stopped until morning.
That's why it ignores such stops. (This is why the nohup command
redirects stdout if it's a tty.)
Nowadays there are all manner of things that make these scenarios far less
likely. e.g. revocation of ttys, the fancy nohup command, etc. But POSIX
specifies only a slight refinement (with the "session" idea) from what was
the original status quo of job control in 4.2/4.3 BSD.
> But still. Let's suppose that pgrp should be considered as orphaned when
> 2 tasks A and B exit. They both exit at the same time and decrement ->live
> down to zero. Now, they both can send SIGHUP to the stopped tasks.
>
> No?
Yes, I think. What we want is that the act of orphaning a given pgrp
happens only once. It's that act, the transition from "controlled"
pgrp to "orphaned pgrp" that should generate the signals. Currently
what we really check is whether it is an orphaned pgrp right now, just
after removing the currently-exiting task from the pgrp. That is not
a transition that happens once, it's a state that prevails once the
transition has occurred.
We need to think about this more. I just spent a while hurting my brain.
I didn't come up with anything clever.
> Stupid question, just to be sure. Suppose that SIGTSTP was sent by a
> "regular" kill_pid_info() (not by tty or kill_pgrp_info). In that case
> get_signal_to_deliver() still must check is_current_pgrp_orphaned(),
> yes?
Yes. It is specifically said that SIGTSTP, SIGTTIN, SIGTTOU may not
cause a stop in a member of an orphaned pgrp. It doesn't matter how
it came to be that the signal got delivered.
Thanks,
Roland