All,
I am posting this patches in the hope of some review of the strategy I
am taking and to let the individual patches be reviewed.
The intersection of wait, exit, ptrace, exec, and CLONE_THREAD does not
work correctly in the linux kernel.
An easy entry point into seeing the problem is that exec can wait
forever holding a mutex while waiting for userspace to call waitpid on a
ptraced thread. This is bad enough it has been observed to cause
deadlocks in real world situations today.
Another easy entry point is to see that a multi-threaded setuid won't
change the credentials on a zombie thread group leader. Which can allow
sending signals to a process that the credential change should forbid.
This is in violation of posix and the semantics we attempt to enforce in
linux.
I have been silent the last couple of weeks reading up on the history
and seeing if I could come up with a solution to the buggy semantics and
great big piles of technical debt that exists in this area of the
kernel. It unfortunately requires understanding all of the pieces to
come up with a satisfactory long term solution to this problem. My goal
is to fix enough of the issues that futher cleanups and bug fixes won't
require such comprehensive knowledge.
In large the solution I have found is to:
- Fix the semantics of wait.
This makes it clear what the semantics the rest of the fixes need to
maintain.
- To move everything possible from release_task to do_exit.
Moving code from release_task to do_exit means that zombie
threads can no longer be used to access shared thread group state,
removing the problem of stale credentials on zombie threads.
Much of the state that is freed in release_task instead of do_exit
is freed there because originally during the CLONE_THREAD development
there was no where else to put code that freed thread group state.
And my changes largely move state freeing back where it used to
be in 2.4 and earlier.
- To remove the concept of thread group leader (and allowing the first
thread to exit).
The concept of thread group leader keeps leading people into false
mental models of what the kernel actually does.
There is a lot of code that makes the mistake of assuming the thread
group leader is always alive. When the code doesn't make that mistake
it requires additional code complexity to deal with zombie thread
group leaders. As seen in the signal code which still doesn't
handle zombie thread group leaders correctly from a permission
checking perspective.
- To promote tgid to a first class concept in the kernel.
This is necessary if there isn't a thread group leader, and a
significant part of the functional changes I am making.
- To only allow sending signals through living threads.
This is a subset of removing the concept of thread group leader. But
worth calling out because signal handling is the most significant
offender.
My first batch of changes includes some trivial cleanups and then
includes the small semantic changes (AKA user visible) I noticed that
are necessary to give userspace consistent semantics.
The full set of changes that I have come up with is at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git exit-cleanups-for-testing
Unless someone has a better suggestion I intend to collect this up as
a topic branch and merge this through my tree.
Review on the details and of the concept would be very much appreciated.
Eric W. Biederman (26):
alpha: Remove unused TASK_GROUP_LEADER
cgroup: Don't open code tasklist_empty()
signal: Do not perform permission checks when sending pdeath_signal
signal: Make group_send_sig_info static
exit: Remove the pointless clearing of SIGPENDING in __exit_signal
rlimit: Remove unnecessary grab of tasklist_lock
pidns: Improve the error handling in alloc_pid
exit: Make the runqueue rcu safe
signal: Don't allow sending SIGKILL or SIGSTOP to init
ptrace: Simplify ptrace_detach & exit_ptrace
wait: Properly implement __WCLONE handling in the presence of exec and ptrace
wait: Directly test for the two cases where wait_task_zombie is called
wait: Remove unused delay_group_leader
wait: Move changing of ptrace from wait_consider_task into wait_task_stopped
wait: Don't delay !ptrace_reparented leaders
exit: Fix reporting a ptraced !reparented leader has exited
exit: Rework the exit states for ptracees
wait: Fix WSTOPPED on a ptraced child
wait: Simpler code for clearing notask_error in wait_consider_task
wait: Don't pass the list to wait_consider_task
wait: Optmize waitpid
exit: Fix auto-wait of ptraced children
signal: Fix SIGCONT before group stop completes.
signal: In ptrace_stop improve identical signal detection
signal: In ptrace_stop use CLD_TRAPPED in all ptrace signals
pidns: Ensure zap_pid_ns_processes always terminates
arch/alpha/kernel/asm-offsets.c | 1 -
include/linux/sched.h | 7 +-
include/linux/sched/signal.h | 29 ++--
include/linux/sched/task.h | 4 +-
include/linux/signal.h | 1 -
kernel/cgroup/cgroup.c | 2 +-
kernel/exit.c | 375 ++++++++++++++++------------------------
kernel/fork.c | 3 +-
kernel/pid.c | 8 +-
kernel/pid_namespace.c | 2 +-
kernel/ptrace.c | 44 ++---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/signal.c | 122 +++++++------
kernel/sys.c | 13 +-
15 files changed, 267 insertions(+), 348 deletions(-)
Eric
This definition is for assembly code and no assembly code uses it.
Remove the definition so that when making future changes we don't
have to worry if alpha assembly code uses task->group_leader.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/alpha/kernel/asm-offsets.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/alpha/kernel/asm-offsets.c b/arch/alpha/kernel/asm-offsets.c
index 6ff8886e7e22..5297a95f491d 100644
--- a/arch/alpha/kernel/asm-offsets.c
+++ b/arch/alpha/kernel/asm-offsets.c
@@ -21,7 +21,6 @@ void foo(void)
DEFINE(TASK_BLOCKED, offsetof(struct task_struct, blocked));
DEFINE(TASK_CRED, offsetof(struct task_struct, cred));
DEFINE(TASK_REAL_PARENT, offsetof(struct task_struct, real_parent));
- DEFINE(TASK_GROUP_LEADER, offsetof(struct task_struct, group_leader));
DEFINE(TASK_TGID, offsetof(struct task_struct, tgid));
BLANK();
--
2.10.1
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/cgroup/cgroup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3c9a0e1b3c9..4b7d66cf8914 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4440,7 +4440,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
/* At system boot, before all subsystems have been
* registered, no tasks have been forked, so we don't
* need to invoke fork callbacks here. */
- BUG_ON(!list_empty(&init_task.tasks));
+ BUG_ON(!tasklist_empty());
BUG_ON(online_css(css));
--
2.10.1
It has no users outside of kernel/signal.c
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/signal.h | 1 -
kernel/signal.c | 1 +
2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 1f5a16620693..f1b296a5e537 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -244,7 +244,6 @@ struct pt_regs;
extern int next_signal(struct sigpending *pending, sigset_t *mask);
extern int do_send_sig_info(int sig, struct siginfo *info,
struct task_struct *p, bool group);
-extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
extern int do_sigtimedwait(const sigset_t *, siginfo_t *,
const struct timespec *);
diff --git a/kernel/signal.c b/kernel/signal.c
index ca92bcfeb322..5eff2f9f8c42 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1262,6 +1262,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
/*
* send signal info to all the members of a group
*/
+static
int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
{
int ret;
--
2.10.1
This fixes and old old regression. When Roland switched from
sending pdeath_signal with send_sig() to send_group_sig_info
it gained a permission check, and started taking the tasklist
lock. Roland earlier fixed the double taking of the tasklist
lock in 3f2a0d1df938 ("[PATCH] fix pdeath_signal SMP locking")
but pdeath_signal still performs an unnecessary permission
check.
Ordinarily I would be hesitant at fixing an ancient regression but
a permission check for our parent sending to us is almost never
likely to fail (so it is unlikely anyone has noticed), and it
is stupid. It makes absolutely no sense to see if our
parent has permission to send us a signal we requested be
sent to us.
As this is more permisssive there is no chance anything will break.
The information of if our parent is living is available elsewhere
getppid, tkill, and proc with no special permissions so this should
not be an information leak.
See https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/
for the bitkeeper era history that I refer to.
Fixes: da334d91ff70 ("[PATCH] linux-2.5.66-signal-cleanup.patch")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 516acdb0e0ec..8926cdd132f2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -692,8 +692,8 @@ static void forget_original_parent(struct task_struct *father,
if (likely(!t->ptrace))
t->parent = t->real_parent;
if (t->pdeath_signal)
- group_send_sig_info(t->pdeath_signal,
- SEND_SIG_NOINFO, t);
+ do_send_sig_info(t->pdeath_signal,
+ SEND_SIG_NOINFO, t, true);
}
/*
* If this is a threaded reparent there is no need to
--
2.10.1
There is no reason to take the tasklist lock here. The sighand
structure is never referenced and and tsk->signal is guaranteed
to stick around until tsk is freed. Further update_rlimit_cpu
does not need the tasklist_lock. And the rlim_lock is used
to guarantee mutual exclusion.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/sys.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/kernel/sys.c b/kernel/sys.c
index 8a94b4eabcaa..705f14b28134 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1380,13 +1380,6 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource,
return -EPERM;
}
- /* protect tsk->signal and tsk->sighand from disappearing */
- read_lock(&tasklist_lock);
- if (!tsk->sighand) {
- retval = -ESRCH;
- goto out;
- }
-
rlim = tsk->signal->rlim + resource;
task_lock(tsk->group_leader);
if (new_rlim) {
@@ -1425,8 +1418,7 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource,
new_rlim->rlim_cur != RLIM_INFINITY &&
IS_ENABLED(CONFIG_POSIX_TIMERS))
update_rlimit_cpu(tsk, new_rlim->rlim_cur);
-out:
- read_unlock(&tasklist_lock);
+
return retval;
}
--
2.10.1
Cause any failure to allocate pid 1 to permanently disable pid
allocations for the pid namespace.
Before the pid becomes pid 1 there ns->last_pid and other state
remains unchanged so it is safe to try again...
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/pid.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index fd1cde1e4576..f4fb1a84109b 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -321,10 +321,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}
if (unlikely(is_child_reaper(pid))) {
- if (pid_ns_prepare_proc(ns)) {
- disable_pid_allocation(ns);
+ if (pid_ns_prepare_proc(ns))
goto out_free;
- }
}
get_pid_ns(ns);
@@ -350,6 +348,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
put_pid_ns(ns);
out_free:
+ /* Ensure everything stops if allocation of pid 1 failed */
+ if ((i < ns->level) && (pid->numbers[ns->level].nr == 1))
+ disable_pid_allocation(ns);
+
while (++i <= ns->level)
free_pidmap(pid->numbers + i);
--
2.10.1
Even to init SIGKILL and SIGSTOP are alwasys delivered if they are
sent, so don't allow tracing an init task allow them.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 5eff2f9f8c42..627b482fa3f8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -98,8 +98,12 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
/*
* Tracers may want to know about even ignored signals.
+ * We can never safely allow SIGKILL or SIGSTOP to
+ * be sent to init from it's children.
*/
- return !t->ptrace;
+ return !t->ptrace ||
+ ((t->signal->flags & SIGNAL_UNKILLABLE) &&
+ sig_kernel_only(sig) && !force);
}
/*
--
2.10.1
Call __ptrace_unlink not __ptrace_detach. As it is guaranteed that
ptrace_detach will never be called on a process that has or may exit.
Rename __ptrace_detach __exit_ptrace as exit_ptrace is now it's only
caller and the corrected name is less confusing.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/ptrace.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 266ddcc1d8bb..490333db9e21 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -495,7 +495,7 @@ static int ignoring_children(struct sighand_struct *sigh)
* reap it now, in that case we must also wake up sub-threads sleeping in
* do_wait().
*/
-static bool __ptrace_detach(struct task_struct *tracer, struct task_struct *p)
+static bool __exit_ptrace(struct task_struct *tracer, struct task_struct *p)
{
bool dead;
@@ -539,7 +539,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
* the comment in ptrace_resume().
*/
child->exit_code = data;
- __ptrace_detach(current, child);
+ __ptrace_unlink(child);
write_unlock_irq(&tasklist_lock);
proc_ptrace_connector(child, PTRACE_DETACH);
@@ -559,7 +559,7 @@ void exit_ptrace(struct task_struct *tracer, struct list_head *dead)
if (unlikely(p->ptrace & PT_EXITKILL))
send_sig_info(SIGKILL, SEND_SIG_FORCED, p);
- if (__ptrace_detach(tracer, p))
+ if (__exit_ptrace(tracer, p))
list_add(&p->ptrace_entry, dead);
}
}
--
2.10.1
Add a rcu_usage to task_struct and use it to reuse the delayed rcu put
logic from release_task in finish_task_switch. This guarantees that
there will be an rcu interval before usage drops to zero for any task
on the run queue. Making it safe to unconditionally call
get_task_struct in a rcu critical section for any task on the run
queue.
This guarantee in turn allows the fair scheduluer to use ordinary rcu
primitives to access tasks on the run queue and makes the magic functions
task_rcu_dereference and try_get_task_struct completely unnecessary.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched.h | 1 +
include/linux/sched/task.h | 4 +--
kernel/exit.c | 83 ++++------------------------------------------
kernel/fork.c | 3 +-
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 2 +-
6 files changed, 12 insertions(+), 83 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc650201..461ecd20731c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -492,6 +492,7 @@ struct task_struct {
volatile long state;
void *stack;
atomic_t usage;
+ atomic_t rcu_usage;
/* Per task flags (PF_*), defined further below: */
unsigned int flags;
unsigned int ptrace;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a978d7189cfd..dc4a4f4c566b 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,7 @@ static inline void put_task_struct(struct task_struct *t)
__put_task_struct(t);
}
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
-struct task_struct *try_get_task_struct(struct task_struct **ptask);
-
+extern void rcu_put_task_struct(struct task_struct *tsk);
#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index c3de7ace243c..625e57f1bb5c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -179,6 +179,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
}
+void rcu_put_task_struct(struct task_struct *tsk)
+{
+ if (atomic_dec_and_test(&tsk->rcu_usage))
+ call_rcu(&tsk->rcu, delayed_put_task_struct);
+}
void release_task(struct task_struct *p)
{
@@ -218,76 +223,13 @@ void release_task(struct task_struct *p)
write_unlock_irq(&tasklist_lock);
release_thread(p);
- call_rcu(&p->rcu, delayed_put_task_struct);
+ rcu_put_task_struct(p);
p = leader;
if (unlikely(zap_leader))
goto repeat;
}
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
- struct sighand_struct *sighand;
- struct task_struct *task;
-
- /*
- * We need to verify that release_task() was not called and thus
- * delayed_put_task_struct() can't run and drop the last reference
- * before rcu_read_unlock(). We check task->sighand != NULL,
- * but we can read the already freed and reused memory.
- */
-retry:
- task = rcu_dereference(*ptask);
- if (!task)
- return NULL;
-
- probe_kernel_address(&task->sighand, sighand);
-
- /*
- * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
- * was already freed we can not miss the preceding update of this
- * pointer.
- */
- smp_rmb();
- if (unlikely(task != READ_ONCE(*ptask)))
- goto retry;
-
- /*
- * We've re-checked that "task == *ptask", now we have two different
- * cases:
- *
- * 1. This is actually the same task/task_struct. In this case
- * sighand != NULL tells us it is still alive.
- *
- * 2. This is another task which got the same memory for task_struct.
- * We can't know this of course, and we can not trust
- * sighand != NULL.
- *
- * In this case we actually return a random value, but this is
- * correct.
- *
- * If we return NULL - we can pretend that we actually noticed that
- * *ptask was updated when the previous task has exited. Or pretend
- * that probe_slab_address(&sighand) reads NULL.
- *
- * If we return the new task (because sighand is not NULL for any
- * reason) - this is fine too. This (new) task can't go away before
- * another gp pass.
- *
- * And note: We could even eliminate the false positive if re-read
- * task->sighand once again to avoid the falsely NULL. But this case
- * is very unlikely so we don't care.
- */
- if (!sighand)
- return NULL;
-
- return task;
-}
-
void rcuwait_wake_up(struct rcuwait *w)
{
struct task_struct *task;
@@ -317,19 +259,6 @@ void rcuwait_wake_up(struct rcuwait *w)
rcu_read_unlock();
}
-struct task_struct *try_get_task_struct(struct task_struct **ptask)
-{
- struct task_struct *task;
-
- rcu_read_lock();
- task = task_rcu_dereference(ptask);
- if (task)
- get_task_struct(task);
- rcu_read_unlock();
-
- return task;
-}
-
/*
* Determine if a process group is "orphaned", according to the POSIX
* definition in 2.2.2.52. Orphaned process groups are not to be affected
diff --git a/kernel/fork.c b/kernel/fork.c
index aa1076c5e4a9..1fe837e8c38e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -567,7 +567,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
* One for us, one for whoever does the "release_task()" (usually
* parent)
*/
- atomic_set(&tsk->usage, 2);
+ atomic_set(&tsk->rcu_usage, 2);
+ atomic_set(&tsk->usage, 1); /* For rcu_usage */
#ifdef CONFIG_BLK_DEV_IO_TRACE
tsk->btrace_seq = 0;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 803c3bc274c4..1fccfd397cab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2762,7 +2762,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
/* Task is done with its stack. */
put_task_stack(prev);
- put_task_struct(prev);
+ rcu_put_task_struct(prev);
}
tick_nohz_task_switch();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d71109321841..5c0a1e1cc0f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1527,7 +1527,7 @@ static void task_numa_compare(struct task_numa_env *env,
int dist = env->dist;
rcu_read_lock();
- cur = task_rcu_dereference(&dst_rq->curr);
+ cur = rcu_dereference(dst_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
cur = NULL;
--
2.10.1
Handling of signals does not happen once do_exit is called as a
process never again exits from the kernel. So remove this ancient hold
over from something else.
Furthermore this is TIF_SIGPENDING on a zombie. Which will never
be scheduled on a cpu again.
Setting sigpending=0 was silly in 2.4 when exit_sighand did it and was
called from do_exit. When exit_sighand started being called from
release_task the code went from silly to completely pointless.
History tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/
Fixes: 6dfc88977e42 ("[PATCH] shared thread signals")
Signed-off-by: Eric W. Biederman <[email protected]>
---
kernel/exit.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 8926cdd132f2..c3de7ace243c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -164,7 +164,6 @@ static void __exit_signal(struct task_struct *tsk)
spin_unlock(&sighand->siglock);
__cleanup_sighand(sighand);
- clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
if (group_dead) {
flush_sigqueue(&sig->shared_pending);
tty_kref_put(tty);
--
2.10.1
When a task exits and uses do_notify_parent to send tsk->exit_signal
or SIGCHLD this has one of two possible meanings.
- The ptraced task has exited
- The thread group has exited
Linux resolves this ambiguity by preferring the thread group exit
interpretation if it is possible. As the exit of a thread group
containing a task is a superset of the exit of a task.
The code attempts to implement this in it's selection of signal
before calling do_notify_parent for a ptraced task. Unfortunately
it fails to properly handle the case when the thread_group is still
running (but it's leader has exited).
Fix this for a ptraced child leader by skipping the notification
instead of changing the exit signal. If we skip sending the signal in
exit_noitfy, when the last of the other threads are reaped
release_task will send the signal for us.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 85b34eff8807..72591eb5e361 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -639,7 +639,7 @@ static void forget_original_parent(struct task_struct *father,
*/
static void exit_notify(struct task_struct *tsk, int group_dead)
{
- bool autoreap;
+ bool autoreap = true;
struct task_struct *p, *n;
LIST_HEAD(dead);
@@ -649,17 +649,12 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
if (group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);
- if (unlikely(tsk->ptrace)) {
- int sig = thread_group_leader(tsk) &&
- thread_group_empty(tsk) &&
- !ptrace_reparented(tsk) ?
- tsk->exit_signal : SIGCHLD;
- autoreap = do_notify_parent(tsk, sig);
- } else if (thread_group_leader(tsk)) {
+ if (thread_group_leader(tsk) && !ptrace_reparented(tsk)) {
autoreap = thread_group_empty(tsk) &&
do_notify_parent(tsk, tsk->exit_signal);
- } else {
- autoreap = true;
+ }
+ else if (unlikely(tsk->ptrace)) {
+ autoreap = do_notify_parent(tsk, SIGCHLD);
}
tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
--
2.10.1
When ptracing waitpid(pid) has two possible meanings.
- Wait for the task with tid == pid to exit
- Wait for the thread group with tgid == pid to exit
Linux resolves this ambiguity by prefering the thread group
interpretation if it is possible. As the exit of a thread group
containing a task is the superset of the exit of a task.
Delaying reaping of thread group leaders for ptraced children
is how the code implements this resolution.
For ptraced tasks that are not also children this ambiguity does
not exit. The only possible valid meaning of waitpid(pid) is
to wait for the task with tid == pid to exit. As such there is
no need to delay reaping ptraced tasks when the tasks are
not also children of the tracer.
Change this carefully as the real parent does need to delay reaping
of these tasks.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index dbf3fce00a1f..85b34eff8807 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1144,7 +1144,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
/* If parent wants a zombie, don't release it now */
state = EXIT_ZOMBIE;
- if (do_notify_parent(p, p->exit_signal))
+ if (thread_group_empty(p) &&
+ do_notify_parent(p, p->exit_signal))
state = EXIT_DEAD;
p->exit_state = state;
write_unlock_irq(&tasklist_lock);
@@ -1366,8 +1367,7 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
* ptracer detaches.
*/
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
- (!thread_group_leader(p) ||
- (ptrace_reparented(p) && thread_group_empty(p))))
+ (!thread_group_leader(p) || ptrace_reparented(p)))
return wait_task_zombie(wo, p);
if (unlikely(exit_state == EXIT_TRACE)) {
--
2.10.1
Rewrite the condition for what contitues a clone child. AKA a child
that reports to it's parent using something other than SIGCHLD.
If the parent has called exec since the child has started that child
will alwasy report to it's parent with SIGCHLD.
If the parent is only ptracing the child the child will always
report to the parent with SIGCHLD.
This implements the documented semantics and subsumes the fix Oleg
made in 4.7 to make __WCLONE unnecessary when ptracing a child. It was
just a bug in the check for __WCLONE support that made that necessary
the ``semantic'' change necessary.
Around v2.3.23 notify_parent was fixed to send SIGCHLD if the parent
had exec'd. Fixing half a bug but wait was not fixed to see children
in that case.
Not handling either the ptrace or the parent exec case appears to
go all of the way back to 1.0.
Fixes: bf959931ddb8 ("wait/ptrace: assume __WALL if the child is traced")
Fixes: v1.0
Fixes: v2.3.23
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 625e57f1bb5c..306e526f4c5e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -950,31 +950,26 @@ static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
task_pid_type(p, wo->wo_type) == wo->wo_pid;
}
-static int
-eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
+static int eligible_child(struct wait_opts *wo, struct task_struct *p)
{
if (!eligible_pid(wo, p))
return 0;
/*
- * Wait for all children (clone and not) if __WALL is set or
- * if it is traced by us.
+ * Wait for all children (clone and not) if __WALL is set.
*/
- if (ptrace || (wo->wo_flags & __WALL))
+ if (wo->wo_flags & __WALL)
return 1;
/*
- * Otherwise, wait for clone children *only* if __WCLONE is set;
- * otherwise, wait for non-clone children *only*.
- *
- * Note: a "clone" child here is one that reports to its parent
- * using a signal other than SIGCHLD, or a non-leader thread which
- * we can only see if it is traced by us.
+ * Otherwise wait for either children that report to their
+ * parent via SIGCHLD (when __WCLONE is not set) or use
+ * another signal (when __WCLONE is set).
*/
- if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
- return 0;
-
- return 1;
+ return (((p->exit_signal == SIGCHLD) ||
+ (p->parent_exec_id == p->real_parent->self_exec_id))
+ && thread_group_leader(p) && !ptrace_reparented(p)) ^
+ !!(wo->wo_flags & __WCLONE);
}
static int wait_noreap_copyout(struct wait_opts *wo, struct task_struct *p,
@@ -1343,7 +1338,7 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
if (unlikely(exit_state == EXIT_DEAD))
return 0;
- ret = eligible_child(wo, ptrace, p);
+ ret = eligible_child(wo, p);
if (!ret)
return ret;
--
2.10.1
When two signals are generated make it the thread level signal that is
possibly surpessed. The reasoning is that the group level signal
always needs to be generated to report a group stop when that happens
and it is possibly doing double duty as both the group level and the
thread level signal. So if the thread level signal is identical to
the group level signal it does not need to be sent (or even attempted
to be sent).
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 0ec90689039a..ff41e6ee2b5e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1813,6 +1813,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
__acquires(¤t->sighand->siglock)
{
bool gstop_done = false;
+ bool identical = false;
if (arch_ptrace_stop_needed(exit_code, info)) {
/*
@@ -1852,8 +1853,16 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
* could be clear now. We act as if SIGCONT is received after
* TASK_TRACED is entered - ignore it.
*/
- if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING))
- gstop_done = task_participate_group_stop(current);
+ if (why == CLD_STOPPED) {
+ if (current->jobctl & JOBCTL_STOP_PENDING)
+ gstop_done = task_participate_group_stop(current);
+
+ /* Will the two generated signals be identical? */
+ identical = gstop_done &&
+ !(current->jobctl & JOBCTL_TRAP_NOTIFY) &&
+ !ptrace_reparented(current) &&
+ (task_pid(current) == task_tgid(current));
+ }
/* any trap clears pending STOP trap, STOP trap clears NOTIFY */
task_clear_jobctl_pending(current, JOBCTL_TRAP_STOP);
@@ -1874,10 +1883,11 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
* know about every stop while the real parent is only
* interested in the completion of group stop. The states
* for the two don't interact with each other. Notify
- * separately unless they're gonna be duplicates.
+ * separately unless they are going to be identical.
*/
- do_notify_parent_cldstop(current, true, why);
- if (gstop_done && ptrace_reparented(current))
+ if (!identical)
+ do_notify_parent_cldstop(current, true, why);
+ if (gstop_done)
do_notify_parent_cldstop(current, false, why);
/*
--
2.10.1
The list a task is found on is not really relevant to wait, so stop passing
the list in and write the conditions in such a way as to not take the list
into consideration. This is technically user visible due to __WNOTHREAD
but is very unlikely to matter in practice.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 8f3825b22de5..c783d5fb5ab3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1329,8 +1329,7 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
* then ->notask_error is 0 if @p is an eligible child,
* or still -ECHILD.
*/
-static int wait_consider_task(struct wait_opts *wo, int ptrace,
- struct task_struct *p)
+static int wait_consider_task(struct wait_opts *wo, struct task_struct *p)
{
/*
* We can race with wait_task_zombie() from another thread.
@@ -1356,11 +1355,13 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
* and reaping will be cascaded to the real parent when the
* ptracer detaches.
*/
- if ((exit_state == EXIT_TRACEE) && ptrace)
+ if ((exit_state == EXIT_TRACEE) && same_thread_group(current, p->parent))
return wait_task_zombie(wo, exit_state, p);
/* Is this task past the point where ptrace cares? */
- if (unlikely((exit_state == EXIT_TRACED) && ptrace))
+ if (unlikely((exit_state == EXIT_TRACED) &&
+ !(same_thread_group(current, p->real_parent) &&
+ thread_group_leader(p))))
return 0;
/*
@@ -1414,7 +1415,7 @@ static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
struct task_struct *p;
list_for_each_entry(p, &tsk->children, sibling) {
- int ret = wait_consider_task(wo, 0, p);
+ int ret = wait_consider_task(wo, p);
if (ret)
return ret;
@@ -1428,7 +1429,7 @@ static int ptrace_do_wait(struct wait_opts *wo, struct task_struct *tsk)
struct task_struct *p;
list_for_each_entry(p, &tsk->ptraced, ptrace_entry) {
- int ret = wait_consider_task(wo, 1, p);
+ int ret = wait_consider_task(wo, p);
if (ret)
return ret;
--
2.10.1
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 3 ---
1 file changed, 3 deletions(-)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index c06d63b3a583..0cc626dd79a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -565,9 +565,6 @@ static inline int thread_group_empty(struct task_struct *p)
return list_empty(&p->thread_group);
}
-#define delay_group_leader(p) \
- (thread_group_leader(p) && !thread_group_empty(p))
-
extern struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
unsigned long *flags);
--
2.10.1
Now that which list a task is found on does not matter there is no
reason to walk the child lists for waitpid when task_pid can directly
find the child.
Add a new helper do_wait_pid that finds the target task via pid_task
and verifies it is on one of the lists for the thread we are
reaping.
This is more efficient in two ways. It skips the list traversal
so uninteresting tasks don't slow things down. It guarantees
a task will only be visited once if p->parent == p->real_parent.
Except for the increase in efficiency this results in no user
visible behavioral differences.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index c783d5fb5ab3..2f01b75e3b2e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1438,6 +1438,18 @@ static int ptrace_do_wait(struct wait_opts *wo, struct task_struct *tsk)
return 0;
}
+static int do_wait_pid(struct wait_opts *wo, struct task_struct *tsk)
+{
+ struct task_struct *p = pid_task(wo->wo_pid, wo->wo_type);
+
+ /* Not on one of this tasks child lists? */
+ if ((tsk != p->parent) &&
+ ((tsk != p->real_parent) || !thread_group_leader(p)))
+ return 0;
+
+ return wait_consider_task(wo, p);
+}
+
static int child_wait_callback(wait_queue_t *wait, unsigned mode,
int sync, void *key)
{
@@ -1486,11 +1498,15 @@ static long do_wait(struct wait_opts *wo)
read_lock(&tasklist_lock);
tsk = current;
do {
- retval = do_wait_thread(wo, tsk);
- if (retval)
- goto end;
+ if (wo->wo_type == PIDTYPE_PID) {
+ retval = do_wait_pid(wo, tsk);
+ } else {
+ retval = do_wait_thread(wo, tsk);
+ if (retval)
+ goto end;
- retval = ptrace_do_wait(wo, tsk);
+ retval = ptrace_do_wait(wo, tsk);
+ }
if (retval)
goto end;
--
2.10.1
Start with reaping pending zombies and then return if there is nothing
wait_task_consider can ever do with the task.
What remains are living tasks and delayed child zombie thread group
leaders waiting for their thread group to exit. Leaving something for
WEXITED to find later.
As long as WEXITED can happen the code is guaranteed not to block
indefinitely, as at least the exit event will wake it up. The only
reason not to clear notask_error is if it is impossible for wait
to find something to report. For zombie thread group leaders it
is possible that living threads will report group wide continued
or stopped states.
Which means we can now safely and practically always clear notask_error.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 62 ++++++++++++++++++++++-------------------------------------
1 file changed, 23 insertions(+), 39 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 4e2d2b6f5581..8f3825b22de5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1359,47 +1359,31 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
if ((exit_state == EXIT_TRACEE) && ptrace)
return wait_task_zombie(wo, exit_state, p);
- if (unlikely(exit_state == EXIT_TRACED)) {
- /*
- * ptrace == 0 means we are the natural parent. In this case
- * we should clear notask_error, debugger will notify us.
- */
- if (likely(!ptrace))
- wo->notask_error = 0;
+ /* Is this task past the point where ptrace cares? */
+ if (unlikely((exit_state == EXIT_TRACED) && ptrace))
return 0;
- }
- if (exit_state == EXIT_ZOMBIE) {
- /*
- * Allow access to stopped/continued state via zombie by
- * falling through. Clearing of notask_error is complex.
- *
- * When !@ptrace:
- *
- * If WEXITED is set, notask_error should naturally be
- * cleared. If not, subset of WSTOPPED|WCONTINUED is set,
- * so, if there are live subthreads, there are events to
- * wait for. If all subthreads are dead, it's still safe
- * to clear - this function will be called again in finite
- * amount time once all the subthreads are released and
- * will then return without clearing.
- *
- * When @ptrace:
- *
- * Stopped state is per-task and thus can't change once the
- * target task dies. Only continued and exited can happen.
- * Clear notask_error if WCONTINUED | WEXITED.
- */
- if ((!ptrace && (!p->ptrace || ptrace_reparented(p))) ||
- (wo->wo_flags & (WCONTINUED | WEXITED)))
- wo->notask_error = 0;
- } else {
- /*
- * @p is alive and it's gonna stop, continue or exit, so
- * there always is something to wait for.
- */
- wo->notask_error = 0;
- }
+ /*
+ * A this point @p is alive or the zombie of a delayed
+ * child thread group leader that has not been reaped yet.
+ *
+ * Allow access to stopped/continued state via zombie by
+ * falling through. Clearing of notask_error is logically complex.
+ *
+ * When @p is alive and it's gonna stop, continue or exit,
+ * so there always is something to wait for.
+ *
+ * When @p is a zombie
+ *
+ * If WEXITED is set, notask_error should naturally be
+ * cleared. If not, subset of WSTOPPED|WCONTINUED is set,
+ * so, if there are live subthreads, there are events to
+ * wait for. If all subthreads are dead, it's still safe
+ * to clear - this function will be called again in finite
+ * amount time once all the subthreads are released and
+ * will then return without clearing.
+ */
+ wo->notask_error = 0;
/*
* Wait for stopped.
--
2.10.1
When ptracing waitpid(pid, WUNTRACED) has two possible meanings.
- Wait for ptrace stops from the task with tid == pid
- Wait for ordinary stops from the process with tgid == pid
The only sensible behavior and the Linux behavior in 2.2 and
2.4 has been to consume both ptrace stops and group stops
in this case. It looks like when Oleg disentangled thread
stops and group stops in 2.6.30 fixing a lot of other issues
the case when we want to reap both was overlooked.
Consume both the group and the ptrace stop state when
waitpid(pid, WUNTRACED) could be asking for both, and
the wait status for both is idenitical. This keeps
us from double reporting the stop and causing confusion.
This is very slight user visible change and is only visible
in the unlikely case a ptracer specifies WUNTRACED aka
WSTOPPED.
Write this code in such a way that it doesn't matter which
list we are traversing when we find a child whose stop states
we care about.
Fixes: 90bc8d8b1a38 ("do_wait: fix waiting for the group stop with the dead leader")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 82 +++++++++++++++++++++++++++++------------------------------
1 file changed, 40 insertions(+), 42 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index ff2ed1d60a8c..4e2d2b6f5581 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1151,22 +1151,23 @@ static int wait_task_zombie(struct wait_opts *wo, int old_state, struct task_str
return retval;
}
-static int *task_stopped_code(struct task_struct *p, bool ptrace)
+static int *task_trace_stopped_code(struct task_struct *p)
{
- if (ptrace) {
- if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
- return &p->exit_code;
- } else {
- if (p->signal->flags & SIGNAL_STOP_STOPPED)
- return &p->signal->group_exit_code;
- }
+ if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
+ return &p->exit_code;
+ return NULL;
+}
+
+static int *task_group_stopped_code(struct task_struct *p)
+{
+ if (p->signal->flags & SIGNAL_STOP_STOPPED)
+ return &p->signal->group_exit_code;
return NULL;
}
/**
* wait_task_stopped - Wait for %TASK_STOPPED or %TASK_TRACED
* @wo: wait options
- * @ptrace: is the wait for ptrace
* @p: task to wait for
*
* Handle sys_wait4() work for %p in state %TASK_STOPPED or %TASK_TRACED.
@@ -1181,49 +1182,47 @@ static int *task_stopped_code(struct task_struct *p, bool ptrace)
* success, implies that tasklist_lock is released and wait condition
* search should terminate.
*/
-static int wait_task_stopped(struct wait_opts *wo,
- int ptrace, struct task_struct *p)
+static int wait_task_stopped(struct wait_opts *wo, struct task_struct *p)
{
struct siginfo __user *infop;
- int retval, exit_code, *p_code, why;
- uid_t uid = 0; /* unneeded, required by compiler */
+ int retval, exit_code, *p_code, *g_code, why;
+ bool group, gstop, pstop;
+ uid_t uid;
pid_t pid;
/*
- * Hide group stop state from real parent; otherwise a single
- * stop can be reported twice as group and ptrace stop. If a
- * ptracer wants to distinguish these two events for its own
- * children it should create a separate process which takes the
- * role of real parent.
- */
- if (!ptrace && p->ptrace && !ptrace_reparented(p))
- ptrace = 1;
-
- /*
* Traditionally we see ptrace'd stopped tasks regardless of options.
*/
- if (!ptrace && !(wo->wo_flags & WUNTRACED))
+ group = thread_group_leader(p) && !ptrace_reparented(p);
+ pstop = same_thread_group(current, p->parent);
+ gstop = group && (wo->wo_flags & WUNTRACED);
+ if (!pstop && !gstop)
return 0;
- if (!task_stopped_code(p, ptrace))
+ if ((!pstop || !task_trace_stopped_code(p)) &&
+ (!gstop || !task_group_stopped_code(p)))
return 0;
exit_code = 0;
spin_lock_irq(&p->sighand->siglock);
- p_code = task_stopped_code(p, ptrace);
- if (unlikely(!p_code))
- goto unlock_sig;
-
- exit_code = *p_code;
- if (!exit_code)
- goto unlock_sig;
-
- if (!unlikely(wo->wo_flags & WNOWAIT))
- *p_code = 0;
-
- uid = from_kuid_munged(current_user_ns(), task_uid(p));
-unlock_sig:
+ p_code = g_code = NULL;
+ if (pstop)
+ p_code = task_trace_stopped_code(p);
+ if (gstop)
+ g_code = task_group_stopped_code(p);
+ if (p_code) {
+ exit_code = *p_code;
+ why = CLD_TRAPPED;
+ if (!(wo->wo_flags & WNOWAIT))
+ *p_code = 0;
+ }
+ if (g_code && (!exit_code || (*g_code == exit_code))) {
+ exit_code = *g_code;
+ why = CLD_STOPPED;
+ if (!(wo->wo_flags & WNOWAIT))
+ *g_code = 0;
+ }
spin_unlock_irq(&p->sighand->siglock);
if (!exit_code)
return 0;
@@ -1236,8 +1235,8 @@ static int wait_task_stopped(struct wait_opts *wo,
* possibly take page faults for user memory.
*/
get_task_struct(p);
+ uid = from_kuid_munged(current_user_ns(), task_uid(p));
pid = task_pid_vnr(p);
- why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
read_unlock(&tasklist_lock);
sched_annotate_sleep();
@@ -1403,10 +1402,9 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
}
/*
- * Wait for stopped. Depending on @ptrace, different stopped state
- * is used and the two don't interact with each other.
+ * Wait for stopped.
*/
- ret = wait_task_stopped(wo, ptrace, p);
+ ret = wait_task_stopped(wo, p);
if (ret)
return ret;
--
2.10.1
With the freeing and slaying of zombies moved earlier in wait_consider_task
changing of the ptrace value only effects the clearing of notask_error and
wait_task_stopped. Move the changing of ptrace into wait_task_stopped.
The value of ptrace coming into the code clearing notask_error is left
at it's original value.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 25 ++++++++++++-------------
1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 9b70e21c960d..dbf3fce00a1f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1194,6 +1194,16 @@ static int wait_task_stopped(struct wait_opts *wo,
pid_t pid;
/*
+ * Hide group stop state from real parent; otherwise a single
+ * stop can be reported twice as group and ptrace stop. If a
+ * ptracer wants to distinguish these two events for its own
+ * children it should create a separate process which takes the
+ * role of real parent.
+ */
+ if (!ptrace && p->ptrace && !ptrace_reparented(p))
+ ptrace = 1;
+
+ /*
* Traditionally we see ptrace'd stopped tasks regardless of options.
*/
if (!ptrace && !(wo->wo_flags & WUNTRACED))
@@ -1370,18 +1380,6 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
return 0;
}
- if (likely(!ptrace) && unlikely(p->ptrace)) {
- /*
- * Hide group stop state from real parent; otherwise a single
- * stop can be reported twice as group and ptrace stop. If a
- * ptracer wants to distinguish these two events for its own
- * children it should create a separate process which takes the
- * role of real parent.
- */
- if (!ptrace_reparented(p))
- ptrace = 1;
- }
-
if (exit_state == EXIT_ZOMBIE) {
/*
* Allow access to stopped/continued state via zombie by
@@ -1403,7 +1401,8 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
* target task dies. Only continued and exited can happen.
* Clear notask_error if WCONTINUED | WEXITED.
*/
- if (likely(!ptrace) || (wo->wo_flags & (WCONTINUED | WEXITED)))
+ if ((!ptrace && (!p->ptrace || ptrace_reparented(p))) ||
+ (wo->wo_flags & (WCONTINUED | WEXITED)))
wo->notask_error = 0;
} else {
/*
--
2.10.1
In November of 2005 Oleg fixed a kernel crash with commit 7ed0175a462c
("[PATCH] Don't auto-reap traced children"). Oleg's patch was the fix
for CVE-2005-3784 where one description says:
The auto-reap of child processes in Linux kernel 2.6 before 2.6.15
includes processes with ptrace attached, which leads to a dangling
ptrace reference and allows local users to cause a denial of
service (crash) and gain root privileges.
Not reaping the zombies resulted in zombies on the ptrace list when
threads that ignored them exited. Resulting in Roland authoring
666f164f4fbf ("fix dangling zombie when new parent ignores children").
Which winds up auto-waiting for those zombies not when the tasks exit
and become zombies but when the ptracer exits.
As the kernel is already auto-waiting zombies for ptraced children
rewrite the code to use the same code paths for auto-waiting as we use
for all other children.
This is a user visible change so something might care but as auto-wait
at exit semantics are not documented anywhere, are in direct violation
of what SIG_IGN and SA_NOCLDWAIT are documented by posix to do, and
added to avoid a kernel crash, I don't expect there will be problems.
Fixes: 7ed0175a462c ("[PATCH] Don't auto-reap traced children")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 10 +++++++---
kernel/ptrace.c | 23 +----------------------
kernel/signal.c | 2 +-
3 files changed, 9 insertions(+), 26 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 2f01b75e3b2e..eaea41c8e646 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -638,7 +638,7 @@ static void forget_original_parent(struct task_struct *father,
*/
static void exit_notify(struct task_struct *tsk, int group_dead)
{
- int state = EXIT_DEAD;
+ int state;
struct task_struct *p, *n;
LIST_HEAD(dead);
@@ -648,6 +648,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
if (group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);
+renotify:
+ state = EXIT_DEAD;
if (thread_group_leader(tsk) && !ptrace_reparented(tsk)) {
state = EXIT_ZOMBIE;
if (thread_group_empty(tsk) &&
@@ -656,8 +658,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
}
else if (unlikely(tsk->ptrace)) {
state = EXIT_TRACEE;
- if (do_notify_parent(tsk, SIGCHLD))
- state = EXIT_DEAD;
+ if (do_notify_parent(tsk, SIGCHLD)) {
+ __ptrace_unlink(tsk);
+ goto renotify;
+ }
}
tsk->exit_state = state;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 003567a615f9..c52cbbcbe258 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -468,19 +468,6 @@ static int ptrace_traceme(void)
}
/*
- * Called with irqs disabled, returns true if childs should reap themselves.
- */
-static int ignoring_children(struct sighand_struct *sigh)
-{
- int ret;
- spin_lock(&sigh->siglock);
- ret = (sigh->action[SIGCHLD-1].sa.sa_handler == SIG_IGN) ||
- (sigh->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT);
- spin_unlock(&sigh->siglock);
- return ret;
-}
-
-/*
* Called with tasklist_lock held for writing.
* Unlink a traced task, and clean it up if it was a traced zombie.
* Return true if it needs to be reaped with release_task().
@@ -501,15 +488,7 @@ static bool __exit_ptrace(struct task_struct *tracer, struct task_struct *p)
__ptrace_unlink(p);
- if (state == EXIT_ZOMBIE) {
- /* Honor the parents request to autoreap children */
- if (thread_group_empty(p) &&
- ignoring_children(tracer->sighand)) {
- state = EXIT_DEAD;
- __wake_up_parent(p, tracer);
- }
- }
- else if (state == EXIT_TRACEE) {
+ if (state == EXIT_TRACEE) {
state = EXIT_DEAD;
if (thread_group_leader(p)) {
state = EXIT_ZOMBIE;
diff --git a/kernel/signal.c b/kernel/signal.c
index 627b482fa3f8..30d652f86964 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1645,7 +1645,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
psig = tsk->parent->sighand;
spin_lock_irqsave(&psig->siglock, flags);
- if (!tsk->ptrace && sig == SIGCHLD &&
+ if (sig == SIGCHLD &&
(psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
(psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
/*
--
2.10.1
Create two new exit states EXIT_TRACEE and EXIT_TRACED replacing
the two states "(EXIT_ZOMBIE && (!thread_group_leader(p) || !ptrace_reparented))
and EXIT_TRACE. With EXIT_ZOMBIE replacing the state:
"(EXIT_ZOMBIE && thread_group_leader(p) && !ptrace_reparented)".
Rework the code to take advantage of the certain knowledge of
exit state progression:
EXIT_TRACEE -> EXIT_TRACED -> EXIT_ZOMBIE -> EXIT_DEAD
This makes the code more readable/maintainable by using simple states
rather than complicated expressions. The values of both of the new
states contain EXIT_ZOMBIE so all of these states appear to userspace
as zombies.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched.h | 6 +++++-
kernel/exit.c | 51 +++++++++++++++++++++++----------------------------
kernel/ptrace.c | 31 +++++++++++++++++--------------
3 files changed, 45 insertions(+), 43 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 461ecd20731c..f2cec7f27e59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -73,7 +73,6 @@ struct task_group;
/* Used in tsk->exit_state: */
#define EXIT_DEAD 16
#define EXIT_ZOMBIE 32
-#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
@@ -82,6 +81,11 @@ struct task_group;
#define TASK_NOLOAD 1024
#define TASK_NEW 2048
#define TASK_STATE_MAX 4096
+/* Used in tsk->exit_state again: */
+#define __EXIT_TRACEE 8192
+#define __EXIT_TRACED 16384
+#define EXIT_TRACEE (EXIT_ZOMBIE | __EXIT_TRACEE)
+#define EXIT_TRACED (EXIT_ZOMBIE | __EXIT_TRACED)
#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPNn"
diff --git a/kernel/exit.c b/kernel/exit.c
index 72591eb5e361..ff2ed1d60a8c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -580,8 +580,7 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
p->exit_signal = SIGCHLD;
/* If it has exited notify the new parent about this child's death. */
- if (!p->ptrace &&
- p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
+ if (p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
if (do_notify_parent(p, p->exit_signal)) {
p->exit_state = EXIT_DEAD;
list_add(&p->ptrace_entry, dead);
@@ -639,7 +638,7 @@ static void forget_original_parent(struct task_struct *father,
*/
static void exit_notify(struct task_struct *tsk, int group_dead)
{
- bool autoreap = true;
+ int state = EXIT_DEAD;
struct task_struct *p, *n;
LIST_HEAD(dead);
@@ -650,14 +649,18 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);
if (thread_group_leader(tsk) && !ptrace_reparented(tsk)) {
- autoreap = thread_group_empty(tsk) &&
- do_notify_parent(tsk, tsk->exit_signal);
+ state = EXIT_ZOMBIE;
+ if (thread_group_empty(tsk) &&
+ do_notify_parent(tsk, tsk->exit_signal))
+ state = EXIT_DEAD;
}
else if (unlikely(tsk->ptrace)) {
- autoreap = do_notify_parent(tsk, SIGCHLD);
+ state = EXIT_TRACEE;
+ if (do_notify_parent(tsk, SIGCHLD))
+ state = EXIT_DEAD;
}
- tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
+ tsk->exit_state = state;
if (tsk->exit_state == EXIT_DEAD)
list_add(&tsk->ptrace_entry, &dead);
@@ -1001,7 +1004,7 @@ static int wait_noreap_copyout(struct wait_opts *wo, struct task_struct *p,
* the lock and this task is uninteresting. If we return nonzero, we have
* released the lock and the system call should return.
*/
-static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
+static int wait_task_zombie(struct wait_opts *wo, int old_state, struct task_struct *p)
{
int state, retval, status;
pid_t pid = task_pid_vnr(p);
@@ -1029,11 +1032,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
return wait_noreap_copyout(wo, p, pid, uid, why, status);
}
/*
- * Move the task's state to DEAD/TRACE, only one thread can do this.
+ * Move the task's state to DEAD/TRACED only one thread can do this.
*/
- state = (ptrace_reparented(p) && thread_group_leader(p)) ?
- EXIT_TRACE : EXIT_DEAD;
- if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
+ state = ((old_state == EXIT_TRACEE) && thread_group_leader(p)) ?
+ EXIT_TRACED : EXIT_DEAD;
+ if (cmpxchg(&p->exit_state, old_state, state) != old_state)
return 0;
/*
* We own this thread, nobody else can reap it.
@@ -1041,10 +1044,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
read_unlock(&tasklist_lock);
sched_annotate_sleep();
- /*
- * Check thread_group_leader() to exclude the traced sub-threads.
- */
- if (state == EXIT_DEAD && thread_group_leader(p)) {
+ if (old_state == EXIT_ZOMBIE) {
struct signal_struct *sig = p->signal;
struct signal_struct *psig = current->signal;
unsigned long maxrss;
@@ -1132,7 +1132,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
if (!retval)
retval = pid;
- if (state == EXIT_TRACE) {
+ if (state == EXIT_TRACED) {
write_lock_irq(&tasklist_lock);
/* We dropped tasklist, ptracer could die and untrace */
ptrace_unlink(p);
@@ -1335,8 +1335,7 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
{
/*
* We can race with wait_task_zombie() from another thread.
- * Ensure that EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE transition
- * can't confuse the checks below.
+ * Ensure that exit_state transition can't confuse the checks below.
*/
int exit_state = ACCESS_ONCE(p->exit_state);
int ret;
@@ -1349,11 +1348,8 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
return ret;
/* zombie child process? */
- if ((exit_state == EXIT_ZOMBIE) &&
- !ptrace_reparented(p) &&
- thread_group_leader(p) &&
- thread_group_empty(p))
- return wait_task_zombie(wo, p);
+ if ((exit_state == EXIT_ZOMBIE) && thread_group_empty(p))
+ return wait_task_zombie(wo, exit_state, p);
/*
* A zombie ptracee that is not a child of it's ptracer's
@@ -1361,11 +1357,10 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
* and reaping will be cascaded to the real parent when the
* ptracer detaches.
*/
- if ((exit_state == EXIT_ZOMBIE) && ptrace &&
- (!thread_group_leader(p) || ptrace_reparented(p)))
- return wait_task_zombie(wo, p);
+ if ((exit_state == EXIT_TRACEE) && ptrace)
+ return wait_task_zombie(wo, exit_state, p);
- if (unlikely(exit_state == EXIT_TRACE)) {
+ if (unlikely(exit_state == EXIT_TRACED)) {
/*
* ptrace == 0 means we are the natural parent. In this case
* we should clear notask_error, debugger will notify us.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 490333db9e21..003567a615f9 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -497,27 +497,30 @@ static int ignoring_children(struct sighand_struct *sigh)
*/
static bool __exit_ptrace(struct task_struct *tracer, struct task_struct *p)
{
- bool dead;
+ int state = p->exit_state;
__ptrace_unlink(p);
- if (p->exit_state != EXIT_ZOMBIE)
- return false;
-
- dead = !thread_group_leader(p);
-
- if (!dead && thread_group_empty(p)) {
- if (!same_thread_group(p->real_parent, tracer))
- dead = do_notify_parent(p, p->exit_signal);
- else if (ignoring_children(tracer->sighand)) {
+ if (state == EXIT_ZOMBIE) {
+ /* Honor the parents request to autoreap children */
+ if (thread_group_empty(p) &&
+ ignoring_children(tracer->sighand)) {
+ state = EXIT_DEAD;
__wake_up_parent(p, tracer);
- dead = true;
+ }
+ }
+ else if (state == EXIT_TRACEE) {
+ state = EXIT_DEAD;
+ if (thread_group_leader(p)) {
+ state = EXIT_ZOMBIE;
+ if (thread_group_empty(p) &&
+ do_notify_parent(p, p->exit_signal))
+ state = EXIT_DEAD;
}
}
/* Mark it as in the process of being reaped. */
- if (dead)
- p->exit_state = EXIT_DEAD;
- return dead;
+ p->exit_state = state;
+ return state == EXIT_DEAD;
}
static int ptrace_detach(struct task_struct *child, unsigned int data)
--
2.10.1
Today initiating a group stop with SIGSTOP, SIGTSTP, SIGTTIN, or
SIGTTOU and then have it contined with SIGCONT before the group stop
is complete results in SIGCHLD being sent with a si_status of 0. The
field si_status is defined by posix to be the stop signal. Which means
we wind up violating posix and any reasonable meaning of siginfo
delivered with SIGCHLD by reporting this partial group stop.
A partial group stop will result in even stranger things when seen in
the context of ptrace. If all of the threads are ptraced they should
all enter a ptrace stop but the cancelation of the group stop before
the group stop completes results in only some of the threads entering
a ptrace stop.
Fix this by performing a delayed thread group wakeup that will take
effect as the last thread is signaling that the group stop is
complete.
This makes reasoning about the code much simpler, fixes partial
group stop interactions with ptrace, and the signal set for
a group stop that sees SIGCONT before it completes. For
a similar cost in code.
I looked into the history to see if I could understand where this
problematic behavior came from, and the source of the behavior goes
back to the original development of thread groups in the kernel. A
bug fix was made to improve the handling of SIGCONT that introduced
group_stop_count and the resuming of partial stops that makes no sense
today in the context of ptrace SIGSTOP handling.
At the time that was not a problem because stops when being ptraced
were in ordinary TASK_STOPPED stops. It potentially became a problem
shortly thereafter when ptrace stops became TASK_TRACED stops which
can not be gotten out of with SIGCONT. Howeve for it to actually
become a problem the code had to wait until recently when 5224fa3660ad
("ptrace: Make do_signal_stop() use ptrace_stop() if the task is being ptraced")
was merged for ptraced processes to stop with in TASK_TRACED when
they were ptraced.
The practical issue with sending an invalid si_status appears to have
been introduced much later after the locking changed to make it
necessary to send signals to the parent from the destination process
itself. When Oleg unified partial and full stop handling
group_exit_code wound up being cleared on both code paths. Not just
the for the full stop case. That in turn introduced the invalid
si_status of 0. As until that point group_exit_code always held the
signal when do_notify_parent_cldstop was called for stop signals.
Historical tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 5224fa3660ad ("ptrace: Make do_signal_stop() use ptrace_stop() if the task is being ptraced")
Fixes: fc321d2e60d6 ("handle_stop_signal: unify partial/full stop handling")
Reference: e442055193e4 ("signals: re-assign CLD_CONTINUED notification from the sender to reciever")
Reference: cece79ae3a39 ("[PATCH] cleanup ptrace stops and remove notify_parent")
Reference: ca3f74aa7baa ("[PATCH] waitid system call")
Reference: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 26 +++++++------
kernel/signal.c | 89 ++++++++++++++++++++++++--------------------
2 files changed, 63 insertions(+), 52 deletions(-)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0cc626dd79a8..dac2d90217c2 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -227,21 +227,17 @@ struct signal_struct {
/*
* Bits in flags field of signal_struct.
*/
-#define SIGNAL_STOP_STOPPED 0x00000001 /* job control stop in effect */
-#define SIGNAL_STOP_CONTINUED 0x00000002 /* SIGCONT since WCONTINUED reap */
-#define SIGNAL_GROUP_EXIT 0x00000004 /* group exit in progress */
-#define SIGNAL_GROUP_COREDUMP 0x00000008 /* coredump in progress */
-/*
- * Pending notifications to parent.
- */
-#define SIGNAL_CLD_STOPPED 0x00000010
-#define SIGNAL_CLD_CONTINUED 0x00000020
-#define SIGNAL_CLD_MASK (SIGNAL_CLD_STOPPED|SIGNAL_CLD_CONTINUED)
+#define SIGNAL_CLD_CONTINUED 0x00000001 /* Need to send SIGCONT to parent */
+#define SIGNAL_STOP_STOPPED 0x00000002 /* job control stop in effect */
+#define SIGNAL_STOP_CONTINUED 0x00000004 /* SIGCONT since WCONTINUED reap */
+#define SIGNAL_WAKEUP_PENDING 0x00000008 /* Wakeup for SIGCONT pending */
+#define SIGNAL_GROUP_EXIT 0x00000010 /* group exit in progress */
+#define SIGNAL_GROUP_COREDUMP 0x00000020 /* coredump in progress */
#define SIGNAL_UNKILLABLE 0x00000040 /* for init: ignore fatal signals */
-#define SIGNAL_STOP_MASK (SIGNAL_CLD_MASK | SIGNAL_STOP_STOPPED | \
- SIGNAL_STOP_CONTINUED)
+#define SIGNAL_STOP_MASK (SIGNAL_CLD_CONTINUED | SIGNAL_STOP_STOPPED | \
+ SIGNAL_STOP_CONTINUED | SIGNAL_WAKEUP_PENDING)
static inline void signal_set_stop_flags(struct signal_struct *sig,
unsigned int flags)
@@ -250,6 +246,12 @@ static inline void signal_set_stop_flags(struct signal_struct *sig,
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}
+static inline bool signal_delayed_wakeup(struct signal_struct *sig)
+{
+ return (sig->flags & (SIGNAL_STOP_STOPPED | SIGNAL_WAKEUP_PENDING)) ==
+ (SIGNAL_STOP_STOPPED | SIGNAL_WAKEUP_PENDING);
+}
+
/* If true, all threads except ->group_exit_task have pending SIGKILL */
static inline int signal_group_exit(const struct signal_struct *sig)
{
diff --git a/kernel/signal.c b/kernel/signal.c
index 30d652f86964..0ec90689039a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -355,7 +355,8 @@ static bool task_participate_group_stop(struct task_struct *task)
* fresh group stop. Read comment in do_signal_stop() for details.
*/
if (!sig->group_stop_count && !(sig->flags & SIGNAL_STOP_STOPPED)) {
- signal_set_stop_flags(sig, SIGNAL_STOP_STOPPED);
+ int old_flags = (sig->flags & SIGNAL_WAKEUP_PENDING);
+ signal_set_stop_flags(sig, SIGNAL_STOP_STOPPED | old_flags);
return true;
}
return false;
@@ -786,6 +787,14 @@ static void ptrace_trap_notify(struct task_struct *t)
ptrace_signal_wake_up(t, t->jobctl & JOBCTL_LISTENING);
}
+static void wake_up_stopped_thread(struct task_struct *t)
+{
+ if (likely(!(t->ptrace & PT_SEIZED)))
+ wake_up_state(t, __TASK_STOPPED);
+ else
+ ptrace_trap_notify(t);
+}
+
/*
* Handle magic process-wide effects of stop/continue signals. Unlike
* the signal actions, these happen immediately at signal-generation
@@ -817,7 +826,13 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
for_each_thread(p, t)
flush_sigqueue_mask(&flush, &t->pending);
} else if (sig == SIGCONT) {
- unsigned int why;
+ if (signal->group_stop_count) {
+ /* Let the stop complete before continuing. This
+ * ensures there are well definined interactions with
+ * ptrace.
+ */
+ signal->flags |= SIGNAL_WAKEUP_PENDING;
+ }
/*
* Remove all stop signals from all queues, wake all threads.
*/
@@ -825,35 +840,21 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
flush_sigqueue_mask(&flush, &signal->shared_pending);
for_each_thread(p, t) {
flush_sigqueue_mask(&flush, &t->pending);
+ if (signal->flags & SIGNAL_WAKEUP_PENDING)
+ continue;
task_clear_jobctl_pending(t, JOBCTL_STOP_PENDING);
- if (likely(!(t->ptrace & PT_SEIZED)))
- wake_up_state(t, __TASK_STOPPED);
- else
- ptrace_trap_notify(t);
+ wake_up_stopped_thread(t);
}
- /*
- * Notify the parent with CLD_CONTINUED if we were stopped.
- *
- * If we were in the middle of a group stop, we pretend it
- * was already finished, and then continued. Since SIGCHLD
- * doesn't queue we report only CLD_STOPPED, as if the next
- * CLD_CONTINUED was dropped.
- */
- why = 0;
- if (signal->flags & SIGNAL_STOP_STOPPED)
- why |= SIGNAL_CLD_CONTINUED;
- else if (signal->group_stop_count)
- why |= SIGNAL_CLD_STOPPED;
-
- if (why) {
+ /* Notify the parent with CLD_CONTINUED if we were stopped. */
+ if (signal->flags & SIGNAL_STOP_STOPPED) {
/*
* The first thread which returns from do_signal_stop()
- * will take ->siglock, notice SIGNAL_CLD_MASK, and
- * notify its parent. See get_signal_to_deliver().
+ * will take ->siglock, notice SIGNAL_CLD_CONTINUED, and
+ * notify its parent. See get_signal().
*/
- signal_set_stop_flags(signal, why | SIGNAL_STOP_CONTINUED);
- signal->group_stop_count = 0;
+ signal_set_stop_flags(signal,
+ SIGNAL_STOP_CONTINUED | SIGNAL_CLD_CONTINUED);
signal->group_exit_code = 0;
}
}
@@ -1691,6 +1692,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
static void do_notify_parent_cldstop(struct task_struct *tsk,
bool for_ptracer, int why)
{
+ struct signal_struct *sig = tsk->signal;
struct siginfo info;
unsigned long flags;
struct task_struct *parent;
@@ -1724,7 +1726,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
info.si_status = SIGCONT;
break;
case CLD_STOPPED:
- info.si_status = tsk->signal->group_exit_code & 0x7f;
+ info.si_status = sig->group_exit_code & 0x7f;
break;
case CLD_TRAPPED:
info.si_status = tsk->exit_code & 0x7f;
@@ -1743,6 +1745,22 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
*/
__wake_up_parent(tsk, parent);
spin_unlock_irqrestore(&sighand->siglock, flags);
+
+ /*
+ * If there was a delayed SIGCONT process it now.
+ */
+ spin_lock_irqsave(&tsk->sighand->siglock, flags);
+ if ((why == CLD_STOPPED) && signal_delayed_wakeup(sig)) {
+ struct task_struct *t;
+
+ for_each_thread(tsk, t)
+ wake_up_stopped_thread(t);
+
+ signal_set_stop_flags(sig,
+ SIGNAL_CLD_CONTINUED | SIGNAL_STOP_CONTINUED);
+ sig->group_exit_code = 0;
+ }
+ spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
}
static inline int may_ptrace_stop(void)
@@ -2166,19 +2184,10 @@ int get_signal(struct ksignal *ksig)
spin_lock_irq(&sighand->siglock);
/*
* Every stopped thread goes here after wakeup. Check to see if
- * we should notify the parent, prepare_signal(SIGCONT) encodes
- * the CLD_ si_code into SIGNAL_CLD_MASK bits.
+ * we should notify the parent.
*/
- if (unlikely(signal->flags & SIGNAL_CLD_MASK)) {
- int why;
-
- if (signal->flags & SIGNAL_CLD_CONTINUED)
- why = CLD_CONTINUED;
- else
- why = CLD_STOPPED;
-
- signal->flags &= ~SIGNAL_CLD_MASK;
-
+ if (unlikely(signal->flags & SIGNAL_CLD_CONTINUED)) {
+ signal->flags &= ~SIGNAL_CLD_CONTINUED;
spin_unlock_irq(&sighand->siglock);
/*
@@ -2190,11 +2199,11 @@ int get_signal(struct ksignal *ksig)
* a duplicate.
*/
read_lock(&tasklist_lock);
- do_notify_parent_cldstop(current, false, why);
+ do_notify_parent_cldstop(current, false, CLD_CONTINUED);
if (ptrace_reparented(current->group_leader))
do_notify_parent_cldstop(current->group_leader,
- true, why);
+ true, CLD_CONTINUED);
read_unlock(&tasklist_lock);
goto relock;
--
2.10.1
We call wait_task_zombie to reap child processes and to release
zombie ptraced threads. Collect all of the tests for these two
cases into their own individual if statements to make it clear
what is going on.
This results in no change in behavior just a change in how the code is
written.
Seeing how these two cases that call wait_task_zombie
are equivalent to the original is not trivial so here
is my reasoning:
/*
* Original test ---------------------------------------------
*/
if ((exit_state == EXIT_ZOMBIE) &&
!delay_group_leader(p) &&
((ptrace ||
(p->ptrace && !ptrace_reparented(p))) ||
!p->ptrace))
return wait_task_zombile(wo, p);
/*
* Expand delay_group_leader --------------------------------
*/
if ((exit_state == EXIT_ZOMBIE) &&
!(thread_group_leader(p) && !thread_group_empty(p)) &&
((ptrace ||
(p->ptrace && !ptrace_reparented(p))) ||
!p->ptrace))
return wait_task_zombile(wo, p);
/*
* Moving ! via DeMorgan's law ------------------------------
*/
if ((exit_state == EXIT_ZOMBIE) &&
(!thread_group_leader(p) || thread_group_empty(p)) &&
((ptrace ||
(p->ptrace && !ptrace_reparented(p))) ||
!p->ptrace))
return wait_task_zombile(wo, p);
/*
* Three basic cases from the last || expression ------------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) && !ptrace && !p->ptrace &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/* ptraced zombie child process? */
if ((exit_state == EXIT_ZOMBIE) && !ptrace && p->ptrace &&
!ptrace_reparented(p) &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/* ptraced process or thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/*
* Combining the first two cases ----------------------------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) && !ptrace &&
!ptrace_reparented(p) &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/* ptraced process or thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/*
* Exploiting !ptrace implies thread_group_leader -----------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) && !ptrace &&
!ptrace_reparented(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced process or thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/*
* Splitting the second case by ptrace_reparented -----------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) && !ptrace &&
!ptrace_reparented(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced child process or child thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
!ptrace_reparenated(p) &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/* ptraced process or thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
ptrace_reparanted(p) &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/*
* Splitting the second case by leadership ------------------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) && !ptrace &&
!ptrace_reparented(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced child process? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
!ptrace_reparenated(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced child thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
!ptrace_reparenated(p) &&
!thread_group_leader(p))
return wait_task_zombie(wo, p);
/* ptraced process or thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
ptrace_reparanted(p) &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/*
* Combining the first two cases ----------------------------
* Using the knowledge that !ptrace implies thread_group_leader
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) &&
!ptrace_reparented(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced child thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
!ptrace_reparenated(p) &&
!thread_group_leader(p))
return wait_task_zombie(wo, p);
/* ptraced non-child process or thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
ptrace_reparanted(p) &&
(!thread_group_leader(p) || thread_group_empty(p)))
return wait_task_zombie(wo, p);
/*
* Splitting the thread case by leadership ------------------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) &&
!ptrace_reparented(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced child thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
!ptrace_reparenated(p) &&
!thread_group_leader(p))
return wait_task_zombie(wo, p);
/* ptraced non-child thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
ptrace_reparanted(p) &&
!thread_group_leader(p))
return wait_task_zombie(wo, p);
/* ptraced non-child process? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
ptrace_reparanted(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/*
* Combining the second and third cases ---------------------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) &&
!ptrace_reparented(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced thread? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
!thread_group_leader(p))
return wait_task_zombie(wo, p);
/* ptraced non-child process? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
ptrace_reparanted(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/*
* Combining the second and third cases ---------------------
*/
/* zombie child process? */
if ((exit_state == EXIT_ZOMBIE) &&
!ptrace_reparented(p) &&
thread_group_leader(p) &&
thread_group_empty(p))
return wait_task_zombie(wo, p);
/* ptraced thread or ptraced non-child process? */
if ((exit_state == EXIT_ZOMBIE) && ptrace &&
(!thread_group_leader(p) ||
(ptrace_reparented(p) && thread_group_empty(p))))
return wait_task_zombie(wo, p);
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 44 +++++++++++++++++++++++---------------------
1 file changed, 23 insertions(+), 21 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 306e526f4c5e..9b70e21c960d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1342,6 +1342,24 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
if (!ret)
return ret;
+ /* zombie child process? */
+ if ((exit_state == EXIT_ZOMBIE) &&
+ !ptrace_reparented(p) &&
+ thread_group_leader(p) &&
+ thread_group_empty(p))
+ return wait_task_zombie(wo, p);
+
+ /*
+ * A zombie ptracee that is not a child of it's ptracer's
+ * thread group is only visible to its ptracer. Notification
+ * and reaping will be cascaded to the real parent when the
+ * ptracer detaches.
+ */
+ if ((exit_state == EXIT_ZOMBIE) && ptrace &&
+ (!thread_group_leader(p) ||
+ (ptrace_reparented(p) && thread_group_empty(p))))
+ return wait_task_zombie(wo, p);
+
if (unlikely(exit_state == EXIT_TRACE)) {
/*
* ptrace == 0 means we are the natural parent. In this case
@@ -1354,33 +1372,17 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
if (likely(!ptrace) && unlikely(p->ptrace)) {
/*
- * If it is traced by its real parent's group, just pretend
- * the caller is ptrace_do_wait() and reap this child if it
- * is zombie.
- *
- * This also hides group stop state from real parent; otherwise
- * a single stop can be reported twice as group and ptrace stop.
- * If a ptracer wants to distinguish these two events for its
- * own children it should create a separate process which takes
- * the role of real parent.
+ * Hide group stop state from real parent; otherwise a single
+ * stop can be reported twice as group and ptrace stop. If a
+ * ptracer wants to distinguish these two events for its own
+ * children it should create a separate process which takes the
+ * role of real parent.
*/
if (!ptrace_reparented(p))
ptrace = 1;
}
- /* slay zombie? */
if (exit_state == EXIT_ZOMBIE) {
- /* we don't reap group leaders with subthreads */
- if (!delay_group_leader(p)) {
- /*
- * A zombie ptracee is only visible to its ptracer.
- * Notification and reaping will be cascaded to the
- * real parent when the ptracer detaches.
- */
- if (unlikely(ptrace) || likely(!p->ptrace))
- return wait_task_zombie(wo, p);
- }
-
/*
* Allow access to stopped/continued state via zombie by
* falling through. Clearing of notask_error is complex.
--
2.10.1
The function zap_pid_processes terminates when the number of pids used
by processes in a pid namespace drops to just those pids used by the
last thread of the dying thread.
Don't allow an init process aka a child_reaper to call
setpgid(0, some_other_processes_pid). That case is already broken today
as it would result in a pid namespace that will hang when the thread group
leader dies. Thankfully I have not received that bug report so it
appears that no one cares and uses that case.
Limiting setpgid ensures that the only two pids in the pid namespace
on the init process that are worth worrying about are the pid and the
tgid. The pgrp will now either match the tgid or it will be from
outside the pid namespace. Likewise the sid will either match the
tgid or be from outside the pid namespace.
To make it clear what is being counted test if the task's tgid is the same
as the the task's pid. In particular the code does not count the
number of processes in a pid namespace, just the number of pids those
processes use. A subtle but important distinction for understanding
the code.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/pid_namespace.c | 2 +-
kernel/sys.c | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 74a5a7255b4d..bdda73768cc0 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -208,7 +208,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
int nr;
int rc;
struct task_struct *task, *me = current;
- int init_pids = thread_group_leader(me) ? 1 : 2;
+ int init_pids = task_pid(me) != task_tgid(me) ? 2 : 1;
/* Don't allow any more processes into the pid namespace */
disable_pid_allocation(pid_ns);
diff --git a/kernel/sys.c b/kernel/sys.c
index 705f14b28134..775dea1e2e06 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -975,7 +975,8 @@ SYSCALL_DEFINE2(setpgid, pid_t, pid, pid_t, pgid)
pgrp = find_vpid(pgid);
g = pid_task(pgrp, PIDTYPE_PGID);
- if (!g || task_session(g) != task_session(group_leader))
+ if (!g || task_session(g) != task_session(group_leader) ||
+ is_child_reaper(task_tgid(p)))
goto out;
}
--
2.10.1
If the only job of the signal is to report a ptrace level event set
si_code to CLD_TRAPPED instead of possibly CLD_STOPPED.
This causes the siginfo of the signals that are sent to match the
signinfo of the signals returned by waitid.
This is a user visible difference but I don't expect anything will
care.
In fact this is a return to historical linux behavior. In linux 2.4.0
all ptrace stops were reported through do_notify_parent with
CLD_TRAPPED. When do_notify_parent_cldstop was added the CLD_TRAPPED
logic was not included and CLD_TRAPPED for ptrace stops was lost. As
nothing was said about this case I assume it was an oversight.
When waitid was added a little earlier all stops were being
reported with do_notify_parent and all ptrace stops were setting
CLD_TRAPPED. So initially signals and waitid were in sync with
respect to setting CLD_TRAPPED.
It is also worth knowing that posix uses documents CLD_TRAPPED
as "Traced child has trapped."
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Ref: ca3f74aa7baa ("[PATCH] waitid system call")
Fixes: Fixes: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index ff41e6ee2b5e..0d4ca87f1fee 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1886,9 +1886,9 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
* separately unless they are going to be identical.
*/
if (!identical)
- do_notify_parent_cldstop(current, true, why);
+ do_notify_parent_cldstop(current, true, CLD_TRAPPED);
if (gstop_done)
- do_notify_parent_cldstop(current, false, why);
+ do_notify_parent_cldstop(current, false, CLD_STOPPED);
/*
* Don't want to allow preemption here, because
@@ -1912,7 +1912,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
* the real parent of the group stop completion is enough.
*/
if (gstop_done)
- do_notify_parent_cldstop(current, false, why);
+ do_notify_parent_cldstop(current, false, CLD_STOPPED);
/* tasklist protects us from ptrace_freeze_traced() */
__set_current_state(TASK_RUNNING);
--
2.10.1
> Another easy entry point is to see that a multi-threaded setuid won't
> change the credentials on a zombie thread group leader. Which can allow
> sending signals to a process that the credential change should forbid.
> This is in violation of posix and the semantics we attempt to enforce in
> linux.
I might be completely wrong on this point (and I haven't looked at the
patches), but I was under the impression that multi-threaded set[ug]id
was implemented in userspace (by glibc's nptl(7) library that uses RT
signals internally to get each thread to update their credentials). And
given that, I wouldn't be surprised (as a user) that zombie threads will
have stale credentials (glibc isn't running in those threads anymore).
Am I mistaken in that belief?
</off-topic>
--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
On Tue, Jun 6, 2017 at 12:03 PM, Eric W. Biederman
<[email protected]> wrote:
>
> As this is more permisssive there is no chance anything will break.
Actually, I do worry about the security issues here.
The thing is, the parent may be some system daemon that wants to catch
SIGCHLD, but we've used prctl and changed pdeath_signal to something
else (like SIGSEGV or something).
Do we really want to be able to kill a system daemon that we couldn't
use kill() on directly, just because that system daemon spawned us?
So I think those permission checks may actually be a good idea.
Although possibly they should be in prctl()..
Linus
On Tue, Jun 6, 2017 at 12:01 PM, Eric W. Biederman
<[email protected]> wrote:
>
> I am posting this patches in the hope of some review of the strategy I
> am taking and to let the individual patches be reviewed.
I'm trying to look through these, and finding (as usual) that the
signal handling and exit code is extremely scary from a correctness
and security standpoint.
I really want Oleg to review/ack these. Oleg?
I also would really really want to see the stuff that actually changes
semantics split out.
For example, I feel much less nervous about things like making the
tasklist RCU-safe. So I'd like to see changes like that be separated
out from the much scarier ones. Would that be possible? Hint hint..
Linus
Eric,
On Tue, Jun 6, 2017 at 9:03 PM, Eric W. Biederman <[email protected]> wrote:
> This fixes and old old regression. When Roland switched from
> sending pdeath_signal with send_sig() to send_group_sig_info
> it gained a permission check, and started taking the tasklist
> lock. Roland earlier fixed the double taking of the tasklist
> lock in 3f2a0d1df938 ("[PATCH] fix pdeath_signal SMP locking")
> but pdeath_signal still performs an unnecessary permission
> check.
>
> Ordinarily I would be hesitant at fixing an ancient regression but
> a permission check for our parent sending to us is almost never
> likely to fail (so it is unlikely anyone has noticed), and it
> is stupid. It makes absolutely no sense to see if our
> parent has permission to send us a signal we requested be
> sent to us.
Another side effect of this change is that audit_signal_info() is no longer
being called. AFAICT this is a user space visible change.
--
Thanks,
//richard
Linus Torvalds <[email protected]> writes:
> On Tue, Jun 6, 2017 at 12:03 PM, Eric W. Biederman
> <[email protected]> wrote:
>>
>> As this is more permisssive there is no chance anything will break.
>
> Actually, I do worry about the security issues here.
>
> The thing is, the parent may be some system daemon that wants to catch
> SIGCHLD, but we've used prctl and changed pdeath_signal to something
> else (like SIGSEGV or something).
>
> Do we really want to be able to kill a system daemon that we couldn't
> use kill() on directly, just because that system daemon spawned us?
>
> So I think those permission checks may actually be a good idea.
> Although possibly they should be in prctl()..
To be clear. pdeath signal is the signal we receive when our parent
dies. It is the parent death signal.
AKA when the system daemon (or whatever is dies) what signal does the
child process that called the prctl get?
There is no chance of killing the system daemon that spawned us,
as the signal only gets sent to ourselves.
Eric
Aleksa Sarai <[email protected]> writes:
>> Another easy entry point is to see that a multi-threaded setuid won't
>> change the credentials on a zombie thread group leader. Which can allow
>> sending signals to a process that the credential change should forbid.
>> This is in violation of posix and the semantics we attempt to enforce in
>> linux.
>
> I might be completely wrong on this point (and I haven't looked at the patches),
> but I was under the impression that multi-threaded set[ug]id was implemented in
> userspace (by glibc's nptl(7) library that uses RT signals internally to get
> each thread to update their credentials). And given that, I wouldn't be
> surprised (as a user) that zombie threads will have stale credentials (glibc
> isn't running in those threads anymore).
>
> Am I mistaken in that belief?
Would you be surprised if you learned that if your first thread
exits, it will become a zombie and persist for the lifetime of your
process?
Furthermore all non-thread specific signals will permission check
against that first zombie thread.
Which I think makes this surprising even if you know that setuid is
implemented in userspace.
Eric
On 06/07/2017 09:36 PM, Eric W. Biederman wrote:
>>> Another easy entry point is to see that a multi-threaded setuid won't
>>> change the credentials on a zombie thread group leader. Which can allow
>>> sending signals to a process that the credential change should forbid.
>>> This is in violation of posix and the semantics we attempt to enforce in
>>> linux.
>>
>> I might be completely wrong on this point (and I haven't looked at the patches),
>> but I was under the impression that multi-threaded set[ug]id was implemented in
>> userspace (by glibc's nptl(7) library that uses RT signals internally to get
>> each thread to update their credentials). And given that, I wouldn't be
>> surprised (as a user) that zombie threads will have stale credentials (glibc
>> isn't running in those threads anymore).
>>
>> Am I mistaken in that belief?
>
> Would you be surprised if you learned that if your first thread
> exits, it will become a zombie and persist for the lifetime of your
> process?
>
> Furthermore all non-thread specific signals will permission check
> against that first zombie thread.
Ah okay, so it really is a matter of Linux's threadgroup semantics just
not being "right" on a more fundamental level than nptl.
> Which I think makes this surprising even if you know that setuid is
> implemented in userspace.
Quite surprising, thanks for the explanation.
--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Hi Eric,
I'll try very much to read this series tomorrow, can't do this today...
On 06/06, Eric W. Biederman wrote:
>
> @@ -1380,13 +1380,6 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource,
> return -EPERM;
> }
>
> - /* protect tsk->signal and tsk->sighand from disappearing */
> - read_lock(&tasklist_lock);
> - if (!tsk->sighand) {
> - retval = -ESRCH;
> - goto out;
> - }
Yes, the comment is wrong.
However we do need read_lock(tasklist_lock) to access ->group_leader. And the
->sighand != NULL check ensures that ->group_leader is the valid pointer.
Also, update_rlimit_cpu() is not safe without tasklist / sighand-check.
We can probably change this code to rely on rcu.
Oleg.
On 06/06, Eric W. Biederman wrote:
>
> Add a rcu_usage to task_struct and use it to reuse the delayed rcu put
> logic from release_task in finish_task_switch.
I didn't really read this patch yet, but it first glance it should work and
you can also remove the ->exit_state check/limitation in rcuwait_wait_event().
Oleg.
Oleg Nesterov <[email protected]> writes:
> Hi Eric,
>
> I'll try very much to read this series tomorrow, can't do this today...
>
> On 06/06, Eric W. Biederman wrote:
>>
>> @@ -1380,13 +1380,6 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource,
>> return -EPERM;
>> }
>>
>> - /* protect tsk->signal and tsk->sighand from disappearing */
>> - read_lock(&tasklist_lock);
>> - if (!tsk->sighand) {
>> - retval = -ESRCH;
>> - goto out;
>> - }
>
> Yes, the comment is wrong.
>
> However we do need read_lock(tasklist_lock) to access ->group_leader. And the
> ->sighand != NULL check ensures that ->group_leader is the valid
> pointer.
As of 4.12-rc1 The code does not access group_leader anymore.
> Also, update_rlimit_cpu() is not safe without tasklist / sighand-check.
>
> We can probably change this code to rely on rcu.
Good point a NULL sighand will cause update_rlimit_cpu to OOPS.
Grr. There is a point in my tree where this is perfectly safe. But not
at this point. Consider this patch dropped for the moment.
Eric
Linus Torvalds <[email protected]> writes:
> On Tue, Jun 6, 2017 at 12:01 PM, Eric W. Biederman
> <[email protected]> wrote:
>>
>> I am posting this patches in the hope of some review of the strategy I
>> am taking and to let the individual patches be reviewed.
>
> I'm trying to look through these, and finding (as usual) that the
> signal handling and exit code is extremely scary from a correctness
> and security standpoint.
>
> I really want Oleg to review/ack these. Oleg?
Definitely. The more review I can get the better.
> I also would really really want to see the stuff that actually changes
> semantics split out.
>
> For example, I feel much less nervous about things like making the
> tasklist RCU-safe. So I'd like to see changes like that be separated
> out from the much scarier ones. Would that be possible? Hint hint..
The patches that I posted are the ones that I would really like to have
ready for the 4.13 merge window. They are supposed to be the least
scary patches that don't really depend on anything else, and the semantic changes.
After a first brush with code review the non-scary patches wind up just
being these.
[PATCH 01/26] alpha: Remove unused TASK_GROUP_LEADER
[PATCH 02/26] cgroup: Don't open code tasklist_empty()
[PATCH 05/26] exit: Remove the pointless clearing of SIGPENDING in __exit_signal
[PATCH 07/26] pidns: Improve the error handling in alloc_pid
[PATCH 08/26] exit: Make the runqueue rcu safe
This turns out to have a dependency I overlooked so I am dropping it for now.
[PATCH 06/26] rlimit: Remove unnecessary grab of tasklist_lock
There is a small pair of a semantic change and it's dependency:
[PATCH 03/26] signal: Do not perform permission checks when sending pdeath_signal
[PATCH 04/26] signal: Make group_send_sig_info static
There is a teeny tiny semantic change:
[PATCH 09/26] signal: Don't allow sending SIGKILL or SIGSTOP to init
The deeply related scarier changes (+ indicates it includes a semantic change):
[PATCH 10/26] ptrace: Simplify ptrace_detach & exit_ptrace
+[PATCH 11/26] wait: Properly implement __WCLONE handling in the presence of exec and ptrace
[PATCH 12/26] wait: Directly test for the two cases where wait_task_zombie is called
[PATCH 13/26] wait: Remove unused delay_group_leader
+[PATCH 14/26] wait: Move changing of ptrace from wait_consider_task into wait_task_stopped
+[PATCH 15/26] wait: Don't delay !ptrace_reparented leaders
+[PATCH 16/26] exit: Fix reporting a ptraced !reparented leader has exited
[PATCH 17/26] exit: Rework the exit states for ptracees
+[PATCH 18/26] wait: Fix WSTOPPED on a ptraced child
[PATCH 19/26] wait: Simpler code for clearing notask_error in wait_consider_task
[PATCH 20/26] wait: Don't pass the list to wait_consider_task
[PATCH 21/26] wait: Optmize waitpid
+[PATCH 22/26] exit: Fix auto-wait of ptraced children
+[PATCH 23/26] signal: Fix SIGCONT before group stop completes.
+[PATCH 24/26] signal: In ptrace_stop improve identical signal detection
+[PATCH 25/26] signal: In ptrace_stop use CLD_TRAPPED in all ptrace signals
+[PATCH 26/26] pidns: Ensure zap_pid_ns_processes always terminates
Out of my 170 or so total changes those are the bulk of the semantic
changes and definitely the scariest. Even those above are very
conservative and are really just about sorting out the weirdness in the
semantics when we are ptracing our own child which causes wait and the
siginfo of SIGCHLD to have two sets of meanings.
I am hoping we can just get 1,2,5,7, and 8 reviewed and I can just apply
them to my for-next branch.
If I need to repost I will respect the split-out I have described above.
At this point I am happy to see that people are not scared when I
suggest killing the concept of the thread group leader.
Eric