Hello,
Eric, I can't understand why the commit ("signal: Guarantee that
SIGNAL_GROUP_EXIT is set on process exit") added the new
quick_threads counter. And why, if we forget about --quick_threads,
synchronize_group_exit() has to take siglock unconditionally.
Did I miss something obvious?
Tejun, could you check the note about css_task_iter_advance() in
the changelog? And can't it use thread_group_empty() instead?
Oleg.
do_exit() can simply decrement signal_struct->live earlier and call
synchronize_group_exit() only if group_dead is true. Also, this code
can avoid spin_lock_irq(siglock) if SIGNAL_GROUP_EXIT is already set.
The only "nontrivial" user of signal->live is css_task_iter_advance()
but it should not be affected. With or without this change the task
is the exiting/exited leader which has already passed cgroup_exit(),
atomic_read(signal->live) can race with sub-threads but this is fine.
Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/sched/signal.h | 1 -
kernel/exit.c | 12 +++++++-----
kernel/fork.c | 2 --
3 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0a0e23c45406..f98aae40d7e6 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -95,7 +95,6 @@ struct signal_struct {
refcount_t sigcnt;
atomic_t live;
int nr_threads;
- int quick_threads;
struct list_head thread_head;
wait_queue_head_t wait_chldexit; /* for wait4() */
diff --git a/kernel/exit.c b/kernel/exit.c
index f95a2c1338a8..11bac37e2bc6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -800,10 +800,11 @@ static void synchronize_group_exit(struct task_struct *tsk, long code)
struct sighand_struct *sighand = tsk->sighand;
struct signal_struct *signal = tsk->signal;
+ if (READ_ONCE(signal->flags) & SIGNAL_GROUP_EXIT)
+ return;
+
spin_lock_irq(&sighand->siglock);
- signal->quick_threads--;
- if ((signal->quick_threads == 0) &&
- !(signal->flags & SIGNAL_GROUP_EXIT)) {
+ if (!(signal->flags & SIGNAL_GROUP_EXIT)) {
signal->flags = SIGNAL_GROUP_EXIT;
signal->group_exit_code = code;
signal->group_stop_count = 0;
@@ -818,7 +819,9 @@ void __noreturn do_exit(long code)
WARN_ON(irqs_disabled());
- synchronize_group_exit(tsk, code);
+ group_dead = atomic_dec_and_test(&tsk->signal->live);
+ if (group_dead)
+ synchronize_group_exit(tsk, code);
WARN_ON(tsk->plug);
@@ -833,7 +836,6 @@ void __noreturn do_exit(long code)
exit_signals(tsk); /* sets PF_EXITING */
acct_update_integrals(tsk);
- group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
/*
* If the last thread of global init has exited, panic
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..4c361d2bdc12 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1861,7 +1861,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
return -ENOMEM;
sig->nr_threads = 1;
- sig->quick_threads = 1;
atomic_set(&sig->live, 1);
refcount_set(&sig->sigcnt, 1);
@@ -2590,7 +2589,6 @@ __latent_entropy struct task_struct *copy_process(
__this_cpu_inc(process_counts);
} else {
current->signal->nr_threads++;
- current->signal->quick_threads++;
atomic_inc(¤t->signal->live);
refcount_inc(¤t->signal->sigcnt);
task_join_group_stop(p);
--
2.25.1.362.g51ebf55
On 06/09, Oleg Nesterov wrote:
>
> Tejun, could you check the note about css_task_iter_advance() in
> the changelog?
Yes,
> And can't it use thread_group_empty() instead?
Please forget. I'll write another email.
Oleg.
I never understood the code in kernel/cgroup/ even remotely, most probably
I missed something, but let me ask a couple of stupid questions anyway.
cgroup_exit() does
css_set_move_task(tsk, cset, NULL, false);
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
but unless I am totally confused css_task_iter_advance() always ignores
the "dying" sub-threads, so perhaps it should do, say,
css_set_move_task(tsk, cset, NULL, false);
if (delay_group_leader(tsk))
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
and then cgroup_release() can check list_empty(cg_list) before it takes
css_set_lock.
No ?
Or, perhaps we can do even better? Can't cgroup_exit() do something like
// group_dead should be passed from do_exit()
css_set_move_task(tsk, cset, NULL, false);
if (thread_group_leader(tsk) && !group_dead)
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
else if (!thread_group_leader(tsk) && group_dead) {
leader = tsk->group_leader;
if (!list_empty(leader->cg_list) {
css_set_skip_task_iters(task_css_set(leader), leader);
list_del_init(&leader->cg_list);
}
}
and then
- kill the atomic_read(&task->signal->live)) check in
css_task_iter_advance()
- kill the code under css_set_lock in cgroup_release()
Oleg.
Sorry for the spam, forgot to mention. Either way the usage of group_dead
or atomic_read(signal->live) in these paths doesn't look "perfect", but
this is another thing. The pseudo code below tries to mimic the current
logic but again, I'm afraid I misread this code completely.
On 06/10, Oleg Nesterov wrote:
>
> I never understood the code in kernel/cgroup/ even remotely, most probably
> I missed something, but let me ask a couple of stupid questions anyway.
>
> cgroup_exit() does
>
> css_set_move_task(tsk, cset, NULL, false);
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>
> but unless I am totally confused css_task_iter_advance() always ignores
> the "dying" sub-threads, so perhaps it should do, say,
>
> css_set_move_task(tsk, cset, NULL, false);
> if (delay_group_leader(tsk))
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>
> and then cgroup_release() can check list_empty(cg_list) before it takes
> css_set_lock.
>
> No ?
>
> Or, perhaps we can do even better? Can't cgroup_exit() do something like
>
> // group_dead should be passed from do_exit()
>
> css_set_move_task(tsk, cset, NULL, false);
>
> if (thread_group_leader(tsk) && !group_dead)
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>
> else if (!thread_group_leader(tsk) && group_dead) {
> leader = tsk->group_leader;
> if (!list_empty(leader->cg_list) {
> css_set_skip_task_iters(task_css_set(leader), leader);
> list_del_init(&leader->cg_list);
> }
> }
>
> and then
>
> - kill the atomic_read(&task->signal->live)) check in
> css_task_iter_advance()
>
> - kill the code under css_set_lock in cgroup_release()
>
> Oleg.
Oleg Nesterov <[email protected]> writes:
> Hello,
>
> Eric, I can't understand why the commit ("signal: Guarantee that
> SIGNAL_GROUP_EXIT is set on process exit") added the new
> quick_threads counter. And why, if we forget about --quick_threads,
> synchronize_group_exit() has to take siglock unconditionally.
> Did I miss something obvious?
At a minimum it is the exact same locking as everywhere else that sets
signal->flags, signal->group_exit_code, and signal->group_stop_count
uses.
So it would probably require some significant reason to not use
the same locking and complicate reasoning about the code. I suspect
setting those values without siglock held is likely to lead to
interesting races.
May I ask which direction you are coming at this from? Are you trying
to reduce the cost of do_exit? Are you interested in untangling the
mess that is exiting threads in a process?
I have a branch around that I was slowly working through to detangle
the entire mess. And if you are interested I can dig it back up.
My memory is I had all of the known issues worked through but I still
needed to feed the code through code review and merge it in small steps
to ensure I don't introduce regressions.
That is where signal->quick_threads comes from. In the work it is a
part of I wind up moving the decrement up much sooner to the point where
individual threads decide to exit. The decrement of signal->live comes
much too late to be useful in that context.
It is also part of me wanting to be able to uniformly use
SIGNAL_GROUP_EXIT and signal->group_exit_code when talking about the
process state, and p->exit_code when talking about the per task state.
At the moment I am staring at wait_task_zombie and trying to understand
how:
status = (p->signal->flags & SIGNAL_GROUP_EXIT)
? p->signal->group_exit_code : p->exit_code;
works without any locks or barriers.
Eric
Hi Eric, thanks for looking at this.
Let me answer your questions out-of-order. But, before anything else,
do you see anything wrong in 1/1 ?
On 06/10, Eric W. Biederman wrote:
>
> May I ask which direction you are coming at this from? Are you trying
> to reduce the cost of do_exit? Are you interested in untangling the
> mess that is exiting threads in a process?
I am trying to understand why do we need another counter.
And, I'd like to cleanup the usage of task->signal->live, I think it
should be avoided (if possible) when task != current. IIRC, we even
discussed this some time ago but I can't find any reference.
See also another thread about css_task_iter_advance().
> > Eric, I can't understand why the commit ("signal: Guarantee that
> > SIGNAL_GROUP_EXIT is set on process exit") added the new
> > quick_threads counter. And why, if we forget about --quick_threads,
> > synchronize_group_exit() has to take siglock unconditionally.
> > Did I miss something obvious?
>
> At a minimum it is the exact same locking as everywhere else that sets
> signal->flags, signal->group_exit_code, and signal->group_stop_count
> uses.
>
> So it would probably require some significant reason to not use
> the same locking and complicate reasoning about the code. I suspect
> setting those values without siglock held is likely to lead to
> interesting races.
I guess I was not clear. Of course, SIGNAL_GROUP_EXIT must be always
set under ->siglock. But I think synchronize_group_exit() can just
return if SIGNAL_GROUP_EXIT is already set? If nothing else, this is
what do_group_exit() does.
Or I misunderstood you?
> That is where signal->quick_threads comes from. In the work it is a
> part of I wind up moving the decrement up much sooner to the point where
> individual threads decide to exit. The decrement of signal->live comes
> much too late to be useful in that context.
And that is why this patch moves the decrement of signal->live to the
start of do_exit().
> It is also part of me wanting to be able to uniformly use
> SIGNAL_GROUP_EXIT and signal->group_exit_code when talking about the
> process state, and p->exit_code when talking about the per task state.
Agreed,
> At the moment I am staring at wait_task_zombie and trying to understand
> how:
>
> status = (p->signal->flags & SIGNAL_GROUP_EXIT)
> ? p->signal->group_exit_code : p->exit_code;
>
> works without any locks or barriers.
Agreed, at first glance this looks worrying without siglock... I'll try
to take a look, perhaps we can simply kill the SIGNAL_GROUP_EXIT check,
not sure.
But this patch should not make any difference ?
Oleg.
Forgot to mention ...
On 06/10, Oleg Nesterov wrote:
>
> I guess I was not clear. Of course, SIGNAL_GROUP_EXIT must be always
> set under ->siglock. But I think synchronize_group_exit() can just
> return if SIGNAL_GROUP_EXIT is already set? If nothing else, this is
> what do_group_exit() does.
Just in case... I guess do_group_exit() needs some cleanups too. For
example,
if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
without siglock is racy, exit_code can be wrong (not a big problem).
LOAD(sig->flags) can make KCSAN unhappy. Etc.
But this all has nothing to do with 1/1 ?
Oleg.
On 06/10, Oleg Nesterov wrote:
>
> > At the moment I am staring at wait_task_zombie and trying to understand
> > how:
> >
> > status = (p->signal->flags & SIGNAL_GROUP_EXIT)
> > ? p->signal->group_exit_code : p->exit_code;
> >
> > works without any locks or barriers.
>
> Agreed, at first glance this looks worrying without siglock... I'll try
> to take a look, perhaps we can simply kill the SIGNAL_GROUP_EXIT check,
> not sure.
Still not sure, will check tomorrow, but probably yes. We can rely on the
the exit_state == EXIT_ZOMBIE check in the caller and tasklist_lock.
OTOH, we should probably take ptrace into account...
In short, I am still not sure ;) but I agree this code needs some changes,
the comments at least.
> But this patch should not make any difference ?
Yes.
Oleg.
On Mon, Jun 10, 2024 at 01:08:52PM +0200, Oleg Nesterov wrote:
> Sorry for the spam, forgot to mention. Either way the usage of group_dead
> or atomic_read(signal->live) in these paths doesn't look "perfect", but
> this is another thing. The pseudo code below tries to mimic the current
> logic but again, I'm afraid I misread this code completely.
The usage of signal->live there is something I added without much thinking.
I just needed something which goes off after all sub-threads are gone.
Getting @group_dead from do_exit() sounds perfectly fine to me.
Thanks.
--
tejun
Hello, Oleg.
On Mon, Jun 10, 2024 at 12:50:28PM +0200, Oleg Nesterov wrote:
> I never understood the code in kernel/cgroup/ even remotely, most probably
> I missed something, but let me ask a couple of stupid questions anyway.
>
> cgroup_exit() does
>
> css_set_move_task(tsk, cset, NULL, false);
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>
> but unless I am totally confused css_task_iter_advance() always ignores
> the "dying" sub-threads, so perhaps it should do, say,
>
> css_set_move_task(tsk, cset, NULL, false);
> if (delay_group_leader(tsk))
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>
> and then cgroup_release() can check list_empty(cg_list) before it takes
> css_set_lock.
>
> No ?
Yeah, I think so. The current code hasn't broken for quite a while but it
also hasn't received much attention after the iterator updates which were
very much in the spirit of just getting it to work. I don't think you're
missing anything.
> Or, perhaps we can do even better? Can't cgroup_exit() do something like
>
> // group_dead should be passed from do_exit()
>
> css_set_move_task(tsk, cset, NULL, false);
>
> if (thread_group_leader(tsk) && !group_dead)
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>
> else if (!thread_group_leader(tsk) && group_dead) {
> leader = tsk->group_leader;
> if (!list_empty(leader->cg_list) {
> css_set_skip_task_iters(task_css_set(leader), leader);
> list_del_init(&leader->cg_list);
> }
> }
>
> and then
>
> - kill the atomic_read(&task->signal->live)) check in
> css_task_iter_advance()
>
> - kill the code under css_set_lock in cgroup_release()
That does sound a lot better than the current code. Care to send the patch?
Thanks.
--
tejun
So...
Eric, do you agree with this patch or not?
Tejun, sorry for delay, I'll try to send the patch which cleanups
(at least in my opinion) the ->dying_tasks logic as soon as I have
time. But just in case... no, cgroup_exit() can't rely on group_exit
passed from the caller, I was wrong ;)
On 06/10, Oleg Nesterov wrote:
>
> Hi Eric, thanks for looking at this.
>
> Let me answer your questions out-of-order. But, before anything else,
> do you see anything wrong in 1/1 ?
>
> On 06/10, Eric W. Biederman wrote:
> >
> > May I ask which direction you are coming at this from? Are you trying
> > to reduce the cost of do_exit? Are you interested in untangling the
> > mess that is exiting threads in a process?
>
> I am trying to understand why do we need another counter.
>
> And, I'd like to cleanup the usage of task->signal->live, I think it
> should be avoided (if possible) when task != current. IIRC, we even
> discussed this some time ago but I can't find any reference.
>
> See also another thread about css_task_iter_advance().
>
> > > Eric, I can't understand why the commit ("signal: Guarantee that
> > > SIGNAL_GROUP_EXIT is set on process exit") added the new
> > > quick_threads counter. And why, if we forget about --quick_threads,
> > > synchronize_group_exit() has to take siglock unconditionally.
> > > Did I miss something obvious?
> >
> > At a minimum it is the exact same locking as everywhere else that sets
> > signal->flags, signal->group_exit_code, and signal->group_stop_count
> > uses.
> >
> > So it would probably require some significant reason to not use
> > the same locking and complicate reasoning about the code. I suspect
> > setting those values without siglock held is likely to lead to
> > interesting races.
>
> I guess I was not clear. Of course, SIGNAL_GROUP_EXIT must be always
> set under ->siglock. But I think synchronize_group_exit() can just
> return if SIGNAL_GROUP_EXIT is already set? If nothing else, this is
> what do_group_exit() does.
>
> Or I misunderstood you?
>
> > That is where signal->quick_threads comes from. In the work it is a
> > part of I wind up moving the decrement up much sooner to the point where
> > individual threads decide to exit. The decrement of signal->live comes
> > much too late to be useful in that context.
>
> And that is why this patch moves the decrement of signal->live to the
> start of do_exit().
>
> > It is also part of me wanting to be able to uniformly use
> > SIGNAL_GROUP_EXIT and signal->group_exit_code when talking about the
> > process state, and p->exit_code when talking about the per task state.
>
> Agreed,
>
> > At the moment I am staring at wait_task_zombie and trying to understand
> > how:
> >
> > status = (p->signal->flags & SIGNAL_GROUP_EXIT)
> > ? p->signal->group_exit_code : p->exit_code;
> >
> > works without any locks or barriers.
>
> Agreed, at first glance this looks worrying without siglock... I'll try
> to take a look, perhaps we can simply kill the SIGNAL_GROUP_EXIT check,
> not sure.
>
> But this patch should not make any difference ?
>
> Oleg.
Oleg Nesterov <[email protected]> writes:
> So...
>
> Eric, do you agree with this patch or not?
I really don't.
I think skipping some work if SIGNAL_GROUP_EXIT is already
set is not necessarily wrong.
I think we need the quick_threads count, and related cleanups.
I was hoping to be able to post a patchset with this reply
to explain things, but it looks like that is still a couple
of days off.
Where I was going, and where I think we should go with quick_threads is
an exit path that is much easier to reason about and maintain. But the
decrement needs to move into the same code that sets SIGNAL_GROUP_EXIT.
Which makes it something very different from signal->live.
Eric