Ratan Nalumasu reported that in a process with many threads doing
unnecessary wakeups. Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.
Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.
We can make child_wait_callback() more clever later, right now it only
checks eligible_child().
Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/signal.c | 3 ++-
kernel/exit.c | 22 +++++++++++++++++++---
2 files changed, 21 insertions(+), 4 deletions(-)
--- WAIT/kernel/signal.c~2_WAKE_PARENT 2009-06-22 17:19:55.000000000 +0200
+++ WAIT/kernel/signal.c 2009-06-22 18:35:37.000000000 +0200
@@ -1388,7 +1388,8 @@ ret:
static inline void __wake_up_parent(struct task_struct *p,
struct task_struct *parent)
{
- wake_up_interruptible_sync(&parent->signal->wait_chldexit);
+ __wake_up_sync_key(&parent->signal->wait_chldexit,
+ TASK_INTERRUPTIBLE, 1, p);
}
/*
--- WAIT/kernel/exit.c~2_WAKE_PARENT 2009-06-22 17:33:08.000000000 +0200
+++ WAIT/kernel/exit.c 2009-06-22 18:55:08.000000000 +0200
@@ -1089,6 +1089,7 @@ struct wait_opts {
int __user *wo_stat;
struct rusage __user *wo_rusage;
+ wait_queue_t child_wait;
int notask_error;
};
@@ -1558,15 +1559,29 @@ static int ptrace_do_wait(struct wait_op
return 0;
}
+static int child_wait_callback(wait_queue_t *wait, unsigned mode,
+ int sync, void *key)
+{
+ struct wait_opts *wo = container_of(wait, struct wait_opts,
+ child_wait);
+ struct task_struct *p = key;
+
+ if (!eligible_child(wo, p))
+ return 0;
+
+ return default_wake_function(wait, mode, sync, key);
+}
+
static long do_wait(struct wait_opts *wo)
{
- DECLARE_WAITQUEUE(wait, current);
struct task_struct *tsk;
int retval;
trace_sched_process_wait(wo->wo_pid);
- add_wait_queue(¤t->signal->wait_chldexit,&wait);
+ init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
+ wo->child_wait.private = current;
+ add_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait);
repeat:
/*
* If there is nothing that can match our critiera just get out.
@@ -1607,7 +1622,8 @@ notask:
}
end:
__set_current_state(TASK_RUNNING);
- remove_wait_queue(¤t->signal->wait_chldexit,&wait);
+ remove_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait);
+
if (wo->wo_info) {
struct siginfo __user *infop = wo->wo_info;
Looks good, I'm glad to see this revived.
I note that even simpler than eligible_child() is just:
if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private != p->parent)
return 0;
IIRC that is the test that Ratan's original patch used to address the
particular application usage that first troubled him. But probably this
is already what you meant by "more clever later" (and ->parent is perhaps
not right in all cases there).
Your two patches as they are look safe and useful to me and I hope they can
go in soon.
Thanks,
Roland
On 06/24, Roland McGrath wrote:
>
> Looks good, I'm glad to see this revived.
>
> I note that even simpler than eligible_child() is just:
I think the check below is orthogonal to eligible_child(). Not sure
eligible_child() can really help, but otoh it is cheap and doesn't hurt.
But perhaps we can kill it later.
> if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private != p->parent)
> return 0;
>
> IIRC that is the test that Ratan's original patch used to address the
> particular application usage that first troubled him.
Aha, now I see what was the problem with Ratan's workload.
> But probably this
> is already what you meant by "more clever later"
I didn't mean this particular optimization, but it looks good to me.
> (and ->parent is perhaps
> not right in all cases there).
I think this is right... Except I'd like to avoid using ->parent.
> Your two patches as they are look safe and useful to me and I hope they can
> go in soon.
Thanks.
Yes I think these 2 patches should be applied first, even if eligible_child()
itself doesn't buy much. It will be cleaner if we add "real" checks on top.
Oleg.
> I think the check below is orthogonal to eligible_child().
Yes.
> Not sure
> eligible_child() can really help, but otoh it is cheap and doesn't hurt.
> But perhaps we can kill it later.
No, it's a very good check to have. It is ideal for waitpid(ONE_PID,...)
uses, which are also probably quite common.
> > (and ->parent is perhaps
> > not right in all cases there).
>
> I think this is right... Except I'd like to avoid using ->parent.
Agreed. The not-right case I had in mind was do_notify_parent_cldstop
where maybe it could be ->real_parent of the untraced group_leader.
> > Your two patches as they are look safe and useful to me and I hope they can
> > go in soon.
>
> Thanks.
>
> Yes I think these 2 patches should be applied first, even if eligible_child()
> itself doesn't buy much. It will be cleaner if we add "real" checks on top.
Agreed.
Thanks,
Roland
On 06/24, Roland McGrath wrote:
>
> > > (and ->parent is perhaps
> > > not right in all cases there).
> >
> > I think this is right... Except I'd like to avoid using ->parent.
>
> Agreed. The not-right case I had in mind was do_notify_parent_cldstop
> where maybe it could be ->real_parent of the untraced group_leader.
Hmm. Could you explain?
I feel I missed something, but can't understand.
Oleg.
> > Agreed. The not-right case I had in mind was do_notify_parent_cldstop
> > where maybe it could be ->real_parent of the untraced group_leader.
>
> Hmm. Could you explain?
>
> I feel I missed something, but can't understand.
Sorry, it wasn't very clear. (I guess I should have said "traced group
leader", though that would not have been much more obvious.)
do_notify_parent_cldstop:
if (task_ptrace(tsk))
parent = tsk->parent;
else {
tsk = tsk->group_leader;
parent = tsk->real_parent;
}
...
__wake_up_parent(tsk, parent);
In the "else" case, parent is not necessarily tsk->parent. That is, if
an untraced thread calls do_notify_parent_cldstop() but its group_leader
is ptrace_reparented(). Then waitee->group_leader->real_parent is who
gets the wakeup, but __wake_up_parent->child_wait_wakeup would check
only waiter == waitee->group_leader->parent.
Thanks,
Roland
On 06/24, Roland McGrath wrote:
>
> do_notify_parent_cldstop:
>
> if (task_ptrace(tsk))
> parent = tsk->parent;
> else {
> tsk = tsk->group_leader;
> parent = tsk->real_parent;
> }
> ...
> __wake_up_parent(tsk, parent);
>
> In the "else" case, parent is not necessarily tsk->parent. That is, if
> an untraced thread calls do_notify_parent_cldstop() but its group_leader
> is ptrace_reparented(). Then waitee->group_leader->real_parent is who
> gets the wakeup, but __wake_up_parent->child_wait_wakeup would check
> only waiter == waitee->group_leader->parent.
... and in this case we do not wake up ->group_leader->real_parent.
But this is fine? It doesn't make sense to wake up, wait_consider_task()
will notice task_ptrace() and do nothing?
I really need to think with a fresh head, but it seems to me we could add
BUG_ON(p->parent != parent) into wait_consider_task() after "if (ptrace...)"
check.
And this "proves" your check in child_wait_callback() is correct, with
__WNOTHREAD do_wait_thread(parent) is always called with parent ==
sleeper, the caller of do_wait().
No?
Btw, this reminds me that wait_consider_task() doesn't need the "parent"
argument. I noticed this after adding wait_opts, but forgot to send a
patch.
Oleg.
> > an untraced thread calls do_notify_parent_cldstop() but its group_leader
> > is ptrace_reparented(). Then waitee->group_leader->real_parent is who
> > gets the wakeup, but __wake_up_parent->child_wait_wakeup would check
> > only waiter == waitee->group_leader->parent.
>
> ... and in this case we do not wake up ->group_leader->real_parent.
>
> But this is fine? It doesn't make sense to wake up, wait_consider_task()
> will notice task_ptrace() and do nothing?
Right. I knew it needed more thought (that's why I said "maybe wrong" to
begin with ;-).
> I really need to think with a fresh head, but it seems to me we could add
> BUG_ON(p->parent != parent) into wait_consider_task() after "if (ptrace...)"
> check.
>
> And this "proves" your check in child_wait_callback() is correct, with
> __WNOTHREAD do_wait_thread(parent) is always called with parent ==
> sleeper, the caller of do_wait().
>
> No?
Yes, I think that's right.
> Btw, this reminds me that wait_consider_task() doesn't need the "parent"
> argument. I noticed this after adding wait_opts, but forgot to send a
> patch.
With the wq.private now in wait_opts, yes.
Thanks,
Roland
On 06/25, Ratan Nalumasu wrote:
>
> I understood it in the _past_, but somehow forgotten the details--I have a
> vague recollection that there was some special handling for the group leader
> stuff.
> However, to reconfirm, we believe that the
> following condition in eligible_child() is good, right:
> ===
> if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private !=
> p->parent)
> return 0;
> ===
>
> I will run it on my test machines and see if everything looks good.
OK, thanks.
The only problem it uses ->parent, this conflicts with other out-of-tree
ptrace changes...
Roland, do you think we should do this change now or later?
If now, then we can also do another nice optimization, and it is for free.
See the untested patch below. It conflicts with those patches too, but
in both cases fixups are trivial.
Oleg.
------------------------------------------------------------------------
[PATCH] do not place sub-threads on task_struct->children list
Currently we add sub-threads to ->real_parent->children list. This
buys nothing but slows down do_wait().
With this patch ->children contains only main threads (group leaders).
The only complication is that forget_original_parent() should iterate
over sub-threads by hand. (note that we can move reparent_thread()
outside of while_each_thread() loop and simplify the code further).
>From now do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check. (note that now we can unify
do_wait_thread() and ptrace_do_wait()).
This change can confuse the optimistic search inmm_update_next_owner(),
but this is fixable and minor.
Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.
Do you see any problems?
---
kernel/fork.c | 2 +-
kernel/exit.c | 36 +++++++++++++++++-------------------
2 files changed, 18 insertions(+), 20 deletions(-)
--- WAIT/kernel/fork.c~3_CHILDREN_NO_THREADS 2009-06-19 01:12:47.000000000 +0200
+++ WAIT/kernel/fork.c 2009-06-29 04:30:34.000000000 +0200
@@ -1245,7 +1245,6 @@ static struct task_struct *copy_process(
}
if (likely(p->pid)) {
- list_add_tail(&p->sibling, &p->real_parent->children);
tracehook_finish_clone(p, clone_flags, trace);
if (thread_group_leader(p)) {
@@ -1257,6 +1256,7 @@ static struct task_struct *copy_process(
p->signal->tty = tty_kref_get(current->signal->tty);
attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
attach_pid(p, PIDTYPE_SID, task_session(current));
+ list_add_tail(&p->sibling, &p->real_parent->children);
list_add_tail_rcu(&p->tasks, &init_task.tasks);
__get_cpu_var(process_counts)++;
}
--- WAIT/kernel/exit.c~3_CHILDREN_NO_THREADS 2009-06-24 17:36:28.000000000 +0200
+++ WAIT/kernel/exit.c 2009-06-29 04:47:21.000000000 +0200
@@ -67,11 +67,11 @@ static void __unhash_process(struct task
detach_pid(p, PIDTYPE_PGID);
detach_pid(p, PIDTYPE_SID);
+ list_del_init(&p->sibling);
list_del_rcu(&p->tasks);
__get_cpu_var(process_counts)--;
}
list_del_rcu(&p->thread_group);
- list_del_init(&p->sibling);
}
/*
@@ -742,10 +742,10 @@ static void reparent_thread(struct task_
if (p->pdeath_signal)
group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
- list_move_tail(&p->sibling, &p->real_parent->children);
-
if (task_detached(p))
return;
+
+ list_move_tail(&p->sibling, &p->real_parent->children);
/*
* If this is a threaded reparent there is no need to
* notify anyone anything has happened.
@@ -771,7 +771,7 @@ static void reparent_thread(struct task_
static void forget_original_parent(struct task_struct *father)
{
- struct task_struct *p, *n, *reaper;
+ struct task_struct *g, *p, *n, *reaper;
LIST_HEAD(dead_children);
exit_ptrace(father);
@@ -779,13 +779,16 @@ static void forget_original_parent(struc
write_lock_irq(&tasklist_lock);
reaper = find_new_reaper(father);
- list_for_each_entry_safe(p, n, &father->children, sibling) {
- p->real_parent = reaper;
- if (p->parent == father) {
- BUG_ON(task_ptrace(p));
- p->parent = p->real_parent;
- }
- reparent_thread(father, p, &dead_children);
+ list_for_each_entry_safe(g, n, &father->children, sibling) {
+ p = g;
+ do {
+ p->real_parent = reaper;
+ if (p->parent == father) {
+ BUG_ON(task_ptrace(p));
+ p->parent = p->real_parent;
+ }
+ reparent_thread(father, p, &dead_children);
+ } while_each_thread(g, p);
}
write_unlock_irq(&tasklist_lock);
@@ -1533,14 +1536,9 @@ static int do_wait_thread(struct wait_op
struct task_struct *p;
list_for_each_entry(p, &tsk->children, sibling) {
- /*
- * Do not consider detached threads.
- */
- if (!task_detached(p)) {
- int ret = wait_consider_task(wo, tsk, 0, p);
- if (ret)
- return ret;
- }
+ int ret = wait_consider_task(wo, tsk, 0, p);
+ if (ret)
+ return ret;
}
return 0;
(change subject)
On 06/29, Oleg Nesterov wrote:
>
> On 06/25, Ratan Nalumasu wrote:
> >
> > I understood it in the _past_, but somehow forgotten the details--I have a
> > vague recollection that there was some special handling for the group leader
> > stuff.
> > However, to reconfirm, we believe that the
> > following condition in eligible_child() is good, right:
> > ===
> > if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private !=
> > p->parent)
> > return 0;
> > ===
> >
> > I will run it on my test machines and see if everything looks good.
>
> OK, thanks.
>
> The only problem it uses ->parent, this conflicts with other out-of-tree
> ptrace changes...
>
> Roland, do you think we should do this change now or later?
>
> If now, then we can also do another nice optimization, and it is for free.
> See the untested patch below. It conflicts with those patches too, but
> in both cases fixups are trivial.
See the updated patch below, slightly tested.
Changes: s/reparent_thread/reparent_leader/ and move it out of loop.
We still have to check task_detached() in reparent_leader(), we can
find EXIT_DEAD leader.
Do you see any problems?
Oleg.
------------------------------------------------------------------------
[PATCH] do not place sub-threads on task_struct->children list
Currently we add sub-threads to ->real_parent->children list. This
buys nothing but slows down do_wait().
With this patch ->children contains only main threads (group leaders).
The only complication is that forget_original_parent() should iterate
over sub-threads by hand.
>From now do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).
This change can confuse the optimistic search inmm_update_next_owner(),
but this is fixable and minor.
Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.
---
kernel/fork.c | 2 +-
kernel/exit.c | 41 ++++++++++++++++++++---------------------
2 files changed, 21 insertions(+), 22 deletions(-)
--- WAIT/kernel/fork.c~3_CHILDREN_NO_THREADS 2009-06-30 00:57:16.000000000 +0200
+++ WAIT/kernel/fork.c 2009-06-30 00:57:50.000000000 +0200
@@ -1245,7 +1245,6 @@ static struct task_struct *copy_process(
}
if (likely(p->pid)) {
- list_add_tail(&p->sibling, &p->real_parent->children);
tracehook_finish_clone(p, clone_flags, trace);
if (thread_group_leader(p)) {
@@ -1257,6 +1256,7 @@ static struct task_struct *copy_process(
p->signal->tty = tty_kref_get(current->signal->tty);
attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
attach_pid(p, PIDTYPE_SID, task_session(current));
+ list_add_tail(&p->sibling, &p->real_parent->children);
list_add_tail_rcu(&p->tasks, &init_task.tasks);
__get_cpu_var(process_counts)++;
}
--- WAIT/kernel/exit.c~3_CHILDREN_NO_THREADS 2009-06-30 00:57:16.000000000 +0200
+++ WAIT/kernel/exit.c 2009-06-30 00:57:50.000000000 +0200
@@ -67,11 +67,11 @@ static void __unhash_process(struct task
detach_pid(p, PIDTYPE_PGID);
detach_pid(p, PIDTYPE_SID);
+ list_del_init(&p->sibling);
list_del_rcu(&p->tasks);
__get_cpu_var(process_counts)--;
}
list_del_rcu(&p->thread_group);
- list_del_init(&p->sibling);
}
/*
@@ -736,12 +736,9 @@ static struct task_struct *find_new_reap
/*
* Any that need to be release_task'd are put on the @dead list.
*/
-static void reparent_thread(struct task_struct *father, struct task_struct *p,
+static void reparent_leader(struct task_struct *father, struct task_struct *p,
struct list_head *dead)
{
- if (p->pdeath_signal)
- group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
-
list_move_tail(&p->sibling, &p->real_parent->children);
if (task_detached(p))
@@ -771,7 +768,7 @@ static void reparent_thread(struct task_
static void forget_original_parent(struct task_struct *father)
{
- struct task_struct *p, *n, *reaper;
+ struct task_struct *g, *p, *n, *reaper;
LIST_HEAD(dead_children);
exit_ptrace(father);
@@ -779,13 +776,20 @@ static void forget_original_parent(struc
write_lock_irq(&tasklist_lock);
reaper = find_new_reaper(father);
- list_for_each_entry_safe(p, n, &father->children, sibling) {
- p->real_parent = reaper;
- if (p->parent == father) {
- BUG_ON(task_ptrace(p));
- p->parent = p->real_parent;
- }
- reparent_thread(father, p, &dead_children);
+ list_for_each_entry_safe(g, n, &father->children, sibling) {
+ p = g;
+ do {
+ p->real_parent = reaper;
+ if (p->parent == father) {
+ BUG_ON(task_ptrace(p));
+ p->parent = p->real_parent;
+ }
+ if (p->pdeath_signal)
+ group_send_sig_info(p->pdeath_signal,
+ SEND_SIG_NOINFO, p);
+ } while_each_thread(g, p);
+
+ reparent_leader(father, g, &dead_children);
}
write_unlock_irq(&tasklist_lock);
@@ -1533,14 +1537,9 @@ static int do_wait_thread(struct wait_op
struct task_struct *p;
list_for_each_entry(p, &tsk->children, sibling) {
- /*
- * Do not consider detached threads.
- */
- if (!task_detached(p)) {
- int ret = wait_consider_task(wo, tsk, 0, p);
- if (ret)
- return ret;
- }
+ int ret = wait_consider_task(wo, tsk, 0, p);
+ if (ret)
+ return ret;
}
return 0;
> > if ((wo->wo_flags & __WNOTHREAD) && wo->child_wait.private !=
> > p->parent)
> > return 0;
> > ===
> >
> > I will run it on my test machines and see if everything looks good.
>
> OK, thanks.
>
> The only problem it uses ->parent, this conflicts with other out-of-tree
> ptrace changes...
>
> Roland, do you think we should do this change now or later?
I think it makes most sense to put that in right after the initial
wait_child_callback patch (if not rolled into it). In fact, the original
approach was to do just this "simplest" __WNOTHREAD-checking callback
first, and add the eligible_child() hacking second (not that I think that
order matters).
Thanks,
Roland
> Currently we add sub-threads to ->real_parent->children list. This
> buys nothing but slows down do_wait().
>
> With this patch ->children contains only main threads (group leaders).
> The only complication is that forget_original_parent() should iterate
> over sub-threads by hand.
This seems right to me, though I'll admit I haven't really walked through
all the exit/reparent paths afresh with this in mind.
Note that this naturally suggests moving ->sibling to signal_struct. (Of
course that can come later.) The abuse of sibling in reparent_leader adds
a wrinkle to that, but off hand it looks like it could do the same with it
living in signal_struct with a bit of contortion.
Oh, and what about the de_thread() leader-replacement case?
Thanks,
Roland
On 06/30, Roland McGrath wrote:
>
> > Currently we add sub-threads to ->real_parent->children list. This
> > buys nothing but slows down do_wait().
> >
> > With this patch ->children contains only main threads (group leaders).
> > The only complication is that forget_original_parent() should iterate
> > over sub-threads by hand.
>
> This seems right to me, though I'll admit I haven't really walked through
> all the exit/reparent paths afresh with this in mind.
>
> Note that this naturally suggests moving ->sibling to signal_struct. (Of
> course that can come later.) The abuse of sibling in reparent_leader adds
> a wrinkle to that, but off hand it looks like it could do the same with it
> living in signal_struct with a bit of contortion.
Yes, this change was suggested a long ago.
But in that case forget_original_parent() still has to iterate over all childs
to send ->pdeath_signal and change ->real_parent when we reparent to sub-thread.
> Oh, and what about the de_thread() leader-replacement case?
Ah, good point!
de_thread() should do list_replace_init().
Oleg.