2021-03-09 20:41:03

by Jim Newsome

[permalink] [raw]
Subject: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

do_wait is an internal function used to implement waitpid, waitid,
wait4, etc. To handle the general case, it does an O(n) linear scan of
the thread group's children and tracees.

This patch adds a special-case when waiting on a pid to skip these scans
and instead do an O(1) lookup. This improves performance when waiting on
a pid from a thread group with many children and/or tracees.

Signed-off-by: James Newsome <[email protected]>
---
kernel/exit.c | 53 +++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..c2438d4ba262 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1439,9 +1439,34 @@ void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
TASK_INTERRUPTIBLE, p);
}

+// Optimization for waiting on PIDTYPE_PID. No need to iterate through child
+// and tracee lists to find the target task.
+static int do_wait_pid(struct wait_opts *wo)
+{
+ struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
+ int retval;
+
+ if (!target)
+ return 0;
+ if (current == target->real_parent ||
+ (!(wo->wo_flags & __WNOTHREAD) &&
+ same_thread_group(current, target->real_parent))) {
+ retval = wait_consider_task(wo, /* ptrace= */ 0, target);
+ if (retval)
+ return retval;
+ }
+ if (target->ptrace && (current == target->parent ||
+ (!(wo->wo_flags & __WNOTHREAD) &&
+ same_thread_group(current, target->parent)))) {
+ retval = wait_consider_task(wo, /* ptrace= */ 1, target);
+ if (retval)
+ return retval;
+ }
+ return 0;
+}
+
static long do_wait(struct wait_opts *wo)
{
- struct task_struct *tsk;
int retval;

trace_sched_process_wait(wo->wo_pid);
@@ -1463,19 +1488,27 @@ static long do_wait(struct wait_opts *wo)

set_current_state(TASK_INTERRUPTIBLE);
read_lock(&tasklist_lock);
- tsk = current;
- do {
- retval = do_wait_thread(wo, tsk);
- if (retval)
- goto end;

- retval = ptrace_do_wait(wo, tsk);
+ if (wo->wo_type == PIDTYPE_PID) {
+ retval = do_wait_pid(wo);
if (retval)
goto end;
+ } else {
+ struct task_struct *tsk = current;

- if (wo->wo_flags & __WNOTHREAD)
- break;
- } while_each_thread(current, tsk);
+ do {
+ retval = do_wait_thread(wo, tsk);
+ if (retval)
+ goto end;
+
+ retval = ptrace_do_wait(wo, tsk);
+ if (retval)
+ goto end;
+
+ if (wo->wo_flags & __WNOTHREAD)
+ break;
+ } while_each_thread(current, tsk);
+ }
read_unlock(&tasklist_lock);

notask:
--
2.30.1


2021-03-10 17:30:35

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 03/09, Jim Newsome wrote:
>
> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.
>
> Signed-off-by: James Newsome <[email protected]>

Reviewed-by: Oleg Nesterov <[email protected]>

2021-03-10 22:43:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

Jim Newsome <[email protected]> writes:

> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.
>
> Signed-off-by: James Newsome <[email protected]>
> ---
> kernel/exit.c | 53 +++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 43 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 04029e35e69a..c2438d4ba262 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1439,9 +1439,34 @@ void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
> TASK_INTERRUPTIBLE, p);
> }
>
> +// Optimization for waiting on PIDTYPE_PID. No need to iterate through child
> +// and tracee lists to find the target task.

Minor nit: C++ style comments look very out of place in this file
which uses old school C /* */ comment delimiters for
all of it's block comments.

> +static int do_wait_pid(struct wait_opts *wo)
> +{
> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is subtle change in behavior.

Today on the task->children list we only place thread group leaders.

Which means that your do_wait_pid wait for thread of someone else's
process and that is a change in behavior.

So the code either needs a thread_group_leader filter on target before
the ptrace=0 case or we need to use "pid_task(wo->wo_pid, PIDTYPE_TGID)"
and "pid_task(wo->wo_pid, PIDTYPE_PID)" for the "ptrace=1" case.

I would like to make thread_group_leaders go away so I would favor two
pid_task calls. But either will work right now.

Eric


> + int retval;
> +
> + if (!target)
> + return 0;
> + if (current == target->real_parent ||
> + (!(wo->wo_flags & __WNOTHREAD) &&
> + same_thread_group(current, target->real_parent))) {
> + retval = wait_consider_task(wo, /* ptrace= */ 0, target);
> + if (retval)
> + return retval;
> + }
> + if (target->ptrace && (current == target->parent ||
> + (!(wo->wo_flags & __WNOTHREAD) &&
> + same_thread_group(current, target->parent)))) {
> + retval = wait_consider_task(wo, /* ptrace= */ 1, target);
> + if (retval)
> + return retval;
> + }
> + return 0;
> +}
> +
> static long do_wait(struct wait_opts *wo)
> {
> - struct task_struct *tsk;
> int retval;
>
> trace_sched_process_wait(wo->wo_pid);
> @@ -1463,19 +1488,27 @@ static long do_wait(struct wait_opts *wo)
>
> set_current_state(TASK_INTERRUPTIBLE);
> read_lock(&tasklist_lock);
> - tsk = current;
> - do {
> - retval = do_wait_thread(wo, tsk);
> - if (retval)
> - goto end;
>
> - retval = ptrace_do_wait(wo, tsk);
> + if (wo->wo_type == PIDTYPE_PID) {
> + retval = do_wait_pid(wo);
> if (retval)
> goto end;
> + } else {
> + struct task_struct *tsk = current;
>
> - if (wo->wo_flags & __WNOTHREAD)
> - break;
> - } while_each_thread(current, tsk);
> + do {
> + retval = do_wait_thread(wo, tsk);
> + if (retval)
> + goto end;
> +
> + retval = ptrace_do_wait(wo, tsk);
> + if (retval)
> + goto end;
> +
> + if (wo->wo_flags & __WNOTHREAD)
> + break;
> + } while_each_thread(current, tsk);
> + }
> read_unlock(&tasklist_lock);
>
> notask:

2021-03-11 00:18:24

by Jim Newsome

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 3/10/21 16:40, Eric W. Biederman wrote:
>> +// Optimization for waiting on PIDTYPE_PID. No need to iterate
through child
>> +// and tracee lists to find the target task.
>
> Minor nit: C++ style comments look very out of place in this file
> which uses old school C /* */ comment delimiters for
> all of it's block comments.

Will do

>> +static int do_wait_pid(struct wait_opts *wo)
>> +{
>> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This is subtle change in behavior.
>
> Today on the task->children list we only place thread group leaders.

Shouldn't we allow waiting on clone children if __WALL or __WCLONE is set?

This is already checked later in `eligible_child`, called from
`wait_consider_task`, so I *think* the current form should already do
the right thing. Now I'm confused though how the general path (through
`do_wait_thread`) works if clone children aren't on the task->children
list...?

(In any case it seems this will need another version with at least an
explanatory comment here)

Thanks!
-Jim

2021-03-11 15:10:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 03/10, Eric W. Biederman wrote:
>
> Jim Newsome <[email protected]> writes:
>
> > +static int do_wait_pid(struct wait_opts *wo)
> > +{
> > + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This is subtle change in behavior.
>
> Today on the task->children list we only place thread group leaders.

Aaah, yes, thanks Eric!

> So the code either needs a thread_group_leader filter on target before
> the ptrace=0 case or we need to use "pid_task(wo->wo_pid, PIDTYPE_TGID)"
> and "pid_task(wo->wo_pid, PIDTYPE_PID)" for the "ptrace=1" case.

Agreed,

> I would like to make thread_group_leaders go away

Hmm, why?

Oleg.

2021-03-11 15:18:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 03/10, Jim Newsome wrote:
>
> On 3/10/21 16:40, Eric W. Biederman wrote:
>
> >> +static int do_wait_pid(struct wait_opts *wo)
> >> +{
> >> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > This is subtle change in behavior.
> >
> > Today on the task->children list we only place thread group leaders.
>
> Shouldn't we allow waiting on clone children if __WALL or __WCLONE is set?

Don't confuse clone child with child's sub-thread.

Oleg.

2021-03-11 16:30:03

by Jim Newsome

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)


On 3/11/21 09:15, Oleg Nesterov wrote:
> On 03/10, Jim Newsome wrote:
>> On 3/10/21 16:40, Eric W. Biederman wrote:
>>
>>>> +static int do_wait_pid(struct wait_opts *wo)
>>>> +{
>>>> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> This is subtle change in behavior.
>>>
>>> Today on the task->children list we only place thread group leaders.
>> Shouldn't we allow waiting on clone children if __WALL or __WCLONE is set?
> Don't confuse clone child with child's sub-thread.

Oops! Thanks; got it. v4 coming shortly


2021-03-11 16:32:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

Jim Newsome <[email protected]> writes:

> On 3/10/21 16:40, Eric W. Biederman wrote:
>>> +// Optimization for waiting on PIDTYPE_PID. No need to iterate
> through child
>>> +// and tracee lists to find the target task.
>>
>> Minor nit: C++ style comments look very out of place in this file
>> which uses old school C /* */ comment delimiters for
>> all of it's block comments.
>
> Will do
>
>>> +static int do_wait_pid(struct wait_opts *wo)
>>> +{
>>> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> This is subtle change in behavior.
>>
>> Today on the task->children list we only place thread group leaders.
>
> Shouldn't we allow waiting on clone children if __WALL or __WCLONE is set?
>
> This is already checked later in `eligible_child`, called from
> `wait_consider_task`, so I *think* the current form should already do
> the right thing. Now I'm confused though how the general path (through
> `do_wait_thread`) works if clone children aren't on the task->children
> list...?
>
> (In any case it seems this will need another version with at least an
> explanatory comment here)

What I am worried about are not clone children. AKA ordinary children
that have a different exit signal but CLONE_THREAD children that are
never put on the children list so are naturally excluded from today's
do_wait (except in the case of ptrace). These are also known as threads.

Maybe I am missing it but I don't see anything in wait_consider_task
or in the way that you are calling it that would exclude CLONE_THREAD
children for the non-ptrace case.

Eric

2021-03-11 16:41:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

Oleg Nesterov <[email protected]> writes:

> On 03/10, Eric W. Biederman wrote:
>>
>> Jim Newsome <[email protected]> writes:
>>
>> > +static int do_wait_pid(struct wait_opts *wo)
>> > +{
>> > + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> This is subtle change in behavior.
>>
>> Today on the task->children list we only place thread group leaders.
>
> Aaah, yes, thanks Eric!
>
>> So the code either needs a thread_group_leader filter on target before
>> the ptrace=0 case or we need to use "pid_task(wo->wo_pid, PIDTYPE_TGID)"
>> and "pid_task(wo->wo_pid, PIDTYPE_PID)" for the "ptrace=1" case.
>
> Agreed,
>
>> I would like to make thread_group_leaders go away
>
> Hmm, why?

Mostly because we have class of very nasty bugs to fix because code
thinks one thread is special.

There has been and I think still is code that mishandles zombie thread
group leaders.

Particularly nasty are zombie thread group leaders after userspace has
called setresuid in a way that changes signal permissions.

Eric