2021-03-12 17:41:27

by Jim Newsome

[permalink] [raw]
Subject: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

do_wait is an internal function used to implement waitpid, waitid,
wait4, etc. To handle the general case, it does an O(n) linear scan of
the thread group's children and tracees.

This patch adds a special-case when waiting on a pid to skip these scans
and instead do an O(1) lookup. This improves performance when waiting on
a pid from a thread group with many children and/or tracees.

Signed-off-by: James Newsome <[email protected]>
---

v4: https://lkml.org/lkml/2021/3/11/1333

* Added back missing target->ptrace check.
* Added explicit thread_group_leader check instead of double-lookup.

kernel/exit.c | 69 +++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..65c862c604a7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1439,9 +1439,50 @@ void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
TASK_INTERRUPTIBLE, p);
}

+static bool is_effectively_child(struct wait_opts *wo, bool ptrace,
+ struct task_struct *target)
+{
+ struct task_struct *parent =
+ !ptrace ? target->real_parent : target->parent;
+
+ return current == parent || (!(wo->wo_flags & __WNOTHREAD) &&
+ same_thread_group(current, parent));
+}
+
+/*
+ * Optimization for waiting on PIDTYPE_PID. No need to iterate through child
+ * and tracee lists to find the target task.
+ */
+static int do_wait_pid(struct wait_opts *wo)
+{
+ bool ptrace;
+ struct task_struct *target;
+ int retval;
+
+ target = pid_task(wo->wo_pid, PIDTYPE_PID);
+ if (!target)
+ return 0;
+
+ ptrace = false;
+ if (thread_group_leader(target) &&
+ is_effectively_child(wo, ptrace, target)) {
+ retval = wait_consider_task(wo, ptrace, target);
+ if (retval)
+ return retval;
+ }
+
+ ptrace = true;
+ if (target->ptrace && is_effectively_child(wo, ptrace, target)) {
+ retval = wait_consider_task(wo, ptrace, target);
+ if (retval)
+ return retval;
+ }
+
+ return 0;
+}
+
static long do_wait(struct wait_opts *wo)
{
- struct task_struct *tsk;
int retval;

trace_sched_process_wait(wo->wo_pid);
@@ -1463,19 +1504,27 @@ static long do_wait(struct wait_opts *wo)

set_current_state(TASK_INTERRUPTIBLE);
read_lock(&tasklist_lock);
- tsk = current;
- do {
- retval = do_wait_thread(wo, tsk);
- if (retval)
- goto end;

- retval = ptrace_do_wait(wo, tsk);
+ if (wo->wo_type == PIDTYPE_PID) {
+ retval = do_wait_pid(wo);
if (retval)
goto end;
+ } else {
+ struct task_struct *tsk = current;
+
+ do {
+ retval = do_wait_thread(wo, tsk);
+ if (retval)
+ goto end;

- if (wo->wo_flags & __WNOTHREAD)
- break;
- } while_each_thread(current, tsk);
+ retval = ptrace_do_wait(wo, tsk);
+ if (retval)
+ goto end;
+
+ if (wo->wo_flags & __WNOTHREAD)
+ break;
+ } while_each_thread(current, tsk);
+ }
read_unlock(&tasklist_lock);

notask:
--
2.30.1


2021-03-12 18:25:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On Fri, 12 Mar 2021 11:38:55 -0600 Jim Newsome <[email protected]> wrote:

> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.

Could we please see some performance testing results to permit us to
evaluate the value of this change?

Thanks.

2021-03-12 18:43:19

by Jim Newsome

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 3/12/21 12:22, Andrew Morton wrote:
>
> Could we please see some performance testing results to permit us to
> evaluate the value of this change?

Sure. I've been doing some ad-hoc measurements with the code below. It
forks 8k children and then waits for them in reverse order (forcing a
full list traversal each time). I'll need to reboot a couple times to
get apples-to-apples measurements on bare metal, though. I'll plan to
run with NUMCHILDREN = 0 -> 8000, by 100.

Does this look like it'd be sufficient, or is there more you'd like to
see? The current form doesn't use ptrace, but I expect the results to be
similar; (maybe more pronounced when tracing threaded children, since
every thread is in the tracee list instead of just the group leaders).

#define NUMCHILDREN 8000

void fork_and_wait() {
pid_t children[NUMCHILDREN];
for (int i = 0; i < NUMCHILDREN; ++i) {
pid_t forkrv = fork();
if (forkrv < 0) {
perror("fork");
exit(1);
}
if (forkrv == 0) {
// child
exit(0);
}
// parent
children[i] = forkrv;
}
for (int i = 0; i < NUMCHILDREN; ++i) {
int wstatus;
if (waitpid(children[NUMCHILDREN - i - 1], &wstatus, 0) < 0) {
perror("waitpid");
exit(1);
}
}
}

2021-03-12 18:51:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On Fri, 12 Mar 2021 12:39:12 -0600 Jim Newsome <[email protected]> wrote:

> On 3/12/21 12:22, Andrew Morton wrote:
> >
> > Could we please see some performance testing results to permit us to
> > evaluate the value of this change?
>
> Sure. I've been doing some ad-hoc measurements with the code below. It
> forks 8k children and then waits for them in reverse order (forcing a
> full list traversal each time). I'll need to reboot a couple times to
> get apples-to-apples measurements on bare metal, though. I'll plan to
> run with NUMCHILDREN = 0 -> 8000, by 100.
>
> Does this look like it'd be sufficient, or is there more you'd like to
> see? The current form doesn't use ptrace, but I expect the results to be
> similar; (maybe more pronounced when tracing threaded children, since
> every thread is in the tracee list instead of just the group leaders).

A very specific microbenchmark which tickles a particular corner case
is useful, as long as there's some realistic chance that someone's
workload is adequately modeled by that test.

Also very useful would be some words which describe what led you to do
this work (presumably some real-world was being impacted) and a description
of how the patch improves that workload (or is expected to improve it).

IOW, please spend a bit of time selling the patch! What is the case
for including it in Linux? What benefit does it provide our users?

2021-03-12 20:03:29

by Jim Newsome

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

(Re-sent without html part, which the list rejected)

On 3/12/21 12:47, Andrew Morton wrote:

> IOW, please spend a bit of time selling the patch! What is the case
> for including it in Linux? What benefit does it provide our users?

Ah yes - I'd included some context when I first reached out to Oleg, but
that never made it to the list :).

I'm helping develop a new ptrace-based version of Shadow [1] - a tool
for simulating a (potentially large) network. Shadow runs the network
user-space applications in an emulated environment, and routes network
traffic through a model of the network accounting for latency,
bandwidth, etc. The Tor Project plans to make increasing use of Shadow
both for focused evaluation of specific proposed software and parameter
changes, attacks, and defenses, and as a regular automated performance
evaluation prior to deployment of new versions. Today Shadow is already
actively used in the research community for applications including tor
and bitcoin.

We're interested in running simulations including at least tens of
thousands of processes, with a stretch goal of being able to handle 1M
processes. Since each process is being ptraced, calling an O(n) waitpid
has a huge performance penalty at this scale, and results in simulation
performance growing ~quadratically with the size of the simulation.

We do have a workaround where we use a "fork proxy" thread to actually
fork all the processes, and we stop and detach inactive processes. (The
number of "active" processes is roughly fixed to the number of worker
threads, which is generally the # of CPUs available). i.e. this keeps
the number of children and tracees small and fixed, allowing us to scale
linearly. However, having to detach and reattach tracees adds a
significant linear overhead factor. This kernel patch would allow us to
get rid of the complexity and extra overhead of this workaround, and
benefit other applications that haven't implemented such a workaround.

We have some details and analysis of this issue in GitHub [2]. I haven't
added results with the patch yet, but plan to do so.

This should also help other applications that have a large number of
children or tracees. Ptrace-based emulation tools are probably the most
likely to benefit. e.g. I suspect it'll help User Mode Linux [3], which
IIUC uses a single tracer thread to ptrace every process running on its
kernel. Likewise it could help DetTrace [4], which uses ptrace for
deterministic software builds.

[1]: https://shadow.github.io/
[2]: https://github.com/shadow/shadow/issues/1134
[3]: https://en.wikipedia.org/wiki/User-mode_Linux
[4]: https://github.com/dettrace/dettrace

2021-03-12 20:31:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

Jim Newsome <[email protected]> writes:

> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.

I am going to kibitz just a little bit more.

When I looked at this a second time it became apparent that using
pid_task twice should actually be faster as it removes a dependent load
caused by thread_group_leader, and replaces it by accessing two adjacent
pointers in the same cache line.

I know the algorithmic improvement is the main advantage, but removing
60ns or so for a dependent load can't hurt.

Plus I think using the two pid types really makes it clear that one
is always a process and the other is always potentially a thread.

/*
* Optimization for waiting on PIDTYPE_PID. No need to iterate through child
* and tracee lists to find the target task.
*/
static int do_wait_pid(struct wait_opts *wo)
{
bool ptrace;
struct task_struct *target;
int retval;

ptrace = false;
target = pid_task(wo->wo_pid, PIDTYPE_TGID);
if (target && is_effectively_child(wo, ptrace, target)) {
retval = wait_consider_task(wo, ptrace, target);
if (retval)
return retval;
}

ptrace = true;
target = pid_task(wo->wo_pid, PIDTYPE_PID);
if (target && target->ptrace &&
is_effectively_child(wo, ptrace, target)) {
retval = wait_consider_task(wo, ptrace, target);
if (retval)
return retval;
}

return 0;
}

Since the probably needs to be respun to include the improved
description can we look at my micro performance improvement?

Eric

2021-03-12 21:07:11

by Jim Newsome

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 3/12/21 14:29, Eric W. Biederman wrote:
> When I looked at this a second time it became apparent that using
> pid_task twice should actually be faster as it removes a dependent load
> caused by thread_group_leader, and replaces it by accessing two adjacent
> pointers in the same cache line.
>
> I know the algorithmic improvement is the main advantage, but removing
> 60ns or so for a dependent load can't hurt.
>
> Plus I think using the two pid types really makes it clear that one
> is always a process and the other is always potentially a thread.
>
> /*
> * Optimization for waiting on PIDTYPE_PID. No need to iterate through child
> * and tracee lists to find the target task.
> */
> static int do_wait_pid(struct wait_opts *wo)
> {
> bool ptrace;
> struct task_struct *target;
> int retval;
>
> ptrace = false;
> target = pid_task(wo->wo_pid, PIDTYPE_TGID);
> if (target && is_effectively_child(wo, ptrace, target)) {
> retval = wait_consider_task(wo, ptrace, target);
> if (retval)
> return retval;
> }
>
> ptrace = true;
> target = pid_task(wo->wo_pid, PIDTYPE_PID);
> if (target && target->ptrace &&
> is_effectively_child(wo, ptrace, target)) {
> retval = wait_consider_task(wo, ptrace, target);
> if (retval)
> return retval;
> }
>
> return 0;
> }

I'm fine with either way.

Part of what made my earlier version with the double-lookup a bit
awkward was only doing the second lookup if the first lookup failed. I'm
happy to take your word though that making the second lookup conditional
is unnecessary or even detrimental :). It did cross my mind that it
might not be a very consistent branch for a branch-predictor, but I also
figured pid_task's synchronization might outweigh that.

2021-03-12 21:27:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

Jim Newsome <[email protected]> writes:

> On 3/12/21 14:29, Eric W. Biederman wrote:
>> When I looked at this a second time it became apparent that using
>> pid_task twice should actually be faster as it removes a dependent load
>> caused by thread_group_leader, and replaces it by accessing two adjacent
>> pointers in the same cache line.
>>
>> I know the algorithmic improvement is the main advantage, but removing
>> 60ns or so for a dependent load can't hurt.
>>
>> Plus I think using the two pid types really makes it clear that one
>> is always a process and the other is always potentially a thread.
>>
>> /*
>> * Optimization for waiting on PIDTYPE_PID. No need to iterate through child
>> * and tracee lists to find the target task.
>> */
>> static int do_wait_pid(struct wait_opts *wo)
>> {
>> bool ptrace;
>> struct task_struct *target;
>> int retval;
>>
>> ptrace = false;
>> target = pid_task(wo->wo_pid, PIDTYPE_TGID);
>> if (target && is_effectively_child(wo, ptrace, target)) {
>> retval = wait_consider_task(wo, ptrace, target);
>> if (retval)
>> return retval;
>> }
>>
>> ptrace = true;
>> target = pid_task(wo->wo_pid, PIDTYPE_PID);
>> if (target && target->ptrace &&
>> is_effectively_child(wo, ptrace, target)) {
>> retval = wait_consider_task(wo, ptrace, target);
>> if (retval)
>> return retval;
>> }
>>
>> return 0;
>> }
>
> I'm fine with either way.
>
> Part of what made my earlier version with the double-lookup a bit
> awkward was only doing the second lookup if the first lookup failed. I'm
> happy to take your word though that making the second lookup conditional
> is unnecessary or even detrimental :).

Oh absolutely. The two lookups are independent.

> It did cross my mind that it
> might not be a very consistent branch for a branch-predictor, but I also
> figured pid_task's synchronization might outweigh that.

pid_task has a lot of verbiage but it is only reading a pointer,
verifying the pointer is not NULL and calling container_of on the result
of the pointer read.

Eric

2021-03-13 02:46:36

by Jim Newsome

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

Here are the micro-benchmark results. I ended up reworking it to use
google's benchmark tool [1]. For each N I timed how long it took to fork
a new child and then immediately wait on it, while already having N
other children. (Initially I tried to vfork, but that didn't play nicely
with the benchmark tool)

[1] https://github.com/google/benchmark

Without the patch:

./bench_waitpid
2021-03-12T20:03:57-06:00
Running ./bench_waitpid
Run on (4 X 2693.88 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 4096 KiB (x4)
L3 Unified 16384 KiB (x4)
Load Average: 0.48, 0.16, 0.06
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_WaitPid/1 313354 ns 74754 ns 8575
BM_WaitPid/10 305954 ns 72424 ns 7977
BM_WaitPid/100 323126 ns 83368 ns 7861
BM_WaitPid/500 324479 ns 99071 ns 6541
BM_WaitPid/1000 377143 ns 178590 ns 3260
BM_WaitPid/5000 816597 ns 803919 ns 849
BM_WaitPid/8000 1282339 ns 1268179 ns 548

For example, this means that for a process that already has 5000
children, forking and then immediately calling waitpid on 5001st took an
average of 179 microsecs. The O(n) implementation starts getting
noticeably slower with around 100 children.

With the patch:

$ ./bench_waitpid
2021-03-12T20:19:52-06:00
Running ./bench_waitpid
Run on (4 X 2693.88 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 4096 KiB (x4)
L3 Unified 16384 KiB (x4)
Load Average: 1.39, 0.49, 0.18
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_WaitPid/1 305501 ns 74028 ns 9261
BM_WaitPid/10 309644 ns 74916 ns 8838
BM_WaitPid/100 319457 ns 77193 ns 8717
BM_WaitPid/500 306929 ns 73863 ns 8327
BM_WaitPid/1000 310849 ns 74848 ns 8458
BM_WaitPid/5000 329324 ns 79386 ns 9123
BM_WaitPid/8000 317991 ns 77889 ns 7526

As expected, the cost is about the same as the baseline for a small
number of children, and stays about the same as the # of children increase.

Source:

#include <sys/times.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

#include "benchmark/benchmark.h"

static void BM_WaitPid(benchmark::State& state) {
int num_children = state.range(0);

// Create num_children
std::vector<pid_t> children;
children.reserve(num_children);
for (int i = 0; i < num_children; ++i) {
pid_t forkrv = fork();
if (forkrv < 0) {
perror("fork");
exit(1);
}
if (forkrv == 0) {
// child
exit(0);
}
children.push_back(forkrv);
}

// The body of this loop is what gets timed.
for (auto _ : state) {
pid_t forkrv = fork();
if (forkrv < 0) {
perror("fork");
exit(1);
}
if (forkrv == 0) {
// child
exit(0);
}
if (waitpid(forkrv, NULL, 0) < 0) {
perror("waitpid");
exit(1);
}
}

// Tear down the other children
for (pid_t child : children) {
if (waitpid(child, NULL, 0) < 0) {
perror("waitpid");
exit(1);
}
}
}
// Register the function as a benchmark
BENCHMARK(BM_WaitPid)->Arg(1)->Arg(10)->Arg(100)->Arg(500)->Arg(1000)->Arg(5000)->Arg(8000);
// Run the benchmark
BENCHMARK_MAIN();

2021-03-13 17:31:41

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

On 03/12, Eric W. Biederman wrote:
>
> I am going to kibitz just a little bit more.
>
> When I looked at this a second time it became apparent that using
> pid_task twice should actually be faster as it removes a dependent load
> caused by thread_group_leader, and replaces it by accessing two adjacent
> pointers in the same cache line.

Heh, I think this code should be optimized for the reader in the first place ;)

But as I said I won't argue, readability is very subjective, I am fine with your
version too.

Oleg.