2002-09-22 11:51:47

by Ulrich Drepper

[permalink] [raw]
Subject: first NPT vs. NGPT vs. LinuxThreads benchmark results

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is a first mail with additional information about the NPT
library, this time about performance. The announcement with the
mentioning of 100,000 threads seems to have caused some interest.
Most people concentrated on the scalability aspect which is important.

Now we are able to release a few initial benchmark results. The
benchmarking was done using a simple benchmark program we developed
during the NPTL development. This does not mean it is tailored to
show NPTL in especially favorable light. To the contrary: we used it
to analyze potential weak points of the 1-on-1 approach.

What is measured is simply the time to create and destroy threads,
under various conditions. Only a certainly, variable number of
threads exist at one time. If the maximum number of parallel threads
is reached the program waits for a thread to terminate before creating
a new one. This keeps resource requirements at a manageable level.
New threads are created by possibly more than one thread; the exact
number is the second variable in the test series.

The tests performed were:

create threads in a row:

for 1 to 20 toplevel threads creating new threads

create for each toplevel thread up to 1 to 10 children


(The number of times we repeated the thread creation operation is
100,000 - this was only done to get a measurable test time and should
not be confused with earlier 'start up 100,000 parallel threads at once'
tests.)


The result is table with 200 times. Each time is indexed with the
number of toplevel threads and the maximum number of threads each
toplevel thread can create before having to wait for one to finish.
The created threads do no work at all, they just finish.


The problem the benchmark program has to solve is ideal for an M-on-N
implementation. It consists simply of creating a new thread, running
it for a tiny bit of time, and terminating it. In an M-on-N
implementation this only means to modify a few user-level data
structures. For the startup a data structure describing the thread
must be created, the context (registers etc.) must be set up (NGPT
uses makecontext() which is implemented completely at user level for
x86 on Linux). At termination time the data structure is disposed off
and that is it. No system call required.

For an 1-on-1 implementation the situation is a bit more difficult.
The kernel thread underlying the user-level thread has to be created
by the kernel. Similarly, to terminate a kernel thread the kernel has
to be called. In total we have at least two exits from the kernel,
two entries and two context switches.


The whole series was run for the old LinuxThreads implementation, NGPT
2.0.2, and the NPTL 0.1 code. All tests are performed using the
2.5.37 kernel. The huge number of threads created proved to be enough
to stabilize the measurements. Running the same benchmark twice
showed little variation (< 1%). Nevertheless these measurements should
be taken with the usual grain of salt even though we stand behind the
correctness of the result. All results are reported in ?sec
(micro-seconds)

We summarize the result of the benchmark runs in two tables. In both
cases we flatten one dimension of the measurement result matrix with a
minimal function.

http://people.redhat.com/drepper/perf-s-100000-pro.pdf

This graph shows the result for the different number of toplevel
threads creating the actual threads we count. The value used is the
minimal time required of all the runs with different numbers of
threads which can run in parallel.

What we can see is the NGPT is indeed a significant improvement over
LinuxThreads; NGPT is twice as fast. The thread creation process of
LinuxThreads was really complicated and slow. What might be surprising
is that a difference to NPTL is so large (a factor of four).

The second summary looks similar

http://people.redhat.com/drepper/perf-s-100000-par.pdf

This time we display the minimum time needed based on the number of
toplevel threads. The optimal number of threads which are used by each
toplevel thread determines the time.

In this graph we see the scalability effects. If too many threads in
parallel try to create even more threads all implementations are
impacted, some more, some less.



The results of this test series are:

- - LinuxThreads indeed had several problems

- - NGPT indeed run much faster (twice the performance)

- - NPTL runs four times faster than NGPT in a benchmark which by all
means should favor an M-on-N implementation.


We will soon have more benchmarks showing the thread libraries in
other real-world situations, such as IO-intensive workloads.

Any who wishes to run own tests should feel encouraged to do so.
Please share the results especially if they show problems in the NPTL
implementation.


Interested parties are encouraged to join the mailing list created for
the purpose of discussing NPTL:

https://listman.redhat.com/mailman/listinfo/phil-list


Ulrich Drepper
Ingo Molnar


- --
- ---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE9jbBD2ijCOnn/RHQRAkcoAKCS8Gw5ZdanA+N0FT3P7D+A2Q7EKwCeJ3Zo
PhGvJ7TfoXY+MhoE1zG4BrA=
=cssc
-----END PGP SIGNATURE-----


2002-09-23 01:59:44

by Bill Huey

[permalink] [raw]
Subject: Re: first NPT vs. NGPT vs. LinuxThreads benchmark results

On Sun, Sep 22, 2002 at 04:57:52AM -0700, Ulrich Drepper wrote:
> The results of this test series are:
>
> - - LinuxThreads indeed had several problems
>
> - - NGPT indeed run much faster (twice the performance)
>
> - - NPTL runs four times faster than NGPT in a benchmark which by all
> means should favor an M-on-N implementation.

Which could mean that they, NGPT, have slower thread allocation algorithms
for many reason. Some M:N systems will red zone protect a page of the thread
stack adding overhead to creation and deletion (FreeBSD'c -current does
this), the memory allocation algorithms might not be able to take advantage
of short term stack recycling and other things. It's not clear that these
benchmarks are meaningful without outlining the conditions that surround it.

Not to take the show away from you folks, but it's definitely something
that I immediately though about once I saw the graphs.

> We will soon have more benchmarks showing the thread libraries in
> other real-world situations, such as IO-intensive workloads.

bill

2002-09-23 02:19:11

by Ulrich Drepper

[permalink] [raw]
Subject: Re: first NPT vs. NGPT vs. LinuxThreads benchmark results

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bill Huey (Hui) wrote:

> Which could mean that they, NGPT, have slower thread allocation algorithms
> for many reason. Some M:N systems will red zone protect a page of the thread
> stack adding overhead to creation and deletion

This is required by the standard and LinuxThreads and NPTL do this of
course. Don't know about NGPT, probably yes.

- --
- ---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE9jnuX2ijCOnn/RHQRAmp/AKCO18uINcoAK8ezNQrp1T5GCtIYMwCgoBLP
mhfjJZxNrec3ZdcM4TXuy/w=
=MH/m
-----END PGP SIGNATURE-----