LinuxLists.cc - Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

2008-08-13 00:45:36

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

[First send rejected by vger.kernel.org due to HTML and/or test
program attachment. Re-send without, please contact me for the test
program.]

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL's pthread_create() to run about three orders of magnitude slower.
As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles -- which is under 100 threads per
second. Larger stacks reduce the severity of slowdown but also make
slowdown happen after allocating a few thousand threads. Costs vary
with platform, stack size, etc., but thread allocation rates drop
suddenly on all of a half-dozen platforms I tried.

The cause is NPTL allocates stacks with code of the form (e.g., glibc
2.7 nptl/allocatestack.c):

sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...);
if (sto == MAP_FAILED)
sto = mmap(0, ..., MAP_PRIVATE, ...);

That is, try to allocate in the low 4GB, and when low addresses are
exhausted, allocate from any location. Thus, once low addresses run
out, every stack allocation does a failing mmap() followed by a
successful mmap(). The failing mmap() is slow because it does a
linear search of all low-space vma's.

Low-address stacks are preferred because some machines context switch
much faster when the stack address has only 32 significant bits. Slow
allocation was discussed in 2003 but without resolution. See, e.g.,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With
increasing use of threads, slow allocation is becoming a problem.

Some old machines were faster switching 32b stacks, but new machines
seem to switch as fast or faster using 64b stacks. I measured
thread-to-thread context switches on two AMD processors and five Intel
procesors. Tests used the same code with 32b or 64b stack pointers;
tests covered varying numbers of threads switched and varying methods
of allocating stacks. Two systems gave indistinguishable performance
with 32b or 64b stacks, four gave 5%-10% better performance using 64b
stacks, and of the systems I tested, only the P4 microarchitecture
x86-64 system gave better performance for 32b stacks, in that case
vastly better. Most systems had thread-to-thread switch costs around
800-1200 cycles. The P4 microarchitecture system had 32b context
switch costs around 3,000 cycles and 64b context switches around 4,800
cycles.

It appears the kernel's 64-bit switch path handles all 32-bit cases.
So on machines with a fast 64-bit path, context switch speed would
presumably be improved yet further by eliminating the special 32-bit
path. It appears this would also collapse the task state's fs and
fsindex fields, and the gs and gsindex fields. These could further
reduce memory, cache, and branch predictor pressure.

Various things would address the slow pthread_create(). Choices include:
- Be more platform-aware about when to use MAP_32BIT.
- Abandon use of MAP_32BIT entirely, with worse performance on some machines.
- Change the mmap() algorithm to be faster on allocation failure
(avoid a linear search of vmas).

Options to improve context switch times include:

- Do nothing.
- Be more platform-aware about when to use different 32b and 64b paths.
- Get rid of the 32b path, which also appears it would make contexts smaller.

[Not] Attached is a program to measure context switch costs.

2008-08-13 10:45:48

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Pardo <[email protected]> wrote:

> As example, in one case creating new threads goes from about 35,000
> cycles up to about 25,000,000 cycles -- which is under 100 threads per
> second. [...]

> Various things would address the slow pthread_create(). Choices
> include:
> - Be more platform-aware about when to use MAP_32BIT.
> - Abandon use of MAP_32BIT entirely, with worse performance on some machines.
> - Change the mmap() algorithm to be faster on allocation failure
> (avoid a linear search of vmas).

Sigh, unfortunately MAP_32BIT use in 64-bit apps for stacks was
apparently created without foresight about what would happen in the MM
when thread stacks exhaust 4GB.

The problem is that MAP_32BIT is used both as a performance hack for
64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot
just start disregarding MAP_32BIT in the kernel - we'd break 32-bit
compat apps and/or compat 32-bit libraries.

There are various other options to solve the (severe!) performance
breakdown:

1- glibc could start not using MAP_32BIT for 64-bit thread stacks (the
boxes where context-switching is slow probably do not matter all that
much anymore - they were very slow at everything 64-bit anyway)

Pros: easiest solution.
Cons: slows down the affected machines and needs a new glibc.

2- We could introduce a new MAP_64BIT_STACK flag which we could
propagate it into MAP_32BIT on those old CPUs. It would be
disregarded on modern CPUs and thread stacks would be 64-bit.

Pros: cleanest solution.
Cons: needs both new glibc and new kernel to take advantage of.

3- We could detect the first-4G-is-full condition and cache it. Problem
is, there will likely be small holes in it so it's rather hard to do
it in a sane way. Also, every munmap() of a thread stack will
invalidate this - triggering a slow linear search every now and then.

Pros: only needs a new kernel to take advantage of.
Cons: is the most complex and messiest solution with no clear
benefit to other workloads. Also, does not 100% solve the
performance problem and prolongues the 4GB stack threads
hack.

i'd go for 1) or 2).

Ingo

2008-08-13 13:37:44

by Arjan van de Ven

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

On Wed, 13 Aug 2008 12:44:45 +0200
Ingo Molnar <[email protected]> wrote:

> There are various other options to solve the (severe!) performance
> breakdown:
>
> 1- glibc could start not using MAP_32BIT for 64-bit thread stacks
> (the boxes where context-switching is slow probably do not matter all
> that much anymore - they were very slow at everything 64-bit anyway)
>
> Pros: easiest solution.
> Cons: slows down the affected machines and needs a new glibc.
>
>
> i'd go for 1) or 2).

I would go for 1) clearly; it's the cleanest thing going forward for
sure.

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-08-13 14:26:55

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Ulrich Drepper <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Arjan van de Ven wrote:
> >> i'd go for 1) or 2).
> >
> > I would go for 1) clearly; it's the cleanest thing going forward for
> > sure.
>
> I want to see numbers first. If there are problems visible I
> definitely would want to see 2. Andi at the time I wrote that code
> was very adamant that I use the flag.

not sure exactly what numbers you mean, but there are lots of numbers in
the first mail, attached below. For example:

| As example, in one case creating new threads goes from about 35,000
| cycles up to about 25,000,000 cycles -- which is under 100 threads per
| second. Larger stacks reduce the severity of slowdown but also make

being able to create only 100 threads per second brings us back to 33
MHz 386 DX Linux performance.

Ingo

---------------------->

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL's pthread_create() to run about three orders of magnitude slower.
As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles -- which is under 100 threads per
second. Larger stacks reduce the severity of slowdown but also make
slowdown happen after allocating a few thousand threads. Costs vary
with platform, stack size, etc., but thread allocation rates drop
suddenly on all of a half-dozen platforms I tried.

The cause is NPTL allocates stacks with code of the form (e.g., glibc
2.7 nptl/allocatestack.c):

sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...);
if (sto == MAP_FAILED)
sto = mmap(0, ..., MAP_PRIVATE, ...);

That is, try to allocate in the low 4GB, and when low addresses are
exhausted, allocate from any location. Thus, once low addresses run
out, every stack allocation does a failing mmap() followed by a
successful mmap(). The failing mmap() is slow because it does a
linear search of all low-space vma's.

Low-address stacks are preferred because some machines context switch
much faster when the stack address has only 32 significant bits. Slow
allocation was discussed in 2003 but without resolution. See, e.g.,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html,
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and
http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With
increasing use of threads, slow allocation is becoming a problem.

Some old machines were faster switching 32b stacks, but new machines
seem to switch as fast or faster using 64b stacks. I measured
thread-to-thread context switches on two AMD processors and five Intel
procesors. Tests used the same code with 32b or 64b stack pointers;
tests covered varying numbers of threads switched and varying methods
of allocating stacks. Two systems gave indistinguishable performance
with 32b or 64b stacks, four gave 5%-10% better performance using 64b
stacks, and of the systems I tested, only the P4 microarchitecture
x86-64 system gave better performance for 32b stacks, in that case
vastly better. Most systems had thread-to-thread switch costs around
800-1200 cycles. The P4 microarchitecture system had 32b context
switch costs around 3,000 cycles and 64b context switches around 4,800
cycles.

It appears the kernel's 64-bit switch path handles all 32-bit cases.
So on machines with a fast 64-bit path, context switch speed would
presumably be improved yet further by eliminating the special 32-bit
path. It appears this would also collapse the task state's fs and
fsindex fields, and the gs and gsindex fields. These could further
reduce memory, cache, and branch predictor pressure.

Various things would address the slow pthread_create(). Choices include:
- Be more platform-aware about when to use MAP_32BIT.
- Abandon use of MAP_32BIT entirely, with worse performance on some machines.
- Change the mmap() algorithm to be faster on allocation failure
(avoid a linear search of vmas).

Options to improve context switch times include:

- Do nothing.
- Be more platform-aware about when to use different 32b and 64b paths.
- Get rid of the 32b path, which also appears it would make contexts smaller.

[Not] Attached is a program to measure context switch costs.

2008-08-13 14:28:21

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Arjan van de Ven wrote:
>> i'd go for 1) or 2).
>
> I would go for 1) clearly; it's the cleanest thing going forward for
> sure.

I want to see numbers first. If there are problems visible I definitely
would want to see 2. Andi at the time I wrote that code was very
adamant that I use the flag.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkii7gcACgkQ2ijCOnn/RHTveQCeIefB1R5QpuQ71RNMihKL5oWD
ZVoAnjjjKgXznRx8qtbrF+fgvcNwsngA
=dAz2
-----END PGP SIGNATURE-----

2008-08-13 14:36:57

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> not sure exactly what numbers you mean, but there are lots of numbers in
> the first mail, attached below. For example:

I mean numbers indicating that it doesn't hurt performance on any of
today's machines. If there are machines where it makes a difference
then we need the flag to indicate the _preference_ for a low stack, as
opposed to indicating a _requirement_.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkii8VcACgkQ2ijCOnn/RHTiLQCfcZ9xJHMi0Jv59l700ZNJUoi6
aEcAn370XuGhs1u1YeD2Gqq35zQnKh26
=rC0v
-----END PGP SIGNATURE-----

2008-08-13 15:11:30

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Ulrich Drepper <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ingo Molnar wrote:
> > not sure exactly what numbers you mean, but there are lots of numbers in
> > the first mail, attached below. For example:
>
> I mean numbers indicating that it doesn't hurt performance on any of
> today's machines. If there are machines where it makes a difference
> then we need the flag to indicate the _preference_ for a low stack, as
> opposed to indicating a _requirement_.

there were a few numbers about that as well, and a test-app. The test
app is below. The numbers were:

| I measured thread-to-thread context switches on two AMD processors and
| five Intel procesors. Tests used the same code with 32b or 64b stack
| pointers; tests covered varying numbers of threads switched and
| varying methods of allocating stacks. Two systems gave
| indistinguishable performance with 32b or 64b stacks, four gave 5%-10%
| better performance using 64b stacks, and of the systems I tested, only
| the P4 microarchitecture x86-64 system gave better performance for 32b
| stacks, in that case vastly better. Most systems had thread-to-thread
| switch costs around 800-1200 cycles. The P4 microarchitecture system
| had 32b context switch costs around 3,000 cycles and 64b context
| switches around 4,800 cycles.

i find it pretty unacceptable these days that we limit any aspect of
pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).
[other than the small execution model which is 2GB obviously.]

Ingo

--------------------->
// switch.cc -- measure thread-to-thread context switch times
// using either low-address stacks or high-address stacks

#include <sys/mman.h>
#include <sys/types.h>
#include <pthread.h>
#include <sched.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

const int kRequestedSwaps = 10000;
const int kNumThreads = 2;
const int kRequestedSwapsPerThread = kRequestedSwaps / kNumThreads;
const int kStackSize = 64 * 1024;
const int kTrials = 100;

typedef long long Tsc;
#define LARGEST_TSC (static_cast<Tsc>(1ULL << (8 * sizeof(Tsc) - 2) - 1))

Tsc now() {
unsigned int eax_lo, edx_hi;
Tsc now;
asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi));
now = ((Tsc)eax_lo) | ((Tsc)(edx_hi) << 32);
return now;
}

// Use 0/1 for size to allow array subscripting.
const int pointer_sizes[] = { 32, 64 };
#define SZ_N (sizeof(pointer_sizes) / sizeof(pointer_sizes[0]))
typedef int PointerSize;

PointerSize address_size(const void *vaddr) {
intptr_t iaddr = reinterpret_cast<intptr_t>(vaddr);
return ((iaddr >> 32) == 0) ? 0 : 1;
}

// One instance poitned to by every PerThread.
struct SharedArgs {
// Read-only during a given test:
cpu_set_t cpu; // Only one bit set; all threads run on this CPU.

// Read/write during a given test:
pthread_barrier_t start_barrier;
pthread_barrier_t stop_barrier;
};

// One per thread.
struct PerThread {
// Thread args
SharedArgs *shared_args;
Tsc *stamps;

// Per-thread storage.
pthread_t thread;
void *stack[SZ_N]; // mmap()'d storage
pthread_attr_t attr;
};

// Distinguish betwen start/stop timestamp for each iteration
typedef enum { START, STOP } StartStop;

// Record each timestamp in isolation for minimum runtime cache footprint;
// after a run, copy each timestamp to one of these so can sort and also track
// start/stop, etc.
struct Event {
Tsc time;
StartStop start_stop;
int thread_num;
int iter;
};

// Sort events in increasing time order.
int event_pred(const void *ve0, const void *ve1) {
const Event *e0 = static_cast<const Event *>(ve0);
const Event *e1 = static_cast<const Event *>(ve1);
return e0->time - e1->time;
}

// Data to aggregate across runs. Print only after runs are all over, in order
// to minimize possible overlap of I/O and benchmark.
struct Result {
int pointer_size;
int swaps;
Tsc fastest;
};

// Each thread runs this worker.
void *worker(void *v_per_thread) {
const PerThread *per_thread = static_cast<const PerThread *>(v_per_thread);
SharedArgs *shared_args = per_thread->shared_args;

// Run all threads on the same CPU.
const cpu_set_t *cpu = &shared_args->cpu;
int cc = sched_setaffinity(0/*self*/, sizeof(*cpu), cpu);
if (cc != 0) {
perror("sched_setaffinity");
exit(1);
}

// Wait for all workers to be ready before running the inner loop.
cc = pthread_barrier_wait(&shared_args->start_barrier);
if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
perror("pthread_barrier_wait");
exit(1);
}

// Inner loop: track time before and after a swap. In principle we
// can use just one timestamp per iteration, but that gives more
// variance between timestamps from overheads such as cache misses
// not related to the context switch.
Tsc *stamp = per_thread->stamps;
for (int i = 0; i < kRequestedSwapsPerThread; ++i) {
// Run timed critical section in as much isolation as possible.
// Notably, read stamps but avoid saving them to memory and taking
// cache misses until after both %tsc reads.
asm volatile ("nop" ::: "memory");
Tsc start = now();
sched_yield();
Tsc stop = now();
asm volatile ("nop" ::: "memory");
*stamp++ = start;
*stamp++ = stop;
}

// Release the manager to clean up.
cc = pthread_barrier_wait(&shared_args->stop_barrier);
if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
perror("pthread_barrier_wait");
exit(1);
}

return NULL;
}

// Manager code that creates and starts worker threads, waits, then cleans up.
void run_test(PerThread *per_thread, PointerSize ps) {
// Create worker threads.
for (int th = 0; th < kNumThreads; ++th) {
int cc = pthread_attr_setstack(&per_thread[th].attr,
per_thread[th].stack[ps], kStackSize);
if (cc != 0) {
perror("pthread_attr_setstack");
exit(1);
}

cc = pthread_create(&per_thread[th].thread, &per_thread[th].attr,
worker, &per_thread[th]);
if (cc != 0) {
perror("pthread_create");
exit(1);
}
}

// Release all worker threads to run their inner loop,
// then wait for all to finish before joining any.
SharedArgs *shared_args = per_thread->shared_args;
int cc = pthread_barrier_wait(&shared_args->start_barrier);
if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
perror("pthread_barrier_wait");
exit(1);
}
cc = pthread_barrier_wait(&shared_args->stop_barrier);
if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
perror("pthread_barrier_wait");
exit(1);
}

// Clean up worker threads.
for (int th = 0; th < kNumThreads; ++th) {
int cc = pthread_join(per_thread[th].thread, NULL);
if (cc != 0) {
perror("pthread_join");
exit(1);
}
}
}

// After a run, find out which sched_yield() calls actually did a yield,
// then find out the fastest sched_yield() that occured during the run.
Result process_data(Event *event, const PerThread per_thread[],
int requested_swaps_per_thread, PointerSize pointer_size) {
// Copy timestamps in to a struct to associate timestamps with thread number.
int event_num = 0;
for (int th = 0; th < kNumThreads; ++th) {
const Tsc *stamps = per_thread[th].stamps;
int stamp_num = 0;
StartStop start_stop = START;
// 2* because there's a start stamp and stop stamp for each swap
for (int iter = 0; iter < (2 * requested_swaps_per_thread); ++iter) {
event[event_num].time = stamps[stamp_num++];
event[event_num].start_stop = start_stop;
start_stop = (start_stop == START) ? STOP : START;
event[event_num].thread_num = th;
event[event_num].iter = iter;
++event_num;
}
}
int num_events = event_num;

// Sort data in timestamp order.
qsort(event, num_events, sizeof(event[0]), event_pred);

// A context switch occurred ff two adjacent stamps are for
// different threads. A requested context switch very likely
// occured if a context switch was between a START stamp in the
// first thread and a STOP stamp in the second. Note that some
// non-requested context switches also get logged. As example, a
// preemptive cswap could have occured, and the following
// sched_yield() may have done a yield-to-self.
Tsc fastest = LARGEST_TSC;
int swaps = 0;
for (int e = 0; e < (num_events - 1); ++e) {
if ((event[e].thread_num != event[e+1].thread_num) &&
(event[e].start_stop == START) && (event[e+1].start_stop == STOP)) {
++swaps;
Tsc t = event[e+1].time - event[e].time;
if (t < fastest)
fastest = t;
}
}

Result result;
result.pointer_size = pointer_size;
result.swaps = swaps;
result.fastest = fastest;
return result;
}

// Dump results for one run. Also aggregate "best of best" and "worst of best".
void dump_one_run(Tsc best[SZ_N], Tsc worst[SZ_N], int trial_num,
const Result *result) {
Tsc t = result->fastest;
PointerSize ps = result->pointer_size;
int cc = printf("run: %d pointer-size: %d requested-swaps: %d got-swaps: %d fastest: %lld\n",
trial_num, pointer_sizes[ps],
kRequestedSwaps, result->swaps, result->fastest);
if (cc < 0) {
perror("printf");
exit(1);
}
if (t < best[ps])
best[ps] = t;
if (t > worst[ps])
worst[ps] = t;
}

void *mmap_stack(PointerSize pointer_size) {
int location_flag;
switch(pointer_sizes[pointer_size]) {
case 32: location_flag = MAP_32BIT; break;
case 64: location_flag = 0x0; break;
default:
fprintf(stderr, "Implementation error: unhandled stack placement\n");
exit(1);
}

void *stack = mmap(0, kStackSize, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|location_flag, 0, 0);
if (stack == MAP_FAILED) {
perror("mmap");
exit(1);
}

// Check we got the stack location we requested
PointerSize got = address_size(stack);
if (got != pointer_size) {
// Note: MSWindohs and Linux are asymmetrical about %p: one prints
// with a leading 0x, the other does not. Assume here it does not matter.
fprintf(stderr, "Did not get requested pointer size\n");
exit(1);
}

return stack;
}

void munmap_stack(void *stack) {
int cc = munmap(stack, kStackSize);
if (cc != 0) {
perror("munmap");
exit(1);
}
}

int main(int argc, char **argv) {
SharedArgs shared_args;

// Find the highest-numbered CPU, all threads run on that thread only.
{
cpu_set_t set;
int sz = sched_getaffinity(0, sizeof(set), &set);
// Documentation says sched_getaffinity() returns the size used by
// the kernel, but by experiment it returns zero on some 2.6.18
// systems, but with a sensible mask nonetheless.
if (sz < 0) {
perror ("sched_getaffinity");
exit(1);
}
// Find an available processor/core. If possible grab something other
// than CPU 0 to minimize interference from interrupts preferentially
// delivered to core 0.
int proc;
for (proc=CPU_SETSIZE-1; proc>=0; --proc)
if (CPU_ISSET(proc, &set))
break;
if (proc >= CPU_SETSIZE) {
fprintf (stderr, "No virtual processors!?\n");
exit(1);
}
CPU_ZERO(&shared_args.cpu);
CPU_SET(proc, &shared_args.cpu);
}

// Reusable per-thread setup
PerThread per_thread[kNumThreads];
for (int th = 0; th < kNumThreads; ++th) {
per_thread[th].stamps = new Tsc[2 * kRequestedSwaps];
per_thread[th].shared_args = &shared_args;
for (int ps = 0; ps < SZ_N; ++ps)
per_thread[th].stack[ps] = mmap_stack(static_cast<PointerSize>(ps));
int cc = pthread_attr_init(&per_thread[th].attr);
if (cc != 0) {
perror("pthread_attr_init");
exit(1);
}
}

// Storage for post-processing timestamps from one trial run.
// 2 stamps per iteration. 'new' the storage since long runs
// otherwise overflow the stack.
Event *event = new Event[kNumThreads * (2 * kRequestedSwaps)];

// Post-processed data for all trial runs. Written during the "run
// tests" phase and read during the "dump data" phase.
int kNumRuns = kTrials * SZ_N;
Result result[kNumRuns];
int result_num = 0;

// Pthread barriers are cyclic, so can reuse them. +1 for the manager thread
pthread_barrier_init(&shared_args.start_barrier, NULL, kNumThreads + 1);
pthread_barrier_init(&shared_args.stop_barrier, NULL, kNumThreads + 1);

// Warming runs
{
run_test(per_thread, static_cast<PointerSize>(0/*32b*/));
run_test(per_thread, static_cast<PointerSize>(1/*64b*/));
}

// Run tests
for (int trial = 0; trial < kTrials; ++trial) {
int requested_swaps_per_thread = kRequestedSwaps / kNumThreads;
for (int ps = 0; ps < SZ_N; ++ps) {
PointerSize pointer_size = static_cast<PointerSize>(ps);
run_test(per_thread, pointer_size);

// Process data and save to RAM. Do not do explicit I/O here on the
// basis background activity may interfere with context switches.
result[result_num++] = process_data(event,
per_thread,
requested_swaps_per_thread,
pointer_size);
}
}

// Cleanup
pthread_barrier_destroy(&shared_args.start_barrier);
pthread_barrier_destroy(&shared_args.stop_barrier);

for (int th = 0; th < kNumThreads; ++th) {
delete[] per_thread[th].stamps;
for (int ps = 0; ps < SZ_N; ++ps)
munmap_stack(per_thread[th].stack[ps]);
int cc = pthread_attr_destroy(&per_thread[th].attr);
if (cc != 0) {
perror("pthread_attr_destory");
exit(1);
}
}
delete[] event;

// Dump data from RAM to stdout.
Tsc best[SZ_N] = { LARGEST_TSC, LARGEST_TSC };
Tsc worst[SZ_N] = { 0, 0 };
for (int r = 0; r < result_num; ++r)
dump_one_run(best, worst, r, &result[r]);
for (int sz = 0; sz < SZ_N; ++sz) {
int cc = printf("best-of-best[%d]: %lld\nworst-of-best[%d]: %lld\n",
pointer_sizes[sz], best[sz], pointer_sizes[sz], worst[sz]);
if (cc < 0) {
perror("printf");
exit(1);
}
}
}

2008-08-13 15:23:31

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> i find it pretty unacceptable these days that we limit any aspect of
> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).

Sure, but if we can pin-point the sub-archs for which it is the problem
then a flag to optionally request it is even easier to handle. You'd
simply ignore the flag for anything but the P4 architecture.

I personally have no problem removing the whole thing because I have no
such machine running anymore. But there are people out there who have.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkii/BcACgkQ2ijCOnn/RHQ8FACfZFV+WaBmS6UNqZZ/xDfV/Z7z
gIAAoJSmbauchdaIVIebz8N2rPrszAMF
=WAzJ
-----END PGP SIGNATURE-----

2008-08-13 15:41:34

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Ulrich Drepper <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ingo Molnar wrote:
> > i find it pretty unacceptable these days that we limit any aspect of
> > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).
>
> Sure, but if we can pin-point the sub-archs for which it is the
> problem then a flag to optionally request it is even easier to handle.
> You'd simply ignore the flag for anything but the P4 architecture.

i suspect you are talking about option #2 i described. It is the option
which will take the most time to trickle down to people.

> I personally have no problem removing the whole thing because I have
> no such machine running anymore. But there are people out there who
> have.

hm, i think the set of people running on such boxes _and_ then upgrading
to a new glibc and expecting everything to be just as fast to the
microsecond as before should be miniscule. Those P4 derived 64-bit boxes
were astonishingly painful in 64-bit mode - most of that hw is running
32-bit i suspect, because 64-bit on it was really a joke.

Btw., can you see any problems with option #1: simply removing MAP_32BIT
from 64-bit stack allocations in glibc unconditionally? It's the fastest
to execute and also the most obvious solution. +1 usecs overhead in the
64-bit context-switch path on those old slow boxes wont matter much.

10 _millisecs_ to start a single thread on top-of-the-line hw is quite
unaccepable. (and there's little sane we can do in the kernel about
allocation overhead when we have an imperfectly filled 4GB box for all
allocations)

Ingo

2008-08-13 15:56:44

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> Btw., can you see any problems with option #1: simply removing MAP_32BIT
> from 64-bit stack allocations in glibc unconditionally?

Yes, as we both agree, there are still such machines out there.

The real problem is: what to do if somebody complains? If we would have
the extra flag such people could be accommodated. If there is no such
flag then distributions cannot just add the flag (it's part of the
kernel API) and they would be caught between a rock and a hard place.
Option #2 provides the biggest flexibility.

I upstream kernel truly doesn't care about such machines anymore there
are two options:

- - really do nothing at all

- - at least reserve a flag in case somebody wants/has to implement option
#2

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkijA+4ACgkQ2ijCOnn/RHRhLQCdGNvwikwY4hMHBuYUP4WDqsy3
cfcAn2hrN1MoOkN3UIC4iSUCtqD2Yl6W
=yG5T
-----END PGP SIGNATURE-----

2008-08-13 16:03:09

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Ulrich Drepper <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ingo Molnar wrote:
> > Btw., can you see any problems with option #1: simply removing MAP_32BIT
> > from 64-bit stack allocations in glibc unconditionally?
>
> Yes, as we both agree, there are still such machines out there.
>
> The real problem is: what to do if somebody complains? If we would
> have the extra flag such people could be accommodated. If there is no
> such flag then distributions cannot just add the flag (it's part of
> the kernel API) and they would be caught between a rock and a hard
> place. Option #2 provides the biggest flexibility.
>
> I upstream kernel truly doesn't care about such machines anymore there
> are two options:
>
> - - really do nothing at all

do nothing at all is not an option - thread creation can take 10 msecs
on top-of-the-line hardware.

> - - at least reserve a flag in case somebody wants/has to implement option
> #2

yeah, i already had a patch for that when i wrote my first mail
[attached below] and listed it as option #4 - then erased the comment
figuring that we'd want to do #1 ;-)

As unimplemented flags just get ignored by the kernel, if this flag goes
into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a
plain old 64-bit [47-bit] allocation), then you could do the glibc
change straight away, correct? So then if people complain we can fix it
in the kernel purely.

how about this then?

Ingo

--------------------->
Subject: mmap: add MAP_64BIT_STACK
From: Ingo Molnar <[email protected]>
Date: Wed Aug 13 12:41:54 CEST 2008

Signed-off-by: Ingo Molnar <[email protected]>
---
include/asm-x86/mman.h | 1 +
1 file changed, 1 insertion(+)

Index: linux/include/asm-x86/mman.h
===================================================================
--- linux.orig/include/asm-x86/mman.h
+++ linux/include/asm-x86/mman.h
@@ -12,6 +12,7 @@
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
+#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

2008-08-13 16:09:36

by H. Peter Anvin

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ingo Molnar wrote:
>> i find it pretty unacceptable these days that we limit any aspect of
>> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).
>
> Sure, but if we can pin-point the sub-archs for which it is the problem
> then a flag to optionally request it is even easier to handle. You'd
> simply ignore the flag for anything but the P4 architecture.
>
> I personally have no problem removing the whole thing because I have no
> such machine running anymore. But there are people out there who have.
>

This could also be done entirely in glibc (thus removing the dependency
on the kernel): set the flag if and only if you detect a P4 CPU. You
don't even need to enumerate all the CPUs in the system (which would be
more painful) if you make the CPUID test wide enough that it catches all
compatible CPUs.

-hpa

2008-08-13 17:10:46

by Linus Torvalds

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

On Wed, 13 Aug 2008, Ulrich Drepper wrote:
>
> The real problem is: what to do if somebody complains?

Ulrich, I don't understand why you worry more about a _potential_ (and
fairly unlikely) complaint, than about a real one today.

Thinking ahead may be good, but you take it to absolutely ridiculous
heights, to the point where you make potential problems be bigger than
-actual- problems.

Linus

2008-08-13 18:05:43

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> Ulrich, I don't understand why you worry more about a _potential_ (and
> fairly unlikely) complaint, than about a real one today.

Of course I care. All I try to do is to prevent going from one extreme
(all focus on P4s) to the other (ignore P4s completely).

Even ignoring this one case here, I think it's in any case useful for
userlevel to tell the kernel that an anonymous memory region is needed
for a stack. This might allow better optimizations and/or security
implementations.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkijIi0ACgkQ2ijCOnn/RHRqCwCcCAeJw+BzO9MSwKRtemm5VAq3
FBYAoKbMwR1pkthjLvNlpCSVS76CCoAq
=UfmJ
-----END PGP SIGNATURE-----

2008-08-13 18:17:04

by Arjan van de Ven

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

On Wed, 13 Aug 2008 11:04:29 -0700
Ulrich Drepper <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Linus Torvalds wrote:
> > Ulrich, I don't understand why you worry more about a _potential_
> > (and fairly unlikely) complaint, than about a real one today.
>
> Of course I care. All I try to do is to prevent going from one
> extreme (all focus on P4s) to the other (ignore P4s completely).

(fwiw as far as I know this is only about early 64 bit P4s, not later
generations)
>
> Even ignoring this one case here, I think it's in any case useful for
> userlevel to tell the kernel that an anonymous memory region is needed
> for a stack. This might allow better optimizations and/or security
> implementations.

yeah maybe we should also tell it we expect it to be used downwards.
Oh wait.. MAP_GROWSDOWN ?

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-08-13 18:23:28

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Arjan van de Ven wrote:
> yeah maybe we should also tell it we expect it to be used downwards.
> Oh wait.. MAP_GROWSDOWN ?

MAP_GROWSDOWN is unusable because we have to allocate the entire address
range for the stack. Otherwise some other allocation happens in that
range and all of a sudden the stack cannot grow as much as needed anymore.

These flags really can be removed. They should not be used because they
are outright dangerous.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkijJm8ACgkQ2ijCOnn/RHQ7/wCfcrLJPlKmtY5AC3c+fuX9LGe8
+YwAnRqLCdSQvwOUdsAz8Hq9H3dmnqEA
=BKsz
-----END PGP SIGNATURE-----

2008-08-13 20:43:25

by Andi Kleen

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

Ingo Molnar <[email protected]> writes:
>
> i find it pretty unacceptable these days that we limit any aspect of
> pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).

It's not limited to 2GB, there's a fallback to >4GB of course. Ok
admittedly the fallback is slow, but it's there.

I would prefer to not slow down the P4s. There are **lots** of them in
field. And they ran 64bit still quite well. Also back then I
benchmarked on early K8 and it also made a difference there (but I
admit I forgot the numbers)

I think it would be better to fix the VM because there are
other use cases of applications who prefer to allocate in a lower area.
For example Java JVMs now widely use a technique called pointer
compression where they dynamically adjust the pointer size based
on how much memory the process uses. For that you have to get
low memory in the 47bit VM too. The VM should deal with that gracefully.

To be honest I always thought the linear search in the VMA list
was a little dumb. I'm sure there are other cases where it hurts
too. Perhaps this would be really an opportunity to do something about it :)

-Andi

2008-08-13 20:58:51

by Andrew Morton

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

On Wed, 13 Aug 2008 22:42:48 +0200
Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
> >
> > i find it pretty unacceptable these days that we limit any aspect of
> > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).
>
> It's not limited to 2GB, there's a fallback to >4GB of course. Ok
> admittedly the fallback is slow, but it's there.
>
> I would prefer to not slow down the P4s. There are **lots** of them in
> field. And they ran 64bit still quite well. Also back then I
> benchmarked on early K8 and it also made a difference there (but I
> admit I forgot the numbers)
>
> I think it would be better to fix the VM because there are
> other use cases of applications who prefer to allocate in a lower area.
> For example Java JVMs now widely use a technique called pointer
> compression where they dynamically adjust the pointer size based
> on how much memory the process uses. For that you have to get
> low memory in the 47bit VM too. The VM should deal with that gracefully.
>
> To be honest I always thought the linear search in the VMA list
> was a little dumb. I'm sure there are other cases where it hurts
> too. Perhaps this would be really an opportunity to do something about it :)
>

Yes, the free_area_cache is always going to have failure modes - I
think we've been kind of waiting for it to explode.

I do think that we need an O(log(n)) search in there. It could still
be on the fallback path, so we retain the mostly-O(1) benefits of
free_area_cache.

2008-08-13 21:45:46

by Andi Kleen

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

> Yes, the free_area_cache is always going to have failure modes - I
> think we've been kind of waiting for it to explode.
>
> I do think that we need an O(log(n)) search in there. It could still
> be on the fallback path, so we retain the mostly-O(1) benefits of
> free_area_cache.

The standard dumb way to do that would be to have two parallel trees, one to
index free space (similar to e.g. the free space btrees in XFS) and the
other to index the objects (like today). That would increase the constant
factor somewhat by bloating the VMAs, increasing cache overhead etc, and
also would be more brute force than elegant. But it would be simple
and straight forward.

Perhaps the combined data structure experience of linux-kernel can come
up with something better and some data structure that allows to look
up both efficiently?

This would be also an opportunity to reevaluate rbtrees for the object
index. One drawback of them is that they are not really optimized to be
cache friendly because their nodes are too small.

-Andi

2008-08-15 12:44:41

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
> >
> > i find it pretty unacceptable these days that we limit any aspect of
> > pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).
>
> It's not limited to 2GB, there's a fallback to >4GB of course. Ok
> admittedly the fallback is slow, but it's there.

Of course - what you are missing is that _10 milliseconds_ thread
creation overhead is completely unacceptable overhead: it is so bad as
if we didnt even support it.

> I would prefer to not slow down the P4s. There are **lots** of them in
> field. And they ran 64bit still quite well. [...]

Nonsense, i had such a P4 based 64-bit box and it was painful. Everyone
with half a brain used them as 32-bit machines. Nor is the
context-switch overhead in any way significant. Plus, as Arjan mentioned
it, only the earliest P4 64-bit CPUs had this problem.

> [...] Also back then I benchmarked on early K8 and it also made a
> difference there (but I admit I forgot the numbers)

that's a lot of handwaving with no actual numbers. The numbers in this
discussion show that the context-switch overhead is small and that the
overhead on perfectly good systems that hit this limit is obscurely
high.

I'd love to zap MAP_32BIT this very minute from the kernel, but you
originally shaped the whole thing in such a stupid way that makes its
elimination impossible now due to ABI constraints. It would have cost
you _nothing_ to have added MAP_64BIT_STACK back then, but the quick &
sloppy solution was to reuse MAP_32BIT for 64-bit tasks. And you are
stupid about it even now. Bleh.

The correct solution is to eliminate this flag from glibc right now, and
maybe add the MAP_64BIT_STACK flag as well, as i posted it - if anyone
with such old boxes still cares (i doubt anyone does). That flag then
will take its usual slow route. Ulrich?

Ingo

2008-08-15 13:32:12

by Andi Kleen

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

On Fri, Aug 15, 2008 at 02:43:50PM +0200, Ingo Molnar wrote:
> i had such a P4 based 64-bit box and it was painful.

I used them as 64bit machines and they weren't painful at all.

> I'd love to zap MAP_32BIT this very minute from the kernel, but you
> originally shaped the whole thing in such a stupid way that makes its
> elimination impossible now due to ABI constraints. It would have cost

MAP_32BIT was not actually added for this originally. It
was originally added for the X server's old dynamic loader, which
needed 2GB memory.

It's main failing, which I freely admit, was to not call it MAP_31BIT.

> you _nothing_ to have added MAP_64BIT_STACK back then, but the quick &

Not sure what the semantics of that would be. For me it would
seem ugly to hardcode specific semantics in the kernel for this
("mechanism not policy")

But for most possible semantics I can think of the data structure would still
need to be fixed I think.

> The correct solution is to eliminate this flag from glibc right now, and

IMHO the correct solution is to fix the data structure to not have such
a bad complexity in this corner case. We typically do this for all
other data structures as we discover such cases. No reason the VMAs
should be any different.

-Andi

2008-08-15 15:55:45

by Jamie Lokier

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

Ingo Molnar wrote:
> As unimplemented flags just get ignored by the kernel, if this flag goes
> into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a
> plain old 64-bit [47-bit] allocation), then you could do the glibc
> change straight away, correct? So then if people complain we can fix it
> in the kernel purely.
>
> how about this then?

> +#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */

I think the flag makes sense but it's name is confusing - 64BIT for a
flag which means "maybe request 32-bit stack"! Suggest:

+#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */
+ /* whichever is faster on this CPU */

Also, is this _only_ useful for thread stacks, or are there other
memory allocations where 31-bitness affects execution speed on old P4s?

-- Jamie

2008-08-15 16:04:35

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Jamie Lokier <[email protected]> wrote:

> > how about this then?
>
> > +#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */
>
> I think the flag makes sense but it's name is confusing - 64BIT for a
> flag which means "maybe request 32-bit stack"! Suggest:
>
> +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */
> + /* whichever is faster on this CPU */

ok. I've applied the patch below to tip/x86/urgent.

> Also, is this _only_ useful for thread stacks, or are there other
> memory allocations where 31-bitness affects execution speed on old
> P4s?

just about anything i guess - but since those CPUs do not really matter
anymore in terms of bleeding-edge performance, what we care about is the
intended current use of this flag: thread stacks.

Ingo

-------------------->
>From 4812c2fddc7f5a3a4480d541a4cb2b7e4ec21dcb Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Wed, 13 Aug 2008 18:02:18 +0200
Subject: [PATCH] x86: add MAP_STACK mmap flag

as per this discussion:

http://lkml.org/lkml/2008/8/12/423

Pardo reported that 64-bit threaded apps, if their stacks exceed the
combined size of ~4GB, slow down drastically in pthread_create() - because
glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is
a legacy hack - to speed up context switching on certain early model
64-bit P4 CPUs.

So introduce a new flag to be used by glibc instead, to not constrain
64-bit apps like this.

glibc can switch to this new flag straight away - it will be ignored
by the kernel. If those old CPUs ever matter to anyone, support for
it can be implemented.

Signed-off-by: Ingo Molnar <[email protected]>
---
include/asm-x86/mman.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86/mman.h b/include/asm-x86/mman.h
index c1682b5..e5852b5 100644
--- a/include/asm-x86/mman.h
+++ b/include/asm-x86/mman.h
@@ -12,6 +12,7 @@
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
+#define MAP_STACK 0x20000 /* give out 32bit stack addresses on old CPUs */

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

2008-08-15 17:13:56

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jamie Lokier wrote:
> Suggest:
>
> +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */
> + /* whichever is faster on this CPU */

I agree. Except for the comment.

> Also, is this _only_ useful for thread stacks, or are there other
> memory allocations where 31-bitness affects execution speed on old P4s?

Actually, I would define the flag as "do whatever is best assuming the
allocation is used for stacks".

For instance, minimally the /proc/*/maps output could show "[user
stack]" or something like this. For security, perhaps, setting of
PROC_EXEC can be prevented.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkiluUMACgkQ2ijCOnn/RHSb5gCfb5VhiLA/wbamoAVqfxR32k4N
tSIAoK/KAmwcVd+RjkPnb9RSuAeL/KLV
=2ynl
-----END PGP SIGNATURE-----

2008-08-15 17:19:56

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Ulrich Drepper <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Jamie Lokier wrote:
> > Suggest:
> >
> > +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */
> > + /* whichever is faster on this CPU */
>
> I agree. Except for the comment.
>
>
> > Also, is this _only_ useful for thread stacks, or are there other
> > memory allocations where 31-bitness affects execution speed on old P4s?
>
> Actually, I would define the flag as "do whatever is best assuming the
> allocation is used for stacks".
>
> For instance, minimally the /proc/*/maps output could show "[user
> stack]" or something like this. For security, perhaps, setting of
> PROC_EXEC can be prevented.

makes sense. Updated patch below. I've also added your Acked-by. Queued
it up in tip/x86/urgent, for v2.6.27 merging.

( also, just to make sure: all Linux kernel versions will ignore such
extra flags, so you can just update glibc to use this flag
unconditionally, correct? )

Ingo

--------------------------->
>From 2fdc86901d2ab30a12402b46238951d2a7891590 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Wed, 13 Aug 2008 18:02:18 +0200
Subject: [PATCH] x86: add MAP_STACK mmap flag

as per this discussion:

http://lkml.org/lkml/2008/8/12/423

Pardo reported that 64-bit threaded apps, if their stacks exceed the
combined size of ~4GB, slow down drastically in pthread_create() - because
glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is
a legacy hack - to speed up context switching on certain early model
64-bit P4 CPUs.

So introduce a new flag to be used by glibc instead, to not constrain
64-bit apps like this.

glibc can switch to this new flag straight away - it will be ignored
by the kernel. If those old CPUs ever matter to anyone, support for
it can be implemented.

Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Ulrich Drepper <[email protected]>
---
include/asm-x86/mman.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86/mman.h b/include/asm-x86/mman.h
index c1682b5..90bc410 100644
--- a/include/asm-x86/mman.h
+++ b/include/asm-x86/mman.h
@@ -12,6 +12,7 @@
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
+#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

2008-08-15 17:24:20

by Ulrich Drepper

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

On Fri, Aug 15, 2008 at 10:19 AM, Ingo Molnar <[email protected]> wrote:
> ( also, just to make sure: all Linux kernel versions will ignore such
> extra flags, so you can just update glibc to use this flag
> unconditionally, correct? )

As soon as the patch hits Linus' tree I can change the code.

2008-08-15 19:01:20

by Ingo Molnar

[permalink] [raw]

Subject: Re: pthread_create() slow for many threads; also time to revisit 64b context switch optimization?

* Ulrich Drepper <[email protected]> wrote:

> On Fri, Aug 15, 2008 at 10:19 AM, Ingo Molnar <[email protected]> wrote:
> > ( also, just to make sure: all Linux kernel versions will ignore such
> > extra flags, so you can just update glibc to use this flag
> > unconditionally, correct? )
>
> As soon as the patch hits Linus' tree I can change the code.

it's upstream now:

| commit cd98a04a59e2f94fa64d5bf1e26498d27427d5e7
| Author: Ingo Molnar <[email protected]>
| Date: Wed Aug 13 18:02:18 2008 +0200
|
| x86: add MAP_STACK mmap flag

thanks everyone,

Ingo