2000-12-08 17:24:14

by Peter Berger

[permalink] [raw]
Subject: Pthreads, linux, gdb, oh my! (fwd)



Hi. I have the following tiny test program which fails dramatically,
using pthreads, in a number of fascinating ways on various version of
linux, using various versions of glibc, under various (current) versions
of GDB. I am honestly not sure if this is a linux bug, a glibc bug, or a
gdb bug, but it runs fine under gdb 5.0 under FreeBSD (running the port of
the linux Pthreads package, even).

So I am sending the test code here; can -anyone- get this to run
correctly, on any version of linux, under gdb?

All the program does is create a thread, wait for that thread to exit,
then iterate and do it again, and again, until MAX_COUNT_SEQ_THREADS is
reached. So no more than 2 threads should be running at once.

I have seen two failure modes: on my machine (linux 2.2.5-22, glibc
2.1.1), when run under gdb 5.0, the created pthreads stick around as
zombies until the machine runs out of resources. On some friends'
machines (kernel 2.2.15, glibc 2.1.94), the program creates one pthread,
waits for it to exit, and then exits.

The code is enclosed at the end of this message. Can people try this out
and let me know what results you get? Does anyone have any opinions as to
where the bug is? And, if the bug is in my code, I will be both relieved
and happy, and look forward to finding out what it is. If it's a kernel
bug, I submit that this makes pthreads unusable, and want to inquire if
anyone is working on fixing this?

Peter Berger, Network Dilettante
http://peterb.telerama.com [email protected]
--------------------------thread_test.c------------------------

#include <pthread.h>
#include <stdio.h>

#define MAX_COUNT_SEQ_THREADS 100000

struct thread_group_s {
pthread_cond_t cond;
pthread_mutex_t lock;
int created;
int running;
int done;
};

/*
* This child thread just runs and exits, always being careful
* to take the mutex whenever it does anything even
* remotely interesting.
*/
int
threads_test_count_seq_proc(struct thread_group_s *tg)
{
int broadcast = 0;

/* We spend a lot of effort to do nothing here! */
#ifdef DEBUG
printf("Hello...");
#endif /* DEBUG */
pthread_mutex_lock(&tg->lock);
tg->running++;
if (tg->running >= tg->created) {
broadcast = 1;
}
pthread_mutex_unlock(&tg->lock);
if (broadcast) {
pthread_cond_broadcast(&tg->cond);
}
broadcast = 0;

pthread_mutex_lock(&tg->lock);
tg->done++;
if (tg->done >= tg->running) {
broadcast = 1;
}
pthread_mutex_unlock(&tg->lock);
if (broadcast) {
pthread_cond_broadcast(&tg->cond);
}
#ifdef DEBUG
printf("goodbye.\n");
#endif /* DEBUG */
pthread_exit(0);
}

/*
* This test is designed to ensure that we can create
* and destroy threads _in sequence_ for as long as
* we please. There are only ever 2 threads running
* at one time. The main routine creates a thread,
* and waits for it to exit before creating the next
* one.
*
* If you should find an error in my concurrency/mutex
* handling, please let me know.
*/
int
main(int argc, char argv[])
{
struct thread_group_s *tg;
pthread_t *thread;
pthread_attr_t *attr;

int i, rc, detached;

rc = 0; detached = 0;

thread = (pthread_t *)malloc(sizeof(*thread));
tg = (struct thread_group_s *)malloc(sizeof(*tg));
attr = (pthread_attr_t *)malloc(sizeof(*attr));

printf("Starting test.\n");
for(i = 1;
((rc == 0) && (i <= MAX_COUNT_SEQ_THREADS));
i++)
{
tg->created = 0; tg->running = 0; tg->done = 0;
rc = pthread_attr_init(attr);
if (rc) {
printf("threads_test: failed initializing pthread attr object: %s\n",
strerror(rc));
}

rc = pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED);
if (rc) {
printf("threads_test: couldn't set thread state to detached: %s\n",
strerror(rc));
}

/* Let's double-check, just to be paranoid. */
rc = pthread_attr_getdetachstate(attr, &detached);
if (detached != PTHREAD_CREATE_DETACHED) {
printf("threads_test: thread will not be created detached (fatal).\n");
exit(1);
}

/* Create a thread that will run and exit. */
rc = pthread_create(thread, attr, (void *)threads_test_count_seq_proc, tg
);
if (rc) {
printf("threads_test: failed creating seq thread #%d with %s\n",
i, strerror(rc));
return(rc);
}
pthread_mutex_lock(&tg->lock);
tg->created++ ;
pthread_mutex_unlock(&tg->lock);

printf("\nthreads_test: thread #%d created...", i);

/* We wait for all (one) of the threads we have created
to start. */
pthread_mutex_lock(&tg->lock);
while(tg->running < tg->created) {
pthread_cond_wait(&tg->cond, &tg->lock);
}
pthread_mutex_unlock(&tg->lock);

/* Wait for the thread we created to exit. */
pthread_mutex_lock(&tg->lock);
while(tg->done < tg->running) {
pthread_cond_wait(&tg->cond, &tg->lock);
}
pthread_mutex_unlock(&tg->lock);

printf("done. ", i);

/*
* Let's yield just to make sure our other thread has time
* to clean up.
*/
rc = sched_yield();
if (rc) {
printf("threads_test: error in sched_yield: %s\n", strerror(rc));
}
}

printf("threads_test: Test over.\n");
return(0);
}




2000-12-08 17:48:46

by Alan

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

> I have seen two failure modes: on my machine (linux 2.2.5-22, glibc
> 2.1.1), when run under gdb 5.0, the created pthreads stick around as

glibc 2.1.1 definitely has problems with several bits of pthreads. You
want 2.1.3 or higher I believe.

> zombies until the machine runs out of resources. On some friends'
> machines (kernel 2.2.15, glibc 2.1.94), the program creates one pthread,
> waits for it to exit, and then exits.
>
> and happy, and look forward to finding out what it is. If it's a kernel
> bug, I submit that this makes pthreads unusable, and want to inquire if
> anyone is working on fixing this?

Its unlikely to be remotely kernel related

> tg->running++;
> if (tg->running >= tg->created) {

tg->created may be out of date

> /* Create a thread that will run and exit. */
> rc = pthread_create(thread, attr, (void *)threads_test_count_seq_proc, tg

You can create it, count it, then up tg->created out of order



2000-12-08 20:14:26

by Peter Berger

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

On Fri, 8 Dec 2000, Alan Cox wrote:
> > I have seen two failure modes: on my machine (linux 2.2.5-22, glibc
> > 2.1.1), when run under gdb 5.0, the created pthreads stick around as
> glibc 2.1.1 definitely has problems with several bits of pthreads. You
> want 2.1.3 or higher I believe.

So you're saying that you got this to work? Because I certainly couldn't
get it working with a higher version either. I would really love a
positive ack from someone -- anyone -- that can get this working on any
version of linux, with any version of glibc. Likewise, if there is
someone running a 'current' glibc who can verify for me that this fails, I
think that would be a useful datapoint.

> Its unlikely to be remotely kernel related
I'm not confident that we have enough data to make that assertion yet,
although I'm certainly willing to believe it! Fortunately there are ways
of testing this (I suggest one below).

> tg->created may be out of date
...
> You can create it, count it, then up tg->created out of order

Well, you're right, but this is picking lint. Making this change (see
http://peterb.telerama.com/thread-test.c for the corrected version)
certainly doesn't make the problem go away (nor would I expect it to).

I apologize for my ignorance -- I frankly don't know the intricicies of
linux kernel development; all I know is I wrote what might be the simplest
of all possible concurrency tests and it is failing. If someone could
point me to a version or combination of linux and glibc where it doesn't
fail, I'd be happy.

Glibc 2.2 (allegedly) works on both linux and the Hurd. Are there any
readers of linux-kernel that are running hurd installations? Could you
run my test program under gdb and see if it evinces the same behavior?
Assuming we see the same broken behavior on my linux box with glibc-2.2
(I'm compiling it now..), as on the Hurd box, we can presume it is a glibc
problem. If it works on the Hurd but not on linux with the same glibc, we
can presume it is a linux problem. I'd do the Hurd test myself, but I
haven't yet played with Hurd enough (read: at all) to be confident that I
was setting up the test correctly.

Likewise if the problem magically goes away on my linux box once I use
glibc-2.2, I'll be sure to report back.

-Peter

2000-12-08 20:52:29

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

On 8 Dec 00 at 14:43, Peter Berger wrote:
> > tg->created may be out of date
> ...
> > You can create it, count it, then up tg->created out of order
>
> Well, you're right, but this is picking lint. Making this change (see
> http://peterb.telerama.com/thread-test.c for the corrected version)
> certainly doesn't make the problem go away (nor would I expect it to).

Can you tell me again (private, probably), which problem do you have?

After I fixed source to get it to compile with -W -Wall -Werror
(missing includes, wrong parameters to main...), and compiling with
-D_REENTRANT, I received nice ./a.out, which runs under 2.4.0-test12-pre7,
glibc-2.2 both standalone and in gdb (gdb 5.0) (all tools except
kernel as of today woody). In gdb I had to do
'handle SIG32 noprint nostop pass', as by default gdb stops on SIG32
arrival...

Now it runs and runs and runs... I do not see any unreaped childrens.
After thread 100000 it finished.
Best regards,
Petr Vandrovec
[email protected]

2000-12-08 22:45:59

by Peter Berger

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)


Petr,

Thanks for testing this and finding a working counterexample! I am still
professionally interested to know if the difference is that you are
running a 2.4 kernel, or the glibc. Anyone running a 2.2 kernel with
glibc 2.2 want to drop me a line?

-Peter

(gdb) run
...
[New Thread 25452]
threads_test: thread #210 created...done.
[New Thread 25453]
threads_test: thread #211 created...done.
[New Thread 25454]
threads_test: thread #212 created...done.
Cannot access memory at address 0xa4dffe8c
(gdb)

[3]+ Stopped gdb ./thread
[peterb@deedee src]$ ps axuw
bash: fork: Resource temporarily unavailable
[peterb@deedee src]$

[kill a process]

[peterb@deedee src]$ ps axuw

peterb 25679 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25680 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25681 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25682 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25683 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25684 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
....several hundred lines later...
peterb 25889 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25890 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25891 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25892 0.0 0.0 0 0 ttya0 Z 17:12 0:00 [thread
<defunct>]
peterb 25893 0.0 0.2 2704 1060 ttya0 R 17:12 0:00 ps axuw
[2]- Done netscape http://www.slashdot.org (wd:
/home/peterb/p4base/tests/common)
(wd now: ~/src)
[

2000-12-08 23:03:10

by David Relson

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

Petr,

It ran fine on my stock Mandrake 7.2 system - linux-2.2.17-21mdk and
glibc-2.2-5mdk. The program ran fine in both environments - command line
and gdb-5.0. Loadavg creeps up slowly as the program continues to run. At
thread #37000, loadavb is 3.65. The ps command indicates 4 threads for the
program (including gdb).

David

At 05:15 PM 12/8/00, Peter Berger wrote:

>Petr,
>
>Thanks for testing this and finding a working counterexample! I am still
>professionally interested to know if the difference is that you are
>running a 2.4 kernel, or the glibc. Anyone running a 2.2 kernel with
>glibc 2.2 want to drop me a line?
>
>-Peter

--------------------------------------------------------
David Relson Osage Software Systems, Inc.
[email protected] 514 W. Keech Ave.
http://www.osagesoftware.com Ann Arbor, MI 48103
voice: 734.821.8800 fax: 734.821.8800

2000-12-09 00:31:34

by Alan

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

> So you're saying that you got this to work? Because I certainly couldn't
> get it working with a higher version either. I would really love a

I read straight down it anf realised you referenced obsolete versions of
tg->created and thus broadcast incorrectly

> I apologize for my ignorance -- I frankly don't know the intricicies of
> linux kernel development; all I know is I wrote what might be the simplest
> of all possible concurrency tests and it is failing. If someone could
> point me to a version or combination of linux and glibc where it doesn't
> fail, I'd be happy.

The way it works on the Linux side for threads is


Kernel provides
Shared resources
clone() - fork with sharing of files/memory etc

glibc provides
POSIX semantics
pthreads API
thread locking on top of its own spin locks and kernel locks



Alan

2000-12-09 07:05:04

by buhr

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

Peter Berger <[email protected]> writes:
>
> Hi. I have the following tiny test program which fails dramatically,
> using pthreads, in a number of fascinating ways on various version of
> linux, using various versions of glibc, under various (current) versions
> of GDB.

It looks like a GDB bug. GDB contains code to recognize when the
"pthreads" shared library has been loaded. When this happens, it sets
itself up to properly handle threads (including setting up correct
SIG32 signal handling). If you trick GDB into thinking "pthreads"
hasn't been loaded and set the SIG32 stuff up yourself, like so:

buhr@saurus:~/src$ gdb thread-test
GNU gdb 19990928
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
(gdb) set auto-solib-add 0
(gdb) handle SIG32 nostop noprint pass
Signal Stop Print Pass to program Description
SIG32 No No Yes Real-time event 32
(gdb) run

then your program works fine. Ergo, this is strong evidence that
GDB's thread handling is broken. Presumably, it's simply not reaping
threads when they exit.

I don't know if Petr just happened to run his GDB 5.0 with "set
auto-solib-add 0" in his ".gdbinit" (a rather necessary precaution if
you're hacking on, say, Mozilla) and didn't see any problem for that
reason, or if some aspect of his configuration has fixed the problem.
Alternatively, maybe Petr is using a GDB that doesn't actually support
threads---if so, he'd *have* to set the SIG32 stuff up manually, and
he wouldn't notice any problem.

Kevin <[email protected]>

2000-12-10 03:15:20

by Peter Berger

[permalink] [raw]
Subject: Re: Pthreads, linux, gdb, oh my! (fwd)

.

> It looks like a GDB bug. GDB contains code to recognize when the
> "pthreads" shared library has been loaded. When this happens, it sets
> itself up to properly handle threads (including setting up correct
> SIG32 signal handling). If you trick GDB into thinking "pthreads"
> hasn't been loaded and set the SIG32 stuff up yourself, like so:
[elided]

Kevin,

This sure looks like it -- I was able to get it working using your
technique. Thank you! It is a relief to know that this was just an
application layer issue rather than something deeper.

My apologies for soaking up cycles on linux-kernel for what turned out to be
a non-kernel issue -- but a big THANKS to everyone that helped track the
problem down -- let me know the next time you're in Pittsburgh, and I'll buy
you a beer (or the beverage of your choice).

-p