2002-03-11 18:21:26

by Brad Pepers

[permalink] [raw]
Subject: Multi-threading

There was a message posted by Jim Starkey about his experiences using threads
on Linux and the problems debugging them. It came down to two things:

1. Using gdb to debug multi-threaded applications was almost useless because
of interactions between the thread model and ptrace.

2. Linux is missing an atomic use-count mechanism which returns values like
the Microsoft InterlockedIncrement/Decrement functions do.

I'll include the text of what he wrote below but I was wondering what anyone
here could say on the subject. The code being worked on is the Firebird
database with FB2 being the next release.

-------------------------------------------------------------------------------
A primary goal of FB2 is to implement fine granuality multi-threading
to take advantage of multiple processors in an SMP configuration.
I've been doing quite a bit of work in that general area on a different
project, and thought I'd share some observations on pitfalls and
challenges, particularly on Linux and similar systems.

The first problem is that gdb (and presumably any other debugger)
can't be used on a multi-threaded system under load. The problem
is an unfortunate interaction between ptrace (the Unix debugging
mechanism) and the Linux implementation of pthreads. The basic
ptrace mechanism works this way: a signal intended for the target
process causes the target process to stall while the signal
is delivered to the debugging process, which figures out what to
do, then restarts the target process. The Linux threading
mechanism uses a separate OS process for each thread. Synchronization
is performed by the pthread mutex calls. When one thread (process)
fails to acquire a mutex, it goes to sleep. When another thread
releases the mutex, it sends a signal to all sleeping threads
(in fact, all threads) to wake up and look around. Unfortunately,
when the target process/threads are under control of the debugger,
there is a very complex multi-process dance involving (apparently)
multiple debugger interactions per wake up. Kinda like the
guys who designed the threads didn't talk to the guys who designed
ptrace or one or the other didn't care.

The bottom line is when debugging a multi-threaded application
under load on Linux, gdb consumes between 45% and 100% of the
cpu. The practical implication is that Linux is not a viable
debugging platform. NT, happily, doesn't exhibit the same
characteristics.

A partial but painful workaround is for the debugging image
to set up handlers for SIGSEGV and SIGILL to invoke gdb via
the system() call to attach to the process. This catches
crashes, but makes ordinary debugging tedious indeed.

A second problem is implementing a use-count mechanism to
control object lifetimes in a multi-threaded environment.
The two alternatives are to use mutexes or other synchronization
mechanisms to protect all addRef/release calls (very, very
expensive) or to use interlocked increment/decrement mechanisms.
Unfortunately, while Microsoft provides intrinsic
InterlockedIncrement/InterlockedDecrement functions that perform
atomic multiprocessor interlocked operations that correctly
return the result of the operation. Unfortunately, there
are no such functions available on Linux. Atomic.h provides
interlocked increment/decrement, but they don't return values.
Interestingly enough, Google couldn't find any example of
the Intel instruction sequences required to implement the
necessary atomic operations using the GNU assembler dialect.

All this means two things. First, there is no alternative
to inline assembler code in FB2. Distasteful, yes, but get
used to it. Second, fine granularity threading is alien
to Linux, so prepared to be a pioneer in a nasty environment
without a workable debugger.

Jim Starkey
----------------------------------------------------------------

--
Brad Pepers
[email protected]


2002-03-11 19:34:18

by Dipankar Sarma

[permalink] [raw]
Subject: Re: Multi-threading


In article <[email protected]> Brad Pepers wrote:
> There was a message posted by Jim Starkey about his experiences using threads
> on Linux and the problems debugging them. It came down to two things:

> 2. Linux is missing an atomic use-count mechanism which returns values like
> the Microsoft InterlockedIncrement/Decrement functions do.

Can't this be done using atomic_dec_and_test() and the likes ?
Google tells me that windoze InterlockedIncrement/Decrement stuff
does the almost same thing. Why can't refcounting be
implemented using just atomic_inc/dec and/or atomic_inc/dec_and_test ?

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-03-11 21:55:02

by Andi Kleen

[permalink] [raw]
Subject: Re: Multi-threading

Brad Pepers <[email protected]> writes:
> there is a very complex multi-process dance involving (apparently)
> multiple debugger interactions per wake up. Kinda like the
> guys who designed the threads didn't talk to the guys who designed
> ptrace or one or the other didn't care.

I guess the new futex mechanism that is currently designed/debugged/discussed
will take care of that. It doesn't require signals anymore. Unfortunately
it is probably some time off until it can be used generally, but at least
it is worked on

[the only problem i currently see in using futexes for pthread mutexes is that
if there are really architectures that need a special mapping for them to work
these won't be able to use them for this]

It is known that gdb is not the best thread debugger in the world :/

> A second problem is implementing a use-count mechanism to
> control object lifetimes in a multi-threaded environment.
> The two alternatives are to use mutexes or other synchronization
> mechanisms to protect all addRef/release calls (very, very
> expensive) or to use interlocked increment/decrement mechanisms.
> Unfortunately, while Microsoft provides intrinsic
> InterlockedIncrement/InterlockedDecrement functions that perform
> atomic multiprocessor interlocked operations that correctly
> return the result of the operation. Unfortunately, there
> are no such functions available on Linux. Atomic.h provides
> interlocked increment/decrement, but they don't return values.
> Interestingly enough, Google couldn't find any example of
> the Intel instruction sequences required to implement the
> necessary atomic operations using the GNU assembler dialect.

atomic_dec_and_test() ?

atomic_dec_and_return() doesn't strike me as too useful, because
it would need to lie to you. When you have a reference count
and you unlink the object first from public view you can trust
a 0 check (as atomic_dec_and_test does). As long as the object
is in public view someone else can change the counts and any
"atomic return" of that would be just lying. You can of course
always use atomic_read(), but it's not really atomic. I doubt the
microsoft equivalent is atomic neither, they are probably equivalent
to atomic_inc(); atomic_read(); or atomic_dec(); atomic_read() and
some hand weaving.

BTW regarding atomic.h: while it is nicely usable on i386 in userspace
it isn't completely portable. Some architectures require helper functions
that are hard to implement in user space.

-Andi

2002-03-11 23:54:58

by Brad Pepers

[permalink] [raw]
Subject: Re: Multi-threading

On Monday 11 March 2002 12:36, Dipankar Sarma wrote:
> In article <[email protected]> Brad Pepers
wrote:
> > There was a message posted by Jim Starkey about his experiences using
> > threads on Linux and the problems debugging them. It came down to two
> > things:
> >
> > 2. Linux is missing an atomic use-count mechanism which returns values
> > like the Microsoft InterlockedIncrement/Decrement functions do.
>
> Can't this be done using atomic_dec_and_test() and the likes ?
> Google tells me that windoze InterlockedIncrement/Decrement stuff
> does the almost same thing. Why can't refcounting be
> implemented using just atomic_inc/dec and/or atomic_inc/dec_and_test ?

The atomic_dec_and_test certainly handles the most often used case and is
good enough. Apparently it would be nice to have the value back sometimes
too though.

The atomic_inc_and_test is not much good though since the case you most often
want to track is the 0 to 1 transition and not -1 to 0!

--
Brad Pepers
[email protected]

2002-03-12 00:03:39

by Brad Pepers

[permalink] [raw]
Subject: Re: Multi-threading

On Monday 11 March 2002 14:54, Andi Kleen wrote:
> Brad Pepers <[email protected]> writes:
> > there is a very complex multi-process dance involving (apparently)
> > multiple debugger interactions per wake up. Kinda like the
> > guys who designed the threads didn't talk to the guys who designed
> > ptrace or one or the other didn't care.
>
> I guess the new futex mechanism that is currently
> designed/debugged/discussed will take care of that. It doesn't require
> signals anymore. Unfortunately it is probably some time off until it can be
> used generally, but at least it is worked on

I'll watch the futex development than and wait for it do be available.

> atomic_dec_and_test() ?

Handles the most common case but not general enough for all cases and its
sister function atomic_inc_and_test is pretty useless.

> atomic_dec_and_return() doesn't strike me as too useful, because
> it would need to lie to you. When you have a reference count
> and you unlink the object first from public view you can trust
> a 0 check (as atomic_dec_and_test does). As long as the object
> is in public view someone else can change the counts and any
> "atomic return" of that would be just lying. You can of course
> always use atomic_read(), but it's not really atomic. I doubt the
> microsoft equivalent is atomic neither, they are probably equivalent
> to atomic_inc(); atomic_read(); or atomic_dec(); atomic_read() and
> some hand weaving.

Apparently the Microsoft one really is in Windows 98 and up (not in 95).
I've had it explained that the asm code (semi-pseudo code here) is like this:

movl reg, #-1
lock xadd reg, counter
decl reg
movl result, reg

This is in comparison to the current code which does something like this:

lock decl counter
sete result

I don't see how the first asm code lies to you. It is returning the value as
it was decremented to and the lock on the xadd keeps it safe.

> BTW regarding atomic.h: while it is nicely usable on i386 in userspace
> it isn't completely portable. Some architectures require helper functions
> that are hard to implement in user space.

Its too bad Linux didn't have a nice wrapper around atom inc/dec/test that
was completely portable. Do you know what arch's have trouble implementing
this?

--
Brad Pepers
[email protected]

2002-03-12 07:10:30

by Andi Kleen

[permalink] [raw]
Subject: Re: Multi-threading

On Mon, Mar 11, 2002 at 05:02:50PM -0700, Brad Pepers wrote:
> > atomic_dec_and_return() doesn't strike me as too useful, because
> > it would need to lie to you. When you have a reference count
> > and you unlink the object first from public view you can trust
> > a 0 check (as atomic_dec_and_test does). As long as the object
> > is in public view someone else can change the counts and any
> > "atomic return" of that would be just lying. You can of course
> > always use atomic_read(), but it's not really atomic. I doubt the
> > microsoft equivalent is atomic neither, they are probably equivalent
> > to atomic_inc(); atomic_read(); or atomic_dec(); atomic_read() and
> > some hand weaving.
>
> Apparently the Microsoft one really is in Windows 98 and up (not in 95).
> I've had it explained that the asm code (semi-pseudo code here) is like this:
>
> movl reg, #-1
> lock xadd reg, counter
> decl reg
> movl result, reg
>
> This is in comparison to the current code which does something like this:
>
> lock decl counter
> sete result
>
> I don't see how the first asm code lies to you. It is returning the value as
> it was decremented to and the lock on the xadd keeps it safe.

Just it might change immediately afterwards if you don't remove the
object from public view first. An "atomic" value that you cannot depend
on is not very useful. The Linux convention of using atomic_dec_and_test()
is also only safe when you remove it first, but the dec_and_test encourages
this at least. atomic_inc_and_read() could only be safe when you remove
the object first too, but I don't see how it could be ever useful assuming
you keep the convention that reference count == 0 means freeable object.

>
> > BTW regarding atomic.h: while it is nicely usable on i386 in userspace
> > it isn't completely portable. Some architectures require helper functions
> > that are hard to implement in user space.
>
> Its too bad Linux didn't have a nice wrapper around atom inc/dec/test that
> was completely portable. Do you know what arch's have trouble implementing
> this?

sparc32 at least.
I think pa-risc32 too.

-Andi

2002-03-13 07:51:54

by David Schwartz

[permalink] [raw]
Subject: Re: Multi-threading


>Just it might change immediately afterwards if you don't remove the
>object from public view first.

If it was in public view, whatever held it in public view would be using it,
and hence its use count could not drop to zero.

DS


2002-03-13 08:23:36

by Andi Kleen

[permalink] [raw]
Subject: Re: Multi-threading

On Tue, Mar 12, 2002 at 11:51:29PM -0800, David Schwartz wrote:
>
> >Just it might change immediately afterwards if you don't remove the
> >object from public view first.
>
> If it was in public view, whatever held it in public view would be using it,
> and hence its use count could not drop to zero.

That's not correct at least in the usual linux kernel pattern of using
reference counts for objects. Hash tables don't hold reference counts,
only users do. If you think about it a hash table or global list holding
a reference count doesn't make too much sense.

-Andi

2002-03-13 09:02:32

by David Schwartz

[permalink] [raw]
Subject: Re: Multi-threading


On Wed, 13 Mar 2002 09:23:06 +0100, Andi Kleen wrote:


>>If it was in public view, whatever held it in public view would be
>> using it,
>>and hence its use count could not drop to zero.

>That's not correct at least in the usual linux kernel pattern of using
>reference counts for objects. Hash tables don't hold reference counts,
>only users do. If you think about it a hash table or global list holding
>a reference count doesn't make too much sense.

That's the way I've always done it and it has saved me a lot of heartache. A
use count of 'zero' means that it's really not used at all, and hence nothing
would have any way of finding it. Anything with a future interest in
something should 'use' it.

In any event, hash tables require locks themselves anyway. So if you find an
object in a hash table, you must be holding some lock when you find it, so
you can increment the use count under the protection of that lock. The trick
becomes the decrement operation, because ideally you'd prefer not to have to
lock the hash table again unless you have to remove the object.

I believe, however, that you are completely safe if you decrement the use
count atomically, and if it's zero, you grab the hash lock, confirm that the
use count is still zero, and then remove the object.

Since the use count is always locked for the first time in any usage chain
with the hash lock held (lock it when you find it), an increment from zero to
one can only occur while the lock is held. So if you hold the lock, an
increment from zero to one cannot occur. No race.

DS


2002-03-14 04:48:41

by Rusty Russell

[permalink] [raw]
Subject: Re: Multi-threading

On Wed, 13 Mar 2002 09:23:06 +0100
Andi Kleen <[email protected]> wrote:

> On Tue, Mar 12, 2002 at 11:51:29PM -0800, David Schwartz wrote:
> >
> > >Just it might change immediately afterwards if you don't remove the
> > >object from public view first.
> >
> > If it was in public view, whatever held it in public view would be
> > using it, and hence its use count could not drop to zero.
>
> That's not correct at least in the usual linux kernel pattern of using
> reference counts for objects. Hash tables don't hold reference counts,
> only users do. If you think about it a hash table or global list holding
> a reference count doesn't make too much sense.

Depends where you are talking. In the conntrack code (and I thought the
rest of the networking code), 0 means "free me now, NOONE has a pointer",
ie. the hash table holds 1.

dcache holds zero-count entries because their semantic requirements are
different, hence the "atomic_dec_and_lock()" stuff.

Cheers!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.