2007-01-29 14:53:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Hello.

I'm pleased to announce initial userspace M-on-N threading model
implementation (for hackers) called NTL.
This is first alpha release, which indeed has bugs and limitations.

Userspace M-on-N threading model is based on the idea, that when signal
is delivered, kernel saves all information related to previous context
in stack, so it is possible to find it and replace.

M-on-N threading model compared to usual NPTL 1-on-1 model has following
advantages and disadvantages:

Benefits.

1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule new
thread execution (just watch what happens with userspace network stack
compared to kernel's one when there are a lot of syscalls performed for
small packets receiving/sending).

2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace, no need
for full creation process which is performed in clone() syscall.

3. Smaller number of cache misses.
Since there is only one process instead of several threads, cache
locality is increased greatly with reduced number of misses.

Drawbacks.

1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process,
it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace and
kernelspace schedulers or simply by increasing process' nice value.

2. All communications are performed through one kevent pipe. (TODO)
All syscalls are going to be converted into non-blocking operations
(including nanosleep() and the like), and keep a track of what each
context performed. In practice glibc rewrite is not what I would like to
do, but instead some layer on top of it will be implemented, which will
convert syscalls into kevent operations, and become a rescheduling
point.

3. Complex code for good SMP scalability and userspace scheduler.
Not a problem. (TESTING)

SMP scalability in M-on-N threading model.

Since only kernel can schedule thread (actually not even thread or
process, but its own kernel's representation, so called kernel's virtual
process) to run on specified CPU, M-on-N threading model should have
several real threads (for example several current POSIX threads), its
number should be equal to number of real CPUs, and then library layer
will schedule execution of context of different real threads, each of
which in turn can run on separate CPU.

So, userspace will create new real threads when pthread_create() is
called until number of them is less than number of real CPUs, each real
thread in turn is a context in the global set of contexts, where fake
context will be added with all subsequent pthread_create() calls, and
userspace scheduler (backed by real threads) will pick up several
contexts from the tree and execute them on the real CPUs.

I would be possible to use existing Linux clone() syscall, but due to
complete absence of hte documentation (which is sometimes plain wrong)
and ery strong encryption of glibc sources it is quite complex task.

As NPTL, M-on-N threading library uses stack rlimit for thread stack
allocation.

Benchmarks.

I only ran simple benchmark of empty thread creation (its function just
exits).
After I started to use atomic locks ("lock" prefix on x86) instead of
semaphores, thread start/empty exec/stop was reduced down to 0.3
microseconds compared to 14 microsecods for POSIX NPTL case.

But there are problems.
First one is that I perform initial context setup through signal
invokation, which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling,
which is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for
signal context and for process context, which must block signals, which
in turn adds additional overhead of sigprocmask() syscall.

After I fixed all above issues (actually not fixed, but confirmed that
they must exist), performance reduced to 9 microseconds compared to 14
microsecods for POSIX NPTL case for empty thread creation/destruction.

(Test machine is Core Duo 2.4 Ghz (run at 3.7) with 2 GB of ram).

This can be fixed, if I would have created arch-specific
getcontext()-like calls, which would be mutually transformable into
signal context information (existing getcontext() and friends produces
different data than signal context has at least on x86). But I can not
right now, since I do not know enough x86 ABI (I learned a lot for past
several days, as you can notice from this blog, but it is still even
remotely not enough).

Currently M-on-N threading model uses ugly arch-specific hacks to start
new threads, which actually are something remotely similar to
makecontext().
So, the solution, which will rock M-on-N threading implementation is to
convert or create getcontext() and friends calls which can be used with
signal context information.

Another limitations are:
* x86 only (I do not have different test boxes to learn different asm)
* does not work if compiled with position-independent code support
* does not work if some functions are inlined (so -fno-inline flag)
* no support for run-time syscall substitution (to make it rescheduling
point) yet
* looks like a real hack

Advantages:
* it is faster (noticebly faster)
* it is simpler
* code is not encrypted like in glibc sources

Sources and developement comments can be downloaded from NTL homepage at
http://tservice.net.ru/~s0mbre/old/?section=projects&item=threading

Archive is also attached for interested reader.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>

--
Evgeniy Polyakov


Attachments:
(No filename) (5.53 kB)
threading-2007_01_29.tar.gz (9.23 kB)
Download all attachments

2007-01-29 14:57:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

P.S. I'm not subscribed to any of the above lists, please Cc: me in
replies.


--
Evgeniy Polyakov

2007-01-29 16:40:54

by Chris Friesen

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov wrote:
> Hello.
>
> I'm pleased to announce initial userspace M-on-N threading model
> implementation (for hackers) called NTL.

If you haven't already, I suggest you look into the story of NGPT and
also read the NPTL white paper
(http://people.redhat.com/drepper/nptl-design.pdf) especially section
5.1 describing why they went with a 1:1 model.

Chris

2007-01-30 01:18:24

by Samuel Thibault

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Hi,

Evgenity, le Mon 29 Jan 2007 16:47:36 +0100, a ?crit :
> Userspace M-on-N threading model is based on the idea, that when signal
> is delivered, kernel saves all information related to previous context
> in stack, so it is possible to find it and replace.

You may want to have a look at some existing implementations:

- Good old `FSU Pthreads' http://moss.csc.ncsu.edu/~mueller/pthreads/
- fully POSIX-compliant `GnuPth' http://www.gnu.org/software/pth/
- server-targetted `Capriccio'
http://www.cs.berkeley.edu/~jcondit/capriccio-sosp-2003.pdf
- efficient `ELiTE/Erlangen'
http://www4.informatik.uni-erlangen.de/Projects/FORTWIHR/ELiTE/
- and our portable, flexible, efficient `Marcel'
http://runtime.futurs.inria.fr/marcel/

Samuel

2007-01-30 09:47:03

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Mon, Jan 29, 2007 at 10:40:42AM -0600, Chris Friesen ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Hello.
> >
> >I'm pleased to announce initial userspace M-on-N threading model
> >implementation (for hackers) called NTL.
>
> If you haven't already, I suggest you look into the story of NGPT and
> also read the NPTL white paper
> (http://people.redhat.com/drepper/nptl-design.pdf) especially section
> 5.1 describing why they went with a 1:1 model.

Of course I read this, but it does not change anything.

NGPT had 2 times slower start time than NPTL, and NTL has 2-20 times
faster than NPTL, so I think NGPT had too major problems to get it
into comparison.

I described in details why and how M:N model better, and its drawbacks
include all issues mentioned by Ulrich Drepper, but nevertheless its
advantages are far too superiour than those which can be provided by 1:1
model.

> Chris

--
Evgeniy Polyakov

2007-01-30 09:57:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Tue, Jan 30, 2007 at 02:18:17AM +0100, Samuel Thibault ([email protected]) wrote:
> Hi,
>
> Evgenity, le Mon 29 Jan 2007 16:47:36 +0100, a écrit :
> > Userspace M-on-N threading model is based on the idea, that when signal
> > is delivered, kernel saves all information related to previous context
> > in stack, so it is possible to find it and replace.
>
> You may want to have a look at some existing implementations:

I saw most of them.
As far as I recall, only PTL (is not shown here) has preemptible
scheduler. NTL has it too, but is based on different approach.

> - Good old `FSU Pthreads' http://moss.csc.ncsu.edu/~mueller/pthreads/
> - fully POSIX-compliant `GnuPth' http://www.gnu.org/software/pth/
> - server-targetted `Capriccio'
> http://www.cs.berkeley.edu/~jcondit/capriccio-sosp-2003.pdf
> - efficient `ELiTE/Erlangen'
> http://www4.informatik.uni-erlangen.de/Projects/FORTWIHR/ELiTE/
> - and our portable, flexible, efficient `Marcel'
> http://runtime.futurs.inria.fr/marcel/
>
> Samuel

--
Evgeniy Polyakov

2007-01-30 10:43:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Tue, Jan 30, 2007 at 11:24:51AM +0100, Samuel Thibault ([email protected]) wrote:
> Evgeniy Polyakov, le Tue 30 Jan 2007 12:53:16 +0300, a écrit :
> > > You may want to have a look at some existing implementations:
> >
> > I saw most of them.
> > As far as I recall, only PTL (is not shown here) has preemptible
> > scheduler. NTL has it too, but is based on different approach.
>
> Marcel has a preemptible scheduler too, based on signals and
> setjmp/longjmp.

Do some documentation and benchmarks exist for that library - site seems
to only described environment is was created for? How does blocking
problem solved?

> Samuel

--
Evgeniy Polyakov

2007-01-30 10:52:16

by Samuel Thibault

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov, le Tue 30 Jan 2007 12:53:16 +0300, a ?crit :
> > You may want to have a look at some existing implementations:
>
> I saw most of them.
> As far as I recall, only PTL (is not shown here) has preemptible
> scheduler. NTL has it too, but is based on different approach.

Marcel has a preemptible scheduler too, based on signals and
setjmp/longjmp.

Samuel

2007-01-30 21:30:33

by Kaz Kylheku

[permalink] [raw]
Subject: RE: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Evgeniy Polyakov wrote:
> I described in details why and how M:N model better, and its drawbacks
> include all issues mentioned by Ulrich Drepper, but nevertheless its
> advantages are far too superiour than those which can be
> provided by 1:1
> model.

M:N threading is an unnecessary performance hack that's needed by people
who are living in a C or C++ exile away from some language that has
lexical closures, generators or first-class continuations. Not having
these niceties, they resort to emulating them with threads. The proper
thing to do is to rewrite the code to use state machines which can be
driven by any available thread. Or else, write yourself a
source-to-source transformer that will give C the lexical closure,
generator, or continuation features that you need to express the
solution that way.

There is no need to retain any vestiges of a user space threading
implementation when you have the real thing.

Programs which appear to benefit from that model are badly optimized or
badly designed. A smartly written program uses an available thread to do
as much work as possible, until that thread happens to block or its time
slice burns up.

2007-01-31 08:48:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Tue, Jan 30, 2007 at 01:16:22PM -0800, Kaz Kylheku ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> > I described in details why and how M:N model better, and its drawbacks
> > include all issues mentioned by Ulrich Drepper, but nevertheless its
> > advantages are far too superiour than those which can be
> > provided by 1:1
> > model.
>
> M:N threading is an unnecessary performance hack that's needed by people
> who are living in a C or C++ exile away from some language that has
> lexical closures, generators or first-class continuations. Not having
> these niceties, they resort to emulating them with threads. The proper
> thing to do is to rewrite the code to use state machines which can be
> driven by any available thread. Or else, write yourself a
> source-to-source transformer that will give C the lexical closure,
> generator, or continuation features that you need to express the
> solution that way.
>
> There is no need to retain any vestiges of a user space threading
> implementation when you have the real thing.
>
> Programs which appear to benefit from that model are badly optimized or
> badly designed. A smartly written program uses an available thread to do
> as much work as possible, until that thread happens to block or its time
> slice burns up.

Do not mix languages like Erlang, specialy designed for concurrent
programming, with M:N threading model - they are completely different,
but you do not want to see this. As you pointed, one thread can do as
much as it need until it is blocked, and what next? Allocate new real
thread? You may want to see how things like JVM work, I seriously doubt
spwning new thread each time task blocks is a way to go. Even having
epoll does not help in many cases. And you forgot the price of
rescheduling in kernelspace and userspace - even with signals it differs
two times, with more intellegent case it differs in 20 times!
Virtual machine can have thousands of threads, actually it cant, since
it will kill Linux in rescheduling.

--
Evgeniy Polyakov

2007-02-01 04:28:10

by Lee Revell

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On 1/29/07, Evgeniy Polyakov <[email protected]> wrote:
> 1. Scheduling fairness.
> Since kernel does not know about multiple threads behind given process,
> it can not add it appropriate number of timeslices for execution.
> Can be solved either by more tight collaboarion of the userspace and
> kernelspace schedulers or simply by increasing process' nice value.

nice value is only meaningful for SCHED_OTHER. How will you handle a
multithreaded realtime application that uses SCHED_OTHER as well as
SCHED_FIFO threads?

Lee

2007-02-01 07:10:09

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Wed, Jan 31, 2007 at 11:28:07PM -0500, Lee Revell ([email protected]) wrote:
> On 1/29/07, Evgeniy Polyakov <[email protected]> wrote:
> >1. Scheduling fairness.
> >Since kernel does not know about multiple threads behind given process,
> >it can not add it appropriate number of timeslices for execution.
> >Can be solved either by more tight collaboarion of the userspace and
> >kernelspace schedulers or simply by increasing process' nice value.
>
> nice value is only meaningful for SCHED_OTHER. How will you handle a
> multithreaded realtime application that uses SCHED_OTHER as well as
> SCHED_FIFO threads?

Threads created inside one process obviously can not compete with RT
threads created by other process. Instead process, which threads have RT
priority itself should change its priority to RT to compete with other
RT process. (By RT I mean any cases except SCHED_OTHER which is
default).

> Lee

--
Evgeniy Polyakov

2007-02-02 16:12:45

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Kaz Kylheku wrote:
> Evgeniy Polyakov wrote:
>> I described in details why and how M:N model better, and its drawbacks
>> include all issues mentioned by Ulrich Drepper, but nevertheless its
>> advantages are far too superiour than those which can be
>> provided by 1:1
>> model.
>
> M:N threading is an unnecessary performance hack that's needed by people
> who are living in a C or C++ exile away from some language that has
> lexical closures, generators or first-class continuations.

Yes, that's called the "real world." Arguments of the "I don't need it,
in a perfect world you wouldn't either, therefore it's a bad idea" type
simply contribute nothing.

Because user threading can avoid context switches, there will always be
cases where it will outperform o/s threads for hardware reasons.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-02-03 15:26:19

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.


> Because user threading can avoid context switches, there will always be
> cases where it will outperform o/s threads for hardware reasons.

actually.. switching from one "real" thread to another in Linux is not
an actual context switch in the hardware sense... at least this part of
your argument seems to be incorrect ;)

>
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-02-04 20:12:23

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Arjan van de Ven wrote:
>> Because user threading can avoid context switches, there will always be
>> cases where it will outperform o/s threads for hardware reasons.
>
> actually.. switching from one "real" thread to another in Linux is not
> an actual context switch in the hardware sense... at least this part of
> your argument seems to be incorrect ;)
>
How does that work? Switching between kernel threads requires going into
the kernel, user level thread switches are all done in user mode.

Do you have some way to change o/s threads w/o going into the kernel?

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-02-04 20:21:01

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Sun, Feb 04, 2007 at 03:12:32PM -0500, Bill Davidsen wrote:
> Arjan van de Ven wrote:
> >>Because user threading can avoid context switches, there will always be
> >>cases where it will outperform o/s threads for hardware reasons.
> >
> >actually.. switching from one "real" thread to another in Linux is not
> >an actual context switch in the hardware sense... at least this part of
> >your argument seems to be incorrect ;)
> >
> How does that work? Switching between kernel threads requires going into
> the kernel, user level thread switches are all done in user mode.
>
> Do you have some way to change o/s threads w/o going into the kernel?

But going into kernel is not very expensive on Linux.

On the other side, the overhead you need to add for every single syscall
that might block for the M:N threads and the associated complications
which make it far harder to conform to POSIX IMHO far outweight the costs
of going into the kernel for a context switch.

Jakub

2007-02-04 21:42:34

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Jakub Jelinek wrote:
> On Sun, Feb 04, 2007 at 03:12:32PM -0500, Bill Davidsen wrote:
>
>> Arjan van de Ven wrote:
>>
>>>> Because user threading can avoid context switches, there will always be
>>>> cases where it will outperform o/s threads for hardware reasons.
>>>>
>>> actually.. switching from one "real" thread to another in Linux is not
>>> an actual context switch in the hardware sense... at least this part of
>>> your argument seems to be incorrect ;)
>>>
>>>
>> How does that work? Switching between kernel threads requires going into
>> the kernel, user level thread switches are all done in user mode.
>>
>> Do you have some way to change o/s threads w/o going into the kernel?
>>
>
> But going into kernel is not very expensive on Linux.
>
> On the other side, the overhead you need to add for every single syscall
> that might block for the M:N threads and the associated complications
> which make it far harder to conform to POSIX IMHO far outweight the costs
> of going into the kernel for a context switch.

That really wasn't my question, Arjan said that switching real threads
wasn't a context switch in the hardware sense, and I was asking if I
missed something. It may be cheap, but it would seem to be a context
switch none-the-less.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-02-04 22:52:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

On Sun, 4 Feb 2007, Jakub Jelinek wrote:

> On Sun, Feb 04, 2007 at 03:12:32PM -0500, Bill Davidsen wrote:
> > Arjan van de Ven wrote:
> > >>Because user threading can avoid context switches, there will always be
> > >>cases where it will outperform o/s threads for hardware reasons.
> > >
> > >actually.. switching from one "real" thread to another in Linux is not
> > >an actual context switch in the hardware sense... at least this part of
> > >your argument seems to be incorrect ;)
> > >
> > How does that work? Switching between kernel threads requires going into
> > the kernel, user level thread switches are all done in user mode.
> >
> > Do you have some way to change o/s threads w/o going into the kernel?
>
> But going into kernel is not very expensive on Linux.
>
> On the other side, the overhead you need to add for every single syscall
> that might block for the M:N threads and the associated complications
> which make it far harder to conform to POSIX IMHO far outweight the costs
> of going into the kernel for a context switch.

Agreed, definitely. A libpcl (using swapcontext(3)) cobench is about 50
times faster than an context switch measured by lmbench (although I have
serious doubts about about the ability of lat_ctx to measure it - but
that's another story) on an Opteron 254. One may say "Wow! Really?!?".
The point is, who cares. We are talking about differences between
super-fast (~2us) and ultra-fast (~0.04us).
The time (and code) that you'll have to drop in the syscall path to handle
M:N is very likely to make you lose way more of what you gain by avoiding
an OS context switch (a "soft" context switch you still have to do it).
Either use N:N (requires locking, but spread over multiple CPUs) or 1:N
(I/O driven state machines or coroutines - no locking, once-CPU bound).



- Davide


2007-02-05 06:14:23

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.


> > On the other side, the overhead you need to add for every single syscall
> > that might block for the M:N threads and the associated complications
> > which make it far harder to conform to POSIX IMHO far outweight the costs
> > of going into the kernel for a context switch.
>
> That really wasn't my question, Arjan said that switching real threads
> wasn't a context switch in the hardware sense, and I was asking if I
> missed something.

a hardware context switch is basically a CR3 change with associated tlb
flush. That is the part that is the most expensive of a context switch.
Just going into the kernel and getting out with a different EIP/ESP is
really cheap, in the order of "a few hundred cycles"; not a heck of a
lot more expensive than a simple getpid or other simple system call.


> It may be cheap, but it would seem to be a context
> switch none-the-less.

it includes a privilege level switch, not so much a full context
switch...

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-02-14 12:20:07

by Pavel Machek

[permalink] [raw]
Subject: Re: [ANN] Userspace M-on-N threading model implementation. Alpha release.

Hi!

> >>How does that work? Switching between kernel threads requires going into
> >>the kernel, user level thread switches are all done in user mode.
> >>
> >>Do you have some way to change o/s threads w/o going into the kernel?
> >>
> >
> >But going into kernel is not very expensive on Linux.
> >
> >On the other side, the overhead you need to add for every single syscall
> >that might block for the M:N threads and the associated complications
> >which make it far harder to conform to POSIX IMHO far outweight the costs
> >of going into the kernel for a context switch.
>
> That really wasn't my question, Arjan said that switching real threads
> wasn't a context switch in the hardware sense, and I was asking if I
> missed something. It may be cheap, but it would seem to be a context
> switch none-the-less.

It is not reloading %cr3.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html