2003-11-11 23:52:45

by kirk bae

[permalink] [raw]
Subject: So, Poll is not scalable... what to do?

If poll is not scalable, which method should I use when writing
multithreaded socket server?

What is the most efficient model to use?

Is there a "standard" model to use when writing a scalable multithreaded
socket serve such as "io completion ports" on windows?

>From the "Microbenchmark comparing poll, kqueue, and /dev/poll", kqueue is
the way to go. Am I correct?

What is the best solution to use on Linux?

Also, why is it that poll doesn not return with "close signal" when a
thread-1 calls poll and thread-2 calls close on a sockfd1? It seems that
poll only handles close signal when a client disconnects from the server.
I've seen this mentioned here before, has it been fixed?

Thank you~~~

_________________________________________________________________
>From Beethoven to the Rolling Stones, your favorite music is always playing
on MSN Radio Plus. No ads, no talk. Trial month FREE!
http://join.msn.com/?page=offers/premiumradio


2003-11-12 03:52:41

by Jeff Garzik

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

kirk bae wrote:
> If poll is not scalable, which method should I use when writing
> multithreaded socket server?
>
> What is the most efficient model to use?


epoll, thread pools, AIO...

2003-11-12 05:32:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

On Tue, Nov 11, 2003 at 05:52:42PM -0600, kirk bae wrote:
> If poll is not scalable, which method should I use when writing
> multithreaded socket server?

Honnestly, if you're using threads (I mean lots of threads, such as one
per connection), I don't think that poll performance will be your worst
ennemy. The first thing to do is to handle the task switching yourself
either with a publicly available coroutine library or with one of your own.

Take a look here for more a comparison of several available methods :

http://www.kegel.com/c10k.html

epoll is compared to other methods with numbers here :

http://www.xmailserver.org/linux-patches/nio-improve.html

Cheers,
Willy

2003-11-12 23:33:44

by Bill Davidsen

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

In article <[email protected]>,
kirk bae <[email protected]> wrote:
| If poll is not scalable, which method should I use when writing
| multithreaded socket server?
|
| What is the most efficient model to use?
|
| Is there a "standard" model to use when writing a scalable multithreaded
| socket serve such as "io completion ports" on windows?
|

In many cases people just run a thread per socket and use blocking i/o.
Then the thread either does the work required or make an entry on a
"work to do" queue.

You've had other suggestions, this is just for completeness.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-11-12 23:37:24

by Bill Davidsen

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

In article <[email protected]>,
Willy Tarreau <[email protected]> wrote:
| On Tue, Nov 11, 2003 at 05:52:42PM -0600, kirk bae wrote:
| > If poll is not scalable, which method should I use when writing
| > multithreaded socket server?
|
| Honnestly, if you're using threads (I mean lots of threads, such as one
| per connection), I don't think that poll performance will be your worst
| ennemy. The first thing to do is to handle the task switching yourself
| either with a publicly available coroutine library or with one of your own.

It's not clear that with 2.6 this is necessary or desirable. I'll let
someone who worked on the new thread and/or futex development say more
if they will, but I'm reasonable convinced that in most cases the kernel
will do it better.
|
| Take a look here for more a comparison of several available methods :
|
| http://www.kegel.com/c10k.html
|
| epoll is compared to other methods with numbers here :
|
| http://www.xmailserver.org/linux-patches/nio-improve.html
|
| Cheers,
| Willy
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
|


--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-11-13 00:33:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

On 12 Nov 2003, bill davidsen wrote:

> In article <[email protected]>,
> Willy Tarreau <[email protected]> wrote:
> | On Tue, Nov 11, 2003 at 05:52:42PM -0600, kirk bae wrote:
> | > If poll is not scalable, which method should I use when writing
> | > multithreaded socket server?
> |
> | Honnestly, if you're using threads (I mean lots of threads, such as one
> | per connection), I don't think that poll performance will be your worst
> | ennemy. The first thing to do is to handle the task switching yourself
> | either with a publicly available coroutine library or with one of your own.
>
> It's not clear that with 2.6 this is necessary or desirable. I'll let
> someone who worked on the new thread and/or futex development say more
> if they will, but I'm reasonable convinced that in most cases the kernel
> will do it better.

Pros & Cons:

*) Coroutines cost is basically its stack (8-16Kb). Threads there's a
little bit more under the hood

*) No locks at all with coroutines

*) Coroutine context switch time was about 20 times faster last time I
tried. I used this:

http://www.xmailserver.org/libpcl.html

against O(1)

*) Coroutines require a more careful coding then threads, expecially
stackwise

*) Achieving SMP scalability/balancing with coroutines is not trivial
while with threads it comes along (well, almost)


Coroutines are not the only alternative though. I/O driven state machines
can make you save some stack space at expenses of less trivial coding.



- Davide





2003-11-13 00:54:54

by Nick Piggin

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?



Davide Libenzi wrote:

>On 12 Nov 2003, bill davidsen wrote:
>
>
>>In article <[email protected]>,
>>Willy Tarreau <[email protected]> wrote:
>>| On Tue, Nov 11, 2003 at 05:52:42PM -0600, kirk bae wrote:
>>| > If poll is not scalable, which method should I use when writing
>>| > multithreaded socket server?
>>|
>>| Honnestly, if you're using threads (I mean lots of threads, such as one
>>| per connection), I don't think that poll performance will be your worst
>>| ennemy. The first thing to do is to handle the task switching yourself
>>| either with a publicly available coroutine library or with one of your own.
>>
>>It's not clear that with 2.6 this is necessary or desirable. I'll let
>>someone who worked on the new thread and/or futex development say more
>>if they will, but I'm reasonable convinced that in most cases the kernel
>>will do it better.
>>
>
>Pros & Cons:
>
>*) Coroutines cost is basically its stack (8-16Kb). Threads there's a
>little bit more under the hood
>
>*) No locks at all with coroutines
>
>*) Coroutine context switch time was about 20 times faster last time I
>tried. I used this:
>

According to lmbench, 2.4 does a context switch in 0.67us (on one type
of uniprocessor cpu - the machines at osdl). 2.6.0-test9 manages 1.17us
(174% longer). I have some patches that brings this to 0.80us (119%).
This is with 2 active tasks, so O(1) doesn't get to show its advantage.

If you remove the realtime priority array (disable RT scheduling support)
you could probably drop this figure below 2.4's. I should get some numbers

Perhaps if there is demand I could make RT scheduling a config option in
my patchset. I think Andrea does something fancy like that. I'll have to
take a look at his code.


That said, I'd be inclined to think an application that it context switch
bound is broken by design. Although maybe there are some special cases.


2.4: http://khack.osdl.org/stp/282982/results/summary_report
2.6: http://khack.osdl.org/stp/282354/results/summary_report
np: http://khack.osdl.org/stp/283054/results/summary_report


2003-11-13 01:06:17

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

In article <[email protected]> you wrote:
> In many cases people just run a thread per socket and use blocking i/o.
> Then the thread either does the work required or make an entry on a
> "work to do" queue.
>
> You've had other suggestions, this is just for completeness.

This is true for Windows and Java, but how many popular Linux Applications
do it that way? There are much more non-blocking (erlang, boa) or process
based (apache, ...)

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-11-13 07:53:09

by David Schwartz

[permalink] [raw]
Subject: RE: So, Poll is not scalable... what to do?


> If poll is not scalable, which method should I use when writing
> multithreaded socket server?

'poll' is quite scalable. It's only not scalable if you make stupid
assumptions about how people use it.

> What is the most efficient model to use?
>
> Is there a "standard" model to use when writing a scalable multithreaded
> socket serve such as "io completion ports" on windows?

If all your sockets are equally active, having a bunch of threads each
'poll'ing 1,024 descriptors is perfectly scalable. 64,000 is not a problem,
and that's about the kernel limit if memory serves me.

If you have active and inactive sockets, sort them. Have some threads that
'poll' on large numbers of inactive sockets and some that 'poll' on small
numbers of active sockets. This will scale quite well as the expensive
'poll' calls are infrequent compared to the cheap ones.

> From the "Microbenchmark comparing poll, kqueue, and /dev/poll",
> kqueue is
> the way to go. Am I correct?

Yes, if you're using a kernel that supports it. It will pretty much always
be more efficient than 'poll'.

> Also, why is it that poll doesn not return with "close signal" when a
> thread-1 calls poll and thread-2 calls close on a sockfd1? It seems that
> poll only handles close signal when a client disconnects from the server.
> I've seen this mentioned here before, has it been fixed?

You can never ever make this work. There will always be a race condition no
matter what you do. (How can you ever be sure thread-2 doesn't all 'close'
jsut before thread-1 calls 'poll'?) So do not ever do this. Do not ever
release a shared resource while a thread is or might be using it because it
might also be about to use it!

DS


2003-11-13 12:13:24

by Bill Davidsen

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

On Thu, 13 Nov 2003, Nick Piggin wrote:

>
>
> Davide Libenzi wrote:
>
> >On 12 Nov 2003, bill davidsen wrote:
> >
> >
> >>In article <[email protected]>,
> >>Willy Tarreau <[email protected]> wrote:
> >>| On Tue, Nov 11, 2003 at 05:52:42PM -0600, kirk bae wrote:
> >>| > If poll is not scalable, which method should I use when writing
> >>| > multithreaded socket server?
> >>|
> >>| Honnestly, if you're using threads (I mean lots of threads, such as one
> >>| per connection), I don't think that poll performance will be your worst
> >>| ennemy. The first thing to do is to handle the task switching yourself
> >>| either with a publicly available coroutine library or with one of your own.
> >>
> >>It's not clear that with 2.6 this is necessary or desirable. I'll let
> >>someone who worked on the new thread and/or futex development say more
> >>if they will, but I'm reasonable convinced that in most cases the kernel
> >>will do it better.
> >>
> >
> >Pros & Cons:
> >
> >*) Coroutines cost is basically its stack (8-16Kb). Threads there's a
> >little bit more under the hood
> >
> >*) No locks at all with coroutines
> >
> >*) Coroutine context switch time was about 20 times faster last time I
> >tried. I used this:
> >
>
> According to lmbench, 2.4 does a context switch in 0.67us (on one type
> of uniprocessor cpu - the machines at osdl). 2.6.0-test9 manages 1.17us
> (174% longer). I have some patches that brings this to 0.80us (119%).
> This is with 2 active tasks, so O(1) doesn't get to show its advantage.

It would seem that in the case of many sockets that there is still a need
to determine when to run a coroutine. Using pthreads and NPTL lets the
kernel do that by unblocking the thread, handles SMP cases, etc. And is
source code portable.

On the other hand, I'm sure that with enough effort some solution-specific
code could do better, assuming that there's nothing else in the machine.
And depending on the base hardware and the computation involved, there
could be a big win with SMP of the HT flavor.

I think any problem like this is an "it depends" case, but the assumption
that polling doesn't scale is a good starting point.

>
> If you remove the realtime priority array (disable RT scheduling support)
> you could probably drop this figure below 2.4's. I should get some numbers
>
> Perhaps if there is demand I could make RT scheduling a config option in
> my patchset. I think Andrea does something fancy like that. I'll have to
> take a look at his code.
>
>
> That said, I'd be inclined to think an application that it context switch
> bound is broken by design. Although maybe there are some special cases.
>
>
> 2.4: http://khack.osdl.org/stp/282982/results/summary_report
> 2.6: http://khack.osdl.org/stp/282354/results/summary_report
> np: http://khack.osdl.org/stp/283054/results/summary_report
>
>

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-11-13 18:25:48

by Dan Kegel

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

Kirk Bae wrote:
> If poll is not scalable, which method should I use when writing
> multithreaded socket server?

People who write multithreaded servers tend to use thread pools
and never need to use poll(). (Well, ok, I've written multithreaded
servers that used poll(), but the threads were there for disk I/O,
not networking.)

> What is the most efficient model to use?
>
> Is there a "standard" model to use when writing a scalable multithreaded
> socket serve such as "io completion ports" on windows?

Depends on your definition of efficient.

If by 'efficient' you mean 'runs like a bat out of hell,
and I don't care how long it takes to write', and
you're willing to write all the code from scratch, and
you're handy with state machines, the right way to go is
an edge-triggered readiness notification method (sys_epoll or kqueue,
if available, else sigio).

A handy wrapper library that lets
you use these without worrying about exactly which method
your operating system supports is at http://kegel.com/rn/
It handles two flavors of epoll as well as sigio at the moment,
and as a proof of concept, I've modified betaftpd to use it;
the resulting ftp server scales very nicely (except for the
calls to the system password checking routine, which is a bottleneck).

On the other hand, if by 'efficient' you mean 'doesn't take
long for somebody used to doing things on Windows to write',
or if you have to use any third-party libraries that use blocking I/O,
or if you're not very handy with state machines,
go ahead and use a thread pool.

> From the "Microbenchmark comparing poll, kqueue, and /dev/poll", kqueue is
> the way to go. Am I correct?

kqueue is for BSD. sys_epoll is the equivalent on (recent enough) linux.

> What is the best solution to use on Linux?

Depends on your needs; see above.

> Also, why is it that poll doesn not return with "close signal" when a
> thread-1 calls poll and thread-2 calls close on a sockfd1? It seems that
> poll only handles close signal when a client disconnects from the server.
> I've seen this mentioned here before, has it been fixed?

The Single Unix Standard (http://www.opengroup.org/onlinepubs/007904975/functions/close.html)
doesn't say what should happen if you close a file descriptor
being used by another thread.
Linus long ago decided that Linux would handle this
without waking the other thread.
I know, other operating systems (and Java!) behave differently,
but Linux is perfectly within its rights to behave as it does here,
and it's not likely to change.

- Dan

p.s. And, yes, http://kegel.com/c10k.html might be a handy reference for you :-)



2003-11-13 23:10:56

by kirk bae

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

>Kirk Bae wrote:
>>If poll is not scalable, which method should I use when writing
>>multithreaded socket server?
>
>People who write multithreaded servers tend to use thread pools
>and never need to use poll(). (Well, ok, I've written multithreaded
>servers that used poll(), but the threads were there for disk I/O,
>not networking.)

By thread pool, do you mean, one thread per socket?, or a work-crew model
where a specified number of threads handle many sockets?


>>What is the most efficient model to use?
>>
>>Is there a "standard" model to use when writing a scalable multithreaded
>>socket serve such as "io completion ports" on windows?
>
>Depends on your definition of efficient.
>
>If by 'efficient' you mean 'runs like a bat out of hell,
>and I don't care how long it takes to write', and
>you're willing to write all the code from scratch, and
>you're handy with state machines, the right way to go is
>an edge-triggered readiness notification method (sys_epoll or kqueue,
>if available, else sigio).

My definition of "efficient" is a method that is most widely used or
accepted for writing a robust scalable server. So I guess, "'runs like a bat
out of hell, and I don't care how long it takes to write'" is close.

Would it be thread pool or epoll? Is it uncommon to mix these two?

Currently, I have a thread-1 calling poll, and dispatching its workload to
thread-2 and thread-3 in blocking sockets. I dispatch the workload to worker
threads because some of the operations take considerable time.

Is mixing threads with poll uncommon? Is this a bad design? any comments
would be appreciated.

_________________________________________________________________
Great deals on high-speed Internet access as low as $26.95.
https://broadband.msn.com (Prices may vary by service area.)

2003-11-14 00:27:48

by Dan Kegel

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

Kirk Bae wrote:
>>>If poll is not scalable, which method should I use when writing
>>>multithreaded socket server?
>>
>>People who write multithreaded servers tend to use thread pools
>>and never need to use poll(). (Well, ok, I've written multithreaded
>>servers that used poll(), but the threads were there for disk I/O,
>>not networking.)
>
> By thread pool, do you mean, one thread per socket?, or a work-crew model
> where a specified number of threads handle many sockets?

The latter. But I don't have much recent experience writing threaded
servers, as I usually use somthing like epoll.

> My definition of "efficient" is a method that is most widely used or
> accepted for writing a robust scalable server. So I guess, "'runs like a bat
> out of hell, and I don't care how long it takes to write'" is close.
>
> Would it be thread pool or epoll?

My personal bias is towards epoll (suitably wrapped for portability;
no need to require epoll when sigio is nearly as good, and universally deployed).

> Is it uncommon to mix these two?

Folks who know how to program with things like epoll and aio tend
to use threads carefully, and try to avoid using blocking I/O for networking.
Blocking I/O is unavoidable for things like opening files, though,
so it's perfectly fine to have a thread that handles stuff like that.

> Currently, I have a thread-1 calling poll, and dispatching its workload to
> thread-2 and thread-3 in blocking sockets. I dispatch the workload to worker
> threads because some of the operations take considerable time.
>
> Is mixing threads with poll uncommon? Is this a bad design? any comments
> would be appreciated.

What are the operations that take considerable time? Are they networking
calls, or disk I/O, or ...? If they're just networking calls,
you might consider turning your code inside out to do everything with
nonblocking I/O and state machines, but only if you're hitting a bottleneck
because of the number of threads active.

Whatever design you're comfortable with is fine until you fail to hit your
performance requirement. No point in optimizing one little part
of the system if the whole system is fast enough, or if the real
bottleneck is elsewhere...
- Dan


2003-11-14 00:53:48

by Mark Mielke

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?

On Thu, Nov 13, 2003 at 05:10:50PM -0600, kirk bae wrote:
> Is mixing threads with poll uncommon? Is this a bad design? any comments
> would be appreciated.

For a portable solution involving threads, I am fond of using threads to
prioritize the scheduling. For example, a high priority thread could poll()
or select() for (the expected few) high priority events, a medium for medium,
and a low for low. Periodic events, or events that may frequently occur,
may deserve their own blocking thread.

The model probably isn't the simplest to code, though... :-)

I'm most fond of this model as the medium or lower priority threads shouldn't
wake up as often.

Another variant on this that may further reduce the overhead of poll()
is to put events that are expected to occur in the high priority
thread, and events that won't occur for some time in the low priority
thread. This variant would cause problems in the case that priority is
important, and mis-classification of an event would have an
exaggerated cost.

I suppose it really depends on the target application, as to which model
would work best...

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2003-11-14 18:59:52

by Frederic Rossi

[permalink] [raw]
Subject: Re: So, Poll is not scalable... what to do?



Hi,

Dan Kegel wrote:
> Kirk Bae wrote:
> >>>If poll is not scalable, which method should I use when writing
> >>>multithreaded socket server?
> >>
> >>People who write multithreaded servers tend to use thread pools
> >>and never need to use poll(). (Well, ok, I've written multithreaded
> >>servers that used poll(), but the threads were there for disk I/O,
> >>not networking.)
> >
> > By thread pool, do you mean, one thread per socket?, or a work-crew
> model
> > where a specified number of threads handle many sockets?
>
> The latter. But I don't have much recent experience writing threaded
> servers, as I usually use somthing like epoll.
>
> > My definition of "efficient" is a method that is most widely used or
> > accepted for writing a robust scalable server. So I guess, "'runs like
> a bat
> > out of hell, and I don't care how long it takes to write'" is close.
> >
> > Would it be thread pool or epoll?
>
> My personal bias is towards epoll (suitably wrapped for portability;
> no need to require epoll when sigio is nearly as good, and universally
> deployed).
>
> > Is it uncommon to mix these two?
>
> Folks who know how to program with things like epoll and aio tend
> to use threads carefully, and try to avoid using blocking I/O for
> networking.
> Blocking I/O is unavoidable for things like opening files, though,
> so it's perfectly fine to have a thread that handles stuff like that.
>
> > Currently, I have a thread-1 calling poll, and dispatching its
> workload to
> > thread-2 and thread-3 in blocking sockets. I dispatch the workload to
> worker
> > threads because some of the operations take considerable time.
> >
> > Is mixing threads with poll uncommon? Is this a bad design? any
> comments
> > would be appreciated.
>
> What are the operations that take considerable time? Are they
> networking
> calls, or disk I/O, or ...? If they're just networking calls,
> you might consider turning your code inside out to do everything with
> nonblocking I/O and state machines, but only if you're hitting a
> bottleneck
> because of the number of threads active.
>
> Whatever design you're comfortable with is fine until you fail to hit
> your
> performance requirement. No point in optimizing one little part
> of the system if the whole system is fast enough, or if the real
> bottleneck is elsewhere...
> - Dan

I think you're right when talking about the design decision and
regarding the initial question (since it is not referenced on the
C10K page) I'd like to suggest
http://aem.sourceforge.net
which provides another possible approach for that kind of problem.

The question of I/O completion port was also raised and I think
it must be taken into account in the initial design as well.
In many situations people use threads to handle this asynchronously
AEM is targeting event handling but provides a native asynchronous
socket interface that brings data directly to the applications.
This is one way of doing it.

On the other hand, it might not be the perfect solution for
the problem above. See it as an alternative.

Frederic