LinuxLists.cc - epoll,threading

2007-05-26 12:50:53

Subject: epoll,threading

Hello all,

I want to know in detail about , what the events (epoll or /dev/poll or
select ) achieve in contrast to thread per client.

i can have a thread per client and use send and recv system call directly
right? Why do i go for these event mechanisms?

Please help me to understand this.

Thanks in advance,
Arunachalam
--
View this message in context: http://www.nabble.com/epoll%2Cthreading-tf3820388.html#a10815949
Sent from the linux-kernel mailing list archive at Nabble.com.

2007-05-26 13:15:17

by Ingo Oeser

[permalink] [raw]

Subject: Re: epoll,threading

Hi Arunachalam,

On Saturday 26 May 2007, Arunachalam wrote:
> I want to know in detail about , what the events (epoll or /dev/poll or
> select ) achieve in contrast to thread per client.
>
> i can have a thread per client and use send and recv system call directly
> right? Why do i go for these event mechanisms?

Try 30.000 clients or more on a x86 32bit box.
That will show you the difference quite nicely :-)

More seriously: Thread per client scales only to a certain amount of clients
per RAM. If you like to scale beyond that to like to minimize your state
per client. If you have a thread then you have a task structure as
unswappable memory in kernel, a per-thread stack, which is reducing
your virtual memory per process (you have only around 3GB of virtual
memory per process in Linux x86 32bit).

So one uses a process or thread pool to scale beyond that.
Pool size is typically related to the amount of CPU cores in the system.

Regards

Ingo Oeser

2007-05-26 23:02:51

by David Schwartz

[permalink] [raw]

Subject: RE: epoll,threading

> Hello all,
>
> I want to know in detail about , what the events (epoll or /dev/poll or
> select ) achieve in contrast to thread per client.
>
> i can have a thread per client and use send and recv system call directly
> right? Why do i go for these event mechanisms?
>
> Please help me to understand this.

Aside from the obvious, consider a server that needs to do a little bit of
work on each of 1,000 clients on a single CPU system. With a
thread-per-client approach, 1,000 context switches will be needed. With an
epoll thread pool approach, none are needed and five or six are typical.

Both get you the major advantages of threading. You can take full advantage
of multiple CPUs. You can write code that blocks occasionally without
bringing the whole server down. A page fault doesn't stall your whole
server.

DS

2007-05-30 06:38:52

by Tejun Heo

[permalink] [raw]

Subject: Re: epoll,threading

David Schwartz wrote:
>> I want to know in detail about , what the events (epoll or /dev/poll or
>> select ) achieve in contrast to thread per client.
>>
>> i can have a thread per client and use send and recv system call directly
>> right? Why do i go for these event mechanisms?
>>
>> Please help me to understand this.
>
> Aside from the obvious, consider a server that needs to do a little bit of
> work on each of 1,000 clients on a single CPU system. With a
> thread-per-client approach, 1,000 context switches will be needed. With an
> epoll thread pool approach, none are needed and five or six are typical.
>
> Both get you the major advantages of threading. You can take full advantage
> of multiple CPUs. You can write code that blocks occasionally without
> bringing the whole server down. A page fault doesn't stall your whole
> server.

It all depends on the workload but thread switching overhead can be
negligible - after all all it does is entering kernel, schedule, swap
processor context and return. Maybe using separate stack has minor
impact on cache hit ratio. You need benchmark numbers to claim one way
or the other.

In my experience with web caches, epoll or similar for idle clients and
thread per active client scaled and performed pretty well - it needed
more memory but the performance wasn't worse than asynchronous design
and doing complex server in async model is a lot of pain.

--
tejun

2007-05-30 07:25:45

by Willy Tarreau

[permalink] [raw]

Subject: Re: epoll,threading

On Wed, May 30, 2007 at 04:35:56AM +0900, Tejun Heo wrote:
> David Schwartz wrote:
> >> I want to know in detail about , what the events (epoll or /dev/poll or
> >> select ) achieve in contrast to thread per client.
> >>
> >> i can have a thread per client and use send and recv system call directly
> >> right? Why do i go for these event mechanisms?
> >>
> >> Please help me to understand this.
> >
> > Aside from the obvious, consider a server that needs to do a little bit of
> > work on each of 1,000 clients on a single CPU system. With a
> > thread-per-client approach, 1,000 context switches will be needed. With an
> > epoll thread pool approach, none are needed and five or six are typical.
> >
> > Both get you the major advantages of threading. You can take full advantage
> > of multiple CPUs. You can write code that blocks occasionally without
> > bringing the whole server down. A page fault doesn't stall your whole
> > server.
>
> It all depends on the workload but thread switching overhead can be
> negligible - after all all it does is entering kernel, schedule, swap
> processor context and return. Maybe using separate stack has minor
> impact on cache hit ratio. You need benchmark numbers to claim one way
> or the other.

In my experience, it's not much the context switch by itself which causes
performance degradation, but the fact that with threads, you have to put
mutexes everywhere. And frankly, walking a list with locks everywhere
is quite slower than doing it in one run at a rate of 3 or 4 cycles per
entry. Also, local storage in function returns is not possible anymore,
and some functions even need to malloc() instead of returning statically
allocated data. I believe this is the reason for openssl being twice as
slow when compiled thread-safe than in native mode.

So in fact, converting a threaded program to a pure async model should
not improve it much because of the initial architectural design. But a
program written from scratch to be purely async should perform better
simply because it has less operations to perform. And there's no magics
here : less cycles spend synchronizing and locking = more cycles available
for the real job.

> In my experience with web caches, epoll or similar for idle clients and
> thread per active client scaled and performed pretty well - it needed
> more memory but the performance wasn't worse than asynchronous design
> and doing complex server in async model is a lot of pain.

It's true that an async model is a lot of pain. But it's always where I
got the best performance. For instance, with epoll(), I can achieve
20000 HTTP reqs/s with 40000 concurrent sessions. The best performance
I have observed from threaded competitors was an order of magnitude below
on either value (sometimes both).

However, I agree that few uses really require to spend time writing and
debugging async programs.

Regards,
Willy

2007-05-30 07:51:05

by David Schwartz

[permalink] [raw]

Subject: RE: epoll,threading

I mostly agree with your comments, so I'm only responding to the points I
disagree with.

> So in fact, converting a threaded program to a pure async model should
> not improve it much because of the initial architectural design. But a
> program written from scratch to be purely async should perform better
> simply because it has less operations to perform. And there's no magics
> here : less cycles spend synchronizing and locking = more cycles available
> for the real job.

On the contrary, converting threaded programs to pure async is a disaster.
For one thing, you can never, ever block under any circumstances on pain of
total disaster. This means every single line of code is performance
critical. For all but the most trivial applications, this alone is a deal
killer.

The other problem is that no matter how careful you are, there is still
going to be some unexpected blocking. For example, suppose one connection
triggers some kind of error condition and the code to handle this error has
never faulted in. Now your entire application stalls until the error
handling code can fault in, which can be awhile if the disk is slow/busy.

> It's true that an async model is a lot of pain. But it's always where I
> got the best performance. For instance, with epoll(), I can achieve
> 20000 HTTP reqs/s with 40000 concurrent sessions. The best performance
> I have observed from threaded competitors was an order of magnitude below
> on either value (sometimes both).

I can't see how adding a second thread to your program could possibly make
it all that much worse. And when you do block unexpectedly, there's that
second thread to keep on working. And, of course, you can take advantage of
another CPU.

If adding a few more threads to your design makes it much worse, there's
something wrong with it. Perhaps you are conflating "threaded competitors"
with "thread-per-client competitors".

I also must be misunderstanding something you're saying, or you are very
confused. At one point you say that threaded programs are an order of
magnitude below pure async. At another point you say converting a threaded
program to pure async should not improve it much. I don't see how both of
these can be true unless you mean something different by "threaded" and
"async" in the two statements.

Thread-per-client works reasonably well up to some number of clients. On
most UNIXes, that number is around 300. On Windows, it's around 800. I
personally would only recommend it for applications that plan to handle 100
clients or fewer, or one platforms where you know the threading library
works well this way.

Pure async is just too painful. If you only have a single thread, the
program becomes unmaintainable because you must never ever block anywhere.
And you still can't avoid blocking sometimes, and this can cause a backup
from which you can never recover. (If you always wondered why your servers
were bursty, this is why.)

The best approach, in my experience, is to use multiple threads so that you
can take advantage of multiple CPUs, not be screwed if you block
unexpectedly, and even block intentionally in code that's not performance
critical. On Linux, 'epoll' is the most efficient socket discovery
technique, so you should use it in any design that's not thread-per-client.

DS

2007-05-30 08:07:48

by David Schwartz

[permalink] [raw]

Subject: RE: epoll,threading

> In my experience, it's not much the context switch by itself which causes
> performance degradation, but the fact that with threads, you have to put
> mutexes everywhere. And frankly, walking a list with locks everywhere
> is quite slower than doing it in one run at a rate of 3 or 4 cycles per
> entry. Also, local storage in function returns is not possible anymore,
> and some functions even need to malloc() instead of returning statically
> allocated data. I believe this is the reason for openssl being twice as
> slow when compiled thread-safe than in native mode.

I don't know where you got this idea from, but several experiments of mine
failed to confirm it. Here, for example, are two identical builds of
OpenSSL, one with 'threads' and one with 'no-threads'. Same compiler, same
other flags, same hardware (Linux 2.6.21.2, P3-1Ghz):

OpenSSL 0.9.8d 28 Sep 2006
built on: Wed May 30 00:55:25 PDT 2007
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial)
idea(int) blowfish(idx)
compiler:
gcc -DDSO_DLFCN -DHAVE_DLFCN_H -mcpu=pentium -DL_ENDIAN -DTERMIO -O3 -fomit-
frame-pointer -Wall -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DSHA1_A
SM -DMD5_ASM -DRMD160_ASM -DAES_ASM
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
bytes
md5 5295.13k 18837.99k 56614.66k 113037.65k
157575.85k
des cbc 21280.46k 22224.36k 22536.87k 22626.65k
22686.38k
sign verify sign/s verify/s
rsa 1024 bits 0.007653s 0.000381s 130.7 2626.1

OpenSSL 0.9.8d 28 Sep 2006
built on: Wed May 30 00:55:23 PDT 2007
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial)
idea(int) blowfish(idx)
compiler:
gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -mcpu=pentium
-DL_ENDIAN -DTERMIO -O3 -fomit-frame-pointer -Wall -DOPENSSL_BN_ASM_PART_WOR
DS -DOPENSSL_IA32_SSE2 -DSHA1_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
bytes
md5 5355.29k 18880.64k 56515.93k 113145.51k
157908.99k
des cbc 21361.71k 22185.28k 22494.72k 22654.29k
22659.07k
sign verify sign/s verify/s
rsa 1024 bits 0.007612s 0.000380s 131.4 2629.5

DS

2007-05-30 08:12:16

by gshan

[permalink] [raw]

Subject: Change IP TOS inside ethernet driver

I want to change the IP TOS inside ioctl callback of ethernet driver,
but don't know how to do this. Anybody has ideas?

Thanks in advance,
Gavin.

2007-05-30 10:13:18

by Tejun Heo

[permalink] [raw]

Subject: Re: epoll,threading

Hello,

Willy Tarreau wrote:
> In my experience, it's not much the context switch by itself which
> causes performance degradation, but the fact that with threads, you
> have to put mutexes everywhere. And frankly, walking a list with
> locks everywhere is quite slower than doing it in one run at a rate
> of 3 or 4 cycles per entry. Also, local storage in function returns
> is not possible anymore, and some functions even need to malloc()
> instead of returning statically allocated data. I believe this is the
> reason for openssl being twice as slow when compiled thread-safe than
> in native mode.
>
> So in fact, converting a threaded program to a pure async model
> should not improve it much because of the initial architectural
> design. But a program written from scratch to be purely async should
> perform better simply because it has less operations to perform. And
> there's no magics here : less cycles spend synchronizing and locking
> = more cycles available for the real job.

The thing is that the synchronization overhead is something you'll have
to pay anyway to support multiple processors. Actually, supporting
multiple processors on an async program is beyond painful. Either you
have to restrict all locking to busy locks or introduce new state for
each possibly blocking synchronization point and what happens if they
have to nest? You kind of end up with stackable state thingie - an
extremely restricted stack.

If you're really serious about performance and scalability, you just
have to support multiple processors and if you do it right the
performance overhead shouldn't be too high. Common servers will soon
have 8 cores on two physical processors - paying some overhead for
synchronization is pretty good deal for scalability.

>> In my experience with web caches, epoll or similar for idle clients
>> and thread per active client scaled and performed pretty well - it
>> needed more memory but the performance wasn't worse than
>> asynchronous design and doing complex server in async model is a
>> lot of pain.
>
> It's true that an async model is a lot of pain. But it's always where
> I got the best performance. For instance, with epoll(), I can achieve
> 20000 HTTP reqs/s with 40000 concurrent sessions. The best
> performance I have observed from threaded competitors was an order of
> magnitude below on either value (sometimes both).

Well, it all depends on how you do it but an order of magnitude
performance difference sounds too much to me. Memory-wise scalability
can be worse by orders of magnitude. You need to restrict per-thread
stack size and use epoll for idle threads, if you wanna scale. Workers
+ async monitoring of idle clients scale pretty well.

> However, I agree that few uses really require to spend time writing
> and debugging async programs.

Yeap, also there are several things which just are too painful in async
server - e.g. adding coordination with another server (virus scan,
sharing cached data), implementing pluggable extension framwork for
third parties (and what happens if they should be able to stack!), and
maintaining the damn thing while trying to add a few features. :-)

IMHO, complex pure async server doesn't really make sense anymore.

Thanks.

--
tejun

2007-05-30 12:58:20

by Willy Tarreau

[permalink] [raw]

Subject: Re: epoll,threading

On Wed, May 30, 2007 at 01:07:08AM -0700, David Schwartz wrote:
>
> > In my experience, it's not much the context switch by itself which causes
> > performance degradation, but the fact that with threads, you have to put
> > mutexes everywhere. And frankly, walking a list with locks everywhere
> > is quite slower than doing it in one run at a rate of 3 or 4 cycles per
> > entry. Also, local storage in function returns is not possible anymore,
> > and some functions even need to malloc() instead of returning statically
> > allocated data. I believe this is the reason for openssl being twice as
> > slow when compiled thread-safe than in native mode.
>
> I don't know where you got this idea from, but several experiments of mine
> failed to confirm it. Here, for example, are two identical builds of
> OpenSSL, one with 'threads' and one with 'no-threads'. Same compiler, same
> other flags, same hardware (Linux 2.6.21.2, P3-1Ghz):

I can see that your numbers are equal in both cases. You have openssl 0.9.8d,
and doing functions benchmarks. I observed this on openssl 0.9.6 + apache 1.3
+ mod_ssl. The difference in terms of hits/s was 1 to 2, just with -DREENTRANT.

Most probably some water has flowed under the bridges since... Anyway, that
makes 0.9.8 interesting to look at ;-)

Willy

>
> OpenSSL 0.9.8d 28 Sep 2006
> built on: Wed May 30 00:55:25 PDT 2007
> options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial)
> idea(int) blowfish(idx)
> compiler:
> gcc -DDSO_DLFCN -DHAVE_DLFCN_H -mcpu=pentium -DL_ENDIAN -DTERMIO -O3 -fomit-
> frame-pointer -Wall -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DSHA1_A
> SM -DMD5_ASM -DRMD160_ASM -DAES_ASM
> available timing options: TIMES TIMEB HZ=100 [sysconf value]
> timing function used: times
> The 'numbers' are in 1000s of bytes per second processed.
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
> bytes
> md5 5295.13k 18837.99k 56614.66k 113037.65k
> 157575.85k
> des cbc 21280.46k 22224.36k 22536.87k 22626.65k
> 22686.38k
> sign verify sign/s verify/s
> rsa 1024 bits 0.007653s 0.000381s 130.7 2626.1
>
> OpenSSL 0.9.8d 28 Sep 2006
> built on: Wed May 30 00:55:23 PDT 2007
> options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial)
> idea(int) blowfish(idx)
> compiler:
> gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -mcpu=pentium
> -DL_ENDIAN -DTERMIO -O3 -fomit-frame-pointer -Wall -DOPENSSL_BN_ASM_PART_WOR
> DS -DOPENSSL_IA32_SSE2 -DSHA1_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM
> available timing options: TIMES TIMEB HZ=100 [sysconf value]
> timing function used: times
> The 'numbers' are in 1000s of bytes per second processed.
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
> bytes
> md5 5355.29k 18880.64k 56515.93k 113145.51k
> 157908.99k
> des cbc 21361.71k 22185.28k 22494.72k 22654.29k
> 22659.07k
> sign verify sign/s verify/s
> rsa 1024 bits 0.007612s 0.000380s 131.4 2629.5
>
> DS
>

2007-05-30 13:12:39

by Willy Tarreau

[permalink] [raw]

Subject: Re: epoll,threading

On Wed, May 30, 2007 at 07:12:01PM +0900, Tejun Heo wrote:
> Hello,
>
> Willy Tarreau wrote:
> > In my experience, it's not much the context switch by itself which
> > causes performance degradation, but the fact that with threads, you
> > have to put mutexes everywhere. And frankly, walking a list with
> > locks everywhere is quite slower than doing it in one run at a rate
> > of 3 or 4 cycles per entry. Also, local storage in function returns
> > is not possible anymore, and some functions even need to malloc()
> > instead of returning statically allocated data. I believe this is the
> > reason for openssl being twice as slow when compiled thread-safe than
> > in native mode.
> >
> > So in fact, converting a threaded program to a pure async model
> > should not improve it much because of the initial architectural
> > design. But a program written from scratch to be purely async should
> > perform better simply because it has less operations to perform. And
> > there's no magics here : less cycles spend synchronizing and locking
> > = more cycles available for the real job.
>
> The thing is that the synchronization overhead is something you'll have
> to pay anyway to support multiple processors.

But you don't need to sync *everything*. It is doable to have 1 thread
per processor, each with their own data, and sync the minimum information
(eg: statistics).

> Actually, supporting
> multiple processors on an async program is beyond painful. Either you
> have to restrict all locking to busy locks or introduce new state for
> each possibly blocking synchronization point and what happens if they
> have to nest? You kind of end up with stackable state thingie - an
> extremely restricted stack.

I have not said it is simple, I said that when it is justified, it is doable.

> If you're really serious about performance and scalability, you just
> have to support multiple processors and if you do it right the
> performance overhead shouldn't be too high. Common servers will soon
> have 8 cores on two physical processors - paying some overhead for
> synchronization is pretty good deal for scalability.
>
> >> In my experience with web caches, epoll or similar for idle clients
> >> and thread per active client scaled and performed pretty well - it
> >> needed more memory but the performance wasn't worse than
> >> asynchronous design and doing complex server in async model is a
> >> lot of pain.
> >
> > It's true that an async model is a lot of pain. But it's always where
> > I got the best performance. For instance, with epoll(), I can achieve
> > 20000 HTTP reqs/s with 40000 concurrent sessions. The best
> > performance I have observed from threaded competitors was an order of
> > magnitude below on either value (sometimes both).
>
> Well, it all depends on how you do it but an order of magnitude
> performance difference sounds too much to me. Memory-wise scalability
> can be worse by orders of magnitude.

It is very often a problem because system limits have not evolved as fast
as requirements.

> You need to restrict per-thread
> stack size and use epoll for idle threads, if you wanna scale. Workers
> + async monitoring of idle clients scale pretty well.

I agree with a small pool of workers. But they must be dedicated to CPU
only, and perform no I/O. Then you can have 1 thread/CPU.

> > However, I agree that few uses really require to spend time writing
> > and debugging async programs.
>
> Yeap, also there are several things which just are too painful in async
> server - e.g. adding coordination with another server (virus scan,
> sharing cached data), implementing pluggable extension framwork for
> third parties (and what happens if they should be able to stack!), and
> maintaining the damn thing while trying to add a few features. :-)
>
> IMHO, complex pure async server doesn't really make sense anymore.

That's clearly not my opinion, but I don't want to enter a flamewar on
the subject, it's not interesting. As long as people like us will push
the system to limits using either model, at least there will be
references for comparisons :-)

Cheers
Willy

2007-05-30 13:43:25

by Tejun Heo

[permalink] [raw]

Subject: Re: epoll,threading

Hello,

Willy Tarreau wrote:
>> The thing is that the synchronization overhead is something you'll have
>> to pay anyway to support multiple processors.
>
> But you don't need to sync *everything*. It is doable to have 1 thread
> per processor, each with their own data, and sync the minimum information
> (eg: statistics).

Agreed it's doable but the list of things to synchronize tends to grow
and it's really hurts when it does. Also, the list becomes much less
different between async and threaded model with multi processor support.

[--snip--]
>> You need to restrict per-thread
>> stack size and use epoll for idle threads, if you wanna scale. Workers
>> + async monitoring of idle clients scale pretty well.
>
> I agree with a small pool of workers. But they must be dedicated to CPU
> only, and perform no I/O. Then you can have 1 thread/CPU.

Well, if you go that far, it's basically async model with multiprocessor
support. Not my favorite one. :-)

>>> However, I agree that few uses really require to spend time writing
>>> and debugging async programs.
>> Yeap, also there are several things which just are too painful in async
>> server - e.g. adding coordination with another server (virus scan,
>> sharing cached data), implementing pluggable extension framwork for
>> third parties (and what happens if they should be able to stack!), and
>> maintaining the damn thing while trying to add a few features. :-)
>>
>> IMHO, complex pure async server doesn't really make sense anymore.
>
> That's clearly not my opinion, but I don't want to enter a flamewar on
> the subject, it's not interesting. As long as people like us will push
> the system to limits using either model, at least there will be
> references for comparisons :-)

Yeap, it's getting off-topic. Let's see what future brings. :-)

Thanks.

--
tejun