2007-02-25 21:40:33

by Ingo Molnar

[permalink] [raw]
Subject: threadlets as 'naive pool of threads', epoll, some measurements


* Davide Libenzi <[email protected]> wrote:

> > i dont understand - this confuses the client because there's no
> > Content-Length field. Did you insert a Content-Length field
> > manually? What i'm trying to figure out, are you relying on a
> > keepalive client or not? I.e. is there a -k option to 'ab' as well,
> > besides the ones you mentioned?
>
> You don't need a Content-Length if you're closing the connection ;)

yeah. But there's apparently a bug in the code because it doesnt close
's'.

> In any case, Evgeniy test "servers" so not handle properly the
> delivery of the content, where it is assumed that an open+sendfile is
> a continuos non-blocking operation. [...]

i have tried the one Evgeniy provided in the URL:

http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c

and 'ab -k -c8000 -n80000' almost always aborts with:

apr_socket_recv: Connection reset by peer (104)

in the few cases it finishes, i got the following epoll result, over
gigabit ethernet, on an UP Athlon64 box:

eserver_epoll: 7800 reqs/sec

the same with the most naive implementation of the same, using
threadlets:

eserver_threadlet: 5800 reqs/sec

Extrapolating from Evgeniy's numbers i'd expect kevents to do around
14000 reqs/sec.

i've attached eserver_threadlet.c [compile it in the async-test/
directory of the syslet userspace testcode, and change MAX_PENDING to
9000 in sys.h] - it uses the request handling function from
eserver_epoll - no tuning or anything.

while keeping in mind that these are raw, preliminary numbers with
little analysis of them, i'm surprised that 8000 async threads perform
so well. Obviously this testcase and implementation just degrades
threadlets down to a pool of 8000 user-space threads, so there's no win
at all from context caching, no intelligent queueing back to the head
context, etc.

Ingo

---------{ eserver_threadlet.c }--------------->
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/futex.h>
#include <sys/time.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <netinet/tcp.h>
#include <stdio.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sched.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>

#define DEBUG 0

#include "syslet.h"
#include "sys.h"
#include "threadlet.h"

struct request {
struct request *next_free;
/*
* The threadlet stack is part of the request structure
* and is thus reused as threadlets complete:
*/
unsigned long threadlet_stack;

/*
* These are all the request-specific parameters:
*/
long sock;
};

//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)

static long handle_request(void *__req)
{
struct request *req = __req;
int s = req->sock, err, fd;
off_t offset;
int count;
char path[] = "/tmp/index.html";
char buf[4096];
struct timeval tm;

read_again:
count = 40960;
offset = 0;

err = recv(s, buf, sizeof(buf), 0);
if (err < 0) {
ulog_err("Failed to read data from s=%d", s);
goto err_out_remove;
}
if (err == 0) {
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
goto err_out_remove;
}

fd = open(path, O_RDONLY);
if (fd == -1) {
ulog_err("Failed to open '%s'", path);
err = -1;
goto err_out_remove;
}
#if 0
do {
err = read(fd, buf, sizeof(buf));
if (err <= 0)
break;
err = send(s, buf, err, 0);
if (err <= 0)
break;
} while (1);
#endif
err = sendfile(s, fd, &offset, count);
{
int on = 0;
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
}

close(fd);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
goto err_out_remove;
}

gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);

goto read_again;

err_out_remove:
printf("blah\n");
close(s);
return complete_threadlet_fn(req, &async_head);
}

/*
* Freelist to recycle requests:
*/
static struct request *freelist;

/*
* Allocate a request and set up its syslet atoms:
*/
static struct request *alloc_req(void)
{
struct request *req;

/*
* Occasionally we have to refill the new-thread stack
* entry:
*/
if (!async_head.new_thread_stack) {
async_head.new_thread_stack = thread_stack_alloc();
pr("allocated new thread stack: %08lx\n",
async_head.new_thread_stack);
}

if (freelist) {
req = freelist;
pr("reusing req %p, threadlet stack %08lx\n",
req, req->threadlet_stack);
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}

req = calloc(1, sizeof(struct request));
pr("allocated req %p\n", req);
req->threadlet_stack = thread_stack_alloc();
pr("allocated thread stack %08lx\n", req->threadlet_stack);

return req;
}

/*
* Check whether there are any completions queued for user-space
* to finish up:
*/
static unsigned long complete(void)
{
unsigned long completed = 0;
struct request *req;

for (;;) {
req = (void *)completion_ring[async_head.user_ring_idx];
if (!req)
return completed;
completed++;
pr("completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);

req->next_free = freelist;
freelist = req;

/*
* Clear the completion pointer. To make sure the
* kernel never stomps upon still unhandled completions
* in the ring the kernel only writes to a NULL entry,
* so user-space has to clear it explicitly:
*/
completion_ring[async_head.user_ring_idx] = NULL;
async_head.user_ring_idx++;
if (async_head.user_ring_idx == MAX_PENDING)
async_head.user_ring_idx = 0;
}
}

static unsigned int pending_requests;

/*
* Handle a request that has just been submitted (either it has
* already been executed, or we have to account it as pending):
*/
static void handle_submitted_request(struct request *req, long done)
{
unsigned int nr;

if (done) {
/*
* This is the cached case - free the request:
*/
pr("cache completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
return;
}
/*
* 'cachemiss' case - the syslet is not finished
* yet. We will be notified about its completion
* via the completion ring:
*/
assert(pending_requests < MAX_PENDING-1);

pending_requests++;
pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
/*
* Attempt to complete requests - this is a fast
* check if there's no completions:
*/
nr = complete();
pending_requests -= nr;

/*
* If the ring is full then wait a bit:
*/
while (pending_requests == MAX_PENDING-1) {
pr("sys_async_wait()");
/*
* Wait for 4 events - to batch things a bit:
*/
sys_async_wait(4, async_head.user_ring_idx, &async_head);
nr = complete();
pending_requests -= nr;
pr("after wait: completed %d requests - still pending: %d\n",
nr, pending_requests);
}
}

static void webserver(int l_sock)
{
struct sockaddr addr;
struct request *req;
socklen_t addrlen;
long done;
int sock;

async_head_init();

/*
* We dont use syslets for the first request (or for oversized
* and/or erroneous requests):
*/
for (;;) {
req = alloc_req();
if (!req)
break;

sock = accept(l_sock, &addr, &addrlen);

if (sock < 0)
break;

req->sock = sock;
done = threadlet_exec(handle_request, req,
req->threadlet_stack, &async_head);

handle_submitted_request(req, done);
}
}

static int tcp_listen_socket(int *l_socket)
{
int ret;
int one = 1;
struct sockaddr_in ad;

/*
* Setup for server socket
*/
memset(&ad, 0, sizeof(ad));
ad.sin_family = AF_INET;
ad.sin_addr.s_addr = htonl(INADDR_ANY);
ad.sin_port = htons(2222);

*l_socket = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
if (*l_socket < 0) {
perror("socket (server)");
return *l_socket;
}

ret = setsockopt(*l_socket, SOL_SOCKET, SO_REUSEADDR, (char *)&one,
sizeof(int));
if (ret < 0) {
perror("setsockopt");
return ret;
}

ret = bind(*l_socket, (struct sockaddr *)&ad, sizeof(ad));
if (ret < 0) {
perror("bind");
return ret;
}

ret = listen(*l_socket, 1000);
if (ret < 0) {
perror("listen");
return ret;
}

return 0;
}

int main(int argc, char *argv[])
{
int ret;
int l_sock;
struct sched_param p = { sched_priority: 0 };

sched_setscheduler(getpid(), 3 /* SCHED_BATCH */, &p);

ret = tcp_listen_socket(&l_sock);
if (ret)
exit(ret);
printf("listening on port 2222\n");
printf("using threadlets\n");

webserver(l_sock);
async_head_exit();

exit(0);
}


2007-02-26 10:51:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements


update:

> i have tried the one Evgeniy provided in the URL:
>
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
>
> and 'ab -k -c8000 -n80000' almost always aborts with:
>
> apr_socket_recv: Connection reset by peer (104)
>
> in the few cases it finishes, i got the following epoll result, over
> gigabit ethernet, on an UP Athlon64 box:
>
> eserver_epoll: 7800 reqs/sec

eserver_epoll.c had a number of bugs. The most serious one was the
apparently buggy use of "EPOLLET" (edge-triggered events). Removing that
and moving epoll to level-triggered (which is slower but does not result
in missed events) gives:

eserver_epoll: 9400 reqs/sec

> the same with the most naive implementation of the same, using
> threadlets:
>
> eserver_threadlet: 5800 reqs/sec

eserver_epoll_threadlet: 9400 reqs/sec

as expected, the level of extra blocking triggered by this is low - even
if the full request function runs without nonblock assumptions.

Ingo

2007-02-26 11:55:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements


yet another performance update - with the fixed 'heaps of stupid
threads' evserver_threadlet.c code attached below i got:

> evserver_epoll: 9400 reqs/sec
> evserver_epoll_threadlet: 9400 reqs/sec

evserver_threadlet: 9000 reqs/sec

so the overhead, instead of the 10x slowdown Evgeniy predicted/feared,
is 4% for this particular, very event-centric workload.

why? because Evgeniy still overlooks what i've mentioned so many times:
that there is lots of inherent 'caching' possible even in this
particular '8000 clients' workload, which even the most stupid threadlet
queueing model is able to take advantage of. The maximum level of
parallelism that i've measured during this test was 161 threads.

Ingo

-----------{ evserver_threadlet.c }--------------->
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/futex.h>
#include <sys/time.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <netinet/tcp.h>
#include <stdio.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sched.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>

#define DEBUG 0

#include "syslet.h"
#include "sys.h"
#include "threadlet.h"

struct request {
struct request *next_free;
/*
* The threadlet stack is part of the request structure
* and is thus reused as threadlets complete:
*/
unsigned long threadlet_stack;

/*
* These are all the request-specific parameters:
*/
long sock;
};

//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)

static long handle_request(void *__req)
{
struct request *req = __req;
int s = req->sock, err, fd;
off_t offset;
int count;
char path[] = "/tmp/index.html";
char buf[4096];
struct timeval tm;

count = 40960;
offset = 0;

err = recv(s, buf, sizeof(buf), 0);
if (err < 0) {
ulog_err("Failed to read data from s=%d", s);
goto err_out_remove;
}
if (err == 0) {
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
goto err_out_remove;
}

fd = open(path, O_RDONLY);
if (fd == -1) {
ulog_err("Failed to open '%s'", path);
err = -1;
goto err_out_remove;
}
#if 0
do {
err = read(fd, buf, sizeof(buf));
if (err <= 0)
break;
err = send(s, buf, err, 0);
if (err <= 0)
break;
} while (1);
#endif
err = sendfile(s, fd, &offset, count);
{
int on = 0;
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
}

close(fd);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
goto err_out_remove;
}
close(s);

gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);

return complete_threadlet_fn(req, &async_head);

err_out_remove:
printf("blah\n");
close(s);
return complete_threadlet_fn(req, &async_head);
}

/*
* Freelist to recycle requests:
*/
static struct request *freelist;

/*
* Allocate a request and set up its syslet atoms:
*/
static struct request *alloc_req(void)
{
struct request *req;

/*
* Occasionally we have to refill the new-thread stack
* entry:
*/
if (!async_head.new_thread_stack) {
async_head.new_thread_stack = thread_stack_alloc();
pr("allocated new thread stack: %08lx\n",
async_head.new_thread_stack);
}

if (freelist) {
req = freelist;
pr("reusing req %p, threadlet stack %08lx\n",
req, req->threadlet_stack);
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}

req = calloc(1, sizeof(struct request));
pr("allocated req %p\n", req);
req->threadlet_stack = thread_stack_alloc();
pr("allocated thread stack %08lx\n", req->threadlet_stack);

return req;
}

/*
* Check whether there are any completions queued for user-space
* to finish up:
*/
static unsigned long complete(void)
{
unsigned long completed = 0;
struct request *req;

for (;;) {
req = (void *)completion_ring[async_head.user_ring_idx];
if (!req)
return completed;
completed++;
pr("completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);

req->next_free = freelist;
freelist = req;

/*
* Clear the completion pointer. To make sure the
* kernel never stomps upon still unhandled completions
* in the ring the kernel only writes to a NULL entry,
* so user-space has to clear it explicitly:
*/
completion_ring[async_head.user_ring_idx] = NULL;
async_head.user_ring_idx++;
if (async_head.user_ring_idx == MAX_PENDING)
async_head.user_ring_idx = 0;
}
}

static unsigned int pending_requests;

/*
* Handle a request that has just been submitted (either it has
* already been executed, or we have to account it as pending):
*/
static void handle_submitted_request(struct request *req, long done)
{
unsigned int nr;

if (done) {
/*
* This is the cached case - free the request:
*/
pr("cache completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
return;
}
/*
* 'cachemiss' case - the syslet is not finished
* yet. We will be notified about its completion
* via the completion ring:
*/
assert(pending_requests < MAX_PENDING-1);

pending_requests++;
pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
/*
* Attempt to complete requests - this is a fast
* check if there's no completions:
*/
nr = complete();
pending_requests -= nr;

/*
* If the ring is full then wait a bit:
*/
while (pending_requests == MAX_PENDING-1) {
pr("sys_async_wait()");
/*
* Wait for 4 events - to batch things a bit:
*/
sys_async_wait(4, async_head.user_ring_idx, &async_head);
nr = complete();
pending_requests -= nr;
pr("after wait: completed %d requests - still pending: %d\n",
nr, pending_requests);
}
}

static void webserver(int l_sock)
{
struct sockaddr addr;
struct request *req;
socklen_t addrlen;
long done;
int sock;

async_head_init();

/*
* We dont use syslets for the first request (or for oversized
* and/or erroneous requests):
*/
for (;;) {
req = alloc_req();
if (!req)
break;

sock = accept(l_sock, &addr, &addrlen);

if (sock < 0)
break;

req->sock = sock;
done = threadlet_exec(handle_request, req,
req->threadlet_stack, &async_head);

handle_submitted_request(req, done);
}
}

static int tcp_listen_socket(int *l_socket)
{
int ret;
int one = 1;
struct sockaddr_in ad;

/*
* Setup for server socket
*/
memset(&ad, 0, sizeof(ad));
ad.sin_family = AF_INET;
ad.sin_addr.s_addr = htonl(INADDR_ANY);
ad.sin_port = htons(2222);

*l_socket = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
if (*l_socket < 0) {
perror("socket (server)");
return *l_socket;
}

ret = setsockopt(*l_socket, SOL_SOCKET, SO_REUSEADDR, (char *)&one,
sizeof(int));
if (ret < 0) {
perror("setsockopt");
return ret;
}

ret = bind(*l_socket, (struct sockaddr *)&ad, sizeof(ad));
if (ret < 0) {
perror("bind");
return ret;
}

ret = listen(*l_socket, 1000);
if (ret < 0) {
perror("listen");
return ret;
}

return 0;
}

int main(int argc, char *argv[])
{
int ret;
int l_sock;
struct sched_param p = { sched_priority: 0 };

sched_setscheduler(getpid(), 3 /* SCHED_BATCH */, &p);

ret = tcp_listen_socket(&l_sock);
if (ret)
exit(ret);
printf("listening on port 2222\n");
printf("using threadlets\n");

webserver(l_sock);
async_head_exit();

exit(0);
}

2007-02-26 12:27:52

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements

On Mon, Feb 26, 2007 at 12:48:58PM +0100, Ingo Molnar ([email protected]) wrote:
>
> yet another performance update - with the fixed 'heaps of stupid
> threads' evserver_threadlet.c code attached below i got:
>
> > evserver_epoll: 9400 reqs/sec
> > evserver_epoll_threadlet: 9400 reqs/sec
>
> evserver_threadlet: 9000 reqs/sec
>
> so the overhead, instead of the 10x slowdown Evgeniy predicted/feared,
> is 4% for this particular, very event-centric workload.
>
> why? because Evgeniy still overlooks what i've mentioned so many times:
> that there is lots of inherent 'caching' possible even in this
> particular '8000 clients' workload, which even the most stupid threadlet
> queueing model is able to take advantage of. The maximum level of
> parallelism that i've measured during this test was 161 threads.

:)

I feared _ONLY_ situation when thousands of thereads are eating my brain
- so case when 161 threads are running simultanesoulsy is not that bad
compared to what micro-design can do (of its best/worst) at all!

So, caching is good - threadlets do not spawn a new thread, kevent
returns immediately, but in case of things are not that shine -
threadlets spawnd a new thread, while kevent process next request or
waits for all completed.

I'm a bit stuck right now with my benchmarks - intel core2 machine
requires reinstallation (it is installed for amd64 arch of debian
testing and admins at paid work setup my internet connection
down to miserable bytes per second (who said that hacked/social
engineered 1mb/sec can live forever?), so expect it tomorrow),
via epia one is under stress testing right now.

> Ingo

--
Evgeniy Polyakov

2007-02-26 12:57:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements


* Evgeniy Polyakov <[email protected]> wrote:

> > yet another performance update - with the fixed 'heaps of stupid
> > threads' evserver_threadlet.c code attached below i got:
> >
> > > evserver_epoll: 9400 reqs/sec
> > > evserver_epoll_threadlet: 9400 reqs/sec
> >
> > evserver_threadlet: 9000 reqs/sec
> >
> > so the overhead, instead of the 10x slowdown Evgeniy
> > predicted/feared, is 4% for this particular, very event-centric
> > workload.
> >
> > why? because Evgeniy still overlooks what i've mentioned so many
> > times: that there is lots of inherent 'caching' possible even in
> > this particular '8000 clients' workload, which even the most stupid
> > threadlet queueing model is able to take advantage of. The maximum
> > level of parallelism that i've measured during this test was 161
> > threads.
>
> :)
>
> I feared _ONLY_ situation when thousands of thereads are eating my
> brain - so case when 161 threads are running simultanesoulsy is not
> that bad compared to what micro-design can do (of its best/worst) at
> all!

even with ten thousand threads it is still pretty fast. Certainly not
'10 times slower' as you claimed. And it takes only a single, trivial
outer event loop to lift it up to the performance levels of a pure event
based server.

conclusion: currently i dont see a compelling need for the kevents
subsystem. epoll is a pretty nice API and it covers most of the event
sources and nicely builds upon our existing poll() infrastructure.

furthermore, i very much contest your claim that a high-performance,
highly scalable webserver needs a kevent+nonblock design. Even if i
ignore all the obvious usability and maintainance-cost advantages of
threadlets.

> So, caching is good - threadlets do not spawn a new thread, kevent
> returns immediately, but in case of things are not that shine -
> threadlets spawnd a new thread, while kevent process next request or
> waits for all completed.

no. Please read the evserver_threadlet.c code. There's no kevent in
there. There's no epoll() in there. All that you can see there is the
natural behavior of pure threadlets. And it's not a workload /I/ picked
for threadlets - it is a workload, filesize, parallelism level and
request handling function /you/ picked for "event-servers".

Ingo

2007-02-26 14:34:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements

On Mon, Feb 26, 2007 at 01:50:54PM +0100, Ingo Molnar ([email protected]) wrote:
> > I feared _ONLY_ situation when thousands of thereads are eating my
> > brain - so case when 161 threads are running simultanesoulsy is not
> > that bad compared to what micro-design can do (of its best/worst) at
> > all!
>
> even with ten thousand threads it is still pretty fast. Certainly not
> '10 times slower' as you claimed. And it takes only a single, trivial
> outer event loop to lift it up to the performance levels of a pure event
> based server.

I did not claim that it will be 10 times slower, I said that it will be
slower, my '10 times slower', which are actually '15% of hte total time'
is a reply to your 'fast as sync' model, no need to repaing the picture :)

> conclusion: currently i dont see a compelling need for the kevents
> subsystem. epoll is a pretty nice API and it covers most of the event
> sources and nicely builds upon our existing poll() infrastructure.
>
> furthermore, i very much contest your claim that a high-performance,
> highly scalable webserver needs a kevent+nonblock design. Even if i
> ignore all the obvious usability and maintainance-cost advantages of
> threadlets.

Ok, I see your point, you insult something you did not ever try to
understand, that is your right.

> > So, caching is good - threadlets do not spawn a new thread, kevent
> > returns immediately, but in case of things are not that shine -
> > threadlets spawnd a new thread, while kevent process next request or
> > waits for all completed.
>
> no. Please read the evserver_threadlet.c code. There's no kevent in
> there. There's no epoll() in there. All that you can see there is the
> natural behavior of pure threadlets. And it's not a workload /I/ picked
> for threadlets - it is a workload, filesize, parallelism level and
> request handling function /you/ picked for "event-servers".

I know that there is no kevents there, that would be really strange if
you would test it in your environment after all that empty kevent
releases.

Enough, you say micro-thread design is superior - ok, that is your
point.

> Ingo

--
Evgeniy Polyakov

2007-02-26 20:31:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements


* Evgeniy Polyakov <[email protected]> wrote:

> > no. Please read the evserver_threadlet.c code. There's no kevent in
> > there. There's no epoll() in there. All that you can see there is
> > the natural behavior of pure threadlets. And it's not a workload /I/
> > picked for threadlets - it is a workload, filesize, parallelism
> > level and request handling function /you/ picked for
> > "event-servers".
>
> I know that there is no kevents there, that would be really strange if
> you would test it in your environment after all that empty kevent
> releases.

i havent got around figuring out the last v2.6.20 based kevent release,
and your git tree is v2.6.21-rc1 based. Do you have some easy URL for me
to fetch the last v2.6.20 kevent release?

> Enough, you say micro-thread design is superior - ok, that is your
> point.

note that threadlets are not 'micro-threads'. A threadlet is more of an
'optional thread' (as i mentioned it earlier): whenever it does anything
that makes it distinct from a plain function call, it's converted into a
separate thread by the kernel. Otherwise it behaves like a plain
function call and returns.

Ingo

2007-02-26 21:22:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements

On Mon, 26 Feb 2007, Ingo Molnar wrote:

>
> update:
>
> > i have tried the one Evgeniy provided in the URL:
> >
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> >
> > and 'ab -k -c8000 -n80000' almost always aborts with:
> >
> > apr_socket_recv: Connection reset by peer (104)
> >
> > in the few cases it finishes, i got the following epoll result, over
> > gigabit ethernet, on an UP Athlon64 box:
> >
> > eserver_epoll: 7800 reqs/sec
>
> eserver_epoll.c had a number of bugs. The most serious one was the
> apparently buggy use of "EPOLLET" (edge-triggered events). Removing that
> and moving epoll to level-triggered (which is slower but does not result
> in missed events) gives:
>
> eserver_epoll: 9400 reqs/sec
>
> > the same with the most naive implementation of the same, using
> > threadlets:
> >
> > eserver_threadlet: 5800 reqs/sec
>
> eserver_epoll_threadlet: 9400 reqs/sec
>
> as expected, the level of extra blocking triggered by this is low - even
> if the full request function runs without nonblock assumptions.

That looks pretty good. I started (spare time allowing) laying down the
based for a more realistic test as far webserver-like benchmarking.
I want to compare three solutions, that uses the same internal code (as
far as request parsing and content delivery).

1) A fully thread classical web "server". This does the trivial
accept+pthread_create and in there does open+fstat+sendhdrs+sendfile.
This is already done:

http://www.xmailserver.org/thrhttp.c

Do a `gcc -o thrhttp thrhttp.c -lpthread` and you're done.

2) A coronet (coroutine+epoll library) handling for network
events/dispatch, plus GUASI (Generic Userspace Asyncronous Syscall
Interface) to handle generic IO.

libpcl: http://www.xmailserver.org/libpcl.html
coronet: http://www.xmailserver.org/coronet-lib.html
GUASI: http://www.xmailserver.org/guasi-lib.html

cghttpd: http://www.xmailserver.org/cghttpd-home.html

All these are configure+make+make_install installable.
The cghttpd server has the same parsing/content-delivery features of
thrhttp, but it uses coroutine+epoll to handle network events, and
GUASI (an old pthread-based userspace implementation of async execution)
to give open/fstat/sendfile an async behaviour. It implements the
epoll_wait() (conet_events_wait() acutally, but that maps directly to
epoll_wait()) hosted inside an async GUASI request, and the GUASI
dispatch loop handle the case with:

if (cookie == conet_events_wait_cookie)
handle_network_events();

3) Finally, a very similar to cghttpd implementation, but this time using
the *syslets* to handle async requests. That should be a prety easy
change to cghttpd, by the mean of replacing the CGHTTPD_SYSCALL() macro
with proper syslet code. The big advantage of the syslets here, is that
they do have the cachehit optimization, while GUASI always have to
trigger a queueing.
Of course, since the only machine I have with enough RAM to keep many
thousands of session active is a dual Opteron, I'd need to have an
x86-64 version of the patch. That shouldn't be a big problem though.


This hopefully may prove two points. First, a fully threaded solution does
not scale well when dealing with thousands and thousands of sessions.
Second, the cachehit syslet trick is The Man in the syslet code, and kicks
userspace solution #2 ass.
In the meantime, I think Jens tests are more meaningfull, as far of usage
field goes, than any network based test.




- Davide


2007-02-27 08:23:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements

On Mon, Feb 26, 2007 at 09:23:38PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > > no. Please read the evserver_threadlet.c code. There's no kevent in
> > > there. There's no epoll() in there. All that you can see there is
> > > the natural behavior of pure threadlets. And it's not a workload /I/
> > > picked for threadlets - it is a workload, filesize, parallelism
> > > level and request handling function /you/ picked for
> > > "event-servers".
> >
> > I know that there is no kevents there, that would be really strange if
> > you would test it in your environment after all that empty kevent
> > releases.
>
> i havent got around figuring out the last v2.6.20 based kevent release,
> and your git tree is v2.6.21-rc1 based. Do you have some easy URL for me
> to fetch the last v2.6.20 kevent release?

I use kevent-36 release patches on top of 2.6.20 tree.
There are some syscall numbers overlap with thrteadlet patches, but
rejectsa re trivial.

> > Enough, you say micro-thread design is superior - ok, that is your
> > point.
>
> note that threadlets are not 'micro-threads'. A threadlet is more of an
> 'optional thread' (as i mentioned it earlier): whenever it does anything
> that makes it distinct from a plain function call, it's converted into a
> separate thread by the kernel. Otherwise it behaves like a plain
> function call and returns.

I know.
But it is rare case for the most situations, when things do not block,
so I called it micro-thread, since it spawns a new thread (get from
preallocated pool) for parallel processing.

> Ingo

--
Evgeniy Polyakov

2007-02-27 08:35:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements


* Evgeniy Polyakov <[email protected]> wrote:

> > > Enough, you say micro-thread design is superior - ok, that is your
> > > point.
> >
> > note that threadlets are not 'micro-threads'. A threadlet is more of
> > an 'optional thread' (as i mentioned it earlier): whenever it does
> > anything that makes it distinct from a plain function call, it's
> > converted into a separate thread by the kernel. Otherwise it behaves
> > like a plain function call and returns.
>
> I know.
> But it is rare case for the most situations, when things do not block,
> so I called it micro-thread, since it spawns a new thread (get from
> preallocated pool) for parallel processing.

ugh. Because 'it spawns a new thread from a preallocated pool' you are
arbitrarily renaming threadlets to 'micro-threads'?? The kernel could be
using a transparent thread pool for ordinary pthread recycling itself
(and will possibly do so in the future) - that does not make them a
micro-thread one iota. So please stop calling them micro-threads,
threadlets are a distinctly separate concept ...

( And i guess you should know it perfectly well from my past mails in
this thread that i dont like micro-thread concepts at all, so are you
perhaps calling threadlets 'micro-threads' intentionally, just to
force a predictably negative reaction from me? Maybe i should start
renaming your code too and refer to kevents as 'kpoll'? That too makes
absolutely zero sense. This is getting really silly. )

Ingo

2007-02-27 10:38:14

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements

On Tue, Feb 27, 2007 at 09:27:57AM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > > > Enough, you say micro-thread design is superior - ok, that is your
> > > > point.
> > >
> > > note that threadlets are not 'micro-threads'. A threadlet is more of
> > > an 'optional thread' (as i mentioned it earlier): whenever it does
> > > anything that makes it distinct from a plain function call, it's
> > > converted into a separate thread by the kernel. Otherwise it behaves
> > > like a plain function call and returns.
> >
> > I know.
> > But it is rare case for the most situations, when things do not block,
> > so I called it micro-thread, since it spawns a new thread (get from
> > preallocated pool) for parallel processing.
>
> ugh. Because 'it spawns a new thread from a preallocated pool' you are
> arbitrarily renaming threadlets to 'micro-threads'?? The kernel could be
> using a transparent thread pool for ordinary pthread recycling itself
> (and will possibly do so in the future) - that does not make them a
> micro-thread one iota. So please stop calling them micro-threads,
> threadlets are a distinctly separate concept ...
>
> ( And i guess you should know it perfectly well from my past mails in
> this thread that i dont like micro-thread concepts at all, so are you
> perhaps calling threadlets 'micro-threads' intentionally, just to
> force a predictably negative reaction from me? Maybe i should start
> renaming your code too and refer to kevents as 'kpoll'? That too makes
> absolutely zero sense. This is getting really silly. )

I already think about renaming kevent aio, since it uses kaio name,
which you frequently reference too, but you definitely did not think
about kevent.

And out of curiosity, how masichistic I would look if I intentinally
want to receive negative reaction from you :)

As far as you can recall, in all syslet related threads I was always for
them, and definitely against micro-threads, but when we come to the land
of IO processing using event driven model - here I can not agree with
you.

So, ok, no micro-thread name.

> Ingo

--
Evgeniy Polyakov

2007-02-27 12:22:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements


* Evgeniy Polyakov <[email protected]> wrote:

> So, ok, no micro-thread name.

thanks!

Ingo

2007-02-27 12:25:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: threadlets as 'naive pool of threads', epoll, some measurements

On Tue, Feb 27, 2007 at 01:15:42PM +0100, Ingo Molnar ([email protected]) wrote:
> > So, ok, no micro-thread name.
>
> thanks!

:) no problem!

> Ingo

--
Evgeniy Polyakov