LinuxLists.cc - PROBLEM: high system usage / poor SMP network performance

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

2002-01-28 19:35:14

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

----- Original Message -----
From: "Alan Cox" <[email protected]>
To: "Vincent Sweeney" <[email protected]>
Cc: <[email protected]>
Sent: Sunday, January 27, 2002 10:54 PM
Subject: Re: PROBLEM: high system usage / poor SMP network performance

> > CPU0 states: 27.2% user, 62.4% system, 0.0% nice, 9.2% idle
> > CPU1 states: 28.4% user, 62.3% system, 0.0% nice, 8.1% idle
>
> The important bit here is ^^^^^^^^ that one. Something is causing
> horrendous lock contention it appears. Is the e100 driver optimised for
SMP
> yet ? Do you get better numbers if you use the eepro100 driver ?

I've switched a server over to the default eepro100 driver as supplied in
2.4.17 (compiled as a module). This is tonights snapshot with about 10%
higher user count than above (2200 connections per ircd)

7:25pm up 5:44, 2 users, load average: 0.85, 1.01, 1.09
38 processes: 33 sleeping, 5 running, 0 zombie, 0 stopped
CPU0 states: 27.3% user, 69.3% system, 0.0% nice, 2.2% idle
CPU1 states: 26.1% user, 71.2% system, 0.0% nice, 2.0% idle
Mem: 385096K av, 232960K used, 152136K free, 0K shrd, 4724K
buff
Swap: 379416K av, 0K used, 379416K free 21780K
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
659 ircd 15 0 74976 73M 660 R 96.7 19.4 263:21 ircd
666 ircd 14 0 75004 73M 656 R 95.5 19.4 253:10 ircd

So as you can see the numbers are almost the same, though they were worse at
lower users than the e100 driver (~45% system per cpu at 1000 users per ircd
with eepro100, ~30% with e100).

I will try the profiling tomorrow with the eepro100 driver compiled into the
kernel, I was unable to do the same for the Intel e100 driver today as I
discovered that the Intel driver can currenty only be compiled as a module.

Vince.

2002-01-28 19:40:54

by Rik van Riel

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

On Mon, 28 Jan 2002, Vincent Sweeney wrote:

> > > CPU0 states: 27.2% user, 62.4% system, 0.0% nice, 9.2% idle
> > > CPU1 states: 28.4% user, 62.3% system, 0.0% nice, 8.1% idle
> >
> > The important bit here is ^^^^^^^^ that one. Something is causing
> > horrendous lock contention it appears.
>
> I've switched a server over to the default eepro100 driver as supplied
> in 2.4.17 (compiled as a module). This is tonights snapshot with about
> 10% higher user count than above (2200 connections per ircd)

Hummm ... poll() / select() ? ;)

> I will try the profiling tomorrow

readprofile | sort -n | tail -20

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-29 16:32:58

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

----- Original Message -----
From: "Rik van Riel" <[email protected]>
To: "Vincent Sweeney" <[email protected]>
Cc: "Alan Cox" <[email protected]>; <[email protected]>
Sent: Monday, January 28, 2002 7:40 PM
Subject: Re: PROBLEM: high system usage / poor SMP network performance

> On Mon, 28 Jan 2002, Vincent Sweeney wrote:
>
> > > > CPU0 states: 27.2% user, 62.4% system, 0.0% nice, 9.2% idle
> > > > CPU1 states: 28.4% user, 62.3% system, 0.0% nice, 8.1% idle
> > >
> > > The important bit here is ^^^^^^^^ that one. Something is causing
> > > horrendous lock contention it appears.
> >
> > I've switched a server over to the default eepro100 driver as supplied
> > in 2.4.17 (compiled as a module). This is tonights snapshot with about
> > 10% higher user count than above (2200 connections per ircd)
>
> Hummm ... poll() / select() ? ;)
>
> > I will try the profiling tomorrow
>
> readprofile | sort -n | tail -20
>
> kind regards,
>
> Rik

Right then, here is the results from today so far (snapshot taken with 2000
users per ircd). Kernel profiling enabled with the eepro100 driver compiled
statically.

---
> readprofile -r ; sleep 60; readprofile | sort -n | tail -30

11 tcp_rcv_established 0.0055
12 do_softirq 0.0536
12 nf_hook_slow 0.0291
13 __free_pages_ok 0.0256
13 kmalloc 0.0378
13 rmqueue 0.0301
13 tcp_ack 0.0159
14 __kfree_skb 0.0455
14 tcp_v4_rcv 0.0084
15 __ip_conntrack_find 0.0441
16 handle_IRQ_event 0.1290
16 tcp_packet 0.0351
17 speedo_rx 0.0227
17 speedo_start_xmit 0.0346
18 ip_route_input 0.0484
23 speedo_interrupt 0.0301
30 ipt_do_table 0.0284
30 tcp_sendmsg 0.0065
116 __pollwait 0.7838
140 poll_freewait 1.7500
170 sys_poll 0.1897
269 do_pollfd 1.4944
462 remove_wait_queue 12.8333
474 add_wait_queue 9.1154
782 fput 3.3707
1216 default_idle 23.3846
1334 fget 16.6750
1347 sock_poll 33.6750
2408 tcp_poll 6.9195
9366 total 0.0094

> top

4:30pm up 2:57, 2 users, load average: 0.76, 0.85, 0.82
36 processes: 33 sleeping, 3 running, 0 zombie, 0 stopped
CPU0 states: 21.4% user, 68.1% system, 0.0% nice, 9.3% idle
CPU1 states: 23.4% user, 67.1% system, 0.0% nice, 8.3% idle
Mem: 382916K av, 191276K used, 191640K free, 0K shrd, 1444K
buff
Swap: 379416K av, 0K used, 379416K free 23188K
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
613 ircd 16 0 67140 65M 660 R 89.6 17.5 102:00 ircd
607 ircd 16 0 64868 63M 656 S 88.7 16.9 98:50 ircd

---

So with my little knowledge of what this means I would say this is purely
down to poll(), but surely even with 4000 connections to the box that
shouldn't stretch a dual P3-800 box as much as it does?

Vince.

2002-01-29 17:55:08

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

"Vincent Sweeney" <[email protected]> wrote:
> > > > CPU0 states: 27.2% user, 62.4% system, 0.0% nice, 9.2% idle
> > > > CPU1 states: 28.4% user, 62.3% system, 0.0% nice, 8.1% idle
> > >
> > > The important bit here is ^^^^^^^^ that one. Something is causing
> > > horrendous lock contention it appears.
> ...
> Right then, here is the results from today so far (snapshot taken with 2000
> users per ircd). Kernel profiling enabled with the eepro100 driver compiled
> statically.
> readprofile -r ; sleep 60; readprofile | sort -n | tail -30
> ...
> 170 sys_poll 0.1897
> 269 do_pollfd 1.4944
> 462 remove_wait_queue 12.8333
> 474 add_wait_queue 9.1154
> 782 fput 3.3707
> 1216 default_idle 23.3846
> 1334 fget 16.6750
> 1347 sock_poll 33.6750
> 2408 tcp_poll 6.9195
> 9366 total 0.0094
> ...
> So with my little knowledge of what this means I would say this is purely
> down to poll(), but surely even with 4000 connections to the box that
> shouldn't stretch a dual P3-800 box as much as it does?

My oldish results,
http://www.kegel.com/dkftpbench/Poller_bench.html#results
show that yes, 4000 connections can really hurt a Linux program
that uses poll(). It is very tempting to port ircd to use
the Poller library (http://www.kegel.com/dkftpbench/dkftpbench-0.38.tar.gz);
that would let us compare poll(), realtimesignals, and /dev/epoll
to see how well they do on your workload.
- Dan

2002-01-30 20:39:37

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

----- Original Message -----
From: "Dan Kegel" <[email protected]>
To: "Vincent Sweeney" <[email protected]>;
<[email protected]>
Sent: Tuesday, January 29, 2002 6:00 PM
Subject: Re: PROBLEM: high system usage / poor SMP network performance

> "Vincent Sweeney" <[email protected]> wrote:
> > > > > CPU0 states: 27.2% user, 62.4% system, 0.0% nice, 9.2% idle
> > > > > CPU1 states: 28.4% user, 62.3% system, 0.0% nice, 8.1% idle
> > > >
> > > > The important bit here is ^^^^^^^^ that one. Something is
causing
> > > > horrendous lock contention it appears.
> > ...
> > Right then, here is the results from today so far (snapshot taken with
2000
> > users per ircd). Kernel profiling enabled with the eepro100 driver
compiled
> > statically.
> > readprofile -r ; sleep 60; readprofile | sort -n | tail -30
> > ...
> > 170 sys_poll 0.1897
> > 269 do_pollfd 1.4944
> > 462 remove_wait_queue 12.8333
> > 474 add_wait_queue 9.1154
> > 782 fput 3.3707
> > 1216 default_idle 23.3846
> > 1334 fget 16.6750
> > 1347 sock_poll 33.6750
> > 2408 tcp_poll 6.9195
> > 9366 total 0.0094
> > ...
> > So with my little knowledge of what this means I would say this is
purely
> > down to poll(), but surely even with 4000 connections to the box that
> > shouldn't stretch a dual P3-800 box as much as it does?
>
> My oldish results,
> http://www.kegel.com/dkftpbench/Poller_bench.html#results
> show that yes, 4000 connections can really hurt a Linux program
> that uses poll(). It is very tempting to port ircd to use
> the Poller library
(http://www.kegel.com/dkftpbench/dkftpbench-0.38.tar.gz);
> that would let us compare poll(), realtimesignals, and /dev/epoll
> to see how well they do on your workload.
> - Dan
>

So basically you are telling me these are my options:

1) Someone is going to have to recode the ircd source we use and
possibly a modified kernel in the *hope* that performance improves.
2) Convert the box to FreeBSD which seems to have a better poll()
implementation, and where I could support 8K clients easily as other admins
on my chat network do already.
3) Move the ircd processes to some 400Mhz Ultra 5's running Solaris-8
which run 3-4K users at 60% cpu!

Now I want to run Linux but unless I get this issue resolved I'm essentialy
not utilizing my hardware to the best of its ability.

Vince.

2002-01-31 05:19:07

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

Vincent Sweeney wrote:
> So basically you are telling me these are my options:
>
> 1) Someone is going to have to recode the ircd source we use and
> possibly a modified kernel in the *hope* that performance improves.
> 2) Convert the box to FreeBSD which seems to have a better poll()
> implementation, and where I could support 8K clients easily as other admins
> on my chat network do already.
> 3) Move the ircd processes to some 400Mhz Ultra 5's running Solaris-8
> which run 3-4K users at 60% cpu!
>
> Now I want to run Linux but unless I get this issue resolved I'm essentialy
> not utilizing my hardware to the best of its ability.

No need to use a modified kernel; plain old 2.4.18 or so should do
fine, it supports the rtsig stuff. But yeah, you may want to
see if the core of ircd can be recoded. Can you give me the URL
for the source of the version you use? I can peek at it.
It only took me two days to recode betaftpd to use Poller...

I do know that the guys working on aio for linux say they
have code that will make poll() much more efficient, so
I suppose another option is to join the linux-aio list and
say "So you folks say you can make plain old poll() more efficient, eh?
Here's a test case for you." :-)

- Dan

2002-02-03 07:58:34

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

Vincent Sweeney wrote:
> > > [I want to use Linux for my irc server, but performance sucks.]
> > > 1) Someone is going to have to recode the ircd source we use and
> > > possibly a modified kernel in the *hope* that performance improves.
> > > 2) Convert the box to FreeBSD which seems to have a better poll()
> > > implementation, and where I could support 8K clients easily as other
> > > admins on my chat network do already....
> >
> > No need to use a modified kernel; plain old 2.4.18 or so should do
> > fine, it supports the rtsig stuff. But yeah, you may want to
> > see if the core of ircd can be recoded. Can you give me the URL
> > for the source of the version you use? I can peek at it.
> > It only took me two days to recode betaftpd to use Poller...
>
> http://dev-com.b2irc.net/ : Undernet's IRCD + Lain 1.1.2 patch

Hmm. Have a look at
http://www.mail-archive.com/[email protected]/msg00060.html
It looks like the mainline Undernet ircd was rewritten around May 2001
to support high efficiency techniques like /dev/poll and kqueue.
The source you pointed to is way behind Undernet's current sources.

Undernet's ircd has engine_{select,poll,devpoll,kqueue}.c,
but not yet an engine_rtsig.c, as far as I know.
If you want ircd to handle zillions of simultaneous connections
on a stock 2.4 Linux kernel, rtsignals are the way to go at the
moment. What's needed is to write ircd's engine_rtsig.c, and
modify ircd's os_linux.c to notice EWOULDBLOCK
return values and feed them to engine_rtsig.c (that's the icky
part about the way linux currently does this kind of event
notification - signals are used for 'I'm ready now', but return
values from I/O functions are where you learn 'I'm no longer ready').

So I dunno if I'm going to go ahead and do that myself, but at least I've
scoped out the situation. Before I did any work, I'd measure CPU
usage under a simulated load of 2000 clients, just to verify that
poll() was indeed a bottleneck (ok, can't imagine it not being a
bottleneck, but it's nice to have a baseline to compare the improved
version against).
- Dan

2002-02-03 08:37:36

by Andrew Morton

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

Dan Kegel wrote:
>
> Before I did any work, I'd measure CPU
> usage under a simulated load of 2000 clients, just to verify that
> poll() was indeed a bottleneck (ok, can't imagine it not being a
> bottleneck, but it's nice to have a baseline to compare the improved
> version against).

I half-did this earlier in the week. It seems that Vincent's
machine is calling poll() maybe 100 times/second. Each call
is taking maybe 10 milliseconds, and is returning approximately
one measly little packet.

select and poll suck for thousands of fds. Always did, always
will. Applications need to work around this.

And the workaround is rather simple:

....
+ usleep(100000);
poll(...);

This will add up to 0.1 seconds latency, but it means that
the poll will gather activity on ten times as many fds,
and that it will be called ten times less often, and that
CPU load will fall by a factor of ten.

This seems an appropriate hack for an IRC server. I guess it
could be souped up a bit:

usleep(nr_fds * 50);

-

2002-02-03 19:10:29

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

Arjen Wolfs wrote:
> The ircu version that supports kqueue and /dev/poll is currently being
> beta-tested on a few servers on the Undernet. The graph at
> http://www.break.net/ircu10-to-11.png shows the load average (multiplied by
> 100) on a on a server with 3000-4000 clients using poll(), and /dev/poll.
> The difference is obviously quite dramatic, and the same effect is being
> seen with kqueue. You could also try some of the /dev/poll patches for
> linux, which migth save you writing a new engine. Note that ircu 2.10.11 is
> still beta though, and is known to crash in mysterious ways from time to time.

None of the original /dev/poll patches for Linux were much
good, I seem to recall; they had scaling problems and bugs.

The /dev/epoll patch is good, but the interface is different enough
from /dev/poll that ircd would need a new engine_epoll.c anyway.
(It would look like a cross between engine_devpoll.c and engine_rtsig.c,
as it would need to be notified by os_linux.c of any EWOULDBLOCK return values.
Both rtsigs and /dev/epoll only provide 'I just became ready' notification,
but no 'I'm not ready anymore' notification.)

And then there's /dev/yapoll (http://www.distributopia.com), which
I haven't tried yet (I don't think the author ever published the patch?).

Anyway, the new engine wouldn't be too hard to write, and
would let irc run fast without a patched kernel.

- Dan

2002-02-03 19:22:52

by Kev

[permalink] [raw]

Subject: Re: PROBLEM: high system usage / poor SMP network performance

> Hmm. Have a look at
> http://www.mail-archive.com/[email protected]/msg00060.html
> It looks like the mainline Undernet ircd was rewritten around May 2001
> to support high efficiency techniques like /dev/poll and kqueue.
> The source you pointed to is way behind Undernet's current sources.

This code is still in beta testing, by the way. It's certainly not the
prettiest way of doing it, though, and I've started working on a new
implementation of the basic idea in a library, which I will then use in
a future version of Undernet's ircd.

> Undernet's ircd has engine_{select,poll,devpoll,kqueue}.c,
> but not yet an engine_rtsig.c, as far as I know.
> If you want ircd to handle zillions of simultaneous connections
> on a stock 2.4 Linux kernel, rtsignals are the way to go at the
> moment. What's needed is to write ircd's engine_rtsig.c, and
> modify ircd's os_linux.c to notice EWOULDBLOCK
> return values and feed them to engine_rtsig.c (that's the icky
> part about the way linux currently does this kind of event
> notification - signals are used for 'I'm ready now', but return
> values from I/O functions are where you learn 'I'm no longer ready').

I haven't examined the usage of the realtime signals stuff, but I did
originally choose not to bother with it. It may be possible to set up
an engine that uses it, and if anyone gets it working, I sure wouldn't
mind seeing the patches. Still, I'd say that the best bet is probably
to either use the /dev/poll patch for linux, or grab the /dev/epoll patch
and implement a new engine to use it. (I should note that I haven't tried
either of these patches, yet, so YMMV.)

> So I dunno if I'm going to go ahead and do that myself, but at least I've
> scoped out the situation. Before I did any work, I'd measure CPU
> usage under a simulated load of 2000 clients, just to verify that
> poll() was indeed a bottleneck (ok, can't imagine it not being a
> bottleneck, but it's nice to have a baseline to compare the improved
> version against).

I'm very certain that poll() is a bottle-neck in any piece of software like
ircd. I have some preliminary data which suggests that not only does the
/dev/poll engine reduce the load averages, but that it scales much better:
Load averages on that beta test server dropped from about 1.30 to about
0.30 for the same number of clients, and adding more clients increases the
load much less than under the previous version using poll(). Of course,
I haven't compared loads under the same server version with two different
engines--it's possible other changes we made have resulted in much of that
load difference.

I should probably note that the beta test server I am refering to is running
Solaris; I have not tried to use the Linux /dev/poll patch as of yet...
--
Kevin L. Mitchell <[email protected]>

2002-02-04 00:07:49

by Kev

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

> The /dev/epoll patch is good, but the interface is different enough
> from /dev/poll that ircd would need a new engine_epoll.c anyway.
> (It would look like a cross between engine_devpoll.c and engine_rtsig.c,
> as it would need to be notified by os_linux.c of any EWOULDBLOCK return values.
> Both rtsigs and /dev/epoll only provide 'I just became ready' notification,
> but no 'I'm not ready anymore' notification.)

I don't understand what it is you're saying here. The ircu server uses
non-blocking sockets, and has since long before EfNet and Undernet branched,
so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.
--
Kevin L. Mitchell <[email protected]>

2002-02-04 00:31:57

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

Kev wrote:
>
> > The /dev/epoll patch is good, but the interface is different enough
> > from /dev/poll that ircd would need a new engine_epoll.c anyway.
> > (It would look like a cross between engine_devpoll.c and engine_rtsig.c,
> > as it would need to be notified by os_linux.c of any EWOULDBLOCK return values.
> > Both rtsigs and /dev/epoll only provide 'I just became ready' notification,
> > but no 'I'm not ready anymore' notification.)
>
> I don't understand what it is you're saying here. The ircu server uses
> non-blocking sockets, and has since long before EfNet and Undernet branched,
> so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.

Right. poll() and Solaris /dev/poll are programmer-friendly; they give
you the current readiness status for each socket. ircu handles them fine.

/dev/epoll and Linux 2.4's rtsig feature, on the other hand, are
programmer-hostile; they don't tell you which sockets are ready.
Instead, they tell you when sockets *become* ready;
your only indication that those sockets have become *unready*
is when you see an EWOULDBLOCK from them.

If this didn't make any sense, maybe seeing how it's used might help.
Look at Poller::clearReadiness() in
http://www.kegel.com/dkftpbench/doc/Poller.html#DOC.9.11 or
http://www.kegel.com/dkftpbench/dkftpbench-0.38/Poller_sigio.cc
and the calls to Poller::clearReadiness() in
http://www.kegel.com/dkftpbench/dkftpbench-0.38/ftp_client_pipe.cc

- Dan

2002-02-04 00:52:39

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Sun, 3 Feb 2002, Dan Kegel wrote:

> Kev wrote:
> >
> > > The /dev/epoll patch is good, but the interface is different enough
> > > from /dev/poll that ircd would need a new engine_epoll.c anyway.
> > > (It would look like a cross between engine_devpoll.c and engine_rtsig.c,
> > > as it would need to be notified by os_linux.c of any EWOULDBLOCK return values.
> > > Both rtsigs and /dev/epoll only provide 'I just became ready' notification,
> > > but no 'I'm not ready anymore' notification.)
> >
> > I don't understand what it is you're saying here. The ircu server uses
> > non-blocking sockets, and has since long before EfNet and Undernet branched,
> > so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.
>
> Right. poll() and Solaris /dev/poll are programmer-friendly; they give
> you the current readiness status for each socket. ircu handles them fine.

I would have to agree with this comment. Hybrid-ircd deals with poll()
and /dev/poll just fine. We have attempted to make it use rtsig, but it
just doesn't seem to agree with the i/o model we are using, which btw, is
the same model that Squid (is/will be?) using. I haven't played with
/dev/epoll yet, but I pray it is nothing like rtsig.

Basically what we need is, something like poll() but not so nasty.
/dev/poll is okay, but its a hack. The best thing I've seen so far, but
it too seems to take the idea so far is FreeBSD's kqueue stuff(which
Hybrid-ircd handles quite nicely).

Regards,

Aaron

2002-02-04 01:10:13

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

Aaron Sethman wrote:
>
> On Sun, 3 Feb 2002, Dan Kegel wrote:
>
> > Kev wrote:
> > >
> > > > The /dev/epoll patch is good, but the interface is different enough
> > > > from /dev/poll that ircd would need a new engine_epoll.c anyway.
> > > > (It would look like a cross between engine_devpoll.c and engine_rtsig.c,
> > > > as it would need to be notified by os_linux.c of any EWOULDBLOCK return values.
> > > > Both rtsigs and /dev/epoll only provide 'I just became ready' notification,
> > > > but no 'I'm not ready anymore' notification.)
> > >
> > > I don't understand what it is you're saying here. The ircu server uses
> > > non-blocking sockets, and has since long before EfNet and Undernet branched,
> > > so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.
> >
> > Right. poll() and Solaris /dev/poll are programmer-friendly; they give
> > you the current readiness status for each socket. ircu handles them fine.
>
> I would have to agree with this comment. Hybrid-ircd deals with poll()
> and /dev/poll just fine. We have attempted to make it use rtsig, but it
> just doesn't seem to agree with the i/o model we are using...

I'd like to know how it disagrees.
I believe rtsig requires you to tweak your I/O code in three ways:
1. you need to pick a realtime signal number to use for an event queue
2. you need to wrap your read()/write() calls on the socket with code
that notices EWOULDBLOCK
3. you need to fall back to poll() on signal queue overflow.

For what it's worth, my Poller library takes care of fallback to poll
transparantly, and makes the EWOULDBLOCK stuff fairly easy. I gather
from the way you quoted my previous messsage, though, that you
consider rtsig too awful to even think about.

> I haven't played with /dev/epoll yet, but I pray it is nothing like rtsig.

Unfortunately, it is exactly like rtsig in how you need to handle
EWOULDBLOCK.

> Basically what we need is, something like poll() but not so nasty.
> /dev/poll is okay, but its a hack. The best thing I've seen so far, but
> it too seems to take the idea so far is FreeBSD's kqueue stuff(which
> Hybrid-ircd handles quite nicely).

Yes, kqueue is quite easy to use, and doesn't require the gyrations
that rtsig or /dev/epoll require. The only thing that makes rtsig or /dev/epoll
usable are user-space wrapper libraries that let you forget about the
gyrations (mostly).
- Dan

2002-02-04 01:23:37

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Sun, 3 Feb 2002, Dan Kegel wrote:

> I'd like to know how it disagrees.
> I believe rtsig requires you to tweak your I/O code in three ways:
> 1. you need to pick a realtime signal number to use for an event queue
Did that.

> 2. you need to wrap your read()/write() calls on the socket with code
> that notices EWOULDBLOCK
This is perhaps the part we it disagrees with our code. I will
investigate this part. The way we normally do things is have callbacks
per fd, that get called when our event occurs doing the read, or, write
directly. We do check for the EWOULDBLOCK stuff and re-register the
event. The thing we do not currently do is, attempt to read or write
unless we've received notification first. This is what I am assuming is
breaking it.

> 3. you need to fall back to poll() on signal queue overflow.
Did that part too.

Regards,

Aaron

2002-02-04 01:33:10

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

Aaron Sethman wrote:
>
> > 2. you need to wrap your read()/write() calls on the socket with code
> > that notices EWOULDBLOCK
> This is perhaps the part we it disagrees with our code. I will
> investigate this part. The way we normally do things is have callbacks
> per fd, that get called when our event occurs doing the read, or, write
> directly.

That sounds totally fine; in fact, it's how my Poller library works.

> We do check for the EWOULDBLOCK stuff and re-register the
> event.

But do you remember that this fd is ready until EWOULDBLOCK?
i.e. if you're notified that an fd is ready, and then you
don't for whatever reason continue to do I/O on it until EWOULDBLOCK,
you'll never ever be notified that it's ready again.
If your code assumes that it will be notified again anyway,
as with poll(), it will be sorely disappointed.

> The thing we do not currently do is, attempt to read or write
> unless we've received notification first. This is what I am assuming is
> breaking it.

Yeah, that would break it, too, I think.

- Dan

2002-02-04 02:55:49

by Kev

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

> > I don't understand what it is you're saying here. The ircu server uses
> > non-blocking sockets, and has since long before EfNet and Undernet branched,
> > so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.
>
> Right. poll() and Solaris /dev/poll are programmer-friendly; they give
> you the current readiness status for each socket. ircu handles them fine.
>
> /dev/epoll and Linux 2.4's rtsig feature, on the other hand, are
> programmer-hostile; they don't tell you which sockets are ready.
> Instead, they tell you when sockets *become* ready;
> your only indication that those sockets have become *unready*
> is when you see an EWOULDBLOCK from them.

If I'm reading Poller_sigio::waitForEvents correctly, the rtsig stuff at
least tries to return a list of which sockets have become ready, and your
implementation falls back to some other interface when the signal queue
overflows. It also seems to extract what state the socket's in at that
point.

If that's true, I confess I can't quite see your point even still. Once
the event is generated, ircd should read or write as much as it can, then
not pay any attention to the socket until readiness is again signaled by
the generation of an event. Sorry if I'm being dense here...
--
Kevin L. Mitchell <[email protected]>

2002-02-04 03:19:22

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

Kev wrote:
>
> > > I don't understand what it is you're saying here. The ircu server uses
> > > non-blocking sockets, and has since long before EfNet and Undernet branched,
> > > so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.
> >
> > Right. poll() and Solaris /dev/poll are programmer-friendly; they give
> > you the current readiness status for each socket. ircu handles them fine.
> >
> > /dev/epoll and Linux 2.4's rtsig feature, on the other hand, are
> > programmer-hostile; they don't tell you which sockets are ready.
> > Instead, they tell you when sockets *become* ready;
> > your only indication that those sockets have become *unready*
> > is when you see an EWOULDBLOCK from them.
>
> If I'm reading Poller_sigio::waitForEvents correctly, the rtsig stuff at
> least tries to return a list of which sockets have become ready, and your
> implementation falls back to some other interface when the signal queue
> overflows. It also seems to extract what state the socket's in at that
> point.
>
> If that's true, I confess I can't quite see your point even still. Once
> the event is generated, ircd should read or write as much as it can, then
> not pay any attention to the socket until readiness is again signaled by
> the generation of an event. Sorry if I'm being dense here...

If you actually do read or write *until an EWOULDBLOCK*, no problem.
If your code has a path where it fails to do so, it will get stuck,
as no further readiness events will be forthcoming. That's all.
- Dan

2002-02-04 04:31:37

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Sun, 3 Feb 2002, Dan Kegel wrote:
>
> But do you remember that this fd is ready until EWOULDBLOCK?
> i.e. if you're notified that an fd is ready, and then you
> don't for whatever reason continue to do I/O on it until EWOULDBLOCK,
> you'll never ever be notified that it's ready again.
> If your code assumes that it will be notified again anyway,
> as with poll(), it will be sorely disappointed.

Yeah that was the problem and I figured out how to work around it in the
code. If you are interested I can point out the code we have been working
with.

Regards,

Aaron

2002-02-04 04:41:00

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Sun, 3 Feb 2002, Dan Kegel wrote:

> Kev wrote:
> > If that's true, I confess I can't quite see your point even still. Once
> > the event is generated, ircd should read or write as much as it can, then
> > not pay any attention to the socket until readiness is again signaled by
> > the generation of an event. Sorry if I'm being dense here...
>
> If you actually do read or write *until an EWOULDBLOCK*, no problem.
> If your code has a path where it fails to do so, it will get stuck,
> as no further readiness events will be forthcoming. That's all.

It seems kind of odd, at first, but it does make sense in a inverted sort
of way. Basically you aren't going to get any signals from the kernel
until the EWOULDBLOCK state clears. Consider what would happen if you
received a signal every time you could, say send. Your process would be
flooded with signals, which of course wouldn't work. If you want to take
a look at the Hybrid-7 cvs tree, let me know and I can give you a copy of
it. I just got the sigio stuff working correctly in their.

Regards,

Aaron

2002-02-04 05:10:58

by Kev

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

> > If I'm reading Poller_sigio::waitForEvents correctly, the rtsig stuff at
> > least tries to return a list of which sockets have become ready, and your
> > implementation falls back to some other interface when the signal queue
> > overflows. It also seems to extract what state the socket's in at that
> > point.
> >
> > If that's true, I confess I can't quite see your point even still. Once
> > the event is generated, ircd should read or write as much as it can, then
> > not pay any attention to the socket until readiness is again signaled by
> > the generation of an event. Sorry if I'm being dense here...
>
> If you actually do read or write *until an EWOULDBLOCK*, no problem.
> If your code has a path where it fails to do so, it will get stuck,
> as no further readiness events will be forthcoming. That's all.

Ah ha! And you may indeed have a point there...
--
Kevin L. Mitchell <[email protected]>

2002-02-04 05:30:03

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

Aaron Sethman wrote:
>
> On Sun, 3 Feb 2002, Dan Kegel wrote:
> >
> > But do you remember that this fd is ready until EWOULDBLOCK?
> > i.e. if you're notified that an fd is ready, and then you
> > don't for whatever reason continue to do I/O on it until EWOULDBLOCK,
> > you'll never ever be notified that it's ready again.
> > If your code assumes that it will be notified again anyway,
> > as with poll(), it will be sorely disappointed.
>
> Yeah that was the problem and I figured out how to work around it in the
> code. If you are interested I can point out the code we have been working
> with.

Yes, I would like to see it; is it part of the mainline undernet ircd cvs tree?
- Dan

2002-02-04 05:36:44

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Sun, 3 Feb 2002, Dan Kegel wrote:

> Aaron Sethman wrote:
> >
> > On Sun, 3 Feb 2002, Dan Kegel wrote:
> > >
> > > But do you remember that this fd is ready until EWOULDBLOCK?
> > > i.e. if you're notified that an fd is ready, and then you
> > > don't for whatever reason continue to do I/O on it until EWOULDBLOCK,
> > > you'll never ever be notified that it's ready again.
> > > If your code assumes that it will be notified again anyway,
> > > as with poll(), it will be sorely disappointed.
> >
> > Yeah that was the problem and I figured out how to work around it in the
> > code. If you are interested I can point out the code we have been working
> > with.
>
> Yes, I would like to see it; is it part of the mainline undernet ircd cvs tree?

This is part of the Hybrid ircd tree I've been talking about.
http://squeaker.ratbox.org/ircd-hybrid-7.tar.gz has the latest snapshot of
the tree. Look at src/s_bsd_sigio.c for the sigio code.

Regards,

Aaron

2002-02-04 06:06:57

by Daniel Phillips

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On February 4, 2002 01:59 am, Aaron Sethman wrote:
> On Sun, 3 Feb 2002, Dan Kegel wrote:
>
> > Kev wrote:
> > >
> > > > The /dev/epoll patch is good, but the interface is different enough
> > > > from /dev/poll that ircd would need a new engine_epoll.c anyway.
> > > > (It would look like a cross between engine_devpoll.c and engine_rtsig.c,
> > > > as it would need to be notified by os_linux.c of any EWOULDBLOCK return values.
> > > > Both rtsigs and /dev/epoll only provide 'I just became ready' notification,
> > > > but no 'I'm not ready anymore' notification.)
> > >
> > > I don't understand what it is you're saying here. The ircu server uses
> > > non-blocking sockets, and has since long before EfNet and Undernet branched,
> > > so it already handles EWOULDBLOCK or EAGAIN intelligently, as far as I know.
> >
> > Right. poll() and Solaris /dev/poll are programmer-friendly; they give
> > you the current readiness status for each socket. ircu handles them fine.
>
> I would have to agree with this comment. Hybrid-ircd deals with poll()
> and /dev/poll just fine. We have attempted to make it use rtsig, but it
> just doesn't seem to agree with the i/o model we are using, which btw, is
> the same model that Squid (is/will be?) using. I haven't played with
> /dev/epoll yet, but I pray it is nothing like rtsig.
>
> Basically what we need is, something like poll() but not so nasty.
> /dev/poll is okay, but its a hack. The best thing I've seen so far, but
> it too seems to take the idea so far is FreeBSD's kqueue stuff(which
> Hybrid-ircd handles quite nicely).

In an effort to somehow control the mushrooming number of IO interface
strategies, why not take a look at the work Ben and Suparna are doing in aio,
and see if there's an interface mechanism there that can be repurposed?

Surparna's writeup, for quick orientation:

http://lse.sourceforge.net/io/bionotes.txt

--
Daniel

2002-02-04 06:19:41

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Mon, 4 Feb 2002, Daniel Phillips wrote:
> In an effort to somehow control the mushrooming number of IO interface
> strategies, why not take a look at the work Ben and Suparna are doing in aio,
> and see if there's an interface mechanism there that can be repurposed?

When AIO no longer sucks on pretty much every platform on the face of the
planet I think people will reconsider. In the mean time, we've got to
deal with that is there. That leaves us writing for at least 6 very
similiar, I/O models with varying attributes.

Regards,

Aaron

2002-02-04 06:24:51

by Daniel Phillips

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On February 4, 2002 07:26 am, Aaron Sethman wrote:
> On Mon, 4 Feb 2002, Daniel Phillips wrote:
> > In an effort to somehow control the mushrooming number of IO interface
> > strategies, why not take a look at the work Ben and Suparna are doing in aio,
> > and see if there's an interface mechanism there that can be repurposed?
>
> When AIO no longer sucks on pretty much every platform on the face of the
> planet I think people will reconsider.

What is the hang, as you see it?

> In the mean time, we've got to
> deal with that is there. That leaves us writing for at least 6 very
> similiar, I/O models with varying attributes.

This is really an unfortunate situation.

--
Daniel

2002-02-04 06:32:24

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMPnetwork performance

On Mon, 4 Feb 2002, Daniel Phillips wrote:

> On February 4, 2002 07:26 am, Aaron Sethman wrote:
> > On Mon, 4 Feb 2002, Daniel Phillips wrote:
> > > In an effort to somehow control the mushrooming number of IO interface
> > > strategies, why not take a look at the work Ben and Suparna are doing in aio,
> > > and see if there's an interface mechanism there that can be repurposed?
> >
> > When AIO no longer sucks on pretty much every platform on the face of the
> > planet I think people will reconsider.
>
> What is the hang, as you see it?
Well on many platforms its implemented via pthreads, which in general
isn't terribly acceptable when you need to deal with 5000 connections in
one process. I would like to see something useful that works well, and
performs well. I think the FreeBSD guys had the right idea with their
kqueue interface, shame they couldn't have written it around the posix aio
interface. But I suppose it would be trivial to write a wrapper around
it.

But the real issue is, that the standard interfaces, select() and poll()
are inadequate in the face of current requirements. Posix AIO seems like
its heading down the right path, but it just isn't ready in any mature
implementation yet, thus pushing people away from it, making the problem
worse.

> > In the mean time, we've got to
> > deal with that is there. That leaves us writing for at least 6 very
> > similiar, I/O models with varying attributes.
>
> This is really an unfortunate situation.

I agree with you 150% on that statement. Lots of wasted time reinventing
tires for the latest and greatest wheel.

Regards,

Aaron

2002-02-04 14:58:27

by Darren Smith

[permalink] [raw]

Subject: RE: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

Hi

I've been testing the modified Undernet (2.10.10) code with Vincent
Sweeney based on the simple usleep(100000) addition to s_bsd.c

PRI NICE SIZE RES STATE C TIME WCPU CPU | # USERS
2 0 96348K 96144K poll 0 29.0H 39.01% 39.01% | 1700 <- Without
Patch
10 0 77584K 77336K nanslp 0 7:08 5.71% 5.71% | 1500 <- With
Patch

Spot the difference!

It doesn't appear to be lagging, yet is using 1/7th the cpu!

Anyone else tried this?

Regards

Darren Smith

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Andrew Morton
Sent: 03 February 2002 08:36
To: Dan Kegel
Cc: Vincent Sweeney; [email protected];
[email protected]; Kevin L. Mitchell
Subject: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network
performance

Dan Kegel wrote:
>
> Before I did any work, I'd measure CPU
> usage under a simulated load of 2000 clients, just to verify that
> poll() was indeed a bottleneck (ok, can't imagine it not being a
> bottleneck, but it's nice to have a baseline to compare the improved
> version against).

I half-did this earlier in the week. It seems that Vincent's
machine is calling poll() maybe 100 times/second. Each call
is taking maybe 10 milliseconds, and is returning approximately
one measly little packet.

select and poll suck for thousands of fds. Always did, always
will. Applications need to work around this.

And the workaround is rather simple:

....
+ usleep(100000);
poll(...);

This will add up to 0.1 seconds latency, but it means that
the poll will gather activity on ten times as many fds,
and that it will be called ten times less often, and that
CPU load will fall by a factor of ten.

This seems an appropriate hack for an IRC server. I guess it
could be souped up a bit:

usleep(nr_fds * 50);

-

2002-02-04 17:34:33

[permalink] [raw]

Subject: RE: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

On Mon, 4 Feb 2002, Darren Smith wrote:

> Hi
>
> I've been testing the modified Undernet (2.10.10) code with Vincent
> Sweeney based on the simple usleep(100000) addition to s_bsd.c
>
> PRI NICE SIZE RES STATE C TIME WCPU CPU | # USERS
> 2 0 96348K 96144K poll 0 29.0H 39.01% 39.01% | 1700 <- Without
> Patch
> 10 0 77584K 77336K nanslp 0 7:08 5.71% 5.71% | 1500 <- With
> Patch
Were you not putting a delay argument into poll(), or perhaps not letting
it delay long enough? If you just do poll with a timeout of 0, its going
to suck lots of cpu.

Regards,

Aaron

2002-02-04 18:11:38

by Darren Smith

[permalink] [raw]

Subject: RE: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

I mean I added a usleep() before the poll in s_bsd.c for the undernet
2.10.10 code.

timeout = (IRCD_MIN(delay2, delay)) * 1000;
+ usleep(100000); <- New Line
nfds = poll(poll_fds, pfd_count, timeout);

And now we're using 1/8th the cpu! With no noticeable effects.

Regards

Darren.

-----Original Message-----
From: Aaron Sethman [mailto:[email protected]]
Sent: 04 February 2002 17:41
To: Darren Smith
Cc: 'Andrew Morton'; 'Dan Kegel'; 'Vincent Sweeney';
[email protected]; [email protected]; 'Kevin L.
Mitchell'
Subject: RE: [Coder-Com] Re: PROBLEM: high system usage / poor SMP
network performance

On Mon, 4 Feb 2002, Darren Smith wrote:

> Hi
>
> I've been testing the modified Undernet (2.10.10) code with Vincent
> Sweeney based on the simple usleep(100000) addition to s_bsd.c
>
> PRI NICE SIZE RES STATE C TIME WCPU CPU | # USERS
> 2 0 96348K 96144K poll 0 29.0H 39.01% 39.01% | 1700 <- Without
> Patch
> 10 0 77584K 77336K nanslp 0 7:08 5.71% 5.71% | 1500 <- With
> Patch
Were you not putting a delay argument into poll(), or perhaps not
letting
it delay long enough? If you just do poll with a timeout of 0, its
going
to suck lots of cpu.

Regards,

Aaron

2002-02-04 18:23:58

[permalink] [raw]

Subject: RE: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

On Mon, 4 Feb 2002, Darren Smith wrote:

> I mean I added a usleep() before the poll in s_bsd.c for the undernet
> 2.10.10 code.
>
> timeout = (IRCD_MIN(delay2, delay)) * 1000;
> + usleep(100000); <- New Line
> nfds = poll(poll_fds, pfd_count, timeout);
Why not just add the additional delay into the poll() timeout? It just
seems like you were not doing enough of a delay in poll().

Regards,

Aaron

2002-02-04 18:49:10

by Kev

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

> > I mean I added a usleep() before the poll in s_bsd.c for the undernet
> > 2.10.10 code.
> >
> > timeout = (IRCD_MIN(delay2, delay)) * 1000;
> > + usleep(100000); <- New Line
> > nfds = poll(poll_fds, pfd_count, timeout);
> Why not just add the additional delay into the poll() timeout? It just
> seems like you were not doing enough of a delay in poll().

Wouldn't have the effect. The original point was that adding the usleep()
gives some time for some more file descriptors to become ready before calling
poll(), thus increasing the number of file descriptors poll() can return
per system call. Adding the time to timeout would have no effect.
--
Kevin L. Mitchell <[email protected]>

2002-02-04 18:52:32

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

On Mon, 4 Feb 2002, Kev wrote:
> Wouldn't have the effect. The original point was that adding the usleep()
> gives some time for some more file descriptors to become ready before calling
> poll(), thus increasing the number of file descriptors poll() can return
> per system call. Adding the time to timeout would have no effect.

My fault, I'm not thinking straight today. I don't believe I've had my
daily allowance of caffine yet.

Regards,

Aaron

2002-02-04 18:54:50

by Doug McNaught

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

Aaron Sethman <[email protected]> writes:

> On Mon, 4 Feb 2002, Darren Smith wrote:
>
> > I mean I added a usleep() before the poll in s_bsd.c for the undernet
> > 2.10.10 code.

> Why not just add the additional delay into the poll() timeout? It just
> seems like you were not doing enough of a delay in poll().

No, because the poll() delay only has an effect if there are no
readable fd's. What the usleep() does is allow time for more fd's to
become readable/writeable before poll() is called, spreading the
poll() overhead over more actual work.

-Doug
--
Let us cross over the river, and rest under the shade of the trees.
--T. J. Jackson, 1863

2002-02-08 22:18:08

by James Antill

[permalink] [raw]

Subject: Re: [Coder-Com] Re: PROBLEM: high system usage / poor SMP network performance

"Darren Smith" <[email protected]> writes:

> I mean I added a usleep() before the poll in s_bsd.c for the undernet
> 2.10.10 code.
>
> timeout = (IRCD_MIN(delay2, delay)) * 1000;
> + usleep(100000); <- New Line
> nfds = poll(poll_fds, pfd_count, timeout);
>
> And now we're using 1/8th the cpu! With no noticeable effects.

Note that something else you want to do is call poll() with a 0
timeout first (and if that doesn't return anything call again with the
timeout), this removes all the wait queue manipulation inside the
kernel when something is ready (most of the time).

--
# James Antill -- [email protected]
:0:
* ^From: .*james@and\.org
/dev/null

2002-02-12 18:48:51