I've got some early SPECWeb [*] results with 2.5.33 and TSO on e1000. I
get 2906 simultaneous connections, 99.2% conforming (i.e. faster than the
320 kbps cutoff), at 0% idle with TSO on. For comparison, with 2.5.25, I
got 2656, and with 2.5.29 I got 2662, (both 99+% conformance and 0% idle) so
TSO and 2.5.33 look like a Big Win.
I'm having trouble testing with TSO off (I changed the #define NETIF_F_TSO
to "0" in include/linux/netdevice.h to turn it off). I am getting errors.
NETDEV WATCHDOG: eth1: transmit timed out
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
That's pushed my SPECWeb results down to below 2500 connections with TSO
off because of those adapter resets (It is only that one adapter, BTW) and
these results (with TSO off) shouldn't be considered valid.
eth1 is the only adapter with errors, and they all look like RX overruns.
For comparison:
eth1 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:9E
inet addr:192.168.4.1 Bcast:192.168.4.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:48621378 errors:8890 dropped:8890 overruns:8890 frame:0
TX packets:64342993 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:3637004554 (3468.5 Mb) TX bytes:1377740556 (1313.9 Mb)
Interrupt:61 Base address:0x1200 Memory:fc020000-0
eth3 Link encap:Ethernet HWaddr 00:02:B3:A3:47:E7
inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:37130540 errors:0 dropped:0 overruns:0 frame:0
TX packets:49061277 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2774988658 (2646.4 Mb) TX bytes:3290541711 (3138.1 Mb)
Interrupt:44 Base address:0x2040 Memory:fe120000-0
I'm still working on getting a clean run with TSO off. If anyone has any
ideas for me about the timeout errors, I'd appreciate the clue.
Thanks,
- Troy
* SPEC(tm) and the benchmark name SPECweb(tm) are registered
trademarks of the Standard Performance Evaluation Corporation.
This benchmarking was performed for research purposes only,
and is non-compliant, with the following deviations from the
rules -
1 - It was run on hardware that does not meed the SPEC
availability-to-the-public criteria. The machine is
an engineering sample.
2 - access_log wasn't kept for full accounting. It was
being written, but deleted every 200 seconds.
Troy Wilson wrote:
> I've got some early SPECWeb [*] results with 2.5.33 and TSO
> on e1000. I get 2906 simultaneous connections, 99.2%
> conforming (i.e. faster than the 320 kbps cutoff), at 0% idle
> with TSO on. For comparison, with 2.5.25, I
> got 2656, and with 2.5.29 I got 2662, (both 99+% conformance
> and 0% idle) so TSO and 2.5.33 look like a Big Win.
A 10% bump is good. Thanks for running the numbers.
> I'm having trouble testing with TSO off (I changed the
> #define NETIF_F_TSO to "0" in include/linux/netdevice.h to
> turn it off). I am getting errors.
Sorry, I should have made a CONFIG switch. Just hack the driver for now to
turn it off:
--- linux-2.5/drivers/net/e1000/e1000_main.c Fri Aug 30 19:26:57 2002
+++ linux-2.5-no_tso/drivers/net/e1000/e1000_main.c Thu Sep 5 13:38:44
2002
@@ -428,9 +428,11 @@ e1000_probe(struct pci_dev *pdev,
}
#ifdef NETIF_F_TSO
+#if 0
if(adapter->hw.mac_type >= e1000_82544)
netdev->features |= NETIF_F_TSO;
#endif
+#endif
if(pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;
-scott
Hey, thanks for crossposting to netdev
So if i understood correctly (looking at the intel site) the main value
add of this feature is probably in having the CPU avoid reassembling and
retransmitting. I am willing to bet that the real value in your results is
in saving on retransmits; I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant CPU
savings. Do you have any stats from the hardware that could show
retransmits etc; have you tested this with zero copy as well (sendfile)
again, if i am right you shouldnt see much benefit from that either?
I would think it probably works well with things like partial ACKs too?
(I am almost sure it does or someone needs to be spanked, so just
checking).
cheers,
jamal
> So if i understood correctly (looking at the intel site) the main value
> add of this feature is probably in having the CPU avoid reassembling and
> retransmitting.
Quoting David S. Miller:
dsm> The performance improvement comes from the fact that the card
dsm> is given huge 64K packets, then the card (using the given ip/tcp
dsm> headers as a template) spits out 1500 byte mtu sized packets.
dsm>
dsm> Less data DMA'd to the device per normal-mtu packet and less
dsm> per-packet data structure work by the cpu is where the improvement
dsm> comes from.
> Do you have any stats from the hardware that could show
> retransmits etc;
I'll gather netstat -s after runs with and without TSO enabled.
Anything else you'd like to see?
> have you tested this with zero copy as well (sendfile)
Yes. My webserver is Apache 2.0.36, which uses sendfile for anything
over 8k in size. But, iirc, Apache sends the http headers using writev.
Thanks,
- Troy
Quoting Troy Wilson <[email protected]>:
> > Do you have any stats from the hardware that could show
> > retransmits etc;
>
> I'll gather netstat -s after runs with and without TSO enabled.
> Anything else you'd like to see?
Troy, this is pointing out the obvious, but make sure
you have the before stats as well :)...
> > have you tested this with zero copy as well (sendfile)
>
> Yes. My webserver is Apache 2.0.36, which uses sendfile for
> anything
> over 8k in size. But, iirc, Apache sends the http headers using
> writev.
SpecWeb99 doesnt execute the path that might benefit the
most from this patch - sendmsg() of large files - large writes
going down..
thanks,
Nivedita
Quoting jamal <[email protected]>:
> So if i understood correctly (looking at the intel site) the main
> value add of this feature is probably in having the CPU avoid
> reassembling and retransmitting. I am willing to bet that the real
Er, even just assembling and transmitting? I'm thinking of the
reduction in things like separate memory allocation calls and looking
up the route, etc..??
> value in your results is in saving on retransmits; I would think
> shoving the data down the NIC and avoid the fragmentation shouldnt
> give you that much significant CPU savings. Do you have any stats
Why do say that? Wouldnt the fact that youre now reducing the
number of calls down the stack by a significant number provide
a significant saving?
> from the hardware that could show retransmits etc; have you tested
> this with zero copy as well (sendfile) again, if i am right you
> shouldnt see much benefit from that either?
thanks,
Nivedita
Nivedita Singhvi wrote:
> SpecWeb99 doesnt execute the path that might benefit the
> most from this patch - sendmsg() of large files - large writes
> going down..
For those of you who don't know Specweb well, the average size of a request
is about 14.5 kB. The largest files are ~5mb, but the largest top out at
just under a meg.
--
Dave Hansen
[email protected]
On Thu, 5 Sep 2002, Nivedita Singhvi wrote:
>
> > value in your results is in saving on retransmits; I would think
> > shoving the data down the NIC and avoid the fragmentation shouldnt
> > give you that much significant CPU savings. Do you have any stats
>
> Why do say that? Wouldnt the fact that youre now reducing the
> number of calls down the stack by a significant number provide
> a significant saving?
I am not sure; if he gets a busy system in a congested network, i can
see the offloading savings i.e i am not sure if the amortization of the
calls away from the CPU is sufficient enough savings if it doesnt
involve a lot of retransmits. I am also wondering how smart this NIC
in doing the retransmits; example i have doubts if this idea is briliant
to begin with; does it handle SACKs for example? What about
the du-jour algorithm, would you have to upgrade the NIC or can it be
taught some new trickes etc etc.
[also i can see why it makes sense to use this feature only with sendfile;
its pretty much useless for interactive apps]
Troy, i am not interested in the nestat -s data rather the TCP stats
this NIC has exposed. Unless those somehow show up magically in netstat.
cheers,
jamal
Quoting jamal <[email protected]>:
> I am not sure; if he gets a busy system in a congested network, i
> can see the offloading savings i.e i am not sure if the amortization
> of the calls away from the CPU is sufficient enough savings if it
> doesnt involve a lot of retransmits. I am also wondering how smart
> this NIC in doing the retransmits; example i have doubts if this
> idea is briliant to begin with; does it handle SACKs for example?
do you mean sack data being sent as a tcp option?
dont know, lots of other questions arise (like timestamp
on all the segments would be the same?).
> Troy, i am not interested in the nestat -s data rather the TCP
> stats this NIC has exposed. Unless those somehow show up magically
> in netstat.
most recent (dont know how far back) versions of netstat
display /proc/net/snmp and /proc/net/netstat (with the
Linux TCP MIB), so netstat -s should show you most of
whats interesting. Or were you referring to something else?
ifconfig -a and netstat -rn would also be nice to have..
thanks,
Nivedita
From: jamal <[email protected]>
Date: Thu, 5 Sep 2002 16:59:47 -0400 (EDT)
I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant
CPU savings.
It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller. In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.
Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.
I think I've said this a million times, perhaps the next person who
tries to figure out where the gains come from can just reply with
a pointer to a URL of this email I'm typing right now :-)
From: jamal <[email protected]>
Date: Thu, 5 Sep 2002 21:47:35 -0400 (EDT)
I am not sure; if he gets a busy system in a congested network, i can
see the offloading savings i.e i am not sure if the amortization of the
calls away from the CPU is sufficient enough savings if it doesnt
involve a lot of retransmits. I am also wondering how smart this NIC
in doing the retransmits; example i have doubts if this idea is briliant
to begin with; does it handle SACKs for example? What about
the du-jour algorithm, would you have to upgrade the NIC or can it be
taught some new trickes etc etc.
[also i can see why it makes sense to use this feature only with sendfile;
its pretty much useless for interactive apps]
Troy, i am not interested in the nestat -s data rather the TCP stats
this NIC has exposed. Unless those somehow show up magically in netstat.
There are no retransmits happening, the card does not analyze
activity on the TCP connection to retransmit things itself
it's just a simple header templating facility.
Read my other emails about where the benefits come from.
In fact when connection is sick (ie. retransmits and SACKs occur)
we disable TSO completely for that socket.
From: Nivedita Singhvi <[email protected]>
Date: Thu, 5 Sep 2002 20:38:10 -0700
most recent (dont know how far back) versions of netstat
display /proc/net/snmp and /proc/net/netstat (with the
Linux TCP MIB), so netstat -s should show you most of
whats interesting. Or were you referring to something else?
ifconfig -a and netstat -rn would also be nice to have..
TSO gets turned off during retransmits/SACK and the card does not do
retransmits.
Can we move on in this conversation now? :-)
From: Nivedita Singhvi <[email protected]>
Date: Thu, 5 Sep 2002 21:20:47 -0700
Sure :). The motivation for seeing the stats though would
be to get an idea of how much retransmission/SACK etc
activity _is_ occurring during Troy's SpecWeb runs, which
would give us an idea of how often we're actually doing
segmentation offload, and better idea of how much gain
its possible to further get from this(ahem) DMA coalescing :).
Some of Troy's early runs had a very large number of
packets dropped by the card.
One thing to do is make absolutely sure that flow control is
enabled and supported by all devices on the link from the
client to the test spedweb server.
Troy can do you do that for us along with the statistic
dumps?
Thanks.
Quoting "David S. Miller" <[email protected]>:
> > ifconfig -a and netstat -rn would also be nice to have..
>
> TSO gets turned off during retransmits/SACK and the card does not
> do
> retransmits.
>
> Can we move on in this conversation now? :-)
Sure :). The motivation for seeing the stats though would
be to get an idea of how much retransmission/SACK etc
activity _is_ occurring during Troy's SpecWeb runs, which
would give us an idea of how often we're actually doing
segmentation offload, and better idea of how much gain
its possible to further get from this(ahem) DMA coalescing :).
Some of Troy's early runs had a very large number of
packets dropped by the card.
thanks,
Nivedita
> I would think shoving the data down the NIC
> and avoid the fragmentation shouldnt give you that much significant
> CPU savings.
>
> It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> is limited by the DMA throughput of the PCI host controller. In
> particular some controllers are limited to smaller DMA bursts to
> work around hardware bugs.
>
> Ie. the headers that don't need to go across the bus are the critical
> resource saved by TSO.
I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4? ... and what's
the raw bandwidth of data we're pushing? ... it's not huge).
I think we're CPU limited (there's no idle time on this machine),
which is odd for an 8 CPU 900MHz P3 Xeon, but still, this is Apache,
not Tux. You mentioned CPU load as another advantage of TSO ...
anything we've done to reduce CPU load enables us to run more and
more connections (I think we started at about 260 or something, so
2900 ain't too bad ;-)).
Just to throw another firework into the fire whilst people are
awake, NAPI does not seem to scale to this sort of load, which
was disappointing, as we were hoping it would solve some of
our interrupt load problems ... seems that half the machine goes
idle, the number of simultaneous connections drop way down, and
everything's blocked on ... something ... not sure what ;-)
Any guesses at why, or ways to debug this?
M.
PS. Anyone else running NAPI on SMP? (ideally at least 4-way?)
From: "Martin J. Bligh" <[email protected]>
Date: Thu, 05 Sep 2002 23:48:42 -0700
Just to throw another firework into the fire whilst people are
awake, NAPI does not seem to scale to this sort of load, which
was disappointing, as we were hoping it would solve some of
our interrupt load problems ...
Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
NAPI is also not the panacea to all problems in the world.
I bet your greatest gain would be obtained from going to Tux
and using appropriate IRQ affinity settings and making sure
Tux threads bind to same cpu as device where they accept
connections.
It is standard method to obtain peak specweb performance.
"David S. Miller" wrote:
>
> ...
>
> NAPI is also not the panacea to all problems in the world.
>
Mala did some testing on this a couple of weeks back. It appears that
NAPI damaged performance significantly.
http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
From: Andrew Morton <[email protected]>
Date: Fri, 06 Sep 2002 00:36:04 -0700
"David S. Miller" wrote:
> NAPI is also not the panacea to all problems in the world.
Mala did some testing on this a couple of weeks back. It appears that
NAPI damaged performance significantly.
http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
Unfortunately it is not listed what e1000 and core NAPI
patch was used. Also, not listed, are the RX/TX mitigation
and ring sizes given to the kernel module upon loading.
Robert can comment on optimal settings
Robert and Jamal can make a more detailed analysis of Mala's
graphs than I.
On Fri, 6 Sep 2002, David S. Miller wrote:
> Mala did some testing on this a couple of weeks back. It appears that
> NAPI damaged performance significantly.
>
> http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
>
> Unfortunately it is not listed what e1000 and core NAPI
> patch was used. Also, not listed, are the RX/TX mitigation
> and ring sizes given to the kernel module upon loading.
>
> Robert can comment on optimal settings
>
> Robert and Jamal can make a more detailed analysis of Mala's
> graphs than I.
I looked at those graphs, but the lack of information makes them useless.
For example there are too many variables to the tests -- what is the
effect the message size? and then look at the socket buffer size, would
you set it to 64K if you are trying to show perfomance numbers? What
other tcp settings are there?
Manfred Spraul about a year back complained about some performance issues
in low load setups (which is what this IBM setup seems to be if you count
the pps to the server); its one of those things that have been low in
the TODO deck.
The issue maybe legit not because NAPI is bad but because it is too good.
I dont have the e1000, but i have some Dlinks giges still in boxes and i
have a two-CPU SMP machine; I'll setup the testing this weekend.
In the case of Manfred, we couldnt reproduce the tests because he had this
odd weird NIC; in this case at least access to the e1000 doesnt require
a visit to the museum.
cheers,
jamal
David S. Miller writes:
> Mala did some testing on this a couple of weeks back. It appears that
> NAPI damaged performance significantly.
>
> http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
>
> Robert can comment on optimal settings
Hopefully yes...
I see other numbers so we have to sort out the differences. Andrew Morton
pinged me about this test last week. So I've had a chance to run some tests.
Some comments:
Scale to CPU can be dangerous measure w. NAPI due to its adapting behaviour
where RX interrupts decreases in favour of successive polls.
And NAPI scheme behaves different since we can not assume that all network
traffic is well-behaved like TCP. System has to be manageable and to "perform"
under any network load not only for well-behaved TCP. So of course we will
see some differences -- there are no free lunch. Simply we can not blindly
look at one test. IMO NAPI is the best overall performer. The number speaks
for themselves.
Here is the most recent test...
NAPI kernel path is included in 2.4.20-pre4 the comparison below is mainly
between e1000 driver w and w/o NAPI and the NAPI port to e1000 is still
evolving.
Linux 2.4.20-pre4/UP PIII @ 933 MHz w. Intel's e100 2 port GIGE adapter.
e1000 4.3.2-k1 (current kernel version) and current NAPI patch. For NAPI
e1000 driver uses RxIntDelay=1. RxIntDewlay=0 caused problem. Non-NAPI
driver RxIntDelay=64. (default)
Three tests: TCP, UDP, packet forwarding.
Netperf. TCP socket size 131070, Single TCP stream. Test length 30 s.
M-size e1000 NAPI-e1000
============================
4 20.74 20.69 Mbit/s data received.
128 458.14 465.26
512 836.40 846.71
1024 936.11 937.93
2048 940.65 939.92
4096 940.86 937.59
8192 940.87 939.95
16384 940.88 937.61
32768 940.89 939.92
65536 940.90 939.48
131070 940.84 939.74
Netperf. UDP_STREAM. 1440 pkts. Single UDP stream. Test length 30 s.
e1000 NAPI-e1000
====================================
955.7 955.7 Mbit/s data received.
Forwarding test. 1 Mpkts at 970 kpps injected.
e1000 NAPI-e1000
=============================================
T-put 305 298580 pkts routed.
NOTE!
With non-NAPI driver this system is "dead" an performes nothing.
Cheers.
--ro
> Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
>
> NAPI is also not the panacea to all problems in the world.
No, but I didn't expect throughput to drop by 40% or so either,
which is (very roughly) what happened. Interrupts are a pain to
manage and do affinity with, so NAPI should (at least in theory)
be better for this kind of setup ... I think.
> I bet your greatest gain would be obtained from going to Tux
> and using appropriate IRQ affinity settings and making sure
> Tux threads bind to same cpu as device where they accept
> connections.
>
> It is standard method to obtain peak specweb performance.
Ah, but that's not really our goal - what we're trying to do is
use specweb as a tool to simulate a semi-realistic customer
workload, and improve the Linux kernel performance, using that
as our yardstick for measuring ourselves. For that I like the
setup we have reasonably well, even though it won't get us the
best numbers.
To get the best benchmark numbers, you're absolutely right though.
M.
> And NAPI scheme behaves different since we can not assume that all network
> traffic is well-behaved like TCP. System has to be manageable and to "perform"
> under any network load not only for well-behaved TCP. So of course we will
> see some differences -- there are no free lunch. Simply we can not blindly
> look at one test. IMO NAPI is the best overall performer. The number speaks
> for themselves.
I don't doubt it's a win for most cases, we just want to reap the benefit
for the large SMP systems as well ... the fundamental mechanism seems
very scalable to me, we probably just need to do a little tuning?
> NAPI kernel path is included in 2.4.20-pre4 the comparison below is mainly
> between e1000 driver w and w/o NAPI and the NAPI port to e1000 is still
> evolving.
We are running from 2.5.latest ... any updates needed for NAPI for the
driver in the current 2.5 tree, or is that OK?
Thanks,
Martin.
Martin J. Bligh wrote:
> Just to throw another firework into the fire whilst people are
> awake, NAPI does not seem to scale to this sort of load, which
> was disappointing, as we were hoping it would solve some of
> our interrupt load problems ... seems that half the machine goes
> idle, the number of simultaneous connections drop way down, and
> everything's blocked on ... something ... not sure what ;-)
> Any guesses at why, or ways to debug this?
I thought that I already tried to explain this to you. (although it could
have been on one of those too-much-coffee-days :)
Something strange happens to the clients when NAPI is enabled on the
Specweb clients. Somehow the start using a lot more CPU. The increased
idle time on the server is because the _clients_ are CPU maxed. I have
some preliminary oprofile data for the clients, but it appears that this is
another case of Specweb code just really sucking.
The real question is why NAPI causes so much more work for the client. I'm
not convinced that it is much, much greater, because I believe that I was
already at the edge of the cliff with my clients and NAPI just gave them a
little shove :). Specweb also takes a while to ramp up (even during the
real run), so sometimes it takes a few minutes to see the clients get
saturated.
--
Dave Hansen
[email protected]
Martin J. Bligh writes:
> We are running from 2.5.latest ... any updates needed for NAPI for the
> driver in the current 2.5 tree, or is that OK?
Should be OK. Get latest kernel e1000 to get Intel's and maintainers latest
work and apply the e1000 NAPI patch. RH includes this patch?
And yes there are plenty of room for improvements...
Cheers.
--ro
Martin J. Bligh wrote:
>>Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
>>
>>NAPI is also not the panacea to all problems in the world.
>
> No, but I didn't expect throughput to drop by 40% or so either,
> which is (very roughly) what happened. Interrupts are a pain to
> manage and do affinity with, so NAPI should (at least in theory)
> be better for this kind of setup ... I think.
No, no. Bad Martin! Throughput didn't drop, "Specweb compliance" dropped.
Those are two very, very different things. I've found that the server
can produce a lot more throughput, although it doesn't have the
characteristics that Specweb considers compliant. Just have Troy enable
mod-status and look at the throughput that Apache tells you that it is
giving during a run. _That_ is real throughput, not number of compliant
connections.
_And_ NAPI is for receive only, right? Also, my compliance drop occurs
with the NAPI checkbox disabled. There is something else in the new driver
that causes our problems.
--
Dave Hansen
[email protected]
> No, no. Bad Martin! Throughput didn't drop, "Specweb compliance"
> dropped. Those are two very, very different things. I've found
> that the server can produce a lot more throughput, although it
> doesn't have the characteristics that Specweb considers compliant.
> Just have Troy enable mod-status and look at the throughput that
> Apache tells you that it is giving during a run. _That_ is real
> throughput, not number of compliant connections.
By throughput I meant number of compliant connections, not bandwidth.
It may well be latency that's going out the window, rather than
bandwidth. Yes, I should use more precise terms ...
> _And_ NAPI is for receive only, right? Also, my compliance drop
> occurs with the NAPI checkbox disabled. There is something else
> in the new driver that causes our problems.
Not sure about that - I was told once that there were transmission
completion interrupts as well? What happens to those? Or am I
confused again ...
M.
> I thought that I already tried to explain this to you. (although
> it could have been on one of those too-much-coffee-days :)
You told me, but I'm far from convinced this is the problem. I think
it's more likely this is a side-effect of a server issue - something
like a lot of dropped packets and retransmits, though not necessarily
that.
> Something strange happens to the clients when NAPI is enabled on
> the Specweb clients. Somehow the start using a lot more CPU.
> The increased idle time on the server is because the _clients_ are
> CPU maxed. I have some preliminary oprofile data for the clients,
> but it appears that this is another case of Specweb code just
> really sucking.
Hmmm ... if you change something on the server, and all the clients
go wild, I'm suspicious of whatever you did to the server. You need
to have a lot more data before leaping to the conclusion that it's
because the specweb client code is crap.
Troy - I think your UP clients weren't anywhere near maxed out on
CPU power, right? Can you take a peek at the clients under NAPI load?
Dave - did you ever try running 4 specweb clients bound to each of
the 4 CPUs in an attempt to make the clients scale better? I'm
suspicious that you're maxing out 4 4-way machines, and Troy's
16 UPs are cruising along just fine.
M.
Quoting Dave Hansen <[email protected]>:
> No, no. Bad Martin! Throughput didn't drop, "Specweb compliance"
> dropped. Those are two very, very different things. I've found that
> the server can produce a lot more throughput, although it doesn't
> have the characteristics that Specweb considers compliant.
> Just have Troy enable mod-status and look at the throughput that
> Apache tells you that it is giving during a run.
> _That_ is real throughput, not number of compliant connections.
> _And_ NAPI is for receive only, right? Also, my compliance drop
> occurs with the NAPI checkbox disabled. There is something else in
> the new driver that causes our problems.
Thanks, Dave, you saved me a bunch of typing...
Just looking at a networking benchmark result is worse than
useless. You really need to look at the stats, settings,
and the profiles. eg, for most of the networking stuff:
ifconfig -a
netstat -s
netstat -rn
/proc/sys/net/ipv4/
/proc/sys/net/core/
before and after the run.
Dave, although in your setup the clients are maxed out,
not sure thats the case for Mala and Troy's clients. (Dont
know, of course). But I'm fairly sure they arent using
single quad NUMAs and they may not be seeing the same
effects..
thanks,
Nivedita
In message <18563262.1031269721@[10.10.2.3]>, > : "Martin J. Bligh" writes:
> > I would think shoving the data down the NIC
> > and avoid the fragmentation shouldnt give you that much significant
> > CPU savings.
> >
> > It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> > is limited by the DMA throughput of the PCI host controller. In
> > particular some controllers are limited to smaller DMA bursts to
> > work around hardware bugs.
>
> I think we're CPU limited (there's no idle time on this machine),
> which is odd for an 8 CPU 900MHz P3 Xeon, but still, this is Apache,
> not Tux. You mentioned CPU load as another advantage of TSO ...
> anything we've done to reduce CPU load enables us to run more and
> more connections (I think we started at about 260 or something, so
> 2900 ain't too bad ;-)).
Troy, is there any chance you could post an oprofile from any sort
of reasonably conformant run? I think that might help enlighten
people a bit as to what we are fighting with. The last numbers I
remember seemed to indicate that we were spending about 1.25 CPUs
in network/e1000 code with 100% CPU utilization and crappy SpecWeb
throughput.
One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines. For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time. High ideals, I know, wanting all features *and* performance
from the same tool... Next thing you know they'll want reliability
or some such thing.
gerrit
Martin J. Bligh wrote:
>>Something strange happens to the clients when NAPI is enabled on
>>the Specweb clients. Somehow the start using a lot more CPU.
>>The increased idle time on the server is because the _clients_ are
>>CPU maxed. I have some preliminary oprofile data for the clients,
>>but it appears that this is another case of Specweb code just
>>really sucking.
>
> Hmmm ... if you change something on the server, and all the clients
> go wild, I'm suspicious of whatever you did to the server.
Me too :) All that was changed was adding the new e1000 driver. NAPI was
disabled.
> You need
> to have a lot more data before leaping to the conclusion that it's
> because the specweb client code is crap.
I'll let the profile speak for itself...
oprofile summary:op_time -d
1 0.0000 0.0000 /bin/sleep
2 0.0001 0.0000 /lib/ld-2.2.5.so.dpkg-new (deleted)
2 0.0001 0.0000 /lib/libpthread-0.9.so
2 0.0001 0.0000 /usr/bin/expr
3 0.0001 0.0000 /sbin/init
4 0.0001 0.0000 /lib/libproc.so.2.0.7
12 0.0004 0.0000 /lib/libc-2.2.5.so.dpkg-new (deleted)
17 0.0005 0.0000 /usr/lib/libcrypto.so.0.9.6.dpkg-new (deleted)
20 0.0006 0.0000 /bin/bash
30 0.0010 0.0000 /usr/sbin/sshd
151 0.0048 0.0000 /usr/bin/vmstat
169 0.0054 0.0000 /lib/ld-2.2.5.so
300 0.0095 0.0000 /lib/modules/2.4.18+O1/oprofile/oprofile.o
1115 0.0354 0.0000 /usr/local/bin/oprofiled
3738 0.1186 0.0000 /lib/libnss_files-2.2.5.so
58181 1.8458 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
249186 7.9056 0.0000 /home/dave/specweb99/build/client
582281 18.4733 0.0000 /lib/libc-2.2.5.so
2256792 71.5986 0.0000 /usr/src/linux/vmlinux
top of oprofile from the client:
08051b3c 2260 0.948938 check_for_timeliness
08051cfc 2716 1.14041 ascii_cat
08050f24 4547 1.90921 HTTPGetReply
0804f138 4682 1.9659 workload_op
08050890 6111 2.56591 HTTPDoConnect
08049a30 7330 3.07775 SHMmalloc
08052244 7433 3.121 HTParse
08052628 8482 3.56146 HTSACopy
08051d88 10288 4.31977 get_some_line
08052150 13070 5.48788 scan
08051a10 65314 27.4243 assign_port_number
0804bd30 83789 35.1817 LOG
#define LOG(x) do {} while(0)
Voila! 35% more CPU!
Top of Kernel profile:
c022c850 33085 1.46602 number
c0106e59 42693 1.89176 restore_all
c01dfe68 42787 1.89592 sys_socketcall
c01df39c 54185 2.40097 sys_bind
c01de698 62740 2.78005 sockfd_lookup
c01372c8 97886 4.3374 fput
c022c110 125306 5.55239 __generic_copy_to_user
c01373b0 181922 8.06109 fget
c020958c 199054 8.82022 tcp_v4_get_port
c0106e10 199934 8.85921 system_call
c022c158 214014 9.48311 __generic_copy_from_user
c0216ecc 257768 11.4219 inet_bind
"oprofpp -k -dl -i /lib/libc-2.2.5.so"
just gives:
vma samples %-age symbol name linenr info image name
00000000 582281 100 (no symbol) (no location information)
/lib/libc-2.2.5.so
I've never really tried to profile anything but the kernel before. Any ideas?
> Troy - I think your UP clients weren't anywhere near maxed out on
> CPU power, right? Can you take a peek at the clients under NAPI load?
Make sure you wait a minute or two. The client tends to ramp up.
"vmstat 2" after the client has told the master that it is running:
U S I
----------
4 15 81
5 17 79
7 16 77
7 17 76
7 21 72
11 25 64
3 16 82
2 14 84
7 23 70
16 50 34
24 75 0
27 73 0
28 72 0
24 76 0
...
> Dave - did you ever try running 4 specweb clients bound to each of
> the 4 CPUs in an attempt to make the clients scale better? I'm
> suspicious that you're maxing out 4 4-way machines, and Troy's
> 16 UPs are cruising along just fine.
No, but I'm not sure it will do any good. They don't run often enough and
I have the feeling that there are very few cache locality benefits to be had.
--
Dave Hansen
[email protected]
From: Gerrit Huizenga <[email protected]>
Date: Fri, 06 Sep 2002 10:26:04 -0700
One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines. For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time. High ideals, I know, wanting all features *and* performance
from the same tool... Next thing you know they'll want reliability
or some such thing.
Why does Tux keep you from taking advantage of all the
feature of Apache? Anything Tux doesn't handle in it's
fast path is simple fed up to Apache.
In message <[email protected]>, > : "David S. Miller" w
rites:
> From: Gerrit Huizenga <[email protected]>
> Date: Fri, 06 Sep 2002 10:26:04 -0700
>
> One of our goals is to actually take the next generation of the most
> common "large system" web server and get it to scale along the lines
> of Tux or some of the other servers which are more common on the
> small machines. For some reasons, big corporate customers want lots
> of features that are in a web server like apache and would also like
> the performance on their 8-CPU or 16-CPU machine to not suck at the
> same time. High ideals, I know, wanting all features *and* performance
> from the same tool... Next thing you know they'll want reliability
> or some such thing.
>
> Why does Tux keep you from taking advantage of all the
> feature of Apache? Anything Tux doesn't handle in it's
> fast path is simple fed up to Apache.
You have to ask the hard questions... Some of this is rooted in
the past when Tux was emerging as a technology rather ubiquitously
available. And, combined with the fact that most customers tend to
lag the technology curve, Apache 1.X or, in our case, IBM HTTPD was
simply a customer drop in with standard configuration support that
roughly matched that on all other platforms, e.g. AIX, Solaris, HPUX,
Linux, etc. So, doing a one off for Linux at a very heterogenous
large customer adds pain, that pain becomes cost for the customer in
terms of consulting, training, sys admin, system management, etc.
We also had some bad starts with using Tux in terms of performance
and scalability on 4-CPU and 8-CPU machines, especially when combining
with things like squid or other cacheing products from various third
parties.
Then there is the problem that 90%+ of our customers seem to have
dynamic-only web servers. Static content is limited to a couple of
banners and images that need to be tied into some kind of cacheing
content server. So, Tux's benefits for static serving turned out to
be only additional overhead because there were no static pages to be
served up.
And, honestly, I'm a kernel guy much more than an applications guy, so
I'll admit that I'm not up to speed on what Tux2 can do with dynamic
content. The last I knew was that it could pass it off to another server.
So we are focused on making the most common case for our customer situations
scale well. As you are probably aware, there are no specweb results
posted using Apache, but web crawler stats suggest that Apache is the
most common server. The problem is that performance on Apache sucks
but people like the features. Hence we are working to make Apache
suck less, and finding that part of the problem is the way it uses the
kernel. Other parts are the interface for specweb in particular which
we have done a bunch of work on with Greg Ames. And we are feeding
data back to the Apache 2.0 team which should help Apache in general.
gerrit
> c0106e59 42693 1.89176 restore_all
> c01dfe68 42787 1.89592 sys_socketcall
> c01df39c 54185 2.40097 sys_bind
> c01de698 62740 2.78005 sockfd_lookup
> c01372c8 97886 4.3374 fput
> c022c110 125306 5.55239 __generic_copy_to_user
> c01373b0 181922 8.06109 fget
> c020958c 199054 8.82022 tcp_v4_get_port
> c0106e10 199934 8.85921 system_call
> c022c158 214014 9.48311 __generic_copy_from_user
> c0216ecc 257768 11.4219 inet_bind
The profile looks bogus. The NIC driver is nowhere in sight. Normally
its mmap IO for interrupts and device registers should show. I would
double check it (e.g. with normal profile)
In case it is no bogus:
Most of these are either atomic_inc/dec of reference counters or some
form of lock. The system_call could be the int 0x80 (using the SYSENTER
patches would help), which also does atomic operations implicitely.
restore_all is IRET, could also likely be speed up by using SYSEXIT.
If NAPI hurts here then it surely not because of eating CPU time.
-Andi
>> One of our goals is to actually take the next generation of the most
>> common "large system" web server and get it to scale along the lines
>> of Tux or some of the other servers which are more common on the
>> small machines. For some reasons, big corporate customers want lots
>> of features that are in a web server like apache and would also like
>> the performance on their 8-CPU or 16-CPU machine to not suck at the
>> same time. High ideals, I know, wanting all features *and* performance
>> from the same tool... Next thing you know they'll want reliability
>> or some such thing.
>>
>> Why does Tux keep you from taking advantage of all the
>> feature of Apache? Anything Tux doesn't handle in it's
>> fast path is simple fed up to Apache.
>
> You have to ask the hard questions...
Ultimately, to me at least, the server doesn't really matter, and
neither do the absolute benchmark numbers. Linux should scale under
any reasonable workload. The point of this is to look at the Linux
kernel, not the webserver, or specweb ... they're just hammers to
beat on the kernel with.
The fact that we're doing something different from everyone else
and turning up a different set of kernel issues is a good thing,
to my mind. You're right, we could use Tux if we wanted to ... but
that doesn't stop Apache being interesting ;-)
M.
On Fri, Sep 06, 2002 at 08:26:46PM +0200, Andi Kleen wrote:
> > c0216ecc 257768 11.4219 inet_bind
>
> The profile looks bogus. The NIC driver is nowhere in sight. Normally
> its mmap IO for interrupts and device registers should show. I would
> double check it (e.g. with normal profile)
The system summary shows :
58181 1.8458 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
so it won't show up in the monolithic kernel profile. You can probably
get a combined comparison with
op_time -dnl | grep -e 'vmlinux|acenic'
regards
john
--
"Are you willing to go out there and save the lives of our children, even if it means losing your own life ?
Yes I am.
I believe you, Jeru... you're ready."
Andi Kleen wrote:
>>c0106e59 42693 1.89176 restore_all
>>c01dfe68 42787 1.89592 sys_socketcall
>>c01df39c 54185 2.40097 sys_bind
>>c01de698 62740 2.78005 sockfd_lookup
>>c01372c8 97886 4.3374 fput
>>c022c110 125306 5.55239 __generic_copy_to_user
>>c01373b0 181922 8.06109 fget
>>c020958c 199054 8.82022 tcp_v4_get_port
>>c0106e10 199934 8.85921 system_call
>>c022c158 214014 9.48311 __generic_copy_from_user
>>c0216ecc 257768 11.4219 inet_bind
>
> The profile looks bogus. The NIC driver is nowhere in sight. Normally
> its mmap IO for interrupts and device registers should show. I would
> double check it (e.g. with normal profile)
Actually, oprofile separated out the acenic module from the rest of the
kernel. I should have included that breakout as well. but it was only 1.3
of CPU:
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
--
Dave Hansen
[email protected]
From: Gerrit Huizenga <[email protected]>
Date: Fri, 06 Sep 2002 11:19:11 -0700
And, honestly, I'm a kernel guy much more than an applications guy, so
I'll admit that I'm not up to speed on what Tux2 can do with dynamic
content.
TUX can optimize dynamic content just fine.
The last I knew was that it could pass it off to another server.
Not true.
The problem is that performance on Apache sucks
but people like the features.
Tux's design allows it to be a drop in acceleration method
which does not require you to relinquish Apache's feature set.
>
> The real question is why NAPI causes so much more work for the client.
>
[Just a summary from my results from last year. All testing with a
simple NIC without hw interrupt mitigation, on a Cyrix P150]
My assumption was that NAPI increases the cost of receiving a single
packet: instead of one hw interrupt with one device access (ack
interrupt) and the softirq processing, the hw interrupt must ack &
disable the interrupt, then the processing occurs in softirq context,
and the interrupts are reenabled at softirq context.
The second point was that interrupt mitigation must remain enabled, even
with NAPI: the automatic mitigation doesn't work with process space
limited loads (e.g. TCP: backlog queue is drained quickly, but the
system is busy processing the prequeue or receive queue)
jamal, it is possible that a driver uses both napi and the normal
interface, or would that break fairness?
Use netif_rx, until it returns dropping. If that happens, disable the
interrupt, and call netif_rx_schedule().
Is it possible to determine the average number of packets that are
processed for each netif_rx_schedule()?
--
Manfred
From: "Martin J. Bligh" <[email protected]>
Date: Fri, 06 Sep 2002 11:26:49 -0700
The fact that we're doing something different from everyone else
and turning up a different set of kernel issues is a good thing,
to my mind. You're right, we could use Tux if we wanted to ... but
that doesn't stop Apache being interesting ;-)
Tux does not obviate Apache from the equation.
See my other emails.
From: Dave Hansen <[email protected]>
Date: Fri, 06 Sep 2002 11:33:10 -0700
Actually, oprofile separated out the acenic module from the rest of the
kernel. I should have included that breakout as well. but it was only 1.3
of CPU:
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
We thought you were using e1000 in these tests?
From: Manfred Spraul <[email protected]>
Date: Fri, 06 Sep 2002 20:35:08 +0200
The second point was that interrupt mitigation must remain enabled, even
with NAPI: the automatic mitigation doesn't work with process space
limited loads (e.g. TCP: backlog queue is drained quickly, but the
system is busy processing the prequeue or receive queue)
Not true. NAPI is in fact a %100 replacement for hw interrupt
mitigation strategies. The cpu usage elimination afforded by
hw interrupt mitigation is also afforded by NAPI and even more
so by NAPI.
See Jamal's paper.
Franks a lot,
David S. Miller
[email protected]
> Actually, oprofile separated out the acenic module from the rest of the
> kernel. I should have included that breakout as well. but it was only 1.3
> of CPU:
> 1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
>
> We thought you were using e1000 in these tests?
e1000 on the server, those profiles were client side.
M.
> The fact that we're doing something different from everyone else
> and turning up a different set of kernel issues is a good thing,
> to my mind. You're right, we could use Tux if we wanted to ... but
> that doesn't stop Apache being interesting ;-)
>
> Tux does not obviate Apache from the equation.
> See my other emails.
That's not the point ... we're getting sidetracked here. The
point is: "is this a realistic-ish stick to beat the kernel
with and expect it to behave" ... I feel the answer is yes.
The secondary point is "what are customers doing in the field?"
(not what *should* they be doing ;-)). Moreover, I think the
Apache + Tux combination has been fairly well beaten on already
by other people in the past, though I'm sure it could be done
again.
I see no reason why turning on NAPI should make the Apache setup
we have perform worse ... quite the opposite. Yes, we could use
Tux, yes we'd get better results. But that's not the point ;-)
M.
From: "Martin J. Bligh" <[email protected]>
Date: Fri, 06 Sep 2002 11:51:29 -0700
I see no reason why turning on NAPI should make the Apache setup
we have perform worse ... quite the opposite. Yes, we could use
Tux, yes we'd get better results. But that's not the point ;-)
Of course.
I just don't want propaganda being spread that using Tux means you
lose any sort of web server functionality whatsoever.
From: "Martin J. Bligh" <[email protected]>
Date: Fri, 06 Sep 2002 11:45:17 -0700
> Actually, oprofile separated out the acenic module from the rest of the
> kernel. I should have included that breakout as well. but it was only 1.3
> of CPU:
> 1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
>
> We thought you were using e1000 in these tests?
e1000 on the server, those profiles were client side.
Ok. BTW acenic is packet rate limited by the speed of the
MIPS cpus on the card.
It might be instramental to disable HW checksumming in the
acenic driver and see what this does to your results.
In message <[email protected]>, > : "David S. Miller" w
rites:
> From: Gerrit Huizenga <[email protected]>
> Date: Fri, 06 Sep 2002 11:19:11 -0700
>
> TUX can optimize dynamic content just fine.
>
> The last I knew was that it could pass it off to another server.
Out of curiosity, and primarily for my own edification, what kind
of optimization does it do when everything is generated by a java/
perl/python/homebrew script and pasted together by something which
consults a content manager. In a few of the cases that I know of,
there isn't really any static content to cache... And why is this
something that Apache couldn't/shouldn't be doing?
gerrit
In message <[email protected]>, > : "David S. Miller"
writes:
> From: "Martin J. Bligh" <[email protected]>
> Date: Fri, 06 Sep 2002 11:51:29 -0700
>
> I see no reason why turning on NAPI should make the Apache setup
> we have perform worse ... quite the opposite. Yes, we could use
> Tux, yes we'd get better results. But that's not the point ;-)
>
> Of course.
>
> I just don't want propaganda being spread that using Tux means you
> lose any sort of web server functionality whatsoever.
Ah sorry - I never meant to imply that Tux was detrimental, other
than one case where it seemed to have no benefit and the performance
numbers while tuning for TPC-W *seemed* worse but were never analyzed
completely. That was the actual event that I meant when I said:
We also had some bad starts with using Tux in terms of performance
and scalability on 4-CPU and 8-CPU machines, especially when
combining with things like squid or other cacheing products from
various third parties.
Those results were never quantified but for various reasons we had a
team that decided to take Tux out of the picture. I think the problem
was more likely lack of knowledge and lack of time to do analysis on
the particular problems. Another combination of solutions was used.
So, any comments I made which might have implied that Tux/Tux2 made things
worse have no substantiated data to prove that and it is quite possible
that there is no such problem. Also, this was run nearly a year ago and
the state of Tux/Tux2 might have been a bit different at the time.
gerrit
From: Gerrit Huizenga <[email protected]>
Date: Fri, 06 Sep 2002 11:57:39 -0700
Out of curiosity, and primarily for my own edification, what kind
of optimization does it do when everything is generated by a java/
perl/python/homebrew script and pasted together by something which
consults a content manager. In a few of the cases that I know of,
there isn't really any static content to cache... And why is this
something that Apache couldn't/shouldn't be doing?
The kernel exec's the CGI process from the TUX server and pipes the
output directly into a networking socket.
Because it is cheaper to create a new fresh user thread from within
the kernel (ie. we don't have to fork() apache and thus dup it's
address space), it is faster.
From: Gerrit Huizenga <[email protected]>
Date: Fri, 06 Sep 2002 12:05:27 -0700
So, any comments I made which might have implied that Tux/Tux2 made things
worse have no substantiated data to prove that and it is quite possible
that there is no such problem. Also, this was run nearly a year ago and
the state of Tux/Tux2 might have been a bit different at the time.
Thanks for clearing things up.
Quoting Andi Kleen <[email protected]>:
> > c0106e59 42693 1.89176 restore_all
> > c01dfe68 42787 1.89592 sys_socketcall
> > c01df39c 54185 2.40097 sys_bind
> > c01de698 62740 2.78005 sockfd_lookup
> > c01372c8 97886 4.3374 fput
> > c022c110 125306 5.55239 __generic_copy_to_user
> > c01373b0 181922 8.06109 fget
> > c020958c 199054 8.82022 tcp_v4_get_port
> > c0106e10 199934 8.85921 system_call
> > c022c158 214014 9.48311 __generic_copy_from_user
> > c0216ecc 257768 11.4219 inet_bind
>
> The profile looks bogus. The NIC driver is nowhere in sight.
> Normally its mmap IO for interrupts and device registers
> should show. I would double check it (e.g. with normal profile)
Separately compiled acenic..
I'm surprised by this profile a bit too - on the client side,
since the requests are small, and the client is receiving
all those files, I would have thought that __generic_copy_to_user
would have been way higher than *from_user.
inet_bind() and tcp_v4_get_port() are up there because
we have to grab the socket lock, the tcp_portalloc_lock,
then the head chain lock and traverse the hash table
which has now many hundred entries. Also, because
of the varied length of the connections, the clients
get freed not in the same order they are allocated
a port, hence the fragmentation of the port space..
Tthere is some cacheline thrashing hurting the NUMA
more than other systems here too..
If you just wanted to speed things up, you could get the
clients to specify ports instead of letting the kernel
cycle through for a free port..:)
thanks,
Nivedita
> In case it is no bogus:
> Most of these are either atomic_inc/dec of reference counters or
> some form of lock. The system_call could be the int 0x80 (using the
> SYSENTER patches would help), which also does atomic operations
> implicitely. restore_all is IRET, could also likely be speed up by
> using SYSEXIT.
>
> If NAPI hurts here then it surely not because of eating CPU time.
>
> -Andi
>
>
> If you just wanted to speed things up, you could get the
> clients to specify ports instead of letting the kernel
> cycle through for a free port..:)
Better would be probably to change the kernel to keep a limited
list of free ports in a free list. The grabbing a free port would
be an O(1) operation.
I'm not entirely sure it is worth it in this case. The locks are
probably the majority of the cost.
-Andi
From: Nivedita Singhvi <[email protected]>
Date: Fri, 6 Sep 2002 12:19:14 -0700
inet_bind() and tcp_v4_get_port() are up there because
we have to grab the socket lock, the tcp_portalloc_lock,
then the head chain lock and traverse the hash table
which has now many hundred entries. Also, because
of the varied length of the connections, the clients
get freed not in the same order they are allocated
a port, hence the fragmentation of the port space..
Tthere is some cacheline thrashing hurting the NUMA
more than other systems here too..
There are methods to eliminate the centrality of the
port allocation locking.
Basically, kill tcp_portalloc_lock and make the port rover be per-cpu.
The only tricky case is the "out of ports" situation. Because there
is no centralized locking being used to serialize port allocation,
it is difficult to be sure that the port space is truly exhausted.
Another idea, which doesn't eliminate the tcp_portalloc_lock but
has other good SMP properties, is to apply a "cpu salt" to the
port rover value. For example, shift the local cpu number into
the upper parts of a 'u16', then 'xor' that with tcp_port_rover.
Alexey and I have discussed this several times but never became
bored enough to experiment :-)
From: Andi Kleen <[email protected]>
Date: Fri, 6 Sep 2002 21:26:19 +0200
I'm not entirely sure it is worth it in this case. The locks are
probably the majority of the cost.
You can more localize the lock accesses (since we use per-chain
locks) by applying a cpu salt to the port numbers you allocate.
See my other email.
From: Manfred Spraul <[email protected]>
Date: Fri, 06 Sep 2002 21:40:09 +0200
Dave, do you have interrupt rates from the clients with and without NAPI?
Robert does.
David S. Miller wrote:
> From: Manfred Spraul <[email protected]>
> Date: Fri, 06 Sep 2002 20:35:08 +0200
>
> The second point was that interrupt mitigation must remain enabled, even
> with NAPI: the automatic mitigation doesn't work with process space
> limited loads (e.g. TCP: backlog queue is drained quickly, but the
> system is busy processing the prequeue or receive queue)
>
> Not true. NAPI is in fact a %100 replacement for hw interrupt
> mitigation strategies. The cpu usage elimination afforded by
> hw interrupt mitigation is also afforded by NAPI and even more
> so by NAPI.
>
> See Jamal's paper.
>
I've read his paper: it's about MLFFR. There is no alternative to NAPI
if packets arrive faster than they are processed by the backlog queue.
But what if the backlog queue is empty all the time? Then NAPI thinks
that the system is idle, and reenables the interrupts after each packet :-(
In my tests, I've used a pentium class system (I have no GigE cards -
that was the only system where I could saturate the cpu with 100MBit
ethernet). IIRC 30% cpu time was needed for the copy_to_user(). The
receive queue was filled, the backlog queue empty. With NAPI, I got 1
interrupt for each packet, with hw interrupt mitigation the throughput
was 30% higher for MTU 600.
Dave, do you have interrupt rates from the clients with and without NAPI?
--
Manfred
> Tthere is some cacheline thrashing hurting the NUMA
> more than other systems here too..
There is no NUMA here ... the clients are 4 single node SMP
systems. We're using the old quads to make them, but they're
all split up, not linked together into one system.
Sorry if we didn't make that clear.
M.
Quoting "David S. Miller" <[email protected]>:
> There are methods to eliminate the centrality of the
> port allocation locking.
>
> Basically, kill tcp_portalloc_lock and make the port rover be
> per-cpu.
Aha! Exactly what I started to do quite a while ago..
> The only tricky case is the "out of ports" situation. Because
> there is no centralized locking being used to serialize port
> allocation, it is difficult to be sure that the port space is truly
> exhausted.
I decided to use a stupid global flag to signal this..It did become
messy and I didnt finalize everything. Then my day job
intervened :). Still hoping for spare time*5 to complete
this if none comes up with something before then..
> Another idea, which doesn't eliminate the tcp_portalloc_lock but
> has other good SMP properties, is to apply a "cpu salt" to the
> port rover value. For example, shift the local cpu number into
> the upper parts of a 'u16', then 'xor' that with tcp_port_rover.
nice..any patch extant? :)
thanks,
Nivedita
In message <[email protected]>, > : "David S. Miller"
writes:
> From: Gerrit Huizenga <[email protected]>
> Date: Fri, 06 Sep 2002 11:57:39 -0700
>
> Out of curiosity, and primarily for my own edification, what kind
> of optimization does it do when everything is generated by a java/
> perl/python/homebrew script and pasted together by something which
> consults a content manager. In a few of the cases that I know of,
> there isn't really any static content to cache... And why is this
> something that Apache couldn't/shouldn't be doing?
>
> The kernel exec's the CGI process from the TUX server and pipes the
> output directly into a networking socket.
>
> Because it is cheaper to create a new fresh user thread from within
> the kernel (ie. we don't have to fork() apache and thus dup it's
> address space), it is faster.
So if apache were using a listen()/clone()/accept()/exec() combo rather than a
full listen()/fork()/exec() model it would see most of the same benefits?
Some additional overhead for the user/kernel syscall path but probably
pretty minor, right?
Or did I miss a piece of data, like the time to call clone() as a function
from in kernel is 2x or 10x more than the same syscall?
gerrit
From: Gerrit Huizenga <[email protected]>
Date: Fri, 06 Sep 2002 12:52:15 -0700
So if apache were using a listen()/clone()/accept()/exec() combo rather than a
full listen()/fork()/exec() model it would see most of the same benefits?
Apache would need to do some more, such as do something about
cpu affinity and do the non-blocking VFS tricks Tux does too.
To be honest, I'm not going to sit here all day long and explain how
Tux works. I'm not even too knowledgable about the precise details of
it's implementation. Besides, the code is freely available and not
too complex, so you can go have a look for yourself :-)
In message <[email protected]>, > : "David S. Miller" w
rites:
> From: Gerrit Huizenga <[email protected]>
> Date: Fri, 06 Sep 2002 12:52:15 -0700
>
> So if apache were using a listen()/clone()/accept()/exec() combo rather than a
> full listen()/fork()/exec() model it would see most of the same benefits?
>
> Apache would need to do some more, such as do something about
> cpu affinity and do the non-blocking VFS tricks Tux does too.
>
> To be honest, I'm not going to sit here all day long and explain how
> Tux works. I'm not even too knowledgable about the precise details of
> it's implementation. Besides, the code is freely available and not
> too complex, so you can go have a look for yourself :-)
Aw, and you are such a good tutor, too. :-) But thanks - my particular
goal isn't to fix apache since there is already a group of folks working
on that, but as we look at kernel traces, this should give us a good
idea if we are at the bottleneck of the apache architecture or if we
have other kernel bottlenecks. At the moment, the latter seems to be
true, and I think we have some good data from Troy and Dave to validate
that. I think we have already seen the affinity problem or at least
talked about it as that was somewhat visible and Apache 2.0 does seem
to have some solutions for helping with that. And when the kernel does
the best it can with Apache's architecture, we have more data to convince
them to fix the architecture problems.
thanks again!
gerrit
On Fri, 2002-09-06 at 19:51, Martin J. Bligh wrote:
> The secondary point is "what are customers doing in the field?"
> (not what *should* they be doing ;-)). Moreover, I think the
> Apache + Tux combination has been fairly well beaten on already
> by other people in the past, though I'm sure it could be done
> again.
Tux has been proven in the field. A glance at some of the interesting
porn domain names using it would show that 8)
> > It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> > is limited by the DMA throughput of the PCI host controller. In
> > particular some controllers are limited to smaller DMA bursts to
> > work around hardware bugs.
>
> I'm not sure that's entirely true in this case - the Netfinity
> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
> there's 4 - 8 gigabit ethernet cards in this beast spread around
> different buses (Troy - are we still just using 4?
My machine is not exactly an 8500r. It's an Intel pre-release
engineering sample (8-way 900MHz PIII) box that is similar to an
8500r... there are some differences when going across the choerency
filter (the bus that ties the two 4-way "halves" of the machine
together). Bill Hartner has a test program that illustrates the
differences-- but more on that later.
I've got 4 PCI busses, two 33 MHz, and two 66MHz, all 64-bit.
I'm configured as follows:
PCI Bus 0 eth1 --- 3 clients
33 MHz eth2 --- Not in use
PCI Bus 1 eth3 --- 2 clients
33 MHz eth4 --- Not in use
PCI Bus 3 eth5 --- 6 clients
66 MHz eth6 --- Not in use
PCI Bus 4 eth7 --- 6 clients
66 MHz eth8 --- Not in use
> ... and what's
> the raw bandwidth of data we're pushing? ... it's not huge).
2900 simultaneous connections, each at ~320 kbps translates to
928000 kbps, which is slightly less than the full bandwidth of a
single e1000. We're spreading that over 4 adapters, and 4 busses.
- Troy
> Do you have any stats from the hardware that could show
> retransmits etc;
**********************************
* netstat -s before the workload *
**********************************
Ip:
433 total packets received
0 forwarded
0 incoming packets discarded
409 incoming packets delivered
239 requests sent out
Icmp:
24 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 24
24 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 24
Tcp:
0 active connections openings
2 passive connection openings
0 failed connection attempts
0 connection resets received
2 connections established
300 segments received
183 segments send out
0 segments retransmited
0 bad segments received.
2 resets sent
Udp:
8 packets received
24 packets to unknown port received.
0 packet receive errors
32 packets sent
TcpExt:
ArpFilter: 0
5 delayed acks sent
4 packets directly queued to recvmsg prequeue.
35 packets header predicted
TCPPureAcks: 5
TCPHPAcks: 160
TCPRenoRecovery: 0
TCPSackRecovery: 0
TCPSACKReneging: 0
TCPFACKReorder: 0
TCPSACKReorder: 0
TCPRenoReorder: 0
TCPTSReorder: 0
TCPFullUndo: 0
TCPPartialUndo: 0
TCPDSACKUndo: 0
TCPLossUndo: 0
TCPLoss: 0
TCPLostRetransmit: 0
TCPRenoFailures: 0
TCPSackFailures: 0
TCPLossFailures: 0
TCPFastRetrans: 0
TCPForwardRetrans: 0
TCPSlowStartRetrans: 0
TCPTimeouts: 0
TCPRenoRecoveryFail: 0
TCPSackRecoveryFail: 0
TCPSchedulerFailed: 0
TCPRcvCollapsed: 0
TCPDSACKOldSent: 0
TCPDSACKOfoSent: 0
TCPDSACKRecv: 0
TCPDSACKOfoRecv: 0
TCPAbortOnSyn: 0
TCPAbortOnData: 0
TCPAbortOnClose: 0
TCPAbortOnMemory: 0
TCPAbortOnTimeout: 0
TCPAbortOnLinger: 0
TCPAbortFailed: 0
TCPMemoryPressures: 0
*********************************
* netstat -s after the workload *
*********************************
Ip:
425317106 total packets received
3648 forwarded
0 incoming packets discarded
425313332 incoming packets delivered
203629600 requests sent out
Icmp:
58 ICMP messages received
12 input ICMP message failed.
ICMP input histogram:
destination unreachable: 58
58 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 58
Tcp:
64 active connections openings
16690445 passive connection openings
56552 failed connection attempts
0 connection resets received
3 connections established
425311551 segments received
203629500 segments send out
4241408 segments retransmited
0 bad segments received.
298883 resets sent
Udp:
8 packets received
34 packets to unknown port received.
0 packet receive errors
42 packets sent
TcpExt:
ArpFilter: 0
8884840 TCP sockets finished time wait in fast timer
12913162 delayed acks sent
17292 delayed acks further delayed because of locked socket
Quick ack mode was activated 102351 times
54977 times the listen queue of a socket overflowed
54977 SYNs to LISTEN sockets ignored
157 packets directly queued to recvmsg prequeue.
51 packets directly received from prequeue
16925947 packets header predicted
51 packets header predicted and directly queued to user
TCPPureAcks: 169071816
TCPHPAcks: 176510836
TCPRenoRecovery: 30090
TCPSackRecovery: 0
TCPSACKReneging: 0
TCPFACKReorder: 0
TCPSACKReorder: 0
TCPRenoReorder: 464
TCPTSReorder: 5
TCPFullUndo: 6
TCPPartialUndo: 29
TCPDSACKUndo: 0
TCPLossUndo: 1
TCPLoss: 0
TCPLostRetransmit: 0
TCPRenoFailures: 218884
TCPSackFailures: 0
TCPLossFailures: 35561
TCPFastRetrans: 145529
TCPForwardRetrans: 0
TCPSlowStartRetrans: 3463096
TCPTimeouts: 373473
TCPRenoRecoveryFail: 1221
TCPSackRecoveryFail: 0
TCPSchedulerFailed: 0
TCPRcvCollapsed: 0
TCPDSACKOldSent: 0
TCPDSACKOfoSent: 0
TCPDSACKRecv: 1
TCPDSACKOfoRecv: 0
TCPAbortOnSyn: 0
TCPAbortOnData: 0
TCPAbortOnClose: 0
TCPAbortOnMemory: 0
TCPAbortOnTimeout: 0
TCPAbortOnLinger: 0
TCPAbortFailed: 0
TCPMemoryPressures: 0
From: Troy Wilson <[email protected]>
Date: Fri, 6 Sep 2002 18:56:04 -0500 (CDT)
4241408 segments retransmited
Is hw flow control being negotiated and enabled properly on the
gigabit interfaces?
There should be no reason for these kinds of retransmits to
happen.
> ifconfig -a and netstat -rn would also be nice to have..
These counters may have wrapped over the course of the full-length
( 3 x 20 minute runs + 20 minute warmup + rampup + rampdown) SPECWeb run.
*******************************
* ifconfig -a before workload *
*******************************
eth0 Link encap:Ethernet HWaddr 00:04:AC:23:5E:99
inet addr:9.3.192.209 Bcast:9.3.192.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:208 errors:0 dropped:0 overruns:0 frame:0
TX packets:104 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:22562 (22.0 Kb) TX bytes:14356 (14.0 Kb)
Interrupt:50 Base address:0x2000 Memory:fe180000-fe180038
eth1 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:9E
inet addr:192.168.4.1 Bcast:192.168.4.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:10 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:5940 (5.8 Kb) TX bytes:256 (256.0 b)
Interrupt:61 Base address:0x1200 Memory:fc020000-0
eth2 Link encap:Ethernet HWaddr 00:02:B3:A8:35:C1
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:54 Base address:0x1220 Memory:fc060000-0
eth3 Link encap:Ethernet HWaddr 00:02:B3:A3:47:E7
inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:44 Base address:0x2040 Memory:fe120000-0
eth4 Link encap:Ethernet HWaddr 00:02:B3:A3:46:F9
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:784 (784.0 b) TX bytes:256 (256.0 b)
Interrupt:36 Base address:0x2060 Memory:fe160000-0
eth5 Link encap:Ethernet HWaddr 00:02:B3:A3:47:88
inet addr:192.168.5.1 Bcast:192.168.5.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:32 Base address:0x3000 Memory:fe420000-0
eth6 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:A0
inet addr:192.168.6.1 Bcast:192.168.6.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:64 (64.0 b) TX bytes:256 (256.0 b)
Interrupt:28 Base address:0x3020 Memory:fe460000-0
eth7 Link encap:Ethernet HWaddr 00:02:B3:A3:47:39
inet addr:192.168.7.1 Bcast:192.168.7.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:24 Base address:0x4000 Memory:fe820000-0
eth8 Link encap:Ethernet HWaddr 00:02:B3:A3:47:87
inet addr:192.168.8.1 Bcast:192.168.8.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:20 Base address:0x4020 Memory:fe860000-0
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:56 errors:0 dropped:0 overruns:0 frame:0
TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:5100 (4.9 Kb) TX bytes:5100 (4.9 Kb)
******************************
* ifconfig -a after workload *
******************************
eth0 Link encap:Ethernet HWaddr 00:04:AC:23:5E:99
inet addr:9.3.192.209 Bcast:9.3.192.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3434 errors:0 dropped:0 overruns:0 frame:0
TX packets:1408 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:336578 (328.6 Kb) TX bytes:290474 (283.6 Kb)
Interrupt:50 Base address:0x2000 Memory:fe180000-fe180038
eth1 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:9E
inet addr:192.168.4.1 Bcast:192.168.4.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:74893662 errors:3 dropped:3 overruns:0 frame:0
TX packets:100464074 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:1286843881 (1227.2 Mb) TX bytes:2106085286 (2008.5 Mb)
Interrupt:61 Base address:0x1200 Memory:fc020000-0
eth2 Link encap:Ethernet HWaddr 00:02:B3:A8:35:C1
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:54 Base address:0x1220 Memory:fc060000-0
eth3 Link encap:Ethernet HWaddr 00:02:B3:A3:47:E7
inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:50054881 errors:0 dropped:0 overruns:0 frame:0
TX packets:67122955 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:3730406436 (3557.5 Mb) TX bytes:3034087396 (2893.5 Mb)
Interrupt:44 Base address:0x2040 Memory:fe120000-0
eth4 Link encap:Ethernet HWaddr 00:02:B3:A3:46:F9
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:48 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:7342 (7.1 Kb) TX bytes:256 (256.0 b)
Interrupt:36 Base address:0x2060 Memory:fe160000-0
eth5 Link encap:Ethernet HWaddr 00:02:B3:A3:47:88
inet addr:192.168.5.1 Bcast:192.168.5.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:149206960 errors:2861 dropped:2861 overruns:0 frame:0
TX packets:200247016 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2530107402 (2412.8 Mb) TX bytes:3331495154 (3177.1 Mb)
Interrupt:32 Base address:0x3000 Memory:fe420000-0
eth6 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:A0
inet addr:192.168.6.1 Bcast:192.168.6.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:13 errors:0 dropped:0 overruns:0 frame:0
TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:832 (832.0 b) TX bytes:640 (640.0 b)
Interrupt:28 Base address:0x3020 Memory:fe460000-0
eth7 Link encap:Ethernet HWaddr 00:02:B3:A3:47:39
inet addr:192.168.7.1 Bcast:192.168.7.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:151162569 errors:2993 dropped:2993 overruns:0 frame:0
TX packets:202895482 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2673954954 (2550.0 Mb) TX bytes:2456469394 (2342.6 Mb)
Interrupt:24 Base address:0x4000 Memory:fe820000-0
eth8 Link encap:Ethernet HWaddr 00:02:B3:A3:47:87
inet addr:192.168.8.1 Bcast:192.168.8.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:20 Base address:0x4020 Memory:fe860000-0
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:100 errors:0 dropped:0 overruns:0 frame:0
TX packets:100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:8696 (8.4 Kb) TX bytes:8696 (8.4 Kb)
***************
* netstat -rn *
***************
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.7.0 0.0.0.0 255.255.255.0 U 40 0 0 eth7
192.168.6.0 0.0.0.0 255.255.255.0 U 40 0 0 eth6
192.168.5.0 0.0.0.0 255.255.255.0 U 40 0 0 eth5
192.168.4.0 0.0.0.0 255.255.255.0 U 40 0 0 eth1
192.168.3.0 0.0.0.0 255.255.255.0 U 40 0 0 eth3
192.168.2.0 0.0.0.0 255.255.255.0 U 40 0 0 eth2
192.168.1.0 0.0.0.0 255.255.255.0 U 40 0 0 eth4
9.3.192.0 0.0.0.0 255.255.255.0 U 40 0 0 eth0
192.168.8.0 0.0.0.0 255.255.255.0 U 40 0 0 eth8
127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
0.0.0.0 9.3.192.1 0.0.0.0 UG 40 0 0 eth0
Quoting Troy Wilson <[email protected]>:
> > Do you have any stats from the hardware that could show
> > retransmits etc;
Troy,
Are tcp_sack, tcp_fack, tcp_dsack turned on?
thanks,
Nivedita
> Are tcp_sack, tcp_fack, tcp_dsack turned on?
tcp_fack and tcp_dsack are on, tcp_sack is off.
- Troy
Manfred Spraul:
>> But what if the backlog queue is empty all the time? Then NAPI thinks
>> that the system is idle, and reenables the interrupts after each packet :-(
Yes and this happens even with without NAPI. Just set RxIntDelay=X and send
pkts at >= X+1 interval.
>> Dave, do you have interrupt rates from the clients with and without NAPI?
DaveM:
> Robert does.
Yes we get into this interesting discussion now... Since with NAPI we can
safely use RxIntDelay=0 (e1000 terminologi). With the classical IRQ we simply
had to add latency (RxIntDelay of 64-128 us common for GIGE) this just to
survive at higher speeds (GIGE max is 1.48 Mpps) and with the interrupt latency
also comes higher network latencies... IMO this delay was a "work-around"
for the old interrupt scheme.
So we now have the option of removing it... But we are trading less latency
for for more interrupts. So yes Manfred is correct...
So is there a decent setting/compromise?
Well first approximation is just to do just what DaveM suggested.
RxIntDelay=0. This solved many problems with buggy hardware and complicated
tuning and RxIntDelay used to be combined with other mitigation parameters to
compensate for different packets sizes etc. This leading to very "fragile"
performance where a NIC could perform excellent w. single TCP stream but
to be seriously broke in many other tests. So tuning to just one "test"
can cause a lot of mis-tuning as well.
Anyway. A tulip NAPI variant added mitigation when we reached "some load" to
avoid the static interrupt delay. (Still keeping things pretty simple):
Load "Mode"
-------------------
Lo 1) RxIntDelay=0
Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
Hi 3) Consecutive polling. No RX interrupts.
Is it worth the effort?
For SMP w/o affinity the delay could eventually reduce the cache bouncing
since the packets becomes more "batched" at cost the of latency of course.
We use RxIntDelay=0 for production use. (IP-forwarding on UP)
Cheers.
--ro
"Martin J. Bligh" <[email protected]> writes:
> > Ie. the headers that don't need to go across the bus are the critical
> > resource saved by TSO.
>
> I'm not sure that's entirely true in this case - the Netfinity
> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
> there's 4 - 8 gigabit ethernet cards in this beast spread around
> different buses (Troy - are we still just using 4? ... and what's
> the raw bandwidth of data we're pushing? ... it's not huge).
>
> I think we're CPU limited (there's no idle time on this machine),
> which is odd for an 8 CPU 900MHz P3 Xeon,
Quite possibly. The P3 has roughly an 800MB/s FSB bandwidth, that must
be used for both I/O and memory accesses. So just driving a gige card at
wire speed takes a considerable portion of the cpus capacity.
On analyzing this kind of thing I usually find it quite helpful to
compute what the hardware can theoretically to get a feel where the
bottlenecks should be.
Eric
>> > Ie. the headers that don't need to go across the bus are the critical
>> > resource saved by TSO.
>>
>> I'm not sure that's entirely true in this case - the Netfinity
>> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
>> there's 4 - 8 gigabit ethernet cards in this beast spread around
>> different buses (Troy - are we still just using 4? ... and what's
>> the raw bandwidth of data we're pushing? ... it's not huge).
>>
>> I think we're CPU limited (there's no idle time on this machine),
>> which is odd for an 8 CPU 900MHz P3 Xeon,
>
> Quite possibly. The P3 has roughly an 800MB/s FSB bandwidth, that must
> be used for both I/O and memory accesses. So just driving a gige card at
> wire speed takes a considerable portion of the cpus capacity.
>
> On analyzing this kind of thing I usually find it quite helpful to
> compute what the hardware can theoretically to get a feel where the
> bottlenecks should be.
We can push about 420MB/s of IO out of this thing (out of that
theoretical 800Mb/s). Specweb is only pushing about 120MB/s of
total data through it, so it's not bus limited in this case.
Of course, I should have given you that data to start with,
but ... ;-)
M.
PS. This thing actually has 3 system buses, 1 for each of the two
sets of 4 CPUs, and 1 for all the PCI buses, and the three buses
are joined by an interconnect in the middle. But all the IO goes
through 1 of those buses, so for the purposes of this discussion,
it makes no difference whatsoever ;-)
"Martin J. Bligh" <[email protected]> writes:
> >> > Ie. the headers that don't need to go across the bus are the critical
> >> > resource saved by TSO.
> >>
> >> I'm not sure that's entirely true in this case - the Netfinity
> >> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
> >> there's 4 - 8 gigabit ethernet cards in this beast spread around
> >> different buses (Troy - are we still just using 4? ... and what's
> >> the raw bandwidth of data we're pushing? ... it's not huge).
> >>
> >> I think we're CPU limited (there's no idle time on this machine),
> >> which is odd for an 8 CPU 900MHz P3 Xeon,
> >
> > Quite possibly. The P3 has roughly an 800MB/s FSB bandwidth, that must
> > be used for both I/O and memory accesses. So just driving a gige card at
> > wire speed takes a considerable portion of the cpus capacity.
> >
> > On analyzing this kind of thing I usually find it quite helpful to
> > compute what the hardware can theoretically to get a feel where the
> > bottlenecks should be.
>
> We can push about 420MB/s of IO out of this thing (out of that
> theoretical 800Mb/s).
Sounds about average for a P3. I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop. Is
that 420MB/sec of IO on this test?
> Specweb is only pushing about 120MB/s of
> total data through it, so it's not bus limited in this case.
Note quite. But you suck at least 240MB/s of your memory bandwidth with
DMA from disk, and then DMA to the nic. Unless there is a highly
cached component. So I doubt you can effectively use more than 1 gige
card, maybe 2. And you have 8?
> Of course, I should have given you that data to start with,
> but ... ;-)
>
> PS. This thing actually has 3 system buses, 1 for each of the two
> sets of 4 CPUs, and 1 for all the PCI buses, and the three buses
> are joined by an interconnect in the middle. But all the IO goes
> through 1 of those buses, so for the purposes of this discussion,
> it makes no difference whatsoever ;-)
Wow the hardware designers really believed in over-subscription.
If the busses are just running 64bit/33Mhz you are oversubscribed.
And at 64bit/66Mhz the pci busses can easily swamp the system
533*4 ~= 2128MB/s.
What kind of memory bandwidth does the system have, and on which
bus are the memory controllers? I'm just curious
Eric
From: [email protected] (Eric W. Biederman)
Date: 11 Sep 2002 09:06:36 -0600
"Martin J. Bligh" <[email protected]> writes:
> We can push about 420MB/s of IO out of this thing (out of that
> theoretical 800Mb/s).
Sounds about average for a P3. I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop.
You pushed that over the PCI bus of your P3? Just to RAM
doesn't count, lots of cpu's can do that.
That's what makes his number interesting.
> Sounds about average for a P3. I have pushed the full 800MiB/s out of
> a P3 processor to memory but it was a very optimized loop. Is
> that 420MB/sec of IO on this test?
Yup, Fibre channel disks. So we know we can push at least that.
> Note quite. But you suck at least 240MB/s of your memory bandwidth with
> DMA from disk, and then DMA to the nic. Unless there is a highly
> cached component. So I doubt you can effectively use more than 1 gige
> card, maybe 2. And you have 8?
Nope, it's operating totally out of pagecache, there's no real disk
IO to speak of.
> Wow the hardware designers really believed in over-subscription.
> If the busses are just running 64bit/33Mhz you are oversubscribed.
> And at 64bit/66Mhz the pci busses can easily swamp the system
> 533*4 ~= 2128MB/s.
Two 32bit buses (or maybe it was just one) and two 64bit buses,
all at 66MHz. Yes, the PCI buses can push more than the backplane,
but things are never perfectly balanced in reality, so I'd prefer
it that way around ... it's not a perfect system, but hey, it's
Intel hardware - this is high volume market, not real high end ;-)
> What kind of memory bandwidth does the system have, and on which
> bus are the memory controllers? I'm just curious
Memory controllers are hung off the interconnect, slightly difficult
to describe. Look for docs on the Intel profusion chipset, or I can
send you a powerpoint (yeah, yeah) presentation when I get into work
later today if you can't find it. Theoretical mem bandwidth should
be 1600MB/s if you're balanced across the CPUs, in practice I'd
expect to be able to push somewhat over 800Mb/s.
M.
"David S. Miller" <[email protected]> writes:
> From: [email protected] (Eric W. Biederman)
> Date: 11 Sep 2002 09:06:36 -0600
>
> "Martin J. Bligh" <[email protected]> writes:
>
> > We can push about 420MB/s of IO out of this thing (out of that
> > theoretical 800Mb/s).
>
> Sounds about average for a P3. I have pushed the full 800MiB/s out of
> a P3 processor to memory but it was a very optimized loop.
>
> You pushed that over the PCI bus of your P3? Just to RAM
> doesn't count, lots of cpu's can do that.
>
> That's what makes his number interesting.
I agree. Getting 420MB/s to the pci bus is nice, especially with a P3.
The 800MB/s to memory was just the test I happened to conduct about 2 years
ago when I was still messing with slow P3 systems. It was a proof of
concept test to see if we could plug in an I/O card into a memory
slot.
On a current P4 system with the E7500 chipset this kind of thing is
easy. I have gotten roughly 450MB/s to a single myrinet card. And there
is enough theoretical bandwidth to do 4 times that. I haven't had a
chance to get it working in practice. When I attempted to run to gige
cards simultaneously I had some weird problem (probably interrupt
related) where adding additional pci cards did not deliver any extra
performance.
On a P3 to get writes from the cpu to hit 800MB/s you use the special
cpu instructions that bypass the cache.
My point was that I have tested the P3 bus in question and I achieved
a real world 800MB/s over it. So I expect that on the system in
question unless another bottleneck is hit, it should be possible to
achieve a real world 800MB/s of I/O. There are enough pci busses
to support that kind of traffic.
Unless the memory controller is carefully placed on the system though
doing 400+MB/s could easily eat up most of the available memory
bandwidth and reduce the system to doing some very slow cache line fills.
Eric
folx,
sorry for the late reply. catching up on kernel mail.
so all this TSO stuff looks v. v. similar to the IP-only fragmentation
that patricia gilfeather and i implemented on alteon acenics a couple of
years ago (see http://www.cs.unm.edu/~maccabe/SSL/frag/FragPaper1/ for a
general overview). it's exciting to see someone else take a stab on
different hardware and approaching some of the tcp-specific issues.
the main different, though, is that general purpose kernel development
still focussed on the improvements in *sending* speed. for real high
performance networking, the improvements are necessary in *receiving* cpu
utilization, in our estimation. (see our analysis of interrupt overhead
and the effect on receivers at gigabit speeds--i hope that this has become
common understanding by now)
i guess i can't disagree with david miller that the improvments in TSO are
due entirely to header retransmission for sending, but that's only because
sending wasn't CPU-intensive in the first place. we were able to get a
significant reduction in receiver cpu-utilization by reassembling IP
fragments on the receiver side (sort of a standards-based interrupt
mitigation strategy that has the benefit of not increasing latency the way
interrupt coalescing does).
anyway, nice work,
t.
On Thu, 5 Sep 2002, David S. Miller wrote:
> It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> is limited by the DMA throughput of the PCI host controller. In
> particular some controllers are limited to smaller DMA bursts to
> work around hardware bugs.
>
> Ie. the headers that don't need to go across the bus are the critical
> resource saved by TSO.
>
> I think I've said this a million times, perhaps the next person who
> tries to figure out where the gains come from can just reply with
> a pointer to a URL of this email I'm typing right now :-)
--
todd underwood, vp & cto
oso grande technologies, inc.
[email protected]
"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
Good work. The first time i have seen someone say Linux's way of
reverse order is a GoodThing(tm). It was also great to see de-mything
some of the old assumption of the world.
BTW, TSO is not a intelligent as what you are suggesting.
If i am not mistaken you are not only suggesting fragmentation and
assembly at that level you are also suggesting retransmits at the NIC.
This could be dangerous for practical reasons (changes in TCP congestion
control algorithms etc). TSO as was pointed in earlier emails is just a
dumb sender of packets. I think even fragmentation is a misnomer.
Essentially you shove a huge buffer to the NIC and it breaks it into MTU
sized packets for you and sends them.
In regards to the receive side CPU utilization improvements: I think
that NAPI does a good job at least in getting ridding of the biggest
offender -- interupt overload. Also with NAPI also having got rid of
intermidiate queues to the socket level, facilitating of zero copy receive
should be relatively easy to add but there are no capable NICs in
existence (well, ok not counting the TIGONII/acenic that you can hack
and the fact that the tigon 2 is EOL doesnt help other than just for
experiments). I dont think theres any NIC that can offload reassembly;
that might not be such a bad idea.
Are you still continuing work on this?
cheers,
jamal
jamal,
> Good work. The first time i have seen someone say Linux's way of
> reverse order is a GoodThing(tm). It was also great to see de-mything
> some of the old assumption of the world.
thanks. although i'd love to take credit, i don't think that the
reverse-order fragmentation appreciation is all that original: who
wouldn't want their data sctructure size determined up-front? :-) (not to
mention getting header-overwriting for-free as part of the single copy.
> BTW, TSO is not a intelligent as what you are suggesting.
> If i am not mistaken you are not only suggesting fragmentation and
> assembly at that level you are also suggesting retransmits at the NIC.
> This could be dangerous for practical reasons (changes in TCP congestion
> control algorithms etc). TSO as was pointed in earlier emails is just a
> dumb sender of packets. I think even fragmentation is a misnomer.
> Essentially you shove a huge buffer to the NIC and it breaks it into MTU
> sized packets for you and sends them.
the biggest problem to our approach is that itis extremely difficult to
mix two very different kinds of workloads together: the regular
server-on-the-internet workload (SOI) and the large-cluster-member
workload (LCM). in the former case, SOI, you get dropped packets,
fragments, no fragments, out of order fragments, etc. in the LCM case you
basically never get any of that stuff--you're on a closed network with
1000-10000 of your closest cluster friends and that's just what you're
doing. no fragments (unless you put them there), no out of order
fragments (unless you send them) and basically no dropped packets ever.
obviously, if you can assume conditions like that, you can do things like:
only reassmble fragments in reverse order since you know you'll only send
them that way, e.g.
> In regards to the receive side CPU utilization improvements: I think
> that NAPI does a good job at least in getting ridding of the biggest
> offender -- interupt overload. Also with NAPI also having got rid of
> intermidiate queues to the socket level, facilitating of zero copy receive
> should be relatively easy to add but there are no capable NICs in
> existence (well, ok not counting the TIGONII/acenic that you can hack
> and the fact that the tigon 2 is EOL doesnt help other than just for
> experiments). I dont think theres any NIC that can offload reassembly;
> that might not be such a bad idea.
i've done some reading about NAPI just recently (somehow i missed the
splash when it came out). the two things i like about it are the hardware
independent interrupt mitigation technique and using the DMA buffers as a
receive backlog. i'm concerned about the numbers posted by ibm folx
recently showing a slowdown under some conditions using NAPI and need to
read the rest of that discussion.
we are definitely aware of the fact that the more you want to put on the
NIC, the more the NIC will have to do (and the more expensive it will have
to be). right now the NICs, that people are developing on are the
TigonII/III and, even more closed/proprietary, the Myrinet NICs. i would
love to have a <$200 NIC with open firmware and a CPU/memory so that we
could offload some more of this functionality (where it makes sense).
>
> Are you still continuing work on this?
>
definitely! we were just talking about some of these issues yesterday
(and trying to find hardware sepc info on the web for the e1000 platform
to see what else they might be able to do). patricia gilfeather is working
on finding parts of TCP that are separable from the rest of TCP, but the
problems you raise are serious: it would have to be on an
application-specific and socket-specific basis, so that the app would
*know* that functionality (like acks for synchronization packets or
whatever) was being offloaded.
the biggest difference in our perspective, versus the common kernel
developers, is that we're still looking for ways to get the OS out of the
way of the applications. if we can do large data transfers (with
pre-posted receives and pre-posted memory allocation, obviously) directly
from the nic into application memory and have a clean, relatively simple
and standard api to do that, we avoid all of the interrupt mitigation
techniques and save hugely on context switching overhead.
this may now be off-topic for linux-kernel and i'd be happy to chat
further in private email if others are getting bored :-).
> cheers,
> jamal
t.
--
todd underwood, vp & cto
oso grande technologies, inc.
[email protected]
"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
On Thu, 2002-09-12 at 14:57, Todd Underwood wrote:
> thanks. although i'd love to take credit, i don't think that the
> reverse-order fragmentation appreciation is all that original: who
> wouldn't want their data sctructure size determined up-front? :-) (not to
> mention getting header-overwriting for-free as part of the single copy.
As far as I am aware it was original when Linux first did it (and we
broke cisco pix, some boot proms, some sco in the process). Credit goes
to Arnt Gulbrandsen probably better known nowdays for his work on Qt
alan,
good to know. it's a nice piece of engineering. it's useful to note that
linux has such a long and rich history of breaking de-facto standards in
order to make things work better.
t.
On 12 Sep 2002, Alan Cox wrote:
> On Thu, 2002-09-12 at 14:57, Todd Underwood wrote:
> > thanks. although i'd love to take credit, i don't think that the
> > reverse-order fragmentation appreciation is all that original: who
> > wouldn't want their data sctructure size determined up-front? :-) (not to
> > mention getting header-overwriting for-free as part of the single copy.
>
> As far as I am aware it was original when Linux first did it (and we
> broke cisco pix, some boot proms, some sco in the process). Credit goes
> to Arnt Gulbrandsen probably better known nowdays for his work on Qt
>
--
todd underwood, vp & cto
oso grande technologies, inc.
[email protected]
"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
Quoting Todd Underwood <[email protected]>:
> sorry for the late reply. catching up on kernel mail.
> the main different, though, is that general purpose kernel
> development still focussed on the improvements in *sending* speed.
> for real high performance networking, the improvements are necessary
> in *receiving* cpu utilization, in our estimation.
> (see our analysis of interrupt overhead and the effect on receivers
> at gigabit speeds--i hope that this has become common understanding
> by now)
Some of that may be a byproduct of the "all the worlds' a webserver"
mindset - we are primarily focussed on the server side (aka
money side ;)), and there is some amount of automatic thinking that
this means we're going to be sending data and receiving small packets,
mostly acks) in return. There is much less emphasis given to solving
the problems on the other side (active connection scalability for
instance), or other issues that manifest themselves as
client side bottlenecks for most applications..
thanks,
Nivedita
From: jamal <[email protected]>
Date: Thu, 12 Sep 2002 08:30:44 -0400 (EDT)
In regards to the receive side CPU utilization improvements: I think
that NAPI does a good job at least in getting ridding of the biggest
offender -- interupt overload.
I disagree, at least for bulk receivers. We have no way currently to
get rid of the data copy. We desperately need sys_receivefile() and
appropriate ops all the way into the networking, then the necessary
driver level support to handle the cards that can do this.
Once 10gbit cards start hitting the shelves this will convert from a
nice perf improvement into a must have.
dave, all,
On Thu, 12 Sep 2002, David S. Miller wrote:
> I disagree, at least for bulk receivers. We have no way currently to
> get rid of the data copy. We desperately need sys_receivefile() and
> appropriate ops all the way into the networking, then the necessary
> driver level support to handle the cards that can do this.
not sure i understand what you're proposing, but while we're at it, why
not also make the api for apps to allocate a buffer in userland that (for
nics that support it) the nic can dma directly into? it seems likely
notification that the buffer was used would have to travel through the
kernel, but it would be nice to save the interrupts altogether.
this may be exactly what you were saying.
>
> Once 10gbit cards start hitting the shelves this will convert from a
> nice perf improvement into a must have.
totally agreed. this is a must for high-performance computing now (since
who wants to waste 80-100% of their CPU just running the network)?
t.
--
todd underwood, vp & cto
oso grande technologies, inc.
[email protected]
"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
Quoting [email protected]:
> dave, all,
>
> not sure i understand what you're proposing, but while we're at it,
> why not also make the api for apps to allocate a buffer in userland
> that (for nics that support it) the nic can dma directly into? it
I believe thats exactly what David was referring to - reverse
direction sendfile() so to speak..
> seems likely notification that the buffer was used would have to
> travel through the kernel, but it would be nice to save the
> interrupts altogether.
However, I dont think what youre saving are interrupts as
much as the extra copy, but I could be wrong..
thanks,
Nivedita
From: [email protected]
Date: Fri, 13 Sep 2002 15:59:15 -0600 (MDT)
not sure i understand what you're proposing
Cards in the future at 10gbit and faster are going to provide
facilities by which:
1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
with the hardware when TCP connections are openned.
2) The card scans TCP packets arriving, if the cookie matches, it
accumulated received data to fill full pages and wakes up the
networking when either:
a) full page has accumulated for a connection
b) connection cookie mismatch
c) configurable timer has expired
3) TCP ends up getting receive packets with skb->shinfo() fraglist
containing the data portion in full struct page *'s
This can be placed directly into the page cache via sys_receivefile
generic code in mm/filemap.c or f.e. NFSD/NFS receive side
processing.
not also make the api for apps to allocate a buffer in userland that (for
nics that support it) the nic can dma directly into? it seems likely
notification that the buffer was used would have to travel through the
kernel, but it would be nice to save the interrupts altogether.
This is already doable with sys_sendfile() for send today. The user
just does the following:
1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_sendfile() to send the data over the socket from that file
3) uses socket write space monitoring to determine if the portions of
the shared area are reclaimable for new writes
BTW Apache could make this, I doubt it does currently.
The corrolary with sys_receivefile would be that the use:
1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_receivefile() to pull in the data from the socket to that file
There is no need to poll the receive socket space as the successful
return from sys_receivefile() is the "data got received successfully"
event.
totally agreed. this is a must for high-performance computing now (since
who wants to waste 80-100% of their CPU just running the network)?
If send side is your bottleneck and you think zerocopy sends of
user anonymous data might help, see the above since we can do it
today and you are free to experiment.
Franks a lot,
David S. Miller
[email protected]
10 gige becomes more of an interesting beast. Not sure if we would see
servers with 10gige real soon now. Your proposal does make sense although
compute power would still be a player. I think the key would be
parallelization;
Now if it wasnt for the stupid way TCP options were designed
you could easily do remote DMA instead. Would be relatively easy to add
NIC support for that. Maybe SCTP would save us ;-> however, if history
could be used to predict the future, i think TCP will continue to be
"hacked" and fit the throughput requirements so no chance for SCTP to be
a big player i am afraid .
cheers,
jamal
On Fri, 13 Sep 2002, David S. Miller wrote:
> From: [email protected]
> Date: Fri, 13 Sep 2002 15:59:15 -0600 (MDT)
>
> not sure i understand what you're proposing
>
> Cards in the future at 10gbit and faster are going to provide
> facilities by which:
>
> 1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
> with the hardware when TCP connections are openned.
>
[..]
From: jamal <[email protected]>
Date: Sun, 15 Sep 2002 16:16:13 -0400 (EDT)
Your proposal does make sense although compute power would still be
a player. I think the key would be parallelization;
Oh I forgot to mention that some of these cards also compute a cookie
for you on receive packets, and your meant to point the input
processing for that packet to a cpu whose number is derived from that
cookie it gives you.
Lockless per-cpu packet input queues make this sort of hard for us
to implement currently.
david,
comments/questions below...
On Fri, 13 Sep 2002, David S. Miller wrote:
> 1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
> with the hardware when TCP connections are openned.
intriguing architecture. are there any standards in progress to support
this. bascially, people doing high performance computing have been
customizing non-commodity nics (acenic, myrinet, quadrics, etc.) to do
some of this cookie registration/scanning. it would be nice if there were
a standard API/hardware capability that took care of at least this piece.
(frankly, it would also be nice if customizable, almost-commodity nics
based on processor/memory/firmware architecture rather than just asics
(like the acenic) continued to exist).
> not also make the api for apps to allocate a buffer in userland that (for
> nics that support it) the nic can dma directly into? it seems likely
> notification that the buffer was used would have to travel through the
> kernel, but it would be nice to save the interrupts altogether.
>
> This is already doable with sys_sendfile() for send today. The user
> just does the following:
>
> 1) mmap()'s a file with MAP_SHARED to write the data
> 2) uses sys_sendfile() to send the data over the socket from that file
> 3) uses socket write space monitoring to determine if the portions of
> the shared area are reclaimable for new writes
>
> BTW Apache could make this, I doubt it does currently.
>
> The corrolary with sys_receivefile would be that the use:
>
> 1) mmap()'s a file with MAP_SHARED to write the data
> 2) uses sys_receivefile() to pull in the data from the socket to that file
>
> There is no need to poll the receive socket space as the successful
> return from sys_receivefile() is the "data got received successfully"
> event.
the send case has been well described and seems work well for the people
for whom that is the bottleneck. that has not been the case in HPC, since
sends are relatively cheaper (in terms of cpu) than receives.
who is working on this architecture for receives? i know quite a few
people who would be interested in working on it and willing to prototype
as well.
> totally agreed. this is a must for high-performance computing now (since
> who wants to waste 80-100% of their CPU just running the network)?
>
> If send side is your bottleneck and you think zerocopy sends of
> user anonymous data might help, see the above since we can do it
> today and you are free to experiment.
for many of the applications that i care about, receive is the bottleneck,
so zerocopy sends are somewhat of a non-issue (not that they're not nice,
they just don't solve the primary waste of processor resources).
is there a beginning implementation yet of zerocopy receives as you
describe above, or you you be interested in entertaining implementations
that work on existing (1Gig-e) cards?
what i'm thinking is something that prototypes the api to the nic that you
are proposing and implements the NIC-side functionality in firmware on the
acenic-2's (which have available firmware in at least two
implementations--the alteon version and pete wyckoff's version (which may
be less license-encumbered).
this is obviously only feasible if there already exists some consensus on
what the os-to-hardware API should look like (or there is willingness to
try to build a consensus around that now).
t.
--
todd underwood, vp & cto
oso grande technologies, inc.
[email protected]
"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
From: [email protected]
Date: Mon, 16 Sep 2002 08:16:47 -0600 (MDT)
are there any standards in progress to support this.
Your question makes no sense, it is a hardware optimization
of an existing standard. The chip merely is told what flows
exist and it concatenates TCP data from consequetive packets
for that flow if they arrive in sequence.
who is working on this architecture for receives?
Once cards with the feature exist, probably Alexey and myself
will work on it.
Basically, who ever isn't busy with something else once the technology
appears.
is there a beginning implementation yet of zerocopy receives
No.
Franks a lot,
David S. Miller
[email protected]
folx,
perhaps i was insufficiently clear.
On Mon, 16 Sep 2002, David S. Miller wrote:
> are there any standards in progress to support this.
>
> Your question makes no sense, it is a hardware optimization
> of an existing standard. The chip merely is told what flows
> exist and it concatenates TCP data from consequetive packets
> for that flow if they arrive in sequence.
hardware optimizations can be standardized. in fact, when they are, it is
substantially easier to implement to them.
my assumption (perhaps incorrect) is that some core set of functionality
is necessary for a card to support zero-copy receives (in particular, the
ability to register cookies of expected data flows and the memory location
to which they are to be sent). what 'existing standard' is this
kernel<->api a standardization of?
> who is working on this architecture for receives?
>
> Once cards with the feature exist, probably Alexey and myself
> will work on it.
>
> Basically, who ever isn't busy with something else once the technology
> appears.
so if we wrote and distributed firmware for alteon acenics that supported
this today, you would be willing to incorporate the new system calls into
the networking code (along with the new firmware for the card, provided we
could talk jes into accepting the changes, assuming he's still the
maintainer of the driver)? that's great.
>
> is there a beginning implementation yet of zerocopy receives
>
> No.
thanks for your feedback.
t.
--
todd underwood, vp & cto
oso grande technologies, inc.
[email protected]
"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
On Mon, 16 Sep 2002, David S. Miller wrote:
> From: [email protected]
> Date: Mon, 16 Sep 2002 08:16:47 -0600 (MDT)
>
> are there any standards in progress to support this.
>
> Your question makes no sense, it is a hardware optimization
> of an existing standard. The chip merely is told what flows
> exist and it concatenates TCP data from consequetive packets
> for that flow if they arrive in sequence.
>
Hrm. Again, the big Q:
How "thmart" is this NIC going to be (think congestion control and
the du-jour flavor).
cheers,
jamal