I am working on a very high speed packet based interface but we are having
severe problems related to bandwidth vs cpu horsepower. enclosed is a part
of a summary. PLEASE cc responses directly to [email protected]
Thanks!!!
--
"Who would be stupid enough to quote a fictitious character?"
-- Don Quixote
On Wed, Feb 21, 2001 at 11:58:23AM +0000, Alan Cox wrote:
> Dropping packets under load will make tcp do the right thing. You don't need
> complex mathematical models since dropping frames under load is just another
> form of congestion and tcp handles it pretty sanely
Alan: thanks for your response...
This is exactly what I would expect to see, but we are seeing something
else..
Under HEAVY load we are seeing approximately 20Mbit of TCP throughput. If
we "shape" (i use the term loosely, we dont actually have a real shaper,
just loading the cpu who is trasmitting) the presented load, we can
get 60-70Mbit. I'm not quite sure why this is. My first guess was
that because the kernel was getting 99% of the cpu, the application was
getting very little, and thus the read wasn't happening fast enough, and
the socket was blocking. In this case, you would expect the system to get
to a nice equilibrium, where if the app stopped reading, the kernel would
stop acking, and the transmitter would back off, eventually to a point
where the app could start reading again because the kernel load dropped.
This is NOT what I'm seeing at all.. the kernel load appears to be
pegged at 100% (or very close to it), the user space app is getting
enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
appears to be ACKING ALL the traffic, which I don't understand at all
(e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
With udp, we can get the full 300MBit throughput, but only if we shape
the load to 300Mbit. If we increase the load past 300 MBit, the received
frames (at the user space udp app) drops to 10-20MBit, again due to
user-space application scheduling problems.
-nye
> that because the kernel was getting 99% of the cpu, the application was
> getting very little, and thus the read wasn't happening fast enough, and
Seems reasonable
> This is NOT what I'm seeing at all.. the kernel load appears to be
> pegged at 100% (or very close to it), the user space app is getting
> enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> appears to be ACKING ALL the traffic, which I don't understand at all
> (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
TCP _requires_ the remote end ack every 2nd frame regardless of progress.
> With udp, we can get the full 300MBit throughput, but only if we shape
> the load to 300Mbit. If we increase the load past 300 MBit, the received
> frames (at the user space udp app) drops to 10-20MBit, again due to
> user-space application scheduling problems.
How is your incoming traffic handled architecturally - irq per packet or
some kind of ring buffer with irq mitigation. Do you know where the cpu
load is - is it mostly the irq servicing or mostly network stack ?
On Wed, Feb 21, 2001 at 10:07:32PM +0000, Alan Cox wrote:
> > that because the kernel was getting 99% of the cpu, the application was
> > getting very little, and thus the read wasn't happening fast enough, and
>
> Seems reasonable
>
> > This is NOT what I'm seeing at all.. the kernel load appears to be
> > pegged at 100% (or very close to it), the user space app is getting
> > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > appears to be ACKING ALL the traffic, which I don't understand at all
> > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
>
> TCP _requires_ the remote end ack every 2nd frame regardless of progress.
>
> > With udp, we can get the full 300MBit throughput, but only if we shape
> > the load to 300Mbit. If we increase the load past 300 MBit, the received
> > frames (at the user space udp app) drops to 10-20MBit, again due to
> > user-space application scheduling problems.
>
> How is your incoming traffic handled architecturally - irq per packet or
> some kind of ring buffer with irq mitigation. Do you know where the cpu
> load is - is it mostly the irq servicing or mostly network stack ?
>
>
Alan: thanks again for your prompt response!
bus mastered DMA ring buffer. As to the load, I'm not quite sure... we
were using a fairly large ring buffer, but increasing/decreasing the size
didn't seem to affect the number of packets per interrrupt. I added a
little watermarking code, and it seems that we do (at peak) about 30-35
packets per interrupt. That is STILL a heck of a lot of interrupts! I
can't quite figure out why the driver refuses to go deeper.
I can think of a couple possible solutions. our interface has a HUGE
amount of hardware buffers, so I can easily simply stop reading for
a small time if we detect conjestion... can you suggest a nice clean
mechanism for this?
any other ideas?
> I can think of a couple possible solutions. our interface has a HUGE
> amount of hardware buffers, so I can easily simply stop reading for
> a small time if we detect conjestion... can you suggest a nice clean
> mechanism for this?
If you have a lot of buffers you can try one thing to see if its IRQ load,
turn the IRQ off, set a fast timer running and hook the buffer handling to
the timer irq.
Next obvious step would be using the timer based irq handling to limit the
number of buffers you use netif_rx() on and discard any others.
Finally don't rule out memory bandwidth, if the ram is main memory then the
dma engine could be pretty much driving the cpu off the bus at high data
rates.
Alan
On Wed, Feb 21, 2001 at 02:00:55PM -0800, Nye Liu wrote:
[snip]
> This is NOT what I'm seeing at all.. the kernel load appears to be
> pegged at 100% (or very close to it), the user space app is getting
> enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> appears to be ACKING ALL the traffic, which I don't understand at all
> (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
>
> With udp, we can get the full 300MBit throughput, but only if we shape
> the load to 300Mbit. If we increase the load past 300 MBit, the received
> frames (at the user space udp app) drops to 10-20MBit, again due to
> user-space application scheduling problems.
Perhaps excess context switches are thrashing the system?
On Wed, Feb 21, 2001 at 10:07:32PM +0000, Alan Cox wrote:
> > that because the kernel was getting 99% of the cpu, the application was
> > getting very little, and thus the read wasn't happening fast enough, and
>
> Seems reasonable
>
> > This is NOT what I'm seeing at all.. the kernel load appears to be
> > pegged at 100% (or very close to it), the user space app is getting
> > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > appears to be ACKING ALL the traffic, which I don't understand at all
> > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
>
> TCP _requires_ the remote end ack every 2nd frame regardless of progress.
YIPES. I didn't realize this was the case.. how is end-to-end application
flow control handled when the bottle neck is user space bound and not b/w
bound? e.g. if i write a test app that does a
while(1) {
sleep (5);
read(sock, buf, 1);
}
and the transmitter is unrestricted, what happens?
Does it have to do with TCP_FORMAL_WINDOW (eg. automatically reduce window
size to zero when queue backs up?)
or is it only a cpu loading problem? (ie. is there a difference in queuing
behavior between 1) the user process doesnt get cycles 2) the user process
simply fails to read ?)
Also, I have been reading up on CONFIG_HW_FLOWCONTROL.. what is the
recommended way for the driver to stop receiving? In the sample tulip
code i see you can register a xon callback, but i can't tell if there
is a way to see the backlog from the driver.
-nye
Alan Cox wrote:
>
> > that because the kernel was getting 99% of the cpu, the application was
> > getting very little, and thus the read wasn't happening fast enough, and
>
> Seems reasonable
>
> > This is NOT what I'm seeing at all.. the kernel load appears to be
> > pegged at 100% (or very close to it), the user space app is getting
> > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > appears to be ACKING ALL the traffic, which I don't understand at all
> > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
>
> TCP _requires_ the remote end ack every 2nd frame regardless of progress.
um, I thought the spec says that ACK every 2nd segment is a SHOULD not a
MUST?
rick jones
--
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
> > > This is NOT what I'm seeing at all.. the kernel load appears to be
> > > pegged at 100% (or very close to it), the user space app is getting
> > > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > > appears to be ACKING ALL the traffic, which I don't understand at all
> > > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
> >
> > TCP _requires_ the remote end ack every 2nd frame regardless of progress.
>
> YIPES. I didn't realize this was the case.. how is end-to-end application
> flow control handled when the bottle neck is user space bound and not b/w
> bound? e.g. if i write a test app that does a
If the app is not reading from the socket buffer, the receiving TCP is
supposed to stop sending window-updates, and the sender is supposed to
stop sending data when it runs-out of window.
If TCP ACK's data, it really should (must?) not then later drop it on
the floor without aborting the connection. If a TCP is ACKing data and
then that data is dropped before it is given to the application, and the
connection is not being reset, that is probably a bug.
A TCP _is_ free to drop data prior to sending an ACK - it simply drops
it and does not ACK it.
rick jones
--
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
> and the transmitter is unrestricted, what happens?
> Does it have to do with TCP_FORMAL_WINDOW (eg. automatically reduce window
> size to zero when queue backs up?)
Read RFC1122. Basically your guess is right. The sender sends data, and gets
back acks saying 'window 0'. It will then do exponential backoffs while
polling the 0 window as it backs off (ack being unreliable)
> > TCP _requires_ the remote end ack every 2nd frame regardless of progress.
>
> um, I thought the spec says that ACK every 2nd segment is a SHOULD not a
> MUST?
Yes its a SHOULD in RFC1122, but in any normal environment pretty much a
must and I know of no stack significantly violating it.
RFC1122 also requires that your protocol stack SHOULD be able to leap tall
buldings at a single bound of course...
Alan Cox wrote:
>
> > > TCP _requires_ the remote end ack every 2nd frame regardless of progress.
> >
> > um, I thought the spec says that ACK every 2nd segment is a SHOULD not a
> > MUST?
>
> Yes its a SHOULD in RFC1122, but in any normal environment pretty much a
> must and I know of no stack significantly violating it.
I didn't know there was such a thing as a normal environment :)
> RFC1122 also requires that your protocol stack SHOULD be able to leap tall
> buldings at a single bound of course...
And, of course my protocol stack does :) It is also a floor wax, AND a
dessert topping!-)
rick jones
--
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
Hi!
> > This is NOT what I'm seeing at all.. the kernel load appears to be
> > pegged at 100% (or very close to it), the user space app is getting
> > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > appears to be ACKING ALL the traffic, which I don't understand at all
> > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
>
> TCP _requires_ the remote end ack every 2nd frame regardless of
> progress.
Should not TCP advertise window of 0 to stop sender?
Where does kernel put all those data in tcp case? I do not understand
that. Transmiter blasts at 300Mbit, userspace gets 20Mbit. There's
280Mbit datastream going _somewhere_. It should be eating memory at
35MB/second, unless you have 1Gig of ram, something interesting should
happen after minute or so...
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]
Hello!
> > Yes its a SHOULD in RFC1122, but in any normal environment pretty much a
> > must and I know of no stack significantly violating it.
>
> I didn't know there was such a thing as a normal environment :)
Jokes apart, such "normal" environments are rare today.
>From tcpdumps it is clear, that win2000 does not ack each other mss.
It can ack once per window at high load. I have seen the same behaviour
of solaris. freebsd-4.x surely does not ack each second mss
(it is from source code), which is probably bug (at least, it stops
to ack at all as soon as MSG_WAITALL is used. 8))
Acking each second mss is required to do slow start more or less
fastly. As soon as window is full, they are useless, so that win2000
is fully right and, in fact, optimal.
Alexey