Please Cc: me in your responses.
The story so far:
I've been continuing to muck around with the stack, trying both to improve
overall performance, and specifically to improve rx relative to tx
performance, primarily in gig-and-beyond (e.g., Quadrics) environments.
To this end, I have begun by profiling and analyzing the RX side stack.
The profiling is being done as I write, and the analysis is what prompts
me to write.
The direct question:
How many times is data copied between the time that it is received at the
NIC and when the user's call to read() returns the data?
The reason for the question:
I could've sworn I heard the stack was single-copy on both the TX and RX
sides. But, it doesn't look to me like it is. Rather, it looks like there
is one copy in tcp_rcv_estabilshed() (via tcp_copy_to_iovec()), and a
second copy in tcp_recvmsg() (which is called when the user calls read()).
Both of these copies are, I believe, done by skb_copy_datagram_iovec().
The ancilary questions:
If I am wrong about this- does anyone care to publicly humiliate me by
telling me how/why (and possibly calling me stupid)?
If I am right about this- is there a specific reason that it is
implemented this way? Are there any thoughts on changing it? Our specific
inclination is to keep the skbs around until the user calls read(), at
which point we do an iovec memcopy to the userspace buffer, eliminating a
copy- the danger here is if the user doesn't read from the socket, this
might needlessly lock up skbs. To avoid this, we can implement either some
watermark or timeout for skb-consilidation- if the user doesn't call
read() soon enough or before too many skbs are used we copy the skbs to a
socket buffer like normal.
Cheers,
--Gus
> How many times is data copied between the time that it is received at the
> NIC and when the user's call to read() returns the data?
Optimal case for TCP
NIC->buffer (DMA)
buffer->user (CPU)
IFF the TCP checksum can be done by the card
On Tue, Jul 09, 2002 at 04:29:35PM -0600, Hurwitz Justin W. wrote:
> Please Cc: me in your responses.
>
> The story so far:
>
> I've been continuing to muck around with the stack, trying both to improve
> overall performance, and specifically to improve rx relative to tx
> performance, primarily in gig-and-beyond (e.g., Quadrics) environments.
...
> The direct question:
>
> How many times is data copied between the time that it is received at the
> NIC and when the user's call to read() returns the data?
> The reason for the question:
>
> I could've sworn I heard the stack was single-copy on both the TX and RX
> sides. But, it doesn't look to me like it is. Rather, it looks like there
> is one copy in tcp_rcv_estabilshed() (via tcp_copy_to_iovec()), and a
> second copy in tcp_recvmsg() (which is called when the user calls read()).
> Both of these copies are, I believe, done by skb_copy_datagram_iovec().
I suspect that in many cases there is third copy right in the network
card driver to realign data so that TCP frame begins at a 32-bit boundary.
Perhaps that is only for RISC CPU systems (e.g. Alpha, primarily.)
Can the GigE cards do ethernet-frame reception pre-alignment so that
after the 14 byte ethernet header, the TCP frame begins at 32-bit
boundary ?
...
> Cheers,
> --Gus
/Matti Aarnio
> I could've sworn I heard the stack was single-copy
> on both the TX and RX sides. But, it doesn't look to
> me like it is. Rather, it looks like there is one copy
> in tcp_rcv_estabilshed() (via tcp_copy_to_iovec()), and a
> second copy in tcp_recvmsg() (which is called when the
> user calls read()). Both of these copies are, I believe,
> done by skb_copy_datagram_iovec().
tcp_recvmsg() only does the copy from the receive_queue
or the backlog queue. tcp_rcv_established() does the copy
directly into the iovec or queues it onto the receive_queue
or backlog queue for tcp_recvmsg() to complete the work. So
there arent two copies of the same data happening, just a
question of one or the other function doing the work depending
on whether there is currently a process doing a read or not..
hth,
thanks,
Nivedita
So, to make sure I have this right:
When the data is processed from the NIC
tcp_rcv_established() is called in processing it
if a user process is waiting on the socket
iovec copy data to the user
else
copy it to receive_queue or backlog_queue
When the user tries read (in any way) a socket
iovec copy from receive_queue or backlog_queue
E.g., if the user is ready for the data, dump it straight from SKBs. Else,
don't waste SKBs on a lazy (or busy) user and copy the data to a queue.
If this is right, I'm happy :) If it's wrong, please correct.
Thx,
--Gus
On Wed, 10 Jul 2002 [email protected] wrote:
>
> > I could've sworn I heard the stack was single-copy
> > on both the TX and RX sides. But, it doesn't look to
> > me like it is. Rather, it looks like there is one copy
> > in tcp_rcv_estabilshed() (via tcp_copy_to_iovec()), and a
> > second copy in tcp_recvmsg() (which is called when the
> > user calls read()). Both of these copies are, I believe,
> > done by skb_copy_datagram_iovec().
>
> tcp_recvmsg() only does the copy from the receive_queue
> or the backlog queue. tcp_rcv_established() does the copy
> directly into the iovec or queues it onto the receive_queue
> or backlog queue for tcp_recvmsg() to complete the work. So
> there arent two copies of the same data happening, just a
> question of one or the other function doing the work depending
> on whether there is currently a process doing a read or not..
>
> hth,
>
> thanks,
> Nivedita
>
>
> So, to make sure I have this right:
>
> When the data is processed from the NIC
> tcp_rcv_established() is called in processing it
> if a user process is waiting on the socket
> iovec copy data to the user
> else
> copy it to receive_queue or backlog_queue
well, we append the skb to the tail of the queue.
this is not a copy operation. (just a few instructions).
> When the user tries read (in any way) a socket
> iovec copy from receive_queue or backlog_queue
>
>
> E.g., if the user is ready for the data, dump it straight from
> SKBs. Else,
> don't waste SKBs on a lazy (or busy) user and copy the data to a
> queue.
yep.
> If this is right, I'm happy :) If it's wrong, please correct.
>
> Thx,
> --Gus
I should add that my reading of the code is hardly
authoritative :). caveat emptor...
thanks,
Nivedita
From: Matti Aarnio <[email protected]>
Date: Wed, 10 Jul 2002 11:29:16 +0300
I suspect that in many cases there is third copy right in the network
card driver to realign data so that TCP frame begins at a 32-bit boundary.
Perhaps that is only for RISC CPU systems (e.g. Alpha, primarily.)
Can the GigE cards do ethernet-frame reception pre-alignment so that
after the 14 byte ethernet header, the TCP frame begins at 32-bit
boundary ?
All gigabit chips allow to start the receive DMA buffer on a 2-byte
aligned boundary. The exception is the ns83820. Andi Kleen had some
ideas of how to deal with even the ns83820 type chips without copying
anything more than the headers (ie. not the data portion).