> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 4:07 AM
> To: Felix Marti
> Cc: David Miller; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
>
> "Felix Marti" <[email protected]> writes:
> > > avoidance gains of TSO and LRO are still a very worthwhile
savings.
> > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20
+
> > 20), 864B, when moving ~64KB of payload - looks like very much in
the
> > noise to me.
>
> TSO is beneficial for the software again. The linux code currently
> takes several locks and does quite a few function calls for each
> packet and using larger packets lowers this overhead. At least with
> 10GbE saving CPU cycles is still quite important.
>
> > an option to get 'high performance'
>
> Shouldn't you qualify that?
>
> It is unlikely you really duplicated all the tuning for corner cases
> that went over many years into good software TCP stacks in your
> hardware. So e.g. for wide area networks with occasional packet loss
> the software might well perform better.
Yes, it used to be sufficient to submit performance data to show that a
technology make 'sense'. In fact, I believe it was Alan Cox who once
said that linux will have a look at offload once an offload device holds
the land speed record (probably assuming that the day never comes ;).
For the last few years it has been Chelsio offload devices that have
been improving their own LSRs (as IO bus speeds have been increasing).
It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW
and the fine-grained (per packet and not per TSO-burst) packet scheduler
in the offload device played a crucial part in pushing performance to
the limits of what OC-192 can do. Most other customers use our offload
products in low-latency cluster environments. - The problem with offload
devices is that they are not all born equal and there have been a lot of
poor implementation giving the technology a bad name. I can only speak
for Chelsio and do claim that we have a solid implementation that scales
from low-latency clusters environments to LFNs.
Andi, I could present performance numbers, i.e. throughput and CPU
utilization in function of IO size, number of connections, ... in a
back-to-back environment and/or in a cluster environment... but what
will it get me? I'd still get hit by the 'not integrated' hammer :(
>
> -Andi