2005-04-17 08:18:48

by Andreas Hartmann

[permalink] [raw]
Subject: More performance for the TCP stack by using additional hardware chip on NIC

Hello!

Alacritech developed a new chip for NIC's
(http://www.alacritech.com/html/tech_review.html), which makes it possible
to take away the TCP stack from the host CPU. Therefore, the host CPU has
more performance for the applications according Alacritech.

This sounds interesting.

Unfortunately, there are two patents belonging to this solution.

Now, I'm wondering if it is possible to implement any support for these
chips in the Linux kernel. If this hardware solution does have really the
advantages described by Alacritech, it would be a pitty, if Linux couldn't
use this hardware.

What do you think about that?



Kind regards,
Andreas Hartmann


2005-04-17 09:14:44

by Arjan van de Ven

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Sun, 2005-04-17 at 10:17 +0200, Andreas Hartmann wrote:
> Hello!
>
> Alacritech developed a new chip for NIC's
> (http://www.alacritech.com/html/tech_review.html), which makes it possible
> to take away the TCP stack from the host CPU. Therefore, the host CPU has
> more performance for the applications according Alacritech.

there are very many good reasons why this for linux is not the right
solution, including the fact that the linux tcp/ip stack already is
quite fast so the "gains" achieved aren't that stellar as the gains you
get when comparing to windows.

Also these types of solution always add quite a bit of overhead to
connection setup/teardown making it actually a *loss* for the "many
short connections" types of workloads. Now guess which things certain
benchmarks use, and guess what real world servers do :)



2005-04-17 10:29:21

by Avi Kivity

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Sun, 2005-04-17 at 12:07, Arjan van de Ven wrote:
> On Sun, 2005-04-17 at 10:17 +0200, Andreas Hartmann wrote:
> > Hello!
> >
> > Alacritech developed a new chip for NIC's
> > (http://www.alacritech.com/html/tech_review.html), which makes it possible
> > to take away the TCP stack from the host CPU. Therefore, the host CPU has
> > more performance for the applications according Alacritech.
>
> there are very many good reasons why this for linux is not the right
> solution, including the fact that the linux tcp/ip stack already is
> quite fast so the "gains" achieved aren't that stellar as the gains you
> get when comparing to windows.
>

TOEs can remove the data copy on receive. In some applications (notably
storage), where the application does not touch most of the data, this is
a significant advantage that cannot be achieved in a software-only
solution.


> Also these types of solution always add quite a bit of overhead to
> connection setup/teardown making it actually a *loss* for the "many
> short connections" types of workloads. Now guess which things certain
> benchmarks use, and guess what real world servers do :)
>

again, this depends on the application.

a copyless solution is probably necessary to achieve 10Gb/s speeds.

Avi

2005-04-17 10:57:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC


>
> TOEs can remove the data copy on receive. In some applications (notably
> storage), where the application does not touch most of the data, this is
> a significant advantage that cannot be achieved in a software-only
> solution.

other solutions can too. Search the archives for posts from Dave Miller
and Jeff Garzik on these issues. Note that TOEs per se don't do this,
specific treats of interfaces to TOE *may* allow this. The interesting
part is that the parts of the interface that would allow this can be
implemented without TOE (and all the downsides of full TOE such as
bypassing firewall rules etc etc) just as well.


> > Also these types of solution always add quite a bit of overhead to
> > connection setup/teardown making it actually a *loss* for the "many
> > short connections" types of workloads. Now guess which things certain
> > benchmarks use, and guess what real world servers do :)
> >
>
> again, this depends on the application.
>
> a copyless solution is probably necessary to achieve 10Gb/s speeds.

I've heard the same say abot 100Mbit and 1Gbit. And neither has been
proven true. Don't get me wrong, avoiding copies is always nice, and on
sending linux already enables that (depending on the applications
capabilities). But I personally find it hard to accept that full
copyless operation is a strict requirement to achieve 10Gb/s.

What sure will be required to achieve efficient 10Gb/s performance is a
whole lot of tuning in the network stack and potentially even in the
tcp/ip layer to allow for bigger buffers etc. But I'm pretty sure that
effort is underway already or will be soon...

2005-04-17 11:34:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

Hello !

On Sun, Apr 17, 2005 at 01:29:14PM +0300, Avi Kivity wrote:
> On Sun, 2005-04-17 at 12:07, Arjan van de Ven wrote:
> > On Sun, 2005-04-17 at 10:17 +0200, Andreas Hartmann wrote:
> > > Hello!
> > >
> > > Alacritech developed a new chip for NIC's
> > > (http://www.alacritech.com/html/tech_review.html), which makes it possible
> > > to take away the TCP stack from the host CPU. Therefore, the host CPU has
> > > more performance for the applications according Alacritech.
> >
> > there are very many good reasons why this for linux is not the right
> > solution, including the fact that the linux tcp/ip stack already is
> > quite fast so the "gains" achieved aren't that stellar as the gains you
> > get when comparing to windows.
> >
>
> TOEs can remove the data copy on receive. In some applications (notably
> storage), where the application does not touch most of the data, this is
> a significant advantage that cannot be achieved in a software-only
> solution.

Well, if the application does not touch most of the data, either it
is playing as a relay, and the data will at least have to be copied,
or it will play as a client or server which reads from/writes to disk,
and in this case, I wonder how the NIC will send its writes directly
to the disk controller without some help.

What worries me with those NICs is that you have no control on the
TCP stack. You often have to disable the acceleration when you
want to insert even 1 firewall rule, use policy routing or even
do a simple anti-spoofing check. It is exactly like the routers
which do many things in hardware at wire speed, but jump to snail
speed when you enable any advanced feature.

> > Also these types of solution always add quite a bit of overhead to
> > connection setup/teardown making it actually a *loss* for the "many
> > short connections" types of workloads. Now guess which things certain
> > benchmarks use, and guess what real world servers do :)
> >
>
> again, this depends on the application.

The speed itself depends on the application. An application which
goal is to achieve 10 Gbps needs to be written with this goal in
mind from start, and needs fine usage of the kernel internals, and
even sometimes good knowledge of the hardware itself. At the moment,
a non-blocking application needs one copy because the final data
position in memory is unknown. Probably soon we'll see new prefetch
syscalls (like in CPUs) which will allow the application to tell
the system that it expects to fetch some data to a particular place.
Then a very simple TOE card would be able to wake the system up to
send only TCP headers first, and the system will say "send the
data there", then wake the application once the data has been copied
and checksummed. This keeps compatible with firewalls and other
mechanisms.

> a copyless solution is probably necessary to achieve 10Gb/s speeds.

That was said for 100 Mbps then Gbps years ago, and the fact is that
software has improved a lot (zero-copy, epoll, etc...) and at the
moment, it's relatively easy to drain multi-gigabit from cheap
hardware. For example, I could fetch 3.2 Gbps of HTTP traffic on
a $3000 opteron 2GHz with a 4-port intel gigabit NIC, and a non-
optimized HTTP client which still uses select().

Memory and I/O busses are becoming very large, eg: 8 Gbps for the
PCI-X 133, multi-gigabytes/s between memory and the CPU, so the
hardware bottleneck for the 10 Gbps is already at the NIC side
and not between the CPU and the memory. When you leverage this
limit, you'll notice that the application needs very large buffers
(eg: 12.5 MB to support a 10ms scheduling latency on 10 Gbps) and
good general design (10 Gbps is 125000 open/read/send/close of
10 kB files every second).

Regards,
Willy

2005-04-17 12:15:29

by Avi Kivity

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Sun, 2005-04-17 at 14:30, Willy Tarreau wrote:

> > TOEs can remove the data copy on receive. In some applications (notably
> > storage), where the application does not touch most of the data, this is
> > a significant advantage that cannot be achieved in a software-only
> > solution.
>
> Well, if the application does not touch most of the data, either it
> is playing as a relay, and the data will at least have to be copied,

it might use copyless send. indeed, copyless send is much easier than
copyless receive.

> or it will play as a client or server which reads from/writes to disk,
> and in this case, I wonder how the NIC will send its writes directly
> to the disk controller without some help.

the TOE dma's data to the application, the disk controller dma's same
data to disk.

but the processor does not touch the data.

>
> What worries me with those NICs is that you have no control on the
> TCP stack. You often have to disable the acceleration when you
> want to insert even 1 firewall rule, use policy routing or even
> do a simple anti-spoofing check. It is exactly like the routers
> which do many things in hardware at wire speed, but jump to snail
> speed when you enable any advanced feature.

this is a very valid concern, which I hadn't thought of. I guess that
will have to be a disadvantage of the solution we will have to live
with.

maybe one day you would be able to offload your firewall and policy
router too :)

>
> > > Also these types of solution always add quite a bit of overhead to
> > > connection setup/teardown making it actually a *loss* for the "many
> > > short connections" types of workloads. Now guess which things certain
> > > benchmarks use, and guess what real world servers do :)
> > >
> >
> > again, this depends on the application.
>
> The speed itself depends on the application. An application which
> goal is to achieve 10 Gbps needs to be written with this goal in
> mind from start, and needs fine usage of the kernel internals, and
> even sometimes good knowledge of the hardware itself. At the moment,
> a non-blocking application needs one copy because the final data
> position in memory is unknown. Probably soon we'll see new prefetch
> syscalls (like in CPUs) which will allow the application to tell
> the system that it expects to fetch some data to a particular place.

aio does this very nicely. in io_submit() you tell the system where you
want your data, in io_getevents() the system tells you you have it.

> Then a very simple TOE card would be able to wake the system up to
> send only TCP headers first, and the system will say "send the
> data there", then wake the application once the data has been copied
> and checksummed. This keeps compatible with firewalls and other
> mechanisms.
>

neat. this would work very well with aio. it's a pity aio development
appears to have stagnated.

> > a copyless solution is probably necessary to achieve 10Gb/s speeds.
>
> That was said for 100 Mbps then Gbps years ago, and the fact is that
> software has improved a lot (zero-copy, epoll, etc...) and at the
> moment, it's relatively easy to drain multi-gigabit from cheap
> hardware. For example, I could fetch 3.2 Gbps of HTTP traffic on
> a $3000 opteron 2GHz with a 4-port intel gigabit NIC, and a non-
> optimized HTTP client which still uses select().
>
> Memory and I/O busses are becoming very large, eg: 8 Gbps for the
> PCI-X 133, multi-gigabytes/s between memory and the CPU, so the
> hardware bottleneck for the 10 Gbps is already at the NIC side
> and not between the CPU and the memory. When you leverage this
> limit, you'll notice that the application needs very large buffers
> (eg: 12.5 MB to support a 10ms scheduling latency on 10 Gbps) and
> good general design (10 Gbps is 125000 open/read/send/close of
> 10 kB files every second).

the aio api is remarkably well suited to such applications, allowing
batching of requests and responses. add that to a
one-process-per-processor design (to avoid scheduling latencies) and you
have most of the solution.

Avi

2005-04-17 12:32:57

by Avi Kivity

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Sun, 2005-04-17 at 13:57, Arjan van de Ven wrote:
> >
> > TOEs can remove the data copy on receive. In some applications (notably
> > storage), where the application does not touch most of the data, this is
> > a significant advantage that cannot be achieved in a software-only
> > solution.
>
> other solutions can too. Search the archives for posts from Dave Miller
> and Jeff Garzik on these issues. Note that TOEs per se don't do this,
> specific treats of interfaces to TOE *may* allow this. The interesting
> part is that the parts of the interface that would allow this can be
> implemented without TOE (and all the downsides of full TOE such as
> bypassing firewall rules etc etc) just as well.
>

I see. if you are referring to Willy's trick in the other post, then I
agree. it has still more overhead than full offload, so only
measurements can tell if it is enough (and, of course, need to wait for
the hardware to materialize).


> > a copyless solution is probably necessary to achieve 10Gb/s speeds.
>
> I've heard the same say abot 100Mbit and 1Gbit. And neither has been
> proven true. Don't get me wrong, avoiding copies is always nice, and on
> sending linux already enables that (depending on the applications
> capabilities). But I personally find it hard to accept that full
> copyless operation is a strict requirement to achieve 10Gb/s.
>
> What sure will be required to achieve efficient 10Gb/s performance is a
> whole lot of tuning in the network stack and potentially even in the
> tcp/ip layer to allow for bigger buffers etc. But I'm pretty sure that
> effort is underway already or will be soon...
>

amen.

Avi

2005-04-17 19:05:46

by Andreas Hartmann

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

Willy Tarreau schrieb:
> Hello !
>
> On Sun, Apr 17, 2005 at 01:29:14PM +0300, Avi Kivity wrote:
>> On Sun, 2005-04-17 at 12:07, Arjan van de Ven wrote:
>> > On Sun, 2005-04-17 at 10:17 +0200, Andreas Hartmann wrote:
>> > > Hello!
>> > >
>> > > Alacritech developed a new chip for NIC's
>> > > (http://www.alacritech.com/html/tech_review.html), which makes it possible
>> > > to take away the TCP stack from the host CPU. Therefore, the host CPU has
>> > > more performance for the applications according Alacritech.
>> >
>> > there are very many good reasons why this for linux is not the right
>> > solution, including the fact that the linux tcp/ip stack already is
>> > quite fast so the "gains" achieved aren't that stellar as the gains you
>> > get when comparing to windows.
>> >
>>
>> TOEs can remove the data copy on receive. In some applications (notably
>> storage), where the application does not touch most of the data, this is
>> a significant advantage that cannot be achieved in a software-only
>> solution.
>
> Well, if the application does not touch most of the data, either it
> is playing as a relay, and the data will at least have to be copied,
> or it will play as a client or server which reads from/writes to disk,
> and in this case, I wonder how the NIC will send its writes directly
> to the disk controller without some help.
>
> What worries me with those NICs is that you have no control on the
> TCP stack. You often have to disable the acceleration when you
> want to insert even 1 firewall rule, use policy routing or even
> do a simple anti-spoofing check. It is exactly like the routers
> which do many things in hardware at wire speed, but jump to snail
> speed when you enable any advanced feature.
>
>> > Also these types of solution always add quite a bit of overhead to
>> > connection setup/teardown making it actually a *loss* for the "many
>> > short connections" types of workloads. Now guess which things certain
>> > benchmarks use, and guess what real world servers do :)
>> >
>>
>> again, this depends on the application.
>
> The speed itself depends on the application. An application which
> goal is to achieve 10 Gbps needs to be written with this goal in
> mind from start, and needs fine usage of the kernel internals, and
> even sometimes good knowledge of the hardware itself.

Alacritech says, the hardware solution would make it very easy for the
application, because _every_ application would gain, without considering
the hardware it runs on itself. These are things which CEO's like to hear
- because they think, they could save time and money during development of
the application.


I don't think that it must be a problem, that on the hardware TCP stack
doesn't run any filter or other additional functions, because machines
(often clusters) with high workloads usually run on dedicated servers with
other dedicated firewall machines in front of.


I think it would be good to support this hardware, because the user can
decide afterwards (after testing), which is the best choice for his
specific application and workload.



Kind regards,
Andreas Hartmann

2005-04-17 19:43:19

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

In article <[email protected]> you wrote:
> maybe one day you would be able to offload your firewall and policy
> router too :)

There are quite a few filtering NICs out there.

Greetings
Bernd

2005-04-17 20:49:38

by David Miller

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Sun, 17 Apr 2005 13:29:14 +0300
Avi Kivity <[email protected]> wrote:

> TOEs can remove the data copy on receive. In some applications (notably
> storage), where the application does not touch most of the data, this is
> a significant advantage that cannot be achieved in a software-only
> solution.

You don't need to offload the TCP stack to make this case get
zero-copy behavior.

2005-04-18 01:12:32

by Horst H. von Brand

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

Andreas Hartmann <[email protected]> said:
> Alacritech developed a new chip for NIC's
> (http://www.alacritech.com/html/tech_review.html), which makes it possible
> to take away the TCP stack from the host CPU. Therefore, the host CPU has
> more performance for the applications according Alacritech.
>
> This sounds interesting.

This idea has been discussed around here a couple of times, and the
consensus is that it is a bad idea: IP (and upper protocol) processing
is not expensive, if done right, so this really doesn't buy much; this
forces a particular interface to networking into the kernel, loosing
flexibility that way is always bad; there is no access to futzing
around in between (for example, for firewalling and such); and if the
"hardware implementation" has bugs, you are screwed.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2005-04-18 04:09:18

by Kyle Moffett

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Apr 17, 2005, at 19:37, Horst von Brand wrote:
> Andreas Hartmann <[email protected]> said:
>> Alacritech developed a new chip for NIC's
>> (http://www.alacritech.com/html/tech_review.html), which makes it
>> possible
>> to take away the TCP stack from the host CPU. Therefore, the host CPU
>> has
>> more performance for the applications according Alacritech.
>>
>> This sounds interesting.
>
> This idea has been discussed around here a couple of times, and the
> consensus is that it is a bad idea: IP (and upper protocol) processing
> is not expensive, if done right, so this really doesn't buy much; this
> forces a particular interface to networking into the kernel, loosing
> flexibility that way is always bad; there is no access to futzing
> around in between (for example, for firewalling and such); and if the
> "hardware implementation" has bugs, you are screwed.

What I think would be _much_ more useful is a generic low-power
multi-proc
MIPS/PPC system on a PCI card with a certain amount of RAM, etc that
could
be programmed at runtime by the master CPU. Then you lose none of the
flexibility, it can be run in the same endian-mode as the host CPU, and
it
would allow you to program it for much more complicated DMA. You could
do
anything from linux software RAID, audio processing, encryption, TCP/IP
stack acceleration, extra scatter-gather for your disk controller, etc.
If it was low-cost, IE: cheaper than adding extra full-speed CPUs to the
system, and using a decent bi-endian, vector-capable CPU (Like PPC), you
might find that people will buy them for the flexibility. Such a thing
might also be useful for the prezero folks, it could be used (when not
otherwise occupied) for zeroing unused pages.

Personally, I think I'd buy one or two just to tinker with them :-D.

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
!y?(-)
------END GEEK CODE BLOCK------


2005-04-18 04:28:25

by Willy Tarreau

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Mon, Apr 18, 2005 at 12:08:41AM -0400, Kyle Moffett wrote:
(...)
> What I think would be _much_ more useful is a generic low-power
> multi-proc MIPS/PPC system on a PCI card with a certain amount of
> RAM, etc that could be programmed at runtime by the master CPU.
> Then you lose none of the flexibility, it can be run in the same
> endian-mode as the host CPU, and it would allow you to program
> it for much more complicated DMA.

it would be really interesting, it would be sort of an I/O coprocessor,
but unfortunately, it would half the PCI bandwidth (which is already a
problem with 10 Gbps) be cause the data would have to go from the NIC
to the copro then from the copro to system RAM.

Or if this copro contains large amounts of RAM, then the applications
can manipulate data directly on the card (and the copro could provide
remote memcpy, memmove, etc...), thus eliminating copies. But in this
case, it would require many modifications on both the kernel and the
application.

> You could do anything from linux software RAID, audio processing,
> encryption, TCP/IP stack acceleration, extra scatter-gather for your
> disk controller, etc.
> If it was low-cost, IE: cheaper than adding extra full-speed CPUs to the
> system, and using a decent bi-endian, vector-capable CPU (Like PPC), you
> might find that people will buy them for the flexibility. Such a thing
> might also be useful for the prezero folks, it could be used (when not
> otherwise occupied) for zeroing unused pages.
>
> Personally, I think I'd buy one or two just to tinker with them :-D.

Then you should take a look at some hardware RAID controllers or even
some special intel NICs, both of which often come with an i960 or PPC
onboard. It might become a good start, and if you can show someting
interesting, we already know there's one guy here who can build the
full-speed CPU from the specs :-)

Cheers,
Willy

2005-04-18 05:35:24

by Avi Kivity

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

David S. Miller wrote:

>On Sun, 17 Apr 2005 13:29:14 +0300
>Avi Kivity <[email protected]> wrote:
>
>
>
>>TOEs can remove the data copy on receive. In some applications (notably
>>storage), where the application does not touch most of the data, this is
>>a significant advantage that cannot be achieved in a software-only
>>solution.
>>
>>
>
>You don't need to offload the TCP stack to make this case get
>zero-copy behavior.
>
>
yes, Willy Tarreau outlined how buffering on the nic and splitting the
dma can achieve zero copy.

are there any adapters out there which work this way?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2005-04-18 08:28:32

by Denis Vlasenko

[permalink] [raw]
Subject: Re: More performance for the TCP stack by using additional hardware chip on NIC

On Sunday 17 April 2005 22:04, Andreas Hartmann wrote:
> Willy Tarreau schrieb:
> > Well, if the application does not touch most of the data, either it
> > is playing as a relay, and the data will at least have to be copied,
> > or it will play as a client or server which reads from/writes to disk,
> > and in this case, I wonder how the NIC will send its writes directly
> > to the disk controller without some help.

If both NIC and disk is clever enough, they can both use DMA:
NIC ==dma==> RAM ==dma==> DISK
without CPU needing to ever touch the bulk of data.

> > What worries me with those NICs is that you have no control on the
> > TCP stack. You often have to disable the acceleration when you
> > want to insert even 1 firewall rule, use policy routing or even
> > do a simple anti-spoofing check. It is exactly like the routers
> > which do many things in hardware at wire speed, but jump to snail
> > speed when you enable any advanced feature.

Yes. This is why TCP offload is a buzzword mostly.
Anybody with real experience on this?

> Alacritech says, the hardware solution would make it very easy for the
> application, because _every_ application would gain, without considering
> the hardware it runs on itself. These are things which CEO's like to hear
> - because they think, they could save time and money during development of
> the application.

Most probably marketspeak.

> I don't think that it must be a problem, that on the hardware TCP stack
> doesn't run any filter or other additional functions, because machines
> (often clusters) with high workloads usually run on dedicated servers with
> other dedicated firewall machines in front of.

If you put firewall machine in front of your 10GigE server, you
are killing its performance.

> I think it would be good to support this hardware, because the user can
> decide afterwards (after testing), which is the best choice for his
> specific application and workload.

Are specs available?
--
vda