2004-09-15 19:34:10

by Jeff Garzik

[permalink] [raw]
Subject: The ultimate TOE design


(reply-to set to netdev)

Every now and then people ask on the lists about TOE, TCP assist, and
that sort of thing. Ignoring the issue of TCP hardware assist, I wanted
to describe what I feel is an optimal method to _fully offload_ the
Linux TCP stack.


Put simply, the "ultimate TOE card" would be a card with network ports,
a generic CPU (arm, mips, whatever.), some RAM, and some flash. This
card's "firmware" is the Linux kernel, configured to run as a _totally
indepenent network node_, with IP address(es) all its own.

Then, your host system OS will communicate with the Linux kernel running
on the card across the PCI bus, using IP packets (64K fixed MTU).

This effectively:

1) fragment processing, IPsec, and other services onto the card.

2) You can use huge card<->host MTUs, which makes sendfile(2) faster
with _zero_ kernel changes

3) You can let the PCI card do 100% of the checksum
processing/generation, and treat the network connection connection
across the PCI bus as CHECKSUM_UNNECESSARY.

2) With enough RAM and cpu cycles, you can even offload complex services
like Web services: the PCI card runs Apache, and fetches files across
the network (your PCI bus!) from the host system.

3) Does not require _any_ modification of Linux network stack.
Interfacing with the card merely requires a simple DMA interface to copy
IP (not ethernet) packets across the PCI bus, and that fits within the
existing Linux net driver API.

4) ensures that the TOE "firmware" [the Linux kernel] can be easily
updated in the event of new features or (more importantly) security
problems.

5) Linux is the most RFC-compliant net stack in the world. Why
re-create (or license) an inferior one?

6) Long-term maintenance of TOE firmware is a BIG problem with existing
full-TOE systems. Under this design, sysadmins would update and patch
their PCI card with security updates just like any other system on their
network. This is added work, yes, but it's a known quantity and a task
they are already doing for other systems.

7) The design is both portable [tons of embedded CPUs, with and without
MMUs, can run Linux] and scalable.



My dream is that some vendor will come along and implement such a
design, and sell it in enough volume that it's US$100 or less. There
are a few cards on the market already where implementing this design
_may_ be possible, but they are all fairly expensive. Just need enough
resources on the PCI to be able to Linux as a
router/firewall/iSCSI/web-proxy gadget.

And I'm not aware of anybody doing a direct IP-over-PCI thing, either.

But I'll keep on dreaming... ;-)

Jeff




2004-09-15 20:07:05

by Paul Jakma

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004, Jeff Garzik wrote:

> Put simply, the "ultimate TOE card" would be a card with network ports, a
> generic CPU (arm, mips, whatever.), some RAM, and some flash. This card's
> "firmware" is the Linux kernel, configured to run as a _totally indepenent
> network node_, with IP address(es) all its own.
>
> Then, your host system OS will communicate with the Linux kernel running on
> the card across the PCI bus, using IP packets (64K fixed MTU).

> My dream is that some vendor will come along and implement such a
> design, and sell it in enough volume that it's US$100 or less.
> There are a few cards on the market already where implementing this
> design _may_ be possible, but they are all fairly expensive.

The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
card running Linux. Or is that what you were referring to with
"<cards exist> but they are all fairly expensive."?

> Jeff

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
There is nothing so easy but that it becomes difficult when you do it
reluctantly.
-- Publius Terentius Afer (Terence)

2004-09-15 20:19:56

by Alan

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
> card running Linux. Or is that what you were referring to with
> "<cards exist> but they are all fairly expensive."?

Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
and also had the advantage they ran user mode code when idle from
network processing.


2004-09-15 20:16:27

by David Stevens

[permalink] [raw]
Subject: Re: The ultimate TOE design

I've never understood why people are so interested in off-loading
networking. Isn't that just a multi-processor system where you can't
use any of the network processor cycles for anything else? And, of
course, to be cheap, the network processor will be slower, and much
harder to debug and update software.

If the PCI bus is too slow, or MTU's too small, wouldn't
it be better to fix those directly and use a fast host processor that can
also do other things when not needed for networking? And why have
memory on a NIC that can't be used by other things?

Why don't we off-load filesystems to disks instead? Or a graphics
card that implements X ? :-) I'd rather have shared system resources--
more flexible. :-)

+-DLS

2004-09-15 20:28:43

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

David Stevens wrote:
> I've never understood why people are so interested in off-loading
> networking. Isn't that just a multi-processor system where you can't
> use any of the network processor cycles for anything else? And, of
> course, to be cheap, the network processor will be slower, and much
> harder to debug and update software.

Well I do agree there is a strong don't-bother-with-TOE argument:
Moore's law, the CPUs (manufactured in vast quantities) will usually


However, there are companies are Just Gotta Do TOE... and I am not
inclined to assist in any effort that compromises Linux's RFC compliancy
or security. Current TOE efforts seem to be of the "shove your data
through this black box" variety, which is rather disheartening.

Even non-TOE NICs these days have ever-more-complex firmwares. tg3 is a
MIPS-based engine for example.


> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

PCI bus tends to be slower than DRAM<->CPU speed, and MTUs across the
Internet will be small as long as ethernet enjoys continued success.

Jeff

2004-09-15 20:44:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

Alan Cox wrote:
> On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
>
>>The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>>card running Linux. Or is that what you were referring to with
>>"<cards exist> but they are all fairly expensive."?
>
>
> Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
> and also had the advantage they ran user mode code when idle from
> network processing.


The point was more to show people who are doing TOE _anyway_ to a decent
design.

As I said in another post, "just don't bother with TOE" is a very valid
answer with today's CPUs.

Jeff


2004-09-15 20:58:06

by Neil Horman

[permalink] [raw]
Subject: Re: The ultimate TOE design

Jeff Garzik wrote:
> David Stevens wrote:
>
>> I've never understood why people are so interested in off-loading
>> networking. Isn't that just a multi-processor system where you can't
>> use any of the network processor cycles for anything else? And, of
>> course, to be cheap, the network processor will be slower, and much
>> harder to debug and update software.
>
>
> Well I do agree there is a strong don't-bother-with-TOE argument:
> Moore's law, the CPUs (manufactured in vast quantities) will usually
>
>
> However, there are companies are Just Gotta Do TOE... and I am not
> inclined to assist in any effort that compromises Linux's RFC compliancy
> or security. Current TOE efforts seem to be of the "shove your data
> through this black box" variety, which is rather disheartening.
>
> Even non-TOE NICs these days have ever-more-complex firmwares. tg3 is a
> MIPS-based engine for example.
>
>
>> If the PCI bus is too slow, or MTU's too small, wouldn't
>> it be better to fix those directly and use a fast host processor that can
>> also do other things when not needed for networking? And why have
>> memory on a NIC that can't be used by other things?
>
>
> PCI bus tends to be slower than DRAM<->CPU speed, and MTUs across the
> Internet will be small as long as ethernet enjoys continued success.
>
> Jeff

There is also something to be said for the embedded market here.
offload chips are fairly usefull when building switches and routers.
Dave M. in a thread just a few weeks ago provided some metrics for how
much bandwidth a PCI-x bus and a some-odd-gigahertz processor could
handle. It worked that a pc with the right componenets could
theoretically handle about 4 gigahertz nics running traffic full duplex
at line rate. Thats great, but it doesn't come close to what you need
for a 24 port gigabit L3 switch, nor does it approach the correct price
point. Most of these designs use a less expensive processor running at
a slower speed, and an offload chip (that incorporates tx/rx logic and a
switching fabric) to preform most of the routing and switching. For
cost concious network equipment manufacturers, they are really the way
to go. Unfortunately, many of them don't actaully run as a
co-processor, and so don't enable Jeff's idea very well (yet :))

Neil

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2004-09-15 20:58:08

by David Miller

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 20:14:22 +0100
Alan Cox <[email protected]> wrote:

> On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
> > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
> > card running Linux. Or is that what you were referring to with
> > "<cards exist> but they are all fairly expensive."?
>
> Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
> and also had the advantage they ran user mode code when idle from
> network processing.

ROFL, and this is my position on this topic as well.

There are absolutely no justified economics in these
TOE engines. By the time you deploy them, the cpus
and memory catch up and what's more those are general
purpose and not just for networking as David Stevens
and others have said.

TOE is just junk, and we'll reject any attempt to put
that garbage into the kernel.

2004-09-15 21:05:28

by Wes Felter

[permalink] [raw]
Subject: Re: The ultimate TOE design

Neil Horman wrote:
> Paul Jakma wrote:
>
>> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>>
>>> Put simply, the "ultimate TOE card" would be a card with network
>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some
>>> flash. This card's "firmware" is the Linux kernel, configured to run
>>> as a _totally indepenent network node_, with IP address(es) all its own.
>>>
>>> Then, your host system OS will communicate with the Linux kernel
>>> running on the card across the PCI bus, using IP packets (64K fixed
>>> MTU).

>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>> card running Linux. Or is that what you were referring to with "<cards
>> exist> but they are all fairly expensive."?

> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of
> hardware assists for DMA and packet inspection in the extended register
> area). Don't know if they still sell it, but at one time I had heard
> they had booted linux on it.

An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core
and PowerNP's PowerPC core are way too slow to do any significant
processing; they are intended for control tasks like updating the
routing tables. All the work in the IXP or PowerNP is done by the
microengines, which have weird, non-Linux-compatible architectures.

To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10
GHz processor on the card? Sounds expensive.

A 440GX or BCM1250 on a cheap PCI card would be fun to play with, though.

Wes Felter - [email protected] - http://felter.org/wesley/

2004-09-15 21:10:26

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 02:01:23PM -0700, David S. Miller wrote:
> On Wed, 15 Sep 2004 16:41:51 -0400
> Jeff Garzik <[email protected]> wrote:
>
> > The point was more to show people who are doing TOE _anyway_ to a decent
> > design.
>
> We shouldn't be forced to refine people's non-sensible ideas which
> we'll not support anyways.

I just described a design that -we already support-.

It's generic scalable model that has application outside the acronym
"TOE". Did you read my message, or just see 'TOE' and nothing else?

Sun used this model with their x86 cards. Total MP did something
similar with their 4-processor PowerPC cards.

There's nothing inherently wrong with sticking a computer running
Linux inside another computer ;-)

Jeff



2004-09-15 21:14:54

by David Lang

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004, Alan Cox wrote:

> On Mer, 2004-09-15 at 21:04, Paul Jakma wrote:
>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>> card running Linux. Or is that what you were referring to with
>> "<cards exist> but they are all fairly expensive."?
>
> Last time I checked 2Ghz accelerators for intel and AMD were quite cheap
> and also had the advantage they ran user mode code when idle from
> network processing.

That depends on how many of these accelerators you already have in the
system. If you have 4 of them and they are heavily used so that you want
to offload them it definantly isn't cheap to add a 5th (you useually have
to go up to 8 or so and the difference between 4 and 8 is frequently 2x-4x
the cost of the 4 processor box)

now if you start with a single CPU system then yes, adding a second one is
cheap. but these are useually not the people who really need TOE (they may
think that they do, but that's a different story :-)

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2004-09-15 21:06:05

by David Miller

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 16:41:51 -0400
Jeff Garzik <[email protected]> wrote:

> The point was more to show people who are doing TOE _anyway_ to a decent
> design.

We shouldn't be forced to refine people's non-sensible ideas which
we'll not support anyways.

If TOE is supported on Windows only, I happily welcome that.

2004-09-15 21:23:13

by Michael Richardson

[permalink] [raw]
Subject: Re: The ultimate TOE design

-----BEGIN PGP SIGNED MESSAGE-----


>>>>> "David" == David S Miller <[email protected]> writes:
>> The point was more to show people who are doing TOE _anyway_ to a decent
>> design.

David> We shouldn't be forced to refine people's non-sensible ideas which
David> we'll not support anyways.

David> If TOE is supported on Windows only, I happily welcome that.

Ha. Too hard to do :-)

The TOEs and L7 content switches that I know of are supported...
UNDER LINUX ONLY

The one that I'm most familliar with (Seaway's SW5000/NCA2000)
provides a new socket family to the host, which corresponds to streams
that terminate on the NCA2000. The host can request things like having
two TCP streams be cross-connected, even adding/subtracting SSL along
the way.

This code does not interact with the Linux IP stack at all --- so it
isn't exactly a TOE. You have to, at a minimum recompile applications.

- --
] "Elmo went to the wrong fundraiser" - The Simpson | firewalls [
] Michael Richardson, Xelerance Corporation, Ottawa, ON |net architect[
] [email protected] http://www.sandelman.ottawa.on.ca/mcr/ |device driver[
] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Finger me for keys

iQCVAwUBQUiw3YqHRg3pndX9AQHrwQQAoK2C4btD6vk/UZ1Bv7zTgtbw/EvZuU2F
ZqPDiYfHMIsfsCYBWqLrjU2oxkkO+RgH3NOoNTJQuuVFjLlDw2pPHgH9DXaYdZy8
3To0LGdmIZR4u+mMx2WFRyYjuDM1iQ3ZbAskN5JzW3Jc77SbrJZaap1fQua5U3qg
gfNQ21OPkSI=
=+JBc
-----END PGP SIGNATURE-----

2004-09-15 21:23:12

by David Miller

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 17:08:18 -0400
Jeff Garzik <[email protected]> wrote:

> There's nothing inherently wrong with sticking a computer running
> Linux inside another computer ;-)

And we already support that :-)

Plus we have things like TSO too but that doesn't require a full Linux
instance to realize on a networking port.
Simple silicon implements this already.
I don't see how that differs from your "big MTU" ideas.

2004-09-15 21:45:51

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 04:35:31PM -0500, Wes Felter wrote:
> Jeff Garzik wrote:
>
> >On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote:
> >
> >>To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10
> >>GHz processor on the card? Sounds expensive.
> >
> >
> >Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet?
>
> Yes. (Or a 4-way ~2GHz server.)

It was a rhetoric question.

No, you don't.

Jeff



2004-09-15 22:01:39

by Tony Lee

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 21:04:38 +0100 (IST), Paul Jakma <[email protected]> wrote:
> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>
> > Put simply, the "ultimate TOE card" would be a card with network ports, a
> > generic CPU (arm, mips, whatever.), some RAM, and some flash. This card's
> > "firmware" is the Linux kernel, configured to run as a _totally indepenent
> > network node_, with IP address(es) all its own.
> >
> > Then, your host system OS will communicate with the Linux kernel running on
> > the card across the PCI bus, using IP packets (64K fixed MTU).
>
> > My dream is that some vendor will come along and implement such a
> > design, and sell it in enough volume that it's US$100 or less.
> > There are a few cards on the market already where implementing this
> > design _may_ be possible, but they are all fairly expensive.
>
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
> card running Linux. Or is that what you were referring to with
> "<cards exist> but they are all fairly expensive."?
>
> > Jeff
>
> regards,
> --
> Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A



I believe Broadcom 5704 (570x) chip/nic card come with 2 MIPS CPUs (133 MHz)
one each for both Tx and Rx data path. The GIGE nic card cost < $50
couple years ago.


Too bad, the software SDK for them is closed (quoted at $96K couple years ago) .

Otherwise, there can be some interesting applications with that extremely
inexpensive chip/nic card.

RDMA over TCP/UDP with that chip/nic card over gige can be very interesting.

so is SSL proxy, SSH tunnel, etc.

With the right distributing processing design, it might even possible
to offload SMB,
NFS to the "right" nic card.


-Tony
--
Having fun with Xilinx Virtex Pro II reconfigurable HW + integrated PPC + Linux

2004-09-15 21:45:50

by Joel Jaeggli

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004, David Stevens wrote:

> I've never understood why people are so interested in off-loading
> networking. Isn't that just a multi-processor system where you can't
> use any of the network processor cycles for anything else? And, of
> course, to be cheap, the network processor will be slower, and much
> harder to debug and update software.

I's like to amplify this, adding more general purpose cpu to a machine
strikes me as the right design choice since they're simply more generally
useful than dedicated cpu's. look at linux software raid compared to the
alternatives, frankly I haven't seen a hardware controller that can touch
it for performance given a similar number of disks and interfaces...
Currently graphcas card have substantionaly more memory bandwidth and
pipelines than most general purpose cpu's but eventually that won't be the
case. as it is gpus still represent the biggest chunk of independat
computational power in a and at least on the server side we don't even
use them.

> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

Between hyper-transport tunnels, pci-x, pci-express and infinband, the
bottlnecks between the cpu core and the perhiperals and memory are falling
away at a rapid clip even as cpu's get faster. we're in a much better
position to build balanced systems then we were 2 years ago.

> Why don't we off-load filesystems to disks instead? Or a graphics
> card that implements X ? :-) I'd rather have shared system resources--
> more flexible. :-)
>
> +-DLS
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
--------------------------------------------------------------------------
Joel Jaeggli Unix Consulting [email protected]
GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2

2004-09-15 22:38:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

David S. Miller wrote:
> Plus we have things like TSO too but that doesn't require a full Linux
> instance to realize on a networking port.
> Simple silicon implements this already.
> I don't see how that differs from your "big MTU" ideas.


WRT MTU: if the card is a buffering endpoint, rather than a
passthrough, the card deals with Path MTU and fragmentation, leaving the
card<->host MTU at 64K, getting nice big fat frames.

Jeff


2004-09-15 23:05:44

by Paul Jakma

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004, Deepak Saxena wrote:

> Unfortunately all the SW that lets one make use of the interesting
> features of the IXPs (microEngines, crypto, etc) is a pile of
> propietary code.

My vague understanding is that while Intel's microengine code is
proprietary, they do provide the docs to the microengines to let you
write your own, no?

> ~Deepak

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
Better tried by twelve than carried by six.
-- Jeff Cooper

2004-09-15 23:09:20

by Paul Jakma

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004, Alan Cox wrote:

> Last time I checked 2Ghz accelerators for intel and AMD were quite
> cheap and also had the advantage they ran user mode code when idle
> from network processing.

Indeed.

Unfortunately though, my vague understanding is, the interesting bits
on the IXP, the microengines, are integrated with the XScale ASIC.

I agree it's silly to stick a general purpose CPU in there, but you
get it for "free" anyway.

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
War is an equal opportunity destroyer.

2004-09-15 23:37:42

by Leonid Grossman

[permalink] [raw]
Subject: RE: The ultimate TOE design

I think Jeff's "ultimate TOE card" based upon generic embedded CPU is doable
at GbE, but we may not see such a product because it's too late for it to
succeed.

TOE is a pretty questionable product in itself; one of the main reasons
people build TOE cards is to put RDMA on top of it and end up with an RNIC
(NIC+TOE+RDMA) Ethernet card.
The hope is to eventually run all three types of server traffic (network,
storage, IPC) over an RNIC, and get rid of two other HBAs in a system.

For this "fabric conversion" over Ethernet to happen it has to be at 10GbE
not GbE, since storage (FiberChannel) is already at 4Gb.
And at 10GbE, embedded CPUs just don't cut it - it has to be custom ASIC
(granted, with some means to simplify debugging and reduce the risk of hw
bugs and TCP changes).

On some other points on the thread:

WRT the TOE price, I suspect that when RNICs come out they will command
little premium over conventional NICs - it will be just a technology
upgrade.

WRT larger MTU - going to bigger MTUs helps a lot, but it will be years
before the infrastructure moves beyond 9600 byte MTU. Even right now, usage
of 9600 byte Jumbos is not universal.

WRT TSO, for applications that don't require RDMA TSO indeed helps a lot on
the transmit side for 1500 MTU - 10GbE cards are innevitably CPU bound, and
we are seeing ~3x throughput improvement with normal frames.

This leaves receive offload schemes in Linux as a biggest improvement (short
of supporting TOE) to make.
It will be great to see such receive schemes defined and implemented, as I
stated in an earlier thread we will be willing to participate in such work
and put the support in S2io 10GbE ASIC and drivers.




> -----Original Message-----
> From: David S. Miller [mailto:[email protected]]
> Sent: Wednesday, September 15, 2004 2:29 PM
> To: Jeff Garzik
> Cc: [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: Re: The ultimate TOE design
>
> On Wed, 15 Sep 2004 17:23:49 -0400
> Jeff Garzik <[email protected]> wrote:
>
> > The typical definition of TOE is "offload 90+% of the net
> stack", as
> > opposed to "TCP assist", which is stuff like TSO.
>
> I think a better goal is "offload 90+% of the net stack cost"
> which is effectively what TSO does on the send side.
>
> This is why these discussions are so circular.
>
> If we want to discuss something specific, like receive
> offload schemes, that is a very different matter. And I'm
> sure folks like Rusty have a lot to contribute in this area :-)
>

2004-09-15 23:46:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

David S. Miller wrote:
> On Wed, 15 Sep 2004 17:23:49 -0400
> Jeff Garzik <[email protected]> wrote:
>
>
>>The typical definition of TOE is "offload 90+% of the net stack", as
>>opposed to "TCP assist", which is stuff like TSO.
>
>
> I think a better goal is "offload 90+% of the net stack cost" which
> is effectively what TSO does on the send side.


A better goal is to not bother with TOE at all, and just get multi-core
processors with huge memory bandwidth :)

Again, the point of my message is to have something _positive_ to tell
people when they specifically asked about TOE. Rather than "no, we'll
never do TOE" we have "it's possible, but there are better questions you
should be asking"

Jeff


2004-09-15 21:40:03

by David Miller

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 17:23:49 -0400
Jeff Garzik <[email protected]> wrote:

> The typical definition of TOE is "offload 90+% of the net stack", as
> opposed to "TCP assist", which is stuff like TSO.

I think a better goal is "offload 90+% of the net stack cost" which
is effectively what TSO does on the send side.

This is why these discussions are so circular.

If we want to discuss something specific, like receive offload
schemes, that is a very different matter. And I'm sure folks
like Rusty have a lot to contribute in this area :-)

2004-09-15 21:40:03

by Deepak Saxena

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Sep 15 2004, at 21:04, Paul Jakma was caught saying:
> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>
> >Put simply, the "ultimate TOE card" would be a card with network ports, a
> >generic CPU (arm, mips, whatever.), some RAM, and some flash. This card's
> >"firmware" is the Linux kernel, configured to run as a _totally indepenent
> >network node_, with IP address(es) all its own.
> >
> >Then, your host system OS will communicate with the Linux kernel running
> >on the card across the PCI bus, using IP packets (64K fixed MTU).
>
> >My dream is that some vendor will come along and implement such a
> >design, and sell it in enough volume that it's US$100 or less.
> >There are a few cards on the market already where implementing this
> >design _may_ be possible, but they are all fairly expensive.
>
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
> card running Linux. Or is that what you were referring to with
> "<cards exist> but they are all fairly expensive."?

Unfortunately all the SW that lets one make use of the interesting
features of the IXPs (microEngines, crypto, etc) is a pile of
propietary code.

~Deepak


--
Deepak Saxena - dsaxena at plexity dot net - http://www.plexity.net/

"Unlike me, many of you have accepted the situation of your imprisonment
and will die here like rotten cabbages." - Number 6

2004-09-16 00:17:54

by Imran Badr

[permalink] [raw]
Subject: Re: The ultimate TOE design

Please see:

Cavium Networks Introduces OCTEON(TM) Family of Integrated Network Services
Processors With up to 16 MIPS64(R)-Based Cores for Internet Services,
Content and Security Processing"

http://www.linuxelectrons.com/article.php?story=20040913082030668&mode=print




-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Wes Felter
Sent: Wednesday, September 15, 2004 2:04 PM
To: [email protected]
Cc: [email protected]
Subject: [SPAM] Re: The ultimate TOE design


Neil Horman wrote:
> Paul Jakma wrote:
>
>> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>>
>>> Put simply, the "ultimate TOE card" would be a card with network
>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some
>>> flash. This card's "firmware" is the Linux kernel, configured to run
>>> as a _totally indepenent network node_, with IP address(es) all its own.
>>>
>>> Then, your host system OS will communicate with the Linux kernel
>>> running on the card across the PCI bus, using IP packets (64K fixed
>>> MTU).

>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>> card running Linux. Or is that what you were referring to with "<cards
>> exist> but they are all fairly expensive."?

> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of
> hardware assists for DMA and packet inspection in the extended register
> area). Don't know if they still sell it, but at one time I had heard
> they had booted linux on it.

An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core
and PowerNP's PowerPC core are way too slow to do any significant
processing; they are intended for control tasks like updating the
routing tables. All the work in the IXP or PowerNP is done by the
microengines, which have weird, non-Linux-compatible architectures.

To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10
GHz processor on the card? Sounds expensive.

A 440GX or BCM1250 on a cheap PCI card would be fun to play with, though.

Wes Felter - [email protected] - http://felter.org/wesley/

2004-09-15 21:31:32

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

David S. Miller wrote:
> On Wed, 15 Sep 2004 17:08:18 -0400
> Jeff Garzik <[email protected]> wrote:
>
>
>>There's nothing inherently wrong with sticking a computer running
>>Linux inside another computer ;-)
>
>
> And we already support that :-)
>
> Plus we have things like TSO too but that doesn't require a full Linux
> instance to realize on a networking port.
> Simple silicon implements this already.
> I don't see how that differs from your "big MTU" ideas.


Part of this is about how to talk to business people.... marketing.

The typical definition of TOE is "offload 90+% of the net stack", as
opposed to "TCP assist", which is stuff like TSO.

If people ask about how to support TOE in Linux, you can say "sure, we
_already_ support TOE, just stick Linux on a PCI card" rather than "no
we don't support it."

And wha-la, we support TOE with zero code changes ;-)

Jeff, who would love to have a bunch of Athlons
on PCI cards to play with.


2004-09-15 21:23:12

by Jeff Garzik

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 04:03:57PM -0500, Wes Felter wrote:
> To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10
> GHz processor on the card? Sounds expensive.

Do you need a 5-10 Ghz Intel server to handle 10 Gbps ethernet?

Jeff



2004-09-15 20:43:09

by David Schwartz

[permalink] [raw]
Subject: Re: The ultimate TOE design


David Stevens wrote:

> I've never understood why people are so interested in off-loading
> networking. Isn't that just a multi-processor system where you can't
> use any of the network processor cycles for anything else? And, of
> course, to be cheap, the network processor will be slower, and much
> harder to debug and update software.

The issues of debugging the network processor software and maintaining it is
certainly a legitimate one. However, nothing stops you from using the extra
network processor cycles for other purposes.

> If the PCI bus is too slow, or MTU's too small, wouldn't
> it be better to fix those directly and use a fast host processor that
> can
> also do other things when not needed for networking? And why have
> memory on a NIC that can't be used by other things?

This isn't an either-or. Processors are cheap. Memory is cheap.

> Why don't we off-load filesystems to disks instead? Or a graphics
> card that implements X ? :-) I'd rather have shared system resources--
> more flexible. :-)

It's not one or the other. If, for example, your network card, graphics
card, and hard drive controller all use a common instruction set and are all
interconnected by a fast bus, code can be fairly mobile and run wherever
it's the most efficient. Nothing stops the OS from offloading internal tasks
to these processors as well.

The only real stumbling blocks have been cost/volume considerations and the
fact that the central processor(s) can be so fast, and the I/O so slow in
comparison, that there's not much to gain.

DS


2004-09-16 01:05:35

by jamal

[permalink] [raw]
Subject: Re: The ultimate TOE design

Jeff,
You are only allowed to start a TOE thread only every six months ;->

On a serious note, I think that PCI-express (if it lives upto its
expectation) will demolish dreams of a lot of these TOE investments.
Our problem is NOT the CPU right now (80% idle processing 450Kpps
forwarding). Bus and memory distance/latency are. If intel would get rid
of the big conspiracy in the form of chipset division and just integrate
the MC like AMD is, we'll be on our our way to kill TOE and a lot of the
network processors (like the IXP). Dang, running Linux is more exciting
than microcoding things to fit into a 2Kword program store.

I rest my canadiana $.02

cheers,
jamal

2004-09-16 01:13:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 01:53:08PM -0700, David S. Miller wrote:
> There are absolutely no justified economics in these
> TOE engines. By the time you deploy them, the cpus
> and memory catch up and what's more those are general
> purpose and not just for networking as David Stevens
> and others have said.

I'm not sure if economics are the worst part of what is being shipped,
to me the worst part is security, I'd never trust myself such a
non-open-source TCP stack for something critical even if it was going to
be much cheaper and performant. Even my PDA is using the linux tcp
stack, and my cell phone only speaks UDP with the wap server anyways.
TCP segment offload OTOH doesn't involve much "intelligence" in the NIC
and it's very reasonable to trust it especially because all the incoming
packets (the real potential offenders) are still processed by the linux
tcp stack.

2004-09-16 01:17:55

by Neil Horman

[permalink] [raw]
Subject: Re: The ultimate TOE design

Paul Jakma wrote:
> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>
>> Put simply, the "ultimate TOE card" would be a card with network
>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some
>> flash. This card's "firmware" is the Linux kernel, configured to run
>> as a _totally indepenent network node_, with IP address(es) all its own.
>>
>> Then, your host system OS will communicate with the Linux kernel
>> running on the card across the PCI bus, using IP packets (64K fixed MTU).
>
>
>> My dream is that some vendor will come along and implement such a
>> design, and sell it in enough volume that it's US$100 or less. There
>> are a few cards on the market already where implementing this design
>> _may_ be possible, but they are all fairly expensive.
>
>
> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI card
> running Linux. Or is that what you were referring to with "<cards exist>
> but they are all fairly expensive."?
>
>> Jeff
>
>
> regards,

IBM's PowerNP chip was also very simmilar (a powerpc core with lots of
hardware assists for DMA and packet inspection in the extended register
area). Don't know if they still sell it, but at one time I had heard
they had booted linux on it.
Neil

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2004-09-16 05:27:15

by Leonid Grossman

[permalink] [raw]
Subject: RE: The ultimate TOE design



> -----Original Message-----
> From: jamal [mailto:[email protected]]
> Sent: Wednesday, September 15, 2004 5:58 PM
> To: Jeff Garzik
> Cc: David S. Miller; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: Re: The ultimate TOE design
>
> Jeff,
> You are only allowed to start a TOE thread only every six months ;->
>
> On a serious note, I think that PCI-express (if it lives upto its
> expectation) will demolish dreams of a lot of these TOE investments.
> Our problem is NOT the CPU right now (80% idle processing
> 450Kpps forwarding). Bus and memory distance/latency are.

In servers, both bottlenecks are there - if you look at the cost of TCP and
filesystem processing at 10GbE, CPU is a huge problem (and will be for
foreseeable future), even for fastest 64-bit systems.
I agree though that bus and memory are bigger issues, this is exactly the
reason for all these RDMA over Ethernet investments :-)
Anyways, did not mean to start an argument - with all the new CPU, bus and
HBA technologies coming to the market it will be another 18-24 months before
we know what works and what doesn't...
Leonid


>If
> intel would get rid of the big conspiracy in the form of
> chipset division and just integrate the MC like AMD is, we'll
> be on our our way to kill TOE and a lot of the network
> processors (like the IXP). Dang, running Linux is more
> exciting than microcoding things to fit into a 2Kword program store.
>
> I rest my canadiana $.02
>
> cheers,
> jamal
>

2004-09-16 05:51:51

by Matt Porter

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 04:26:09PM -0400, Neil Horman wrote:
> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of
> hardware assists for DMA and packet inspection in the extended register
> area). Don't know if they still sell it, but at one time I had heard
> they had booted linux on it.

Well, yes, PowerNP support has been in the kernel for years and embedded
Linux distros like Mvista support them. It's no longer an IBM chip,
though. AMCC purchased the PPC4xx network processors (PowerNP) from
IBM and later purchased the entire standard SoC PPC4xx product line
from IBM. That is, except for the PPC4xx STB chips like are found in
the Hauppage MediaMVP, IBM retained those. AMCC pretty much owns all
the PPC4xx line and PowerNP 405H/L are still available.

-Matt

2004-09-16 09:03:43

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: The ultimate TOE design

On 2004-09-15T15:33:47,
Jeff Garzik <[email protected]> said:

> Then, your host system OS will communicate with the Linux kernel running
> on the card across the PCI bus, using IP packets (64K fixed MTU).
>
> This effectively:

Actually, given that there's almost no reason to offload TCP/IP
processing for speed (better spent the money on CPU / memory for the
main system), I like the idea of this for security: Off-load the packet
filtering to create an additional security barrier. (Different CPU
architecture and all that.)

(With two cards, one could even use the conntrack fail-over internally.
- A Linux-running NIC with builtin firewalling, sell to all the windows
weenies... ;)

With dedicated processors, maybe a IP/Sec accelerator would also be
cool, but I'd think a crypto accelerator for the main system would again
be saner here (unless, of course, the argument of the security domain
isolation is applied again).

Admittedely, one can solve all these differently, but it still might be
cool. ;-)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-09-16 09:29:50

by Lincoln Dale

[permalink] [raw]
Subject: Re: The ultimate TOE design

not that i disagree with the general idea and rationale, but reality is
what it is today for some reasons:

At 07:23 AM 16/09/2004, Jeff Garzik wrote:
> Jeff, who would love to have a bunch of Athlons
> on PCI cards to play with.

. . . this ignore the realities of power restrictions of PCI today . . .
sure, one could create a PCI card that takes a power-connector, but that
don't scale so well either . . .

At 07:29 AM 16/09/2004, David S. Miller wrote:
>I think a better goal is "offload 90+% of the net stack cost" which
>is effectively what TSO does on the send side.
>
>This is why these discussions are so circular.

TSO works on LAN-like environments (zero latency, minimal drop), it doesn't
work so well across the internet . . .

i believe that there are better alternatives than TSO, but it involves NICs
having decent scatter-gather DMA engines and being able to be handled
multiple transactions (packets/frames) at once.
in theory, NICs like tg2/tg3 should be capable of implementing something
like this -- if one could get to the ucode on the embedded cores.


at least with PCI Express the general architecture of a PC starts to have a
hope of keeping up with Moore's law.
the same couldn't be said prior to DDR-SDRAM and higher front-side-bus
frequencies.


cheers,

lincoln.

2004-09-16 09:58:18

by jamal

[permalink] [raw]
Subject: RE: The ultimate TOE design

On Thu, 2004-09-16 at 01:25, Leonid Grossman wrote:
>
> > -----Original Message-----
> > From: jamal [mailto:[email protected]]

> > On a serious note, I think that PCI-express (if it lives upto its
> > expectation) will demolish dreams of a lot of these TOE investments.
> > Our problem is NOT the CPU right now (80% idle processing
> > 450Kpps forwarding). Bus and memory distance/latency are.
>
> In servers, both bottlenecks are there - if you look at the cost of TCP and
> filesystem processing at 10GbE, CPU is a huge problem (and will be for
> foreseeable future), even for fastest 64-bit systems.

True, but with the bus contention being a non-issue you got more of that
xeon being available for use (lets say i can use 50% more of its
capacity then i can do more). IOW, it becomes a compute capacity problem
mostly - one that you should in theory be able to throw more CPU at. SMT
(the way power5 and some of the network processors do it[1]) should go a
long way to address both additional compute and hardware threading to
work around memory latencies. With PCI-express, compute power in
mini-clustering in the form of AS (http://www.asi-sig.org/home) is being
plotted as we speak.
To sumarize: The problem to solve in 24 months maybe 100Gige.

> I agree though that bus and memory are bigger issues, this is exactly the
> reason for all these RDMA over Ethernet investments :-)

And AS does a damn good job at specing all those RDMA requirements; my
view is that intel is going to build them chips - so it can be done on a
$5 board off the pacific rim. This takes most of the small players out
of the market.

> Anyways, did not mean to start an argument - with all the new CPU, bus and
> HBA technologies coming to the market it will be another 18-24 months before
> we know what works and what doesn't...

Agreed. Would you like to invest on something that will obsoleted in
18-24 months though? OR even not obsoleted, but holds that uncertainty?
I think thats the risk facing you when you are in the offload bussiness.

Here are results for Hifn 7956 ref board on 2.6GHz P4 (HT) system,
kernel 2.6.6 SMP as compared to a s/ware only setup on same machine.
[Name of tester withheld to protect privacy].

first column - algo, second - packet size, third -
time in us spend by hw crypto, forth - time in us spent by sw crypto:

des 64: 28 3
des 128: 29 6
des 192: 33 9
des 256: 33 12
des 320: 37 15
des 384: 38 18
des 448: 41 21
des 512: 42 23
des 576: 45 26
des 640: 46 29
des 704: 49 33
des 768: 50 35
des 832: 53 38
des 896: 54 41
des 960: 57 44
des 1024: 58 47
des 1088: 61 50
des 1152: 62 53
des 1216: 66 56
des 1280: 66 59
des 1344: 70 62
des 1408: 71 65
des 1472: 74 68
des3_ede 64: 28 6
des3_ede 128: 30 13
des3_ede 192: 34 20
des3_ede 256: 43 26
des3_ede 320: 38 33
des3_ede 384: 48 40
des3_ede 448: 44 45
des3_ede 512: 54 53
des3_ede 576: 50 60
des3_ede 640: 59 67
des3_ede 704: 55 74
des3_ede 768: 66 78
des3_ede 832: 61 85
des3_ede 896: 72 94
des3_ede 960: 67 100
des3_ede 1024: 77 107
des3_ede 1088: 73 114
des3_ede 1152: 82 121
des3_ede 1216: 79 127
des3_ede 1280: 88 128
des3_ede 1344: 84 135
des3_ede 1408: 94 147
des3_ede 1472: 90 153
aes 64: 28 2
aes 192: 33 6
aes 320: 37 10
aes 448: 46 15
aes 576: 53 19
aes 704: 53 23
aes 832: 65 28
aes 960: 66 32
aes 1088: 71 37
aes 1216: 80 41
aes 1344: 83 45
aes 1472: 92 50

Moral of the data above: The 2.6Ghz is already showing signs of
obsoleting the hifn crypto offloader[2]. I think it took less than a
year for it to happen.

cheers,
jamal

[1] I also like the MIPS.com approach to SMT

[2] There are actually issues with some of the crypto offloading in
Linux; however this does serve as a good example.

2004-09-16 11:42:08

by Neil Horman

[permalink] [raw]
Subject: Re: The ultimate TOE design

Wes Felter wrote:
> Neil Horman wrote:
>
>> Paul Jakma wrote:
>>
>>> On Wed, 15 Sep 2004, Jeff Garzik wrote:
>>>
>>>> Put simply, the "ultimate TOE card" would be a card with network
>>>> ports, a generic CPU (arm, mips, whatever.), some RAM, and some
>>>> flash. This card's "firmware" is the Linux kernel, configured to
>>>> run as a _totally indepenent network node_, with IP address(es) all
>>>> its own.
>>>>
>>>> Then, your host system OS will communicate with the Linux kernel
>>>> running on the card across the PCI bus, using IP packets (64K fixed
>>>> MTU).
>
>
>>> The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
>>> card running Linux. Or is that what you were referring to with
>>> "<cards exist> but they are all fairly expensive."?
>
>
>> IBM's PowerNP chip was also very simmilar (a powerpc core with lots of
>> hardware assists for DMA and packet inspection in the extended
>> register area). Don't know if they still sell it, but at one time I
>> had heard they had booted linux on it.
>
>
> An IXP or PowerNP wouldn't work for Jeff's idea. The IXP's XScale core
> and PowerNP's PowerPC core are way too slow to do any significant
> processing; they are intended for control tasks like updating the
> routing tables. All the work in the IXP or PowerNP is done by the
> microengines, which have weird, non-Linux-compatible architectures.
>
I didn't say the assist hardware wouldn't need an extra driver. Its not
100% free, as Jeff proposes, but the CPU portion of these designs is
_sufficient_ to run linux, and a driver can be written to drive the
remainder of these chips. Its the combination that network device
manufacturers design to today: A specialized chip to do L3/L2 forwarding
at line rate over a large number of ports, and just enough general
purpose CPU to manage the user interface, the forwarding hardware and
any overflow forwarding that the forwarding hardware can't deal with
quickly.
> To do 10 Gbps Ethernet with Jeff's approach, wouldn't you need a 5-10
> GHz processor on the card? Sounds expensive.
>
To handle port densities that are competing in the market today? Yes,
which as I mentioned earlier would price designs like this out of the
market. Jeffs idea is a nice one, but it doesn't really fit well with
the hardware that networking equipment manufacturers are building today.
Take a look at Broadcoms StrataSwitch/StrataXGS lines, or Switchcores
Xpeedium processors. These are the sorts of things we have to work with
. They provide network stack offload in competitive port densities, but
they aren't also general purpose processors. They need a driver to
massage their behavior into something more linux friendly. If we could
develop an infrastrucutre that made these chips easy to integrate into a
platform running linux, linux could quickly come to dominate a large
portion of the network device space.

Neil

> Wes Felter - [email protected] - http://felter.org/wesley/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2004-09-16 13:15:41

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 14:11:04 MDT, David Stevens said:

> Why don't we off-load filesystems to disks instead? Or a graphics
> card that implements X ? :-) I'd rather have shared system resources--
> more flexible. :-)

All depends where in the "cycle of reincarnation" we are at the moment. Way
back in 1964, IBM released this monster called System/360 - and one of the
things it did was push a *lot* of the disk processing off on the channel and
disk controller using a count-key-data format rather than the fixed-block that
Linux uses. So out on the platters, the disk format would say things like "This
is a 400 byte record, the first 56 of which is a search key". A lot of stuff,
both userspace and OS, used things like 'Search Key Equal' and letting the disk
do all the searching.

There was also this terminal beast called the 3270, which had a local
controller for the terminals, and only interrupted the CPU on 'page send' type
events.

Back then, the ideas made sense - it wasn't at all unreasonable for a single
S/360-65 to drive 3,000+ concurrent terminals in an airline reservation system or
similar (and we're talking about a box that had literally only half the
hamsters of a VAX780).

But today, the 3270 isn't seen much anymore, and currently IBM emulates the CKD
format on fixed-block systems for their z/Series boxes running z/OS or whatever MVS is
called now....


Attachments:
(No filename) (226.00 B)

2004-09-16 13:24:06

by Alan

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Iau, 2004-09-16 at 10:29, Lincoln Dale wrote:
> . . . this ignore the realities of power restrictions of PCI today . . .
> sure, one could create a PCI card that takes a power-connector, but that
> don't scale so well either . . .

At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI
controller. I'm sure its not co-incidence that powerpc shows up on such
boards a lot more than x86 however.


2004-09-16 13:34:10

by Andi Kleen

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Thu, Sep 16, 2004 at 01:19:21PM +0100, Alan Cox wrote:
> On Iau, 2004-09-16 at 10:29, Lincoln Dale wrote:
> > . . . this ignore the realities of power restrictions of PCI today . . .
> > sure, one could create a PCI card that takes a power-connector, but that
> > don't scale so well either . . .
>
> At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI

Are you sure that's worst case, not average? Worst case is usually
much worse on a big CPU like an Athlon, but the power supply
has to be sized for it.

-Andi

2004-09-16 14:04:11

by Alan

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Iau, 2004-09-16 at 14:33, Andi Kleen wrote:
> > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI
>
> Are you sure that's worst case, not average? Worst case is usually
> much worse on a big CPU like an Athlon, but the power supply
> has to be sized for it.

You are correct - 6W average 9W TDP, still less than my scsicontroller
8)

2004-09-16 15:07:16

by Leonid Grossman

[permalink] [raw]
Subject: RE: The ultimate TOE design



> -----Original Message-----
> From: jamal [mailto:[email protected]]
> Sent: Thursday, September 16, 2004 2:58 AM
> To: Leonid Grossman
> Cc: 'Jeff Garzik'; 'David S. Miller';
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: RE: The ultimate TOE design
>
> On Thu, 2004-09-16 at 01:25, Leonid Grossman wrote:
> >
> > > -----Original Message-----
> > > From: jamal [mailto:[email protected]]
>
> > > On a serious note, I think that PCI-express (if it lives upto its
> > > expectation) will demolish dreams of a lot of these TOE
> investments.
> > > Our problem is NOT the CPU right now (80% idle processing 450Kpps
> > > forwarding). Bus and memory distance/latency are.
> >
> > In servers, both bottlenecks are there - if you look at the cost of
> > TCP and filesystem processing at 10GbE, CPU is a huge problem (and
> > will be for foreseeable future), even for fastest 64-bit systems.
>
> True, but with the bus contention being a non-issue you got
> more of that xeon being available for use (lets say i can use
> 50% more of its capacity then i can do more). IOW, it becomes
> a compute capacity problem mostly - one that you should in
> theory be able to throw more CPU at. SMT (the way power5 and
> some of the network processors do it[1]) should go a long way
> to address both additional compute and hardware threading to
> work around memory latencies. With PCI-express, compute power
> in mini-clustering in the form of AS
> (http://www.asi-sig.org/home) is being plotted as we speak.
> To sumarize: The problem to solve in 24 months maybe 100Gige.
>
> > I agree though that bus and memory are bigger issues, this
> is exactly
> > the reason for all these RDMA over Ethernet investments :-)
>
> And AS does a damn good job at specing all those RDMA
> requirements; my view is that intel is going to build them
> chips - so it can be done on a
> $5 board off the pacific rim. This takes most of the small
> players out of the market.
>
> > Anyways, did not mean to start an argument - with all the
> new CPU, bus
> > and HBA technologies coming to the market it will be another 18-24
> > months before we know what works and what doesn't...
>
> Agreed. Would you like to invest on something that will obsoleted in
> 18-24 months though? OR even not obsoleted, but holds that
> uncertainty?
> I think thats the risk facing you when you are in the offload
> bussiness.

Well.. Any business has risks, this one doesn't seem to be higher than
others :-)
I view 18-26 mo timeframe as a start of the offload mass-adoption, not the
end of it.

In our tests, the bus contention and the %cpu are mostly orthogonal
problems; PCI-X DDR and PCI-Express will help but only to a point.
(BTW this is all related to the higher end systems - 2-4 way and above,
running 10GbE NICs. Client is a different story, cpu is mostly "free"
there).
My sense is that (unlike on previous cycles) the "slow host, fast network"
scenario is here to stay for a long while, and will have to be addressed one
way or another - whether it is a full TOE+RDMA offload in a longer run, or
an improvement to "static" offloads.
In server space, applications will never be happy with less than 80% cpu.

Leonid

>
> Here are results for Hifn 7956 ref board on 2.6GHz P4 (HT)
> system, kernel 2.6.6 SMP as compared to a s/ware only setup
> on same machine.
> [Name of tester withheld to protect privacy].
>
> first column - algo, second - packet size, third - time in us
> spend by hw crypto, forth - time in us spent by sw crypto:
>
> des 64: 28 3
> des 128: 29 6
> des 192: 33 9
> des 256: 33 12
> des 320: 37 15
> des 384: 38 18
> des 448: 41 21
> des 512: 42 23
> des 576: 45 26
> des 640: 46 29
> des 704: 49 33
> des 768: 50 35
> des 832: 53 38
> des 896: 54 41
> des 960: 57 44
> des 1024: 58 47
> des 1088: 61 50
> des 1152: 62 53
> des 1216: 66 56
> des 1280: 66 59
> des 1344: 70 62
> des 1408: 71 65
> des 1472: 74 68
> des3_ede 64: 28 6
> des3_ede 128: 30 13
> des3_ede 192: 34 20
> des3_ede 256: 43 26
> des3_ede 320: 38 33
> des3_ede 384: 48 40
> des3_ede 448: 44 45
> des3_ede 512: 54 53
> des3_ede 576: 50 60
> des3_ede 640: 59 67
> des3_ede 704: 55 74
> des3_ede 768: 66 78
> des3_ede 832: 61 85
> des3_ede 896: 72 94
> des3_ede 960: 67 100
> des3_ede 1024: 77 107
> des3_ede 1088: 73 114
> des3_ede 1152: 82 121
> des3_ede 1216: 79 127
> des3_ede 1280: 88 128
> des3_ede 1344: 84 135
> des3_ede 1408: 94 147
> des3_ede 1472: 90 153
> aes 64: 28 2
> aes 192: 33 6
> aes 320: 37 10
> aes 448: 46 15
> aes 576: 53 19
> aes 704: 53 23
> aes 832: 65 28
> aes 960: 66 32
> aes 1088: 71 37
> aes 1216: 80 41
> aes 1344: 83 45
> aes 1472: 92 50
>
> Moral of the data above: The 2.6Ghz is already showing signs
> of obsoleting the hifn crypto offloader[2]. I think it took
> less than a year for it to happen.
>
> cheers,
> jamal
>
> [1] I also like the MIPS.com approach to SMT
>
> [2] There are actually issues with some of the crypto
> offloading in Linux; however this does serve as a good example.
>

2004-09-16 22:38:21

by Lincoln Dale

[permalink] [raw]
Subject: Re: The ultimate TOE design

Hi Alan,

At 10:57 PM 16/09/2004, Alan Cox wrote:
>On Iau, 2004-09-16 at 14:33, Andi Kleen wrote:
> > > At 1Ghz the Athlon Geode NX draws about 6W. Thats less than my SCSI
> >
> > Are you sure that's worst case, not average? Worst case is usually
> > much worse on a big CPU like an Athlon, but the power supply
> > has to be sized for it.
>
>You are correct - 6W average 9W TDP, still less than my scsicontroller
>8)

sure -- ok -- that gets you the main processor.
now add to that a Northbridge (perhaps AMD doesnt need that but i'm sure it
still does), Southbridge, DDR-SDRAM, ancilliary chips for doing MAC, PHY, ...

couple that with the voltage of PCI where you're likely to need
step-up/step-down circuits (which aren't 100% efficient themselves), you're
still going to get very close to the limit, if not over it.

... and after all that, the Geode is really designed to be an embedded
processor.
Jeff was implying using garden-variety processors which seem to have large
heatsinks, not to mention cooling fans, not to mention quite significant
heat generation.

we're not _quite_ at the stage of being able to take garden-variety
processors and build-your-own-blade-server using PCI _just_ yet. :-)


cheers,

lincoln.

2004-09-17 06:47:02

by Eric D. Mudama

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <[email protected]> wrote:
> Why don't we off-load filesystems to disks instead?

Disks have had file systems on them since close to the beginning...

2004-09-17 13:39:53

by Jörn Engel

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Fri, 17 September 2004 08:37:17 +1000, Lincoln Dale wrote:
>
> sure -- ok -- that gets you the main processor.
> now add to that a Northbridge (perhaps AMD doesnt need that but i'm sure it
> still does), Southbridge, DDR-SDRAM, ancilliary chips for doing MAC, PHY,
> ...
>
> couple that with the voltage of PCI where you're likely to need
> step-up/step-down circuits (which aren't 100% efficient themselves), you're
> still going to get very close to the limit, if not over it.
>
> ... and after all that, the Geode is really designed to be an embedded
> processor.
> Jeff was implying using garden-variety processors which seem to have large
> heatsinks, not to mention cooling fans, not to mention quite significant
> heat generation.
>
> we're not _quite_ at the stage of being able to take garden-variety
> processors and build-your-own-blade-server using PCI _just_ yet. :-)

FWIW, I've already been working with complete systems that suck their
power from PCI. They do exist, just not in the grocery store next
door.

J?rn

--
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing

2004-09-17 15:34:32

by Alan

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Gwe, 2004-09-17 at 07:46, Eric Mudama wrote:
> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <[email protected]> wrote:
> > Why don't we off-load filesystems to disks instead?
>
> Disks have had file systems on them since close to the beginning...

This is essentially the path Lustre is taking. Although it seems you
don't want to have a "full" file system on the disk since you lose to
much flexibility, instead you want the ability to allocate by handle
giving hints about locality and use.

People have also tried full file system offload - intel for example
prototyped an I2O file system class, and adaptec clearly were trying
this out on aacraid development from looking at the public headers.

Alan

2004-09-17 20:27:47

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Fri, 17 Sep 2004 00:46:59 MDT, Eric Mudama said:
> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <[email protected]> wrot
e:
> > Why don't we off-load filesystems to disks instead?
>
> Disks have had file systems on them since close to the beginning...

No, he means "offload the processing of the filesystem to the disk itself".

IBM's MVS systems basically did that - it used the disk's "Search Key" I/O
opcodes to basically get the equivalent of doing namei() out on the disk itself
(it did this for system catalog and PDS directory searches from the beginning,
and added 'indexed VTOC' support in the mid-80s). So you'd send out a CCW
(channel command word) stream that basically said "Find me the dataset
USER3.ACCTING.TESTJOBS", and when the I/O completed, you'd have the DSCB (the
moral equiv of an inode) ready to go.


Attachments:
(No filename) (226.00 B)

2004-09-17 20:39:30

by David Lang

[permalink] [raw]
Subject: Re: The ultimate TOE design

actually the sector based access that is made to modern drives is a very
primitive filesystem. if you go back to the days of the MFM and RLL drives
you had the computer sending the raw bitstreams to the drives, but with
SCSI and IDE this stopped and you instead a higher level logical block to
the drive and it deals with the details of getting it to and from the
platter.

David Lang

On Fri, 17 Sep 2004 [email protected] wrote:

> Date: Fri, 17 Sep 2004 16:27:31 -0400
> From: [email protected]
> To: Eric Mudama <[email protected]>
> Cc: David Stevens <[email protected]>, Netdev <[email protected]>,
> [email protected], Linux Kernel <[email protected]>
> Subject: Re: The ultimate TOE design
>
> On Fri, 17 Sep 2004 00:46:59 MDT, Eric Mudama said:
>> On Wed, 15 Sep 2004 14:11:04 -0600, David Stevens <[email protected]> wrot
> e:
>>> Why don't we off-load filesystems to disks instead?
>>
>> Disks have had file systems on them since close to the beginning...
>
> No, he means "offload the processing of the filesystem to the disk itself".
>
> IBM's MVS systems basically did that - it used the disk's "Search Key" I/O
> opcodes to basically get the equivalent of doing namei() out on the disk itself
> (it did this for system catalog and PDS directory searches from the beginning,
> and added 'indexed VTOC' support in the mid-80s). So you'd send out a CCW
> (channel command word) stream that basically said "Find me the dataset
> USER3.ACCTING.TESTJOBS", and when the I/O completed, you'd have the DSCB (the
> moral equiv of an inode) ready to go.
>
>

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2004-09-17 23:21:06

by Tony Lee

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Fri, 17 Sep 2004 13:36:14 -0700 (PDT), David Lang
<[email protected]> wrote:
> actually the sector based access that is made to modern drives is a very
> primitive filesystem. if you go back to the days of the MFM and RLL drives
> you had the computer sending the raw bitstreams to the drives, but with
> SCSI and IDE this stopped and you instead a higher level logical block to
> the drive and it deals with the details of getting it to and from the
> platter.
>
> David Lang
>

Maybe next evolutionary step is to put VFS layer directory on top of
RDMA -> PCI
Express/Latest serial IO, etc.
Similar to access file thru NFS/SMB just on a faster standardize
(RDMA) transport.


On the networking front, instead of TOE, it should be services
offload, similar to
web load balancer. Offload service base on src/dest addr port
proto (tcp/udp).
NSO (Network service offload.) - kind of like Apache's reverse proxy
with URL rewrite, but maybe for other applications.



Question for Leonid of S2io.com: Your company has an interesting card.
I think it must have some kind of embedded CPU. Care to tell us what kind
of CPU are they?


--
-Tony
Having a lot of fun with Xilinx Virtex Pro II reconfigurable HW + ppc + Linux

2004-09-17 23:36:50

by Leonid Grossman

[permalink] [raw]
Subject: RE: The ultimate TOE design



> -----Original Message-----
> From: Tony Lee [mailto:[email protected]]
> Sent: Friday, September 17, 2004 4:21 PM
Skipped...

> Question for Leonid of S2io.com: Your company has an
> interesting card.
> I think it must have some kind of embedded CPU. Care to tell
> us what kind of CPU are they?

Hi Tony,
For 10GbE card, we designed our own ASIC - embedded CPUs don't cut it at
10GbE...
Leonid


> --
> -Tony
> Having a lot of fun with Xilinx Virtex Pro II reconfigurable
> HW + ppc + Linux
>

2004-09-22 20:23:32

by Nivedita Singhvi

[permalink] [raw]
Subject: Re: The ultimate TOE design

Leonid Grossman wrote:

>>From: Nivedita Singhvi [mailto:[email protected]]
>>Sent: Thursday, September 16, 2004 9:19 AM
>>To: Leonid Grossman
>>Cc: 'Andi Kleen'; 'David S. Miller'; 'John Heffner';
>>[email protected]
>>Subject: Re: The ultimate TOE design
>>
>>Leonid Grossman wrote:
>>
>>
>>>We can dream about benefits of huge MTUs, but the reality is that
>>>moving beyond 9k MTU is years away. Reasons - mainly infrastructure,
>>>plus MTU above ~10k may loose checksum protection (granted, this
>>>depends whether the errors are simple or complex, and also this may
>>>not be a showstopper for some people).
>>>Even 9k MTU is very far from being universally accepted,
>>>eight years after our Alteon spec went out :-).
>>
>>One other factor is TCP congestion control, and congestion
>>windows we obey. Most of the time, you just can't send that much.
>
>
> It's a bit painful to setup, but in general with 9k jumbos and TSO we were
> able to get close to pci-x 133 limit - both in LAN and WAN tests.
> Leonid

Cool, but a very specific environment, no? ;)

What concerns me about all this is that it seems
so very host-centric design. Wouldn't it be nice if
we had a little bit more network-centric worldview
when designing network infrastructure?

It isn't just a matter of how had we can push stuff
out, it also matters how much the network can take.
Blasting tens of gigs into the ether seems all very
exciting sexy and cool, but suited for dedicated links
or network attached storage channels, not general-purpose
networking on the Internet or intra-nets.

And if that is the case, we're talking about a much
smaller market (but perhaps a more profitable
one ;))...

thanks,
Nivedita



2004-09-22 23:25:41

by Eric D. Mudama

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Fri, 17 Sep 2004 16:27:31 -0400, [email protected]
<[email protected]> wrote:
> No, he means "offload the processing of the filesystem to the disk itself".

I know what was meant.

I'm not saying the filesystem on the drive is very advanced, but it's
still a filesystem. Our "Record ID" is the LBA identifier, and all
records are 1 block in size. We can handle defects, reallocations,
and other issues, with some success.

2004-09-23 04:48:05

by Leonid Grossman

[permalink] [raw]
Subject: RE: The ultimate TOE design


> >
> > It's a bit painful to setup, but in general with 9k jumbos
> and TSO we
> > were able to get close to pci-x 133 limit - both in LAN and
> WAN tests.
> > Leonid
>
> Cool, but a very specific environment, no? ;)

Define specific environment :-). We are running common tcp benchmarks like
nttcp or iperf or Chariot or filesystem applications on a very generic white
boxes, with generic OS/settings.

>
> What concerns me about all this is that it seems so very
> host-centric design. Wouldn't it be nice if we had a little
> bit more network-centric worldview when designing network
> infrastructure?
>
> It isn't just a matter of how had we can push stuff out, it
> also matters how much the network can take.
> Blasting tens of gigs into the ether seems all very exciting
> sexy and cool, but suited for dedicated links or network
> attached storage channels, not general-purpose networking on
> the Internet or intra-nets.

This is somewhat different from IB or FC "miniature networks",
some/most of 10GbE testing runs in existing datacenters or over
existing long-haul links - see for example
http://sravot.home.cern.ch/sravot/Networking/10GbE/LSR_041504.htm

Cheers, Leonid

>
> And if that is the case, we're talking about a much smaller
> market (but perhaps a more profitable one ;))...
>
> thanks,
> Nivedita
>
>
>

2004-09-24 13:07:40

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 04:29:45PM -0700, Leonid Grossman wrote:

> And at 10GbE, embedded CPUs just don't cut it - it has to be custom ASIC
> (granted, with some means to simplify debugging and reduce the risk of hw
> bugs and TCP changes).

Intel's IXP2800 can do 10GbE.

http://www.intel.com/design/network/products/npfamily/ixp2800.htm


--L

2004-09-24 13:13:51

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Wed, Sep 15, 2004 at 02:36:00PM -0700, Deepak Saxena wrote:

> > The intel IXP's are like the above, XScale+extra-bits host-on-a-PCI
> > card running Linux. Or is that what you were referring to with
> > "<cards exist> but they are all fairly expensive."?
>
> Unfortunately all the SW that lets one make use of the interesting
> features of the IXPs (microEngines, crypto, etc) is a pile of
> propietary code.

I'm working on open source microengine code for the IXP line, which
should be available Real Soon Now(TM).


--L

2004-09-24 13:23:09

by Leonid Grossman

[permalink] [raw]
Subject: RE: The ultimate TOE design



> -----Original Message-----
> From: Lennert Buytenhek [mailto:[email protected]]
> Sent: Friday, September 24, 2004 6:08 AM
> To: Leonid Grossman
> Cc: 'David S. Miller'; 'Jeff Garzik';
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: The ultimate TOE design
>
> On Wed, Sep 15, 2004 at 04:29:45PM -0700, Leonid Grossman wrote:
>
> > And at 10GbE, embedded CPUs just don't cut it - it has to be custom
> > ASIC (granted, with some means to simplify debugging and reduce the
> > risk of hw bugs and TCP changes).
>
> Intel's IXP2800 can do 10GbE.

Hi Lennert,
I was referring to the server side.
One can certanly build a 10GbE box based on IPX2800 (or some other parts),
but at 17-25W it is not usable in NICs since the entire PCI card budget is
less than that - nothing left for 10GbE PHY, memory, etc.
Leonid

>
> http://www.intel.com/design/network/products/npfamily/ixp2800.htm
>
>
> --L
>

2004-09-24 18:13:16

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Fri, Sep 24, 2004 at 06:21:35AM -0700, Leonid Grossman wrote:

> > > And at 10GbE, embedded CPUs just don't cut it - it has to be custom
> > > ASIC (granted, with some means to simplify debugging and reduce the
> > > risk of hw bugs and TCP changes).
> >
> > Intel's IXP2800 can do 10GbE.
>
> Hi Lennert,

Hello,


> I was referring to the server side.
> One can certanly build a 10GbE box based on IPX2800 (or some other parts),
> but at 17-25W it is not usable in NICs since the entire PCI card budget is
> less than that - nothing left for 10GbE PHY, memory, etc.

Ah, ok, that makes sense. As someone else also noted, the IXP2800
only has a 64/66 PCI interface anyway, so it wouldn't really be
suitable for the task you were referring to.


cheers,
Lennert

2004-09-24 19:39:27

by Joel Jaeggli

[permalink] [raw]
Subject: Re: The ultimate TOE design

On Fri, 24 Sep 2004, Lennert Buytenhek wrote:

>
>> I was referring to the server side.
>> One can certanly build a 10GbE box based on IPX2800 (or some other parts),
>> but at 17-25W it is not usable in NICs since the entire PCI card budget is
>> less than that - nothing left for 10GbE PHY, memory, etc.

I have a graphics card which requires two four pin molex power connectors,
going back in time there have allway been certain perphiral cards which
required external (non-bus supplied power sources for whatever reason)
(hard-drive on a card, sparc on a card, pc on a card, early 90's hardware
mpeg encoder, data collection device, remote mangement card, graphics card
in modern mac etc), it's obviously not a general solution, but it's been
done frequently enough that it shouldn't just be discarded out of hand.

> Ah, ok, that makes sense. As someone else also noted, the IXP2800
> only has a 64/66 PCI interface anyway, so it wouldn't really be
> suitable for the task you were referring to.
>
>
> cheers,
> Lennert
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
--------------------------------------------------------------------------
Joel Jaeggli Unix Consulting [email protected]
GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2