2003-08-02 17:04:59

by Werner Almesberger

[permalink] [raw]
Subject: TOE brain dump

At OLS, there was a bit of discussion on (true and false *) TOEs
(TCP Offload Engines). In the course of this discussion, I've
suggested what might be a novel approach, so in case this is a
good idea, I'd like to dump my thoughts on it, before someone
tries to patent my ideas. (Most likely, some of this has already
been done or tried elsewhere, but it can't hurt to try to err on
the safe side.)

(*) The InfiniBand people unfortunately call also their TCP/IP
bypass "TOE" (for which they promptly get shouted down,
every time they use that word). This is misleading, because
there is no TCP that's getting offloaded, but TCP is simply
never done. I would consider it to be more accurate to view
this as a separate networking technology, with semantics
different from TCP/IP, similar to ATM and AAL5.

While I'm not entirely convinced about the usefulness of TOE in
all the cases it's been suggested for, I can see value in certain
areas, e.g. when TCP per-packet overhead becomes an issue.

However, I consider the approach of putting a new or heavily
modified stack, which duplicates a considerable amount of the
functionality in the main kernel, on a separate piece of hardware
questionable at best. Some of the issues:

- if this stack is closed source or generally hard to modify,
security fixes will be slowed down

- if this stack is closed source or generally hard to modify,
TOE will not be available to projects modifying the stack,
e.g. any of the research projects trying to make TCP work at
gigabit speeds

- this stack either needs to implement all administrative
interfaces of the regular kernel, or such a system would have
non-uniform configuration/monitoring across interfaces

- in some cases, administrative interfaces will require a
NIC/TOE-specific switch in the kernel (netlink helps here)

- route changes on multi-homed hosts (or any similar kind of
failover) are difficult if the state of TCP connections is
tied to specific NICs (I've discussed some issues when
"migrating" TCP connections in the documentation of tcpcp,
http://www.almesberger.net/tcpcp/)

- new kernel features will always lag behind on this kind of
TOE, and different kernels will require different "firmware"

- last but not least, keeping TOE firmware up to date with the
TCP/IP stack in the mainstream kernel will require - for each
such TOE device - a significant and continuous effort over a
long period of time

In short, I think such a solution is either a pain to use, or
unmaintainable, or - most likely - both.

So, how to do better ? Easy: use the Source, Luke. Here's my
idea:

- instead of putting a different stack on the TOE, a
general-purpose processor (probably with some enhancements,
and certainly with optimized data paths) is added to the NIC

- that processor runs the same Linux kernel image as the host,
acting like a NUMA system

- a selectable part of TCP/IP is handled on the NIC, and the
rest of the system runs on the host processor

- instrumentation is added to the mainstream kernel to ensure
that as little data as possible is shared between the main
CPU and such peripheral CPUs. Note that such instrumentation
would be generic, outlining possible boundaries, and not tied
to a specific TOE design.

- depending on hardware details (cache coherence, etc.), the
instrumentation mentioned above may even be necessary for
correctness. This would have the unfortunate effect of making
the design very fragile with respect to changes in the
mainstream kernel. (Performance loss in the case of imperfect
instrumentation would be preferable.)

- further instrumentation may be needed to let the kernel switch
CPUs (i.e. host to NIC, and vice versa) at the right time

- since the NIC would probably use a CPU design different from
the host CPU, we'd need "fat" kernel binaries:

- data structures are the same, i.e. word sizes, byte order,
bit numbering, etc. are compatible, and alignments are
chosen such that all CPUs involved are reasonably happy

- kernels live in the same address space

- function pointers become arrays, with one pointer per
architecture. When comparing pointers, the first element is
used.

- if one should choose to also run parts of user space on the
NIC, fat binaries would also be needed for this (along with
other complications)

Benefits:

- putting the CPU next to the NIC keeps data paths short, and
allows for all kinds of optimizations (e.g. a pipelined
memory architecture)

- the design is fairly generic, and would equally apply to
other areas of the kernel than TCP/IP

- using the same kernel image eliminates most maintenance
problems, and encourages experimenting with the stack

- using the same kernel image (and compatible data structures)
guarantees that administrative interfaces are uniform in the
entire system

- such a design is likely to be able to allow TCP state to be
moved to a different NIC, if necessary

Possible problems, that may kill this idea:

- it may be too hard to achieve correctness

- it may be too hard to switch CPUs properly

- it may not be possible to express copy operations efficiently
in such a context

- there may be no way to avoid sharing of hardware-specific
data structures, such as page tables, or to emulate their use

- people may consider the instrumentation required for this,
although fairly generic, too intrusive

- all this instrumentation may eat too much performance

- nobody may be interested in building hardware for this

- nobody may be patient enough to pursue such long-termish
development, with uncertain outcome

- something I haven't thought of

I lack the resources (hardware, financial, and otherwise) to
actually do something with these ideas, so please feel free to
put them to some use.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/


2003-08-02 17:32:55

by Nivedita Singhvi

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger wrote:

> (*) The InfiniBand people unfortunately call also their TCP/IP
> bypass "TOE" (for which they promptly get shouted down,
> every time they use that word). This is misleading, because

Thank you! Yes! All in favor say Aye..AYE!!! Motion passes,
the infiniband people don't get to call it TOE anymore..

> While I'm not entirely convinced about the usefulness of TOE in
> all the cases it's been suggested for, I can see value in certain
> areas, e.g. when TCP per-packet overhead becomes an issue.

Ditto, but I see it being used to rollout the idea and process,
rather than anything of value now, and the lessons are being
learned for the future, when we reach 20Gb, 40Gb, even faster
networks of tommorow. The processors might keep up, but nothing
else will, for sure.

> However, I consider the approach of putting a new or heavily
> modified stack, which duplicates a considerable amount of the
> functionality in the main kernel, on a separate piece of hardware
> questionable at best. Some of the issues:
>
> - if this stack is closed source or generally hard to modify,
> security fixes will be slowed down

as will bug fixes, and debugging becomes a right royal pain.

Also, most profiles of networking applications show the
largest blip is essentially the user<->kernel transfer, and
that would still remain the unaddressed bottleneck.

> So, how to do better ? Easy: use the Source, Luke. Here's my
> idea:
>
> - instead of putting a different stack on the TOE, a
> general-purpose processor (probably with some enhancements,
> and certainly with optimized data paths) is added to the NIC

The thing is, all the TOE efforts are propietary ones, to
my limited knowledge. Thus all the design is occurring in
confidential, vendor internal forums. How will they/we
come up with really the needed, _common_ design approach?

Or is this not so needed?

thanks,
Nivedita

2003-08-02 18:06:12

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Nivedita Singhvi wrote:
> Also, most profiles of networking applications show the
> largest blip is essentially the user<->kernel transfer, and
> that would still remain the unaddressed bottleneck.

I have some hope that sendfile plus a NUMA-like approach will be
sufficient for keeping transfers away from buses and memory they
don't need to touch.

> The thing is, all the TOE efforts are propietary ones, to
> my limited knowledge.

Many companies default to "closed" designs if they're not given a
convincing reason for going "open". The approach I've described
may provide that reason.

There are also historicial reasons, e.g. if you want to interface
with the stack of Windows, or any proprietary Unix, you probably
need to obtain some of their source under NDA, and use some of
that information in your own drivers or firmware. Of course, none
of this is an issue here.

Since we're talking about 1-2 years of development time anyway,
legacy hardware (i.e. hardware choices influenced by information
obtained under an NDA) will be quite obsolete by then and doesn't
matter.

> Or is this not so needed?

Exactly. The "NUMA" approach would avoid the "common TOE design"
problem.

All you need is a reasonably well documented "general-purpose"
CPU (that doesn't mean it has to be an off-the-shelf design, but
most likely, the core would be an off-the-shelf one), plus some
NIC hardware. Now, if that NIC in turn has some hidden secrets,
this isn't an issue as long as one can still write a GPLed driver
for it.

Of course, there would be elements in such a system that vendors
would like to keep secret. But then, there always are, and so far,
we've found reasonable compromises most of the time, so I don't
see why this couldn't happen here, too.

Also, if "classical TOE" patches keep getting rejected, but an
open and maintainable approach makes it into the mainstream
kernel, also the business aspects should become fairly clear.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-02 19:09:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: TOE brain dump

My own brain dump:

If one wants to go straight from disk to network, why is anyone
bothering to involve the host CPU and host memory bus at all? Memory
bandwidth and PCI bus bandwidth are still bottlenecks, no much how much
of the net stack you offload.


Regardless of how fast your network zooms packets, you've gotta keep
that pipeline full to make use of it. And you've gotta do something
intelligent with it, which in TCP's case involves the host CPU quite a
bit. TCP is sufficiently complex, for a reason. It has to handle all
manner of disturbingly slow and disturbing fast net connections, all
jabbering at the same time. TCP is a "one size fits all" solution, but
it doesn't work well for everyone.

The "TCP Offload Everything" people really need to look at what data
your users want to push, at such high speeds. It's obviously not over a
WAN... so steer users away from TCP, to an IP protocol that is tuned
for your LAN needs, and more friendly to some sort of h/w offloading
solution.

A "foo over ipv6" protocol that was designed for h/w offloading from the
start, would be a far better idea than full TCP offload will ever be.

In any case, when you approach these high speeds, you really must take a
good look at the other end of the pipeline: what are you serving at
10Gb/s, 20Gb/s, 40Gb/s? For some time, I think the answer will be
"highly specialized stuff" At some point, Intel networking gear will be
able to transfer more bits per second than there exist atoms on planet
Earth :) Garbage in, garbage out.

So, fix the other end of the pipeline too, otherwise this fast network
stuff is flashly but pointless. If you want to serve up data from disk,
then start creating PCI cards that have both Serial ATA and ethernet
connectors on them :) Cut out the middleman of the host CPU and host
memory bus instead of offloading portions of TCP that do not need to be
offloaded.

Jeff



2003-08-02 21:01:48

by Alan

[permalink] [raw]
Subject: Re: TOE brain dump

On Sad, 2003-08-02 at 18:04, Werner Almesberger wrote:
> - last but not least, keeping TOE firmware up to date with the
> TCP/IP stack in the mainstream kernel will require - for each
> such TOE device - a significant and continuous effort over a
> long period of time

or even the protocol and protocol refinements..

> - instead of putting a different stack on the TOE, a
> general-purpose processor (probably with some enhancements,
> and certainly with optimized data paths) is added to the NIC

Like say an opteron in the 2nd socket on the motherboard

> Benefits:
>
> - putting the CPU next to the NIC keeps data paths short, and
> allows for all kinds of optimizations (e.g. a pipelined
> memory architecture)

It moves the cost it doesnt make it vanish

If I read you right you are arguing for a second processor running
Linux.with its own independant memory bus. AMD make those already its
called AMD64. I don't know anyone thinking at that level about
partitioning one as an I/O processor.


2003-08-02 21:49:27

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Jeff Garzik wrote:
> jabbering at the same time. TCP is a "one size fits all" solution, but
> it doesn't work well for everyone.

But then, ten "optimized xxPs" that work well in two different
scenarios each, but not so good in the 98 others, wouldn't be
much fun either.

It's been tried a number of times. Usually, real life sneaks
in at one point or another, leaving behind a complex mess.
When they've sorted out these problems, regular TCP has caught
up with the great optimized transport protocols. At that point,
they return to their niche, sometimes tail between legs and
muttering curses, sometimes shaking their fist and boldly
proclaiming how badly they'll rub TCP in the dirt in the next
round. Maybe they shed off some of the complexity, and trade it
for even more aggressive optimization, which puts them into
their niche even more firmly. Eventually, they fade away.

There are cases where TCP doesn't work well, like a path of
badly mismatched link layers, but such paths don't treat any
protocol following the end-to-end principle kindly.

Another problem of TCP is that it has grown a bit too many
knobs you need to turn before it works over your really fast
really long pipe. (In one of the OLS after dinner speeches,
this was quite appropriately called the "wizard gap".)

> It's obviously not over a WAN...

That's why NFS turned off UDP checksums ;-) As soon as you put
it on IP, it will crawl to distances you didn't imagine in your
wildest dreams. It always does.

> So, fix the other end of the pipeline too, otherwise this fast network
> stuff is flashly but pointless. If you want to serve up data from disk,
> then start creating PCI cards that have both Serial ATA and ethernet
> connectors on them :) Cut out the middleman of the host CPU and host
> memory bus instead of offloading portions of TCP that do not need to be
> offloaded.

That's a good point. A hierarchical memory structure can help
here. Moving one end closer to the hardware, and letting it
know (e.g. through sendfile) that also the other end is close
(or can be reached more directly that through some hopelessly
crowded main bus) may help too.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-02 22:14:20

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Alan Cox wrote:
> It moves the cost it doesnt make it vanish

I don't think it really can. What it can do is reduce the
overhead (which usually translates to latency and burstiness)
and the sharing.

> If I read you right you are arguing for a second processor running
> Linux.with its own independant memory bus. AMD make those already its
> called AMD64. I don't know anyone thinking at that level about
> partitioning one as an I/O processor.

That's taking this idea to an extreme, yes. I'd think of
using something as big as an amd64 for this as "too
expensive", but perhaps it's cheap enough in the long run,
compared to some "optimized" design.

It would certainly have the advantage of already solving
various consistency and compatibility issues. (That is, if
your host CPUs is/are also amd64.)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-03 04:01:49

by Ben Greear

[permalink] [raw]
Subject: Re: TOE brain dump

Jeff Garzik wrote:

> So, fix the other end of the pipeline too, otherwise this fast network
> stuff is flashly but pointless. If you want to serve up data from disk,
> then start creating PCI cards that have both Serial ATA and ethernet
> connectors on them :) Cut out the middleman of the host CPU and host

I for one would love to see something like this, and not just Serial ATA..
but maybe 8x Serial ATA and RAID :)

Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2003-08-03 06:23:10

by Alan Shih

[permalink] [raw]
Subject: RE: TOE brain dump

A DMA xfer that fills the NIC pipe with IDE source. That's not very hard...
need a lot of bufferring/FIFO though. May require large modification to the
file serving applications?

Alan

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Ben Greear
Sent: Saturday, August 02, 2003 9:02 PM
To: Jeff Garzik
Cc: Nivedita Singhvi; Werner Almesberger; [email protected];
[email protected]
Subject: Re: TOE brain dump


Jeff Garzik wrote:

> So, fix the other end of the pipeline too, otherwise this fast network
> stuff is flashly but pointless. If you want to serve up data from disk,
> then start creating PCI cards that have both Serial ATA and ethernet
> connectors on them :) Cut out the middleman of the host CPU and host

I for one would love to see something like this, and not just Serial ATA..
but maybe 8x Serial ATA and RAID :)

Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2003-08-03 06:40:48

by Jeff Garzik

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger wrote:
> Jeff Garzik wrote:
>
>>jabbering at the same time. TCP is a "one size fits all" solution, but
>>it doesn't work well for everyone.
>
>
> But then, ten "optimized xxPs" that work well in two different
> scenarios each, but not so good in the 98 others, wouldn't be
> much fun either.
>
> It's been tried a number of times. Usually, real life sneaks
> in at one point or another, leaving behind a complex mess.
> When they've sorted out these problems, regular TCP has caught
> up with the great optimized transport protocols. At that point,
> they return to their niche, sometimes tail between legs and
> muttering curses, sometimes shaking their fist and boldly
> proclaiming how badly they'll rub TCP in the dirt in the next
> round. Maybe they shed off some of the complexity, and trade it
> for even more aggressive optimization, which puts them into
> their niche even more firmly. Eventually, they fade away.
>
> There are cases where TCP doesn't work well, like a path of
> badly mismatched link layers, but such paths don't treat any
> protocol following the end-to-end principle kindly.
>
> Another problem of TCP is that it has grown a bit too many
> knobs you need to turn before it works over your really fast
> really long pipe. (In one of the OLS after dinner speeches,
> this was quite appropriately called the "wizard gap".)
>
>
>>It's obviously not over a WAN...
>
>
> That's why NFS turned off UDP checksums ;-) As soon as you put
> it on IP, it will crawl to distances you didn't imagine in your
> wildest dreams. It always does.

Really fast, really long pipes in practice don't exist for 99.9% of all
Internet users.


When you approach traffic levels that push you want to offload most of
the TCP net stack, then TCP isn't the right solution for you anymore,
all things considered.


The Linux net stack just isn't built to be offloaded. TOE engines will
either need to (1) fall back to Linux software for all-but-the-common
case (otherwise netfilter, etc. break), or, (2) will need to be
hideously complex beasts themselves. And I can't see ASIC and firmware
designers being excited about implementing netfilter on a PCI card :)

Unfortunately some vendors seem to choosing TOE option #3: TCP offload
which introduces many limitations (connection limits, netfilter not
supported, etc.) which Linux never had before. Vendors don't seem to
realize TOE has real potential to damage the "good network neighbor"
image the net stack has. The Linux net stack's behavior is known,
documented, predictable. TOE changes all that.

There is one interesting TOE solution, that I have yet to see created:
run Linux on an embedded processor, on the NIC. This stripped-down
Linux kernel would perform all the header parsing, checksumming, etc.
into the NIC's local RAM. The Linux OS driver interface becomes a
virtual interface with a large MTU, that communicates from host CPU to
NIC across the PCI bus using jumbo-ethernet-like data frames.
Management frames would control the ethernet interface on the other side
of the PCI bus "tunnel".


>>So, fix the other end of the pipeline too, otherwise this fast network
>>stuff is flashly but pointless. If you want to serve up data from disk,
>>then start creating PCI cards that have both Serial ATA and ethernet
>>connectors on them :) Cut out the middleman of the host CPU and host
>>memory bus instead of offloading portions of TCP that do not need to be
>>offloaded.
>
>
> That's a good point. A hierarchical memory structure can help
> here. Moving one end closer to the hardware, and letting it
> know (e.g. through sendfile) that also the other end is close
> (or can be reached more directly that through some hopelessly
> crowded main bus) may help too.

Definitely.

Jeff



2003-08-03 06:41:49

by Jeff Garzik

[permalink] [raw]
Subject: Re: TOE brain dump

Alan Shih wrote:
> A DMA xfer that fills the NIC pipe with IDE source. That's not very hard...
> need a lot of bufferring/FIFO though. May require large modification to the
> file serving applications?


Nope, that's using the existing sendfile(2) facility.

Jeff



2003-08-03 08:27:37

by David Lang

[permalink] [raw]
Subject: RE: TOE brain dump

do you really want the processor on the card to be tunning
apache/NFS/Samba/etc ?

putting enough linux on the card to act as a router (which would include
the netfilter stuff) is one thing. putting the userspace code that
interfaces with the outside world for file transfers is something else.

if you really want the disk connected to your network card you are just
talking a low-end linux box. forget all this stuff about it being on a
card and just use a full box (economys of scale will make this cheaper)

making a firewall that's a core system with a dozen slave systems attached
to it (the network cards) sounds like the type of clustering that Linux
has been used for for compute nodes. complicated to setup, but extremely
powerful and scalable once configured.

if you want more then a router on the card then Alan Cox is right, just
add another processor to the system, it's easier and cheaper.

David Lang


On
Sat, 2 Aug 2003, Alan Shih wrote:

> Date: Sat, 2 Aug 2003 23:22:52 -0700
> From: Alan Shih <[email protected]>
> To: Ben Greear <[email protected]>, Jeff Garzik <[email protected]>
> Cc: Nivedita Singhvi <[email protected]>,
> Werner Almesberger <[email protected]>, [email protected],
> [email protected]
> Subject: RE: TOE brain dump
>
> A DMA xfer that fills the NIC pipe with IDE source. That's not very hard...
> need a lot of bufferring/FIFO though. May require large modification to the
> file serving applications?
>
> Alan
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Ben Greear
> Sent: Saturday, August 02, 2003 9:02 PM
> To: Jeff Garzik
> Cc: Nivedita Singhvi; Werner Almesberger; [email protected];
> [email protected]
> Subject: Re: TOE brain dump
>
>
> Jeff Garzik wrote:
>
> > So, fix the other end of the pipeline too, otherwise this fast network
> > stuff is flashly but pointless. If you want to serve up data from disk,
> > then start creating PCI cards that have both Serial ATA and ethernet
> > connectors on them :) Cut out the middleman of the host CPU and host
>
> I for one would love to see something like this, and not just Serial ATA..
> but maybe 8x Serial ATA and RAID :)
>
> Ben
>
>
> --
> Ben Greear <[email protected]>
> Candela Technologies Inc http://www.candelatech.com
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-08-03 12:12:59

by Ihar 'Philips' Filipau

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger wrote:
>
> - instead of putting a different stack on the TOE, a
> general-purpose processor (probably with some enhancements,
> and certainly with optimized data paths) is added to the NIC
>

Modern NPUs generally do this.
You need to have something like this to handle e.g. routing of GE
traffic.

Check for example:
http://www.vitesse.com/products/categories.cfm?family_id=5&category_id=16

2003-08-03 18:05:58

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

David Lang wrote:
> do you really want the processor on the card to be tunning
> apache/NFS/Samba/etc ?

If it runs a Linux kernel, that's not a problem. Whether you
actually want to do this or not, becomes an entirely separate
issue.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-03 17:57:59

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Jeff Garzik wrote:
> Really fast, really long pipes in practice don't exist for 99.9% of all
> Internet users.

It matters to some right now, i.e. the ones who are interested
in TOE in the first place. (And there also those who try to
tweak TCP to actually work over such links. Right now, its
congestion control doesn't scale that well.) Also, IT has been
good at making all that elitarian high-performance gear
available to the common people rather quickly, and I don't see
that changing. The Crisis just alters the pace a little.

> When you approach traffic levels that push you want to offload most of
> the TCP net stack, then TCP isn't the right solution for you anymore,
> all things considered.

No. Ironically, TCP is almost always the right solution.
Sometimes people try to use something else. Eventually, their
protocol wants to go over WANs or something that looks
suspiciously like a WAN (MAN or such). At that point, they
usually realize that TCP provides exactly the functionality
they need.

In theory, one could implement the same functionality in other
protocols. There was even talk at IETF to support a generic
congestion control manager for this purpose. That was many
years ago, and I haven't seen anything come out of this.

So it seems that, by the time your protocol grows up to want
to play in the real world, it wants to be so much like TCP
that you're better off using TCP.

The amusing bit here is to watch all the "competitors" pop
up, grow, fail, and eventually die.

> The Linux net stack just isn't built to be offloaded.

Yes ! And that's not a flaw of the stack, but it's simply a
fact of life. I think that no "real life" stack can be
offloaded (in the traditional sense).

> And I can't see ASIC and firmware
> designers being excited about implementing netfilter on a PCI card :)

And when they're done with netfilter, you can throw IPsec,
IPv6, or traffic control at them. Eventually, you'll wear
them down ;-)

> Unfortunately some vendors seem to choosing TOE option #3: TCP offload
> which introduces many limitations (connection limits, netfilter not
> supported, etc.) which Linux never had before.

That's when that little word "no" comes into play, i.e.
when their modifications to the stack show up on netdev
or linux-kernel. Dave Miller seems to be pretty good at
saying "no". I hope he keeps on being good at this ;-)

> There is one interesting TOE solution, that I have yet to see created:
> run Linux on an embedded processor, on the NIC.

That's basically what I've been talking about all the
while :-)

> The Linux OS driver interface becomes a virtual interface
> with a large MTU,

Probably not. I think you also want to push some
knowledge of where the data ultimately goes to the NIC.
This could be something like sendfile, something new, or
just a few bytes of user space code.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-03 18:10:16

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Ihar 'Philips' Filipau wrote:
> Modern NPUs generally do this.

Unfortunately, they don't - they run *some* code, but that
is rarely a Linux kernel, or a substantial part of it.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-03 18:28:01

by Erik Andersen

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote:
> > There is one interesting TOE solution, that I have yet to see created:
> > run Linux on an embedded processor, on the NIC.
>
> That's basically what I've been talking about all the
> while :-)

http://www.snapgear.com/pci630.html

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2003-08-03 19:24:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger <[email protected]> writes:

> Jeff Garzik wrote:
> > jabbering at the same time. TCP is a "one size fits all" solution, but
> > it doesn't work well for everyone.
>
> But then, ten "optimized xxPs" that work well in two different
> scenarios each, but not so good in the 98 others, wouldn't be
> much fun either.

The optimized for low latency cases seem to have a strong
market in clusters. And they are currently keeping alive
quite a few technologies. Myrinet, Infiniband, Quadric's Elan, etc.
Having low latency and switch technologies that scale is quite
rare currently.

> Another problem of TCP is that it has grown a bit too many
> knobs you need to turn before it works over your really fast
> really long pipe. (In one of the OLS after dinner speeches,
> this was quite appropriately called the "wizard gap".)

Does anyone know which knobs to turn to make TCP go fast over
Infiniband. (A low latency high bandwidth network?) I get to
deal with them on a regular basis...

There is one place in low latency communications that I can think
of where TCP/IP is not the proper solution. For low latency
communication the checksum is at the wrong end of the packet.
IB gets this one correct and places the checksum at the tail end of
the packet. This allows the packet to start transmitting before
the checksum is computed, possibly even having the receive start
at the other end before the tail of the packet is transmitted.

Would it make any sense to do a low latency variation on TCP that
fixes that problem? For the IP header we are fine as the data
precedes the checksum. But the problem appears to affect all
of the upper level protocols that ride on IP, UDP, TCP, SCTP...

> > So, fix the other end of the pipeline too, otherwise this fast network
> > stuff is flashly but pointless. If you want to serve up data from disk,
> > then start creating PCI cards that have both Serial ATA and ethernet
> > connectors on them :) Cut out the middleman of the host CPU and host
> > memory bus instead of offloading portions of TCP that do not need to be
> > offloaded.
>
> That's a good point. A hierarchical memory structure can help
> here. Moving one end closer to the hardware, and letting it
> know (e.g. through sendfile) that also the other end is close
> (or can be reached more directly that through some hopelessly
> crowded main bus) may help too.

On that score it is worth noting that the next generation of
peripheral busses (Hypertransport, PCI Express, etc) are all switched.
Which means that device to device communication may be more
reasonable. Going from a bussed interconnect to a switched
interconnect is certainly a dramatic change in infrastructure. How
that will affect the tradeoffs I don't know.

Eric

2003-08-03 19:40:37

by Larry McVoy

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote:
> On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote:
> > > There is one interesting TOE solution, that I have yet to see created:
> > > run Linux on an embedded processor, on the NIC.
> >
> > That's basically what I've been talking about all the
> > while :-)
>
> http://www.snapgear.com/pci630.html

ipcop plus a new PC for $200 is way higher performance and does more.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-03 20:15:12

by David Lang

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun, 3 Aug 2003, Larry McVoy wrote:

> On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote:
> > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote:
> > > > There is one interesting TOE solution, that I have yet to see created:
> > > > run Linux on an embedded processor, on the NIC.
> > >
> > > That's basically what I've been talking about all the
> > > while :-)
> >
> > http://www.snapgear.com/pci630.html
>
> ipcop plus a new PC for $200 is way higher performance and does more.

however I can see situations where you would put multiple cards in one box
and there could be an advantage to useing PCI (or PCI-X) for you
communications between the different 'nodes' of you routing cluster
instead of gig ethernet.

if this is the approach that the networking guys really want to encourage
how about defining an API that you would be willing to support and you can
even implement it and then any card that is produced would be supported
from day 1.

this interface would not have to cover the configuration of the card (that
can be done with userspace tools that talk to the card over the 'network',
it just needs to cover the ability to do what is effectivly IP over PCI.

Linus has commented that in mahy ways Linux is not designed for any
existing CPU, it's designed for a virtual CPU that implements all the
features we want and those features that aren't implemented in the chips
get emulated as needed (obviously what is actually implemented and the
speed of emulation are serious considerations for performance) why doesn't
the network team define what they thing the ideal NIC interface would be.
I can see three catagories of 'ideal' cards

1. cards that are directly driven by the kernel IP stack (similar to what
we support now, but an ideal version)

2. router nodes that have access to main memory (PCI card running linux
acting as a router/firewall/VPN to offload the main CPU's)

3. router nodes that don't have access to main memory (things like
USB/fibrechannel/infiniband/etc versions of #2, the node can run linux and
deal with the outside world, only sending the data that is needed to/from
the host)

even if nobody makes hardware that supports all the desired features
directly having a 'this is the dieal driver' reference should impruve
furture drivers by letting them use this as the core and implementing code
to simulate the features not in hardware.

they claim they need this sort of performance, you say 'not that way do it
sanely' why not give them a sane way to do it?

David Lang

2003-08-03 20:31:07

by Larry McVoy

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote:
> 2. router nodes that have access to main memory (PCI card running linux
> acting as a router/firewall/VPN to offload the main CPU's)

I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
quantity 1, shipped to my door.

Why would I want to spend money on some silly offload card when I can get
the whole PC for less than the card?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-03 20:35:43

by jamal

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun, 2003-08-03 at 15:40, Larry McVoy wrote:
> On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote:
> > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote:
> > > > There is one interesting TOE solution, that I have yet to see created:
> > > > run Linux on an embedded processor, on the NIC.
> > >
> > > That's basically what I've been talking about all the
> > > while :-)
> >
> > http://www.snapgear.com/pci630.html
>
> ipcop plus a new PC for $200 is way higher performance and does more.

;-> Actually this proves that putting the whole stack on the NIC is the
wrong way to go ;-> That poor piece of NIC was obsoleted before it was
born on pricing alone and not just compute power it was supposed to
liberate us from.

I think the idea of hierachical memories and computation is certainly
interesting. Put a CPU and memory on the NIC but not to do the work that
Linux already does. Instead think of the NIC and its memeory + CPU as a
L1 data and code cache for TCP processing. The idea posed from Davem is
intriguing:
The only thing the NIC should do is TCP fast path processing based on
cached control data generated from the main CPU stack. Any time the fast
path gets violated, the cache gets invalidate and it becomes an
exception handling to be handled by the main CPU stack.

IMO, the only time this will make sense is when the setup cost
(downloading the cache or cookies as Dave calls them) is amortized by
the data that follows. For example, may not make sense to worry about a
HTTP1.0 flow which has 3-4 packets after the SYNack.Bulk transfers make
sense (storage, file serving). I dont remember the Mogul paper details
but i think this is what he was implying.

cheers,
jamal

2003-08-03 20:55:32

by Alan

[permalink] [raw]
Subject: Re: TOE brain dump

On Sad, 2003-08-02 at 23:14, Werner Almesberger wrote:
> That's taking this idea to an extreme, yes. I'd think of
> using something as big as an amd64 for this as "too
> expensive", but perhaps it's cheap enough in the long run,
> compared to some "optimized" design.

Volume makes cheap. If you look at software v hardware raid controllers
the hardware people are permanently being killed by the low volume of
cards.

2003-08-03 20:58:23

by Alan

[permalink] [raw]
Subject: Re: TOE brain dump

On Sul, 2003-08-03 at 05:01, Ben Greear wrote:
> Jeff Garzik wrote:
>
> > So, fix the other end of the pipeline too, otherwise this fast network
> > stuff is flashly but pointless. If you want to serve up data from disk,
> > then start creating PCI cards that have both Serial ATA and ethernet
> > connectors on them :) Cut out the middleman of the host CPU and host
>
> I for one would love to see something like this, and not just Serial ATA..
> but maybe 8x Serial ATA and RAID :)

There is a protocol floating around for ATA over ethernet, no TCP layer
or nasty latency eating complexities in the middle

2003-08-03 21:23:09

by David Lang

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun, 3 Aug 2003, Larry McVoy wrote:

> On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote:
> > 2. router nodes that have access to main memory (PCI card running linux
> > acting as a router/firewall/VPN to offload the main CPU's)
>
> I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
> cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
> quantity 1, shipped to my door.
>
> Why would I want to spend money on some silly offload card when I can get
> the whole PC for less than the card?

you may want to do this for a database box where you want to dedicate your
main processing power to the database task, if you use a seperate box you
still have to talk to that box over a network, if you have it as a card
you can talk to the card much more efficantly then you can talk to the
seperate machine.

if your 8-way opteron database box is already the bottleneck for your
system you will have to spend a LOT of money to get anything that gives
you more available processing power, getting a card to offload any
processing from the main CPU's can be a win.

yes this is somewhat of a niche market, but as you point out adding more
and more processors in a SMP model is not the ideal way to go, either from
performance or from the cost point of view.

on the webserver front there are a lot of companies making a living by
selling cards and boxes to offload processing from the main CPU's of the
webservers (cards to do gzip compression are a relativly new addition, but
cards to do SSL handshakes have been around for a while) used properly
these can be a very worthwhile invenstment for high-volume webserver
companies.

also the cost of an extra box can be considerably higer then just the cost
of the hardware.

I know of one situation where between Linux OS license fees (redhat
advanced server) and security software (intrusion detection, auditing,
privilage management, etc) a company is looking at ~$4000 in licensing
fees for every box they put in their datacenter (and this is for boxes
just running apache, add something like an oracle or J2EE appserver
software and the cost goes up even more). at this point the fact that the
box only cost $200 doesn't really matter, spending an extra $500 each on 4
boxes to eliminate the need for a 5th is easily worth it. (and this
company is re-examining hardwaare raid controllers after having run
software raid for years becouse they are realizing that this is requiring
them to run more servers due to the load on the CPU's)

at the low end you are right, just add another box or add another CPU to
an existing box, but there are conditions that make adding specialized
cards to offload specific functionality a win (for that matter, even at
the low end people routinly offload graphics processing to specialized
cards, simply to make their games run faster)

David Lang

2003-08-03 21:58:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: TOE brain dump

Larry McVoy wrote:
> I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
> cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
> quantity 1, shipped to my door.
>
> Why would I want to spend money on some silly offload card when I can get
> the whole PC for less than the card?


Yep. I think we are entering the era of what I call RAIC (pronounced
"rake") -- redundant array of inexpensive computers. For organizations
that can handle the space/power/temperature load, a powerful cluster of
supercheap PCs, the "Wal-Mart Supercomputer", can be built for a
rock-bottom price.

2003-08-03 22:02:28

by Alan Shih

[permalink] [raw]
Subject: RE: TOE brain dump

On an embedded system, no processor will be fast enough to compete with
direct DMA xfer. So just provide sendfile hooks that allow the kernel to
initiate data filling from source to dest then allow TSO to take place.
Kernel still needs to take care of the TCP stack.

I don't see this as building extensive customization though.

Alan

-----Original Message-----
From: David Lang [mailto:[email protected]]
Sent: Sunday, August 03, 2003 1:26 AM
To: Alan Shih
Cc: Ben Greear; Jeff Garzik; Nivedita Singhvi; Werner Almesberger;
[email protected]; [email protected]
Subject: RE: TOE brain dump


do you really want the processor on the card to be tunning
apache/NFS/Samba/etc ?

putting enough linux on the card to act as a router (which would include
the netfilter stuff) is one thing. putting the userspace code that
interfaces with the outside world for file transfers is something else.

if you really want the disk connected to your network card you are just
talking a low-end linux box. forget all this stuff about it being on a
card and just use a full box (economys of scale will make this cheaper)

making a firewall that's a core system with a dozen slave systems attached
to it (the network cards) sounds like the type of clustering that Linux
has been used for for compute nodes. complicated to setup, but extremely
powerful and scalable once configured.

if you want more then a router on the card then Alan Cox is right, just
add another processor to the system, it's easier and cheaper.

David Lang


On
Sat, 2 Aug 2003, Alan Shih wrote:

> Date: Sat, 2 Aug 2003 23:22:52 -0700
> From: Alan Shih <[email protected]>
> To: Ben Greear <[email protected]>, Jeff Garzik <[email protected]>
> Cc: Nivedita Singhvi <[email protected]>,
> Werner Almesberger <[email protected]>, [email protected],
> [email protected]
> Subject: RE: TOE brain dump
>
> A DMA xfer that fills the NIC pipe with IDE source. That's not very
hard...
> need a lot of bufferring/FIFO though. May require large modification to
the
> file serving applications?
>
> Alan
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Ben Greear
> Sent: Saturday, August 02, 2003 9:02 PM
> To: Jeff Garzik
> Cc: Nivedita Singhvi; Werner Almesberger; [email protected];
> [email protected]
> Subject: Re: TOE brain dump
>
>
> Jeff Garzik wrote:
>
> > So, fix the other end of the pipeline too, otherwise this fast network
> > stuff is flashly but pointless. If you want to serve up data from disk,
> > then start creating PCI cards that have both Serial ATA and ethernet
> > connectors on them :) Cut out the middleman of the host CPU and host
>
> I for one would love to see something like this, and not just Serial ATA..
> but maybe 8x Serial ATA and RAID :)
>
> Ben
>
>
> --
> Ben Greear <[email protected]>
> Candela Technologies Inc http://www.candelatech.com
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-08-03 23:44:38

by Larry McVoy

[permalink] [raw]
Subject: Re: TOE brain dump

On Sun, Aug 03, 2003 at 02:21:12PM -0700, David Lang wrote:
> if your 8-way opteron database box is already the bottleneck for your
> system you will have to spend a LOT of money to get anything that gives
> you more available processing power, getting a card to offload any
> processing from the main CPU's can be a win.

I'd like to see data which supports this. CPUs have gotten so fast and
disk I/O still sucks. All the systems I've seen are CPU rich and I/O
starved. The smartest thing you could do would be to get a cheap box
with a GB of ram as a disk cache and make it be a SAN device. Make
N of those and you have tons of disk space and tons of cache and your
8 way opteron database box is going to work just fine.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-04 01:47:38

by Glen Turner

[permalink] [raw]
Subject: Re: TOE brain dump


> Really fast, really long pipes in practice don't exist for 99.9% of all
> Internet users.

Writing from Australia, I think you're out by at least
one order of magnitude and probably two. That is, I'd
expect about 10% of the net to be on long fast pipes.

Here every worthwhile fast pipe is a long fast pipe. 90% of
Australia's net traffic goes to the West Coast of the USA,
that's 14,000Km away.

Australia accounts for about 10% of current net traffic. About
30% of Australia's net traffic is from AARNet, typically
100Base-TX hosts.

So you're out by about an order of magnitude, just accounting
for one ISP in one small country. I'll leave the calculations
for the academic networks of China to others.

> There is one interesting TOE solution, that I have yet to see created:
> run Linux on an embedded processor, on the NIC. This stripped-down
> Linux kernel would perform all the header parsing, checksumming, etc.
> into the NIC's local RAM. The Linux OS driver interface becomes a
> virtual interface with a large MTU, that communicates from host CPU to
> NIC across the PCI bus using jumbo-ethernet-like data frames. Management
> frames would control the ethernet interface on the other side of the PCI
> bus "tunnel".

This assumes the offload processor is at least 100x faster at
processing the IP frames than the kernel. There is silicon where
that is true (eg, network processors), but good GCC support for
that silicon is unlikely (as good GCC support for popular silicon
is somewhat lacking).

Someone else wrote:
> It's been tried a number of times. Usually, real life sneaks
> in at one point or another, leaving behind a complex mess.
> When they've sorted out these problems, regular TCP has caught
> up with the great optimized transport protocols. At that point,
> they return to their niche, sometimes tail between legs and
> muttering curses, sometimes shaking their fist and boldly
> proclaiming how badly they'll rub TCP in the dirt in the next
> round. Maybe they shed off some of the complexity, and trade it
> for even more aggressive optimization, which puts them into
> their niche even more firmly. Eventually, they fade away.

This ignores the push-back of platform support onto protocol
design. The IETF iSCSI WG discussed using tranpsort protocols
which allow out-of-order delivery of SCSI blocks, rather than
the head-of-queue blocking that happens using TCP, but it
was felt that iSCSI would never gain vendor support unless
it ran over TCP.

> Another problem of TCP is that it has grown a bit too many
> knobs you need to turn before it works over your really fast
> really long pipe. (In one of the OLS after dinner speeches,
> this was quite appropriately called the "wizard gap".)

That's Matt Mathis's phrase. The Web100 project
<http://www.web100.org/> has a set of patches to the kernel
which go a long way to reducing the wizard gap. It would be
nice to see those patches eventually appear in the Linux
mainstream.

It's disturbing to see patches with a similar purpose (such
as those instrumenting UDP) being knocked back on grounds
of slowing the TCP/IP path. Which is a wonderful example
of suboptimisation.

> That's why NFS turned off UDP checksums ;-) As soon as you put
> it on IP, it will crawl to distances you didn't imagine in your
> wildest dreams. It always does.

I'll note that Sun turned UDP checksumming back on. Not
only is disk corruption forever, but Sun servers running
DNS servers were notorious for not checksumming DNS responses,
having the nasty effect of poisoning DNS caches.

The NANOG mailing list (a list of US ISP network engineers)
cooperated in finding all of these and getting those Classic
SunOS kernels patched to activate checksumming. We couldn't
do that nowdays, the net is just so much bigger.

Do the net a favour, don't stuff with UDP checksumming.
RFC1122 (Host Requirements) states that checksumming
MUST be on by default and that hosts MAY allow checksumming
to be turned off per *program* (ie, not across the entire
box). That requirement is born of bitter experience with
Classic SunOS's "no checksumming across the entire box by
default".

--
Glen Turner Tel: (08) 8303 3936 or +61 8 8303 3936
Network Engineer Email: [email protected]
Australian Academic & Research Network http://www.aarnet.edu.au
--
linux.conf.au 2004, Adelaide lca2004.linux.org.au
Main conference 14-17 January 2004 Miniconfs from 12 Jan

2003-08-04 03:49:05

by Larry McVoy

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, Aug 04, 2003 at 11:17:23AM +0930, Glen Turner wrote:
> >Really fast, really long pipes in practice don't exist for 99.9% of all
> >Internet users.
>
> Here every worthwhile fast pipe is a long fast pipe. 90% of
> Australia's net traffic goes to the West Coast of the USA,
> that's 14,000Km away.

I couldn't tell from your posting if you were arguing for an offload or not.
If you are, and you are using these stats as a reason, I'd like to know
the absolute numbers, router to router, that we are talking about. I have
a feeling that few cheap PC's could handle all the load but I'm willing
to be educated.

Even if I'm way off, a pair of routers between Australia and the US is
hardly a reason to muck about in the TCP stack. If we were talking
about millions of routers, well sure, that makes sense.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-04 08:54:52

by Ihar 'Philips' Filipau

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger wrote:
> Ihar 'Philips' Filipau wrote:
>
>> Modern NPUs generally do this.
>
>
> Unfortunately, they don't - they run *some* code, but that
> is rarely a Linux kernel, or a substantial part of it.
>

Embedded CPU we are using is based MIPS, and has a lot of specialized
instructions.
It makes not that much sense to run kernel (especially Linux) on CPU
which is optimized for handling of network packets. (And has actually
several co-processors to help in this task).
How much sense it makes to run general purpose OS (optimized for PCs
and servers) on devices which can make only couple of functions? (and no
MMU btw)

It is a whole idea behind this kind of CPUs - to do a few of
functions - but to do them good.

If you will start stretching CPUs like this to fit Linux kernel - it
will generally just increase price. Probably there are some markets
which can afford this.

Remeber - "Small is beatiful" (c) - and linux kernel far from it.
Our routing code which handles two GE interfaces (actually not pure
GE, but up to 2.5GB) fits into 3k. 3k of code - and that's it. not 650kb
of bzip compressed bloat. And it handles two interfaces, handles fast
data path from siblign interfaces, handles up to 1E6 routes. 3k of code.
not 650k of bzip.

2003-08-04 13:09:44

by Jesse Pollard

[permalink] [raw]
Subject: Re: TOE brain dump

On Monday 04 August 2003 03:55, Ihar 'Philips' Filipau wrote:
> Werner Almesberger wrote:
> > Ihar 'Philips' Filipau wrote:
> >> Modern NPUs generally do this.
> >
> > Unfortunately, they don't - they run *some* code, but that
> > is rarely a Linux kernel, or a substantial part of it.
>
> Embedded CPU we are using is based MIPS, and has a lot of specialized
> instructions.
> It makes not that much sense to run kernel (especially Linux) on CPU
> which is optimized for handling of network packets. (And has actually
> several co-processors to help in this task).
> How much sense it makes to run general purpose OS (optimized for PCs
> and servers) on devices which can make only couple of functions? (and no
> MMU btw)
> It is a whole idea behind this kind of CPUs - to do a few of
> functions - but to do them good.
>
> If you will start stretching CPUs like this to fit Linux kernel - it
> will generally just increase price. Probably there are some markets
> which can afford this.
>
> Remeber - "Small is beatiful" (c) - and linux kernel far from it.
> Our routing code which handles two GE interfaces (actually not pure
> GE, but up to 2.5GB) fits into 3k. 3k of code - and that's it. not 650kb
> of bzip compressed bloat. And it handles two interfaces, handles fast
> data path from siblign interfaces, handles up to 1E6 routes. 3k of code.
> not 650k of bzip.

And it handles ipfilter?
and LSM security hooks?
how about IPSec?
and IPv6?

I don't think so.

2003-08-04 14:15:07

by Ihar 'Philips' Filipau

[permalink] [raw]
Subject: Re: TOE brain dump

Jesse Pollard wrote:
>>3k of code.
>>not 650k of bzip.
>
> And it handles ipfilter?
> and LSM security hooks?
> how about IPSec?
> and IPv6?
>
> I don't think so.

Answer is "No".

I'm running expensive workstation - and I'm _NOT_ using
LSM/IPSec/IPv6. I do not care what I _*can*_ do - I care about what I
_*need*_ to do.
Point is here that 3k of code is all what we need. Not 'what every
one does need', not Linux kernel.

P.S.
printk() is absolutely renundant since there is no display at all ;-)
And can you imagine Linux without printk, bug_on & panic?-)))

2003-08-04 14:59:42

by Jesse Pollard

[permalink] [raw]
Subject: Re: TOE brain dump

On Monday 04 August 2003 09:15, Ihar 'Philips' Filipau wrote:
> Jesse Pollard wrote:
> >>3k of code.
> >>not 650k of bzip.
> >
> > And it handles ipfilter?
> > and LSM security hooks?
> > how about IPSec?
> > and IPv6?
> >
> > I don't think so.
>
> Answer is "No".
>
> I'm running expensive workstation - and I'm _NOT_ using
> LSM/IPSec/IPv6. I do not care what I _*can*_ do - I care about what I
> _*need*_ to do.
> Point is here that 3k of code is all what we need. Not 'what every
> one does need', not Linux kernel.

I'm on a workstation right now that needs IPv6 sometime in the next few
months. There have been several instances where IPSec would have resolved
internal problems (it's not easily available for Solaris yet.. soon).

So why should I buy another interface every time I need to change networks?

And who said it was a workstation target? If you are going to offload TCP/IP
in a TOE, it should be where it might be useful - large (and saturated)
compute servers, file servers. Not workstations. High bandwidth workstation
requirements are rare. And large servers will require IPSec eventually
(personally, I think it should be required already). And if the server
requires IPSec, then the workstation will too.

So you have programmed your way into a small market. And a likely shrinking
one at that.

>
> P.S.
> printk() is absolutely renundant since there is no display at all ;-)
> And can you imagine Linux without printk, bug_on & panic?-)))

So? It's called "embeded Linux". No MM, no printk (for production anyway).
Display not required.

2003-08-04 15:51:01

by Ihar 'Philips' Filipau

[permalink] [raw]
Subject: Re: TOE brain dump

Jesse Pollard wrote:
>
> And who said it was a workstation target? If you are going to offload TCP/IP
> in a TOE, it should be where it might be useful - large (and saturated)
> compute servers, file servers. Not workstations. High bandwidth workstation
> requirements are rare. And large servers will require IPSec eventually
> (personally, I think it should be required already). And if the server
> requires IPSec, then the workstation will too.
>

I gave example of my personal WS just as an example that not every
one needs all features, even having capacities.

TOE for IPsec/IPv6/iptables/routing? take a look at http://www.cisco.com,
this guys are doing exactly this with IOS. And take a look then at prices.
They are not cheap.

Take a look at prices of SMP/AMP systems. Multi-threaded software?
like Oracle or Sybase for example? which can utilize fully SMP/AMP
resources.
They are not cheap.

If you will try to make a pice of hardware to put Linux on, you will
simultaneously get headaches of both TOE designers/programmers and
SMP/AMP designers/programmers.
This is not going to simple nor cheap.

But you are encouraged to try ;-)

> Not workstations. High bandwidth workstation
> requirements are rare. And large servers will require IPSec eventually

Rare? IMHO servers are rare too. Compare number of PCs/devices and
compare the number of servers. 1000s to 1s.
And devices are different and for most of them ipv4 is just more than
enough, since not of them has capacity even to handle 10MB Ethernet. I'm
not talking about ipv4/ipsec/lsm. So servers for servers? Security for
security?


I wanted to make a simple point: every piece should do a few of
things, but should do it good. [1]
Putting OS kernel into the device makes not that much sense, if you
can achive the same with simple 3k firmware.


[1] "The Unix Philosophy in One Lesson",
http://www.catb.org/~esr/writings/taoup/html/ch01s07.html

P.S. Gone to offtopic. Sorry. Leaving.

2003-08-04 16:10:35

by Ingo Oeser

[permalink] [raw]
Subject: Re: TOE brain dump

Hi Jeff,

On Sat, Aug 02, 2003 at 03:08:52PM -0400, Jeff Garzik wrote:
> So, fix the other end of the pipeline too, otherwise this fast network
> stuff is flashly but pointless. If you want to serve up data from disk,
> then start creating PCI cards that have both Serial ATA and ethernet
> connectors on them :) Cut out the middleman of the host CPU and host
> memory bus instead of offloading portions of TCP that do not need to be
> offloaded.

Exactly what I suggested: sys_ioroute()

"Providing generic pipelines and io routing as Linux service"
Msg-ID: <[email protected]>

on linux-kernel and linux-fsdevel

Be my guest.

I know, that you mean doing it in hardware, but you cannot
accelerate sth. which the kernel doesn't do ;-)

Regards

Ingo Oeser

2003-08-04 17:21:48

by Alan Shih

[permalink] [raw]
Subject: RE: TOE brain dump

Hmm,

So would main processor still need a copy of the data for re-transmission?
Won't that defeat the purpose?

Alan

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Ingo Oeser
Sent: Monday, August 04, 2003 7:36 AM
To: Jeff Garzik
Cc: Nivedita Singhvi; Werner Almesberger; [email protected];
[email protected]
Subject: Re: TOE brain dump


Hi Jeff,

On Sat, Aug 02, 2003 at 03:08:52PM -0400, Jeff Garzik wrote:
> So, fix the other end of the pipeline too, otherwise this fast network
> stuff is flashly but pointless. If you want to serve up data from disk,
> then start creating PCI cards that have both Serial ATA and ethernet
> connectors on them :) Cut out the middleman of the host CPU and host
> memory bus instead of offloading portions of TCP that do not need to be
> offloaded.

Exactly what I suggested: sys_ioroute()

"Providing generic pipelines and io routing as Linux service"
Msg-ID: <[email protected]>

on linux-kernel and linux-fsdevel

Be my guest.

I know, that you mean doing it in hardware, but you cannot
accelerate sth. which the kernel doesn't do ;-)

Regards

Ingo Oeser

2003-08-04 18:36:24

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: RE: TOE brain dump


> From: Larry McVoy [mailto:[email protected]]
>
> > 2. router nodes that have access to main memory (PCI card running linux
> > acting as a router/firewall/VPN to offload the main CPU's)
>
> I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
> cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
> quantity 1, shipped to my door.
>
> Why would I want to spend money on some silly offload card when I can get
> the whole PC for less than the card?

Because you want to stack 200 of those together in a huge
data center interconnecting whatever you want to interconnect
and you don't want your maintenance costs to go up to the sky?

I see your point, though :)

I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault)

2003-08-04 19:07:57

by Alan

[permalink] [raw]
Subject: RE: TOE brain dump

On Llu, 2003-08-04 at 19:36, Perez-Gonzalez, Inaky wrote:
> > Why would I want to spend money on some silly offload card when I can get
> > the whole PC for less than the card?
>
> Because you want to stack 200 of those together in a huge
> data center interconnecting whatever you want to interconnect
> and you don't want your maintenance costs to go up to the sky?

17cm squared, fanless, network booting. Its not as big a cost as
you might think, and TOE cards fail too, the difference being that if
they are now out of production you have a nasty mess on your hands.

2003-08-04 19:24:46

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Eric W. Biederman wrote:
> The optimized for low latency cases seem to have a strong
> market in clusters.

Clusters have captive, no, _desperate_ customers ;-) And it
seems that people are just as happy putting MPI as their
transport on top of all those link-layer technologies.

> There is one place in low latency communications that I can think
> of where TCP/IP is not the proper solution. For low latency
> communication the checksum is at the wrong end of the packet.

That's one of the few things ATM's AAL5 got right. But in the end,
I think it doesn't really matter. At 1 Gbps, an MTU-sized packet
flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point,
you may well treat it as an atomic unit.

> On that score it is worth noting that the next generation of
> peripheral busses (Hypertransport, PCI Express, etc) are all switched.

And it's about time for that :-)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-04 19:33:09

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Ihar 'Philips' Filipau wrote:
> It makes not that much sense to run kernel (especially Linux) on CPU
> which is optimized for handling of network packets. (And has actually
> several co-processors to help in this task).

All you need to do is to make the CPU capable of running the kernel
(well, some of it), but it doesn't have to be particularly good at
running anything but the TCP/IP code. And you can still benefit
from most of the features of NPUs, such as a specialized memory
architecture, parallel data paths, accelerated operations, etc.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-04 19:31:15

by David Miller

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, 4 Aug 2003 16:24:33 -0300
Werner Almesberger <[email protected]> wrote:

> Eric W. Biederman wrote:
> > There is one place in low latency communications that I can think
> > of where TCP/IP is not the proper solution. For low latency
> > communication the checksum is at the wrong end of the packet.
>
> That's one of the few things ATM's AAL5 got right.

Let's recall how long the IFF_TRAILERS hack from BSD :-)

> But in the end, I think it doesn't really matter.

I tend to agree on this one.

And on the transmit side if you have more than 1 pending TX frame, you
can always be prefetching the next one into the fifo so that by the
time the medium is ready all the checksum bits have been done.

In fact I'd be surprised if current generation 1g/10g cards are not
doing something like this.

2003-08-04 19:50:11

by David Lang

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, 4 Aug 2003, Werner Almesberger wrote:

> Ihar 'Philips' Filipau wrote:
> > It makes not that much sense to run kernel (especially Linux) on CPU
> > which is optimized for handling of network packets. (And has actually
> > several co-processors to help in this task).
>
> All you need to do is to make the CPU capable of running the kernel
> (well, some of it), but it doesn't have to be particularly good at
> running anything but the TCP/IP code. And you can still benefit
> from most of the features of NPUs, such as a specialized memory
> architecture, parallel data paths, accelerated operations, etc.

also how many of the standard kernel features could you turn off?
do you really need filesystems for example?
could userspace be eliminated? (if you have some way to give the config
commands to the kernel on the NIC and get the log messages back to the
main kernel what else do you need?)
a lot of the other IO buffer stuff can be trimmed back (as per
config_embedded)

what else could be done to use the kernel features taht are wanted without
bringin extra baggage along?

David Lang

2003-08-04 19:56:55

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

David Lang wrote:
> also how many of the standard kernel features could you turn off?

You don't turn them off - you just don't run them. What I'm
suggesting is not a separate system that runs a stripped-down
Linux kernel, but rather a device that looks like another
node in a NUMA system.

There might be a point in completely excluding subsystems
that will never be used on that NIC anyway, but that's already
an optimization.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-04 20:04:58

by David Lang

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, 4 Aug 2003, Werner Almesberger wrote:

> David Lang wrote:
> > also how many of the standard kernel features could you turn off?
>
> You don't turn them off - you just don't run them. What I'm
> suggesting is not a separate system that runs a stripped-down
> Linux kernel, but rather a device that looks like another
> node in a NUMA system.
>
> There might be a point in completely excluding subsystems
> that will never be used on that NIC anyway, but that's already
> an optimization.

I would think that it's much more difficult to run NUMA across different
types of CPU's then it would be to run a seperate kernel on the NIC.

I'm thinking clustering instead of single-system-image.

David Lang

2003-08-04 20:09:37

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

David Lang wrote:
> I would think that it's much more difficult to run NUMA across different
> types of CPU's

I'd view this as a new and interesting challenge :-) Besides,
if one would use Alan's idea, and just use an amd64, or such,
the CPUs wouldn't be all that different in the end.

One added benefit of using similar CPUs would be that also
bits of user space (e.g. a copy loop) could migrate to the
NIC.

> then it would be to run a seperate kernel on the NIC.

Yes, but that separate kernel would need new administrative
interfaces, and things like route changes would be difficult
to handle. (That is, if you still want this to appear as a
single system to user space.) It would certainly be better
that running a completely proprietary solution, but you still
get a few nasty problems.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-04 20:32:15

by David Lang

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, 4 Aug 2003, Werner Almesberger wrote:

> David Lang wrote:
> > I would think that it's much more difficult to run NUMA across different
> > types of CPU's
>
> I'd view this as a new and interesting challenge :-) Besides,
> if one would use Alan's idea, and just use an amd64, or such,
> the CPUs wouldn't be all that different in the end.

you missed Alan's point, he was saying you don't do TOE on the NIC, you
just add another CPU to your main system and use non-TOE NIC's the way you
do today.

> > then it would be to run a seperate kernel on the NIC.
>
> Yes, but that separate kernel would need new administrative
> interfaces, and things like route changes would be difficult
> to handle. (That is, if you still want this to appear as a
> single system to user space.) It would certainly be better
> that running a completely proprietary solution, but you still
> get a few nasty problems.

Any time you create a cluster of machines you want to create som nice
administrative interfaces for them to maintain your own sanity (you don't
think sysadmins login to every machine on a 1000 node beowolf cluster do
you :-)

in trying to run a single kernel across different types of CPU's you run
into some really nasty problems (different machine code, and even if it's
the same family of processor it could require very different
optimizations, imagine the two processor types useing different word
lengths or endian order)

Larry McVoy has the right general idea when he says buy another box to do
the job, he is just missing the idea that there are some advantages of
coupling the cluster more tightly then you can do with a seperate box.

David Lang

2003-08-04 23:30:42

by Peter Chubb

[permalink] [raw]
Subject: Re: TOE brain dump


One thing that you could do *if* you cared to go to a SYSVr4
streams-like approach is just to push *some* of the TCP/IP stack onto
the card, as one or more streams modules.

Peter C

2003-08-05 01:38:14

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

David Lang wrote:
> you missed Alan's point, he was saying you don't do TOE on the NIC,

Only as far as "traditional TOE" is concerned. My idea is
precisely to avoid treating TOE as a special case.

> just add another CPU to your main system and use non-TOE NIC's the way you
> do today.

For a start, that may be good enough, even though you miss
a lot of nice hardware optimizations.

> Any time you create a cluster of machines you want to create som nice
> administrative interfaces for them to maintain your own sanity

You've got a point there. The question is whether these
interface really cover everything we need, and - more
importantly - whether they still have the same semantics.

> Larry McVoy has the right general idea when he says buy another box to do
> the job, he is just missing the idea that there are some advantages of
> coupling the cluster more tightly then you can do with a seperate box.

Clusters are nice, but they don't help if your bottleneck
is per-packet processing overhead with a single NIC, or if
you can't properly distribute the applications.

I'm not saying that TOE, even if done in a maintainable way,
is always the right approach. E.g. if all you need is a fast
path to main memory, Dave's flow cache would be a much
cheaper solution. If you can distribute the workload, and
the extra hardware doesn't bother you, your clusters become
attractive.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-05 01:48:19

by David Lang

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, 4 Aug 2003, Werner Almesberger wrote:

> David Lang wrote:
> > you missed Alan's point, he was saying you don't do TOE on the NIC,
>
> Only as far as "traditional TOE" is concerned. My idea is
> precisely to avoid treating TOE as a special case.
>
> > just add another CPU to your main system and use non-TOE NIC's the way you
> > do today.
>
> For a start, that may be good enough, even though you miss
> a lot of nice hardware optimizations.

exactly, Alan is saying that the hardware optimizations aren't nessasary.
putting an Opteron on a NIC card just to match the other processors in
your system seems like a huge amount of overkill. you aren't going to have
nearly the same access to memory so that processor will be crippled, but
stil cost full price (and then some, remember you have to supply the thing
with power and cool it)

> > Any time you create a cluster of machines you want to create som nice
> > administrative interfaces for them to maintain your own sanity
>
> You've got a point there. The question is whether these
> interface really cover everything we need, and - more
> importantly - whether they still have the same semantics.

as long as tools are written that have the same command line semantics the
rest of the complexity can be hidden. and even this isn't strictly
nessasary, these are special purpose cards and a special procedure for
configuring them isn't unreasonable.

> > Larry McVoy has the right general idea when he says buy another box to do
> > the job, he is just missing the idea that there are some advantages of
> > coupling the cluster more tightly then you can do with a seperate box.
>
> Clusters are nice, but they don't help if your bottleneck
> is per-packet processing overhead with a single NIC, or if
> you can't properly distribute the applications.
>
> I'm not saying that TOE, even if done in a maintainable way,
> is always the right approach. E.g. if all you need is a fast
> path to main memory, Dave's flow cache would be a much
> cheaper solution. If you can distribute the workload, and
> the extra hardware doesn't bother you, your clusters become
> attractive.

I'm saying treat the one machine with 10 of these specialty NIC's in it as
a 11 machine cluster, one machne running your server software and 10
others running your networking.

David Lang

2003-08-05 01:57:18

by Larry McVoy

[permalink] [raw]
Subject: Re: TOE brain dump

I'd suggest that all of you look at the fact that all of these offload
card companies have ended up dieing. I don't know of a single one that
made it to profitability. Doesn't that tell you something? What has
changed that makes this a good idea?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-08-05 02:30:51

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Larry McVoy wrote:
> I'd suggest that all of you look at the fact that all of these offload
> card companies have ended up dieing. I don't know of a single one that
> made it to profitability. Doesn't that tell you something? What has
> changed that makes this a good idea?

1) So far, most of the battle has been about data transfers.
Now, per-packet overhead is becoming an issue.

2) AFAIK, they all went for designs that isolated their code
from the main stack. That's one thing that, IMHO, has to
change.

Is this enough to make TOE succeed ? I don't know.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-05 03:04:43

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

David Lang wrote:
> exactly, Alan is saying that the hardware optimizations aren't nessasary.

Eventually, you'll want them, and if it's only to lower the
chip or pin count.

> putting an Opteron on a NIC card just to match the other processors in
> your system seems like a huge amount of overkill. you aren't going to have
> nearly the same access to memory so that processor will be crippled, but
> stil cost full price

You might be able to get them for free ;-) Just pick the
rejects where the FPU or such doesn't quite work. Call it
amd64sx :-)

But even if you get regular CPUs, they're not *that*
expensive. Particularly not for a first generation design.

> (and then some, remember you have to supply the thing
> with power and cool it)

Yes, this, chip count, and chip surface are what make me feel
queasy when thinking of somebody using something as powerful
as an amd64.

> as long as tools are written that have the same command line semantics the
> rest of the complexity can be hidden.

You want to be API and probably even ABI-compatible, so that
user-space demons (routing, management, etc.) work, too.

> and even this isn't strictly
> nessasary, these are special purpose cards and a special procedure for
> configuring them isn't unreasonable.

I'd think thrice before buying a card that requires me to
change my entire network management system - and change it
again, if I ever decide to switch brands, or if the next
generation of that special NIC gets a little more special.

> I'm saying treat the one machine with 10 of these specialty NIC's in it as
> a 11 machine cluster, one machne running your server software and 10
> others running your networking.

You can probably afford rather fancy TOE hardware for the
price of ten cluster nodes, a high-speed LAN to connect
your cluster, and a switch that connects the high-speed
link to the ten not-quite-so-high-speed links.

Likewise for power, cooling, and space.

And that's still assuming you can actually distribute all
this.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-05 08:16:13

by Ingo Oeser

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, Aug 04, 2003 at 10:19:21AM -0700, Alan Shih wrote:
> So would main processor still need a copy of the data for re-transmission?
> Won't that defeat the purpose?

No, since I didn't state that a retransmission is done along the
pipe, because you cannot go back in a pipeline.

A retransmission can be done at the end of the pipe, where this
can also be done in hardware.

Regards

Ingo Oeser

2003-08-05 17:22:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger <[email protected]> writes:

> Eric W. Biederman wrote:
> > The optimized for low latency cases seem to have a strong
> > market in clusters.
>
> Clusters have captive, no, _desperate_ customers ;-) And it
> seems that people are just as happy putting MPI as their
> transport on top of all those link-layer technologies.

MPI is not a transport. It an interface like the Berkeley sockets
layer. The semantics it wants right now are usually mapped to
TCP/IP when used on an IP network. Though I suspect SCTP might
be a better fit.

But right now nothing in the IP stack is a particularly good fit.

Right now there is a very strong feeling among most of the people
using and developing on clusters that by and large what they are doing
is not of interest to the general kernel community, and so has no
chance of going in. So you see hack piled on top of hack piled on
top of hack.

Mostly I think the that is less true, at least if they can stand the
process of severe code review and cleaning up their code. If we can
put in code to scale the kernel to 64 processors. NIC drivers for
fast interconnects and a few similar tweaks can't hurt either.

But of course to get through the peer review process people need
to understand what they are doing.

> > There is one place in low latency communications that I can think
> > of where TCP/IP is not the proper solution. For low latency
> > communication the checksum is at the wrong end of the packet.
>
> That's one of the few things ATM's AAL5 got right. But in the end,
> I think it doesn't really matter. At 1 Gbps, an MTU-sized packet
> flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point,
> you may well treat it as an atomic unit.

So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the
second switch chip + 1.3us to the top level switch chip + 1.3us to a middle layer
switch chip + 1.3us to the receiving NIC + 1.3us the receiver.

1.3us * 7 = 9.1us to deliver a packet to the other side. That is
still quite painful. Right now I can get better latencies over any of
the cluster interconnects. I think 5 us is the current low end, with
the high end being about 1 us.

Quite often in MPI when a message is sent the program cannot continue
until the reply is received. Possibly this is a fundamental problem
with the application programming model, encouraging applications to
be latency sensitive. But it is a well established API and
programming paradigm so it has to be lived with.

All of this is pretty much the reverse of the TOE case. Things are
latency sensitive because real work needs to be done. And the more
latency you have the slower that work gets done.

A lot of the NICs which are used for MPI tend to be smart for two
reasons. 1) So they can do source routing. 2) So they can safely
export some of their interface to user space, so in the fast path
they can bypass the kernel.

Eric


2003-08-05 19:15:39

by Timothy Miller

[permalink] [raw]
Subject: Re: TOE brain dump



Larry McVoy wrote:
> On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote:
>
>>2. router nodes that have access to main memory (PCI card running linux
>>acting as a router/firewall/VPN to offload the main CPU's)
>
>
> I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
> cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
> quantity 1, shipped to my door.
>
> Why would I want to spend money on some silly offload card when I can get
> the whole PC for less than the card?


Physical space? Power usage? Heat dissipation? Optimization for the
specific task? Fast, low latency communication between CPU and device
(ie. local bus)? Maintenance?

Lots of reasons why one might pay more for the offload card. If you're
cheap, you'll just use the software stack and a $10 NIC and just live
with the corresponding CPU usage. If you're a performance freak, you'll
spend whatever you have to to squeeze out every last bit of performance
you can.

Mind you, another option is, if you're dealing with the kind of load
that requires that much network performance, is to use redundant
servers, like google. No one server is exceptionally fast, but it not
many people are using it, it's fast enough.

2003-08-06 01:52:38

by Valerie Henson

[permalink] [raw]
Subject: Re: TOE brain dump

On Mon, Aug 04, 2003 at 11:30:27PM -0300, Werner Almesberger wrote:
> Larry McVoy wrote:
> > I'd suggest that all of you look at the fact that all of these offload
> > card companies have ended up dieing. I don't know of a single one that
> > made it to profitability. Doesn't that tell you something? What has
> > changed that makes this a good idea?
>
> 1) So far, most of the battle has been about data transfers.
> Now, per-packet overhead is becoming an issue.
>
> 2) AFAIK, they all went for designs that isolated their code
> from the main stack. That's one thing that, IMHO, has to
> change.
>
> Is this enough to make TOE succeed ? I don't know.

Jeff Mogul recently wrote an interesting paper called "TCP Offload is
a Dumb Idea Whose Time Has Come":

http://www.usenix.org/events/hotos03/tech/full_papers/mogul/mogul_html/

(It's 6 pages long and in HTML - easy to read.)

After explaining why TCP offload is a dumb idea, he goes on to argue
that *if* storage area networks are replaced with switched ethernet,
and RDMA becomes popular, TCP offload might make sense for sending
data to your disks.

This is a good, short paper to read if you are interested in TOE for
any reason.

-VAL

2003-08-06 05:13:17

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Eric W. Biederman wrote:
> MPI is not a transport. It an interface like the Berkeley sockets
> layer.

Hmm, but doesn't it also unify transport semantics (i.e. chop
TCP streams into messages), maybe add reliability to transports
that don't have it, and provide addressing ? Okay, perhaps you
wouldn't call this a transport in the OSI sense, but it still
seems to have considerably more functionality than just
providing an API.

> Mostly I think the that is less true, at least if they can stand the
> process of severe code review and cleaning up their code.

Hmm, people putting dozens of millions into building clusters
can't afford to have what is probably their most essential
infrastructure code reviewed and cleaned up ? Oh dear.

> But of course to get through the peer review process people need
> to understand what they are doing.

A good point :-)

> So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
> per copy.

But your switch could just do cut-through, no ? Or do they
need to recompute checksums ?

> A lot of the NICs which are used for MPI tend to be smart for two
> reasons. 1) So they can do source routing. 2) So they can safely
> export some of their interface to user space, so in the fast path
> they can bypass the kernel.

The second part could be interesting for TOE, too. Only that
the interface exported would just be the socket interface.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-06 07:23:11

by Andre Hedrick

[permalink] [raw]
Subject: Re: TOE brain dump



Jeff,

Do be sure to check that your data payload is correct.
Everyone knows that a router/gateway/switch with a sticky bit in its
memory will recompute the net crc16 checksum insure it pass the to the nic
regardless. It is amazing how much data can be corrupted by a network
environment via all the NFS/NBD/etc wantabie storage products out there.

Just a chuckle for you to ponder.

--a

On Sun, 3 Aug 2003, Jeff Garzik wrote:

> Werner Almesberger wrote:
> > Jeff Garzik wrote:
> >
> >>jabbering at the same time. TCP is a "one size fits all" solution, but
> >>it doesn't work well for everyone.
> >
> >
> > But then, ten "optimized xxPs" that work well in two different
> > scenarios each, but not so good in the 98 others, wouldn't be
> > much fun either.
> >
> > It's been tried a number of times. Usually, real life sneaks
> > in at one point or another, leaving behind a complex mess.
> > When they've sorted out these problems, regular TCP has caught
> > up with the great optimized transport protocols. At that point,
> > they return to their niche, sometimes tail between legs and
> > muttering curses, sometimes shaking their fist and boldly
> > proclaiming how badly they'll rub TCP in the dirt in the next
> > round. Maybe they shed off some of the complexity, and trade it
> > for even more aggressive optimization, which puts them into
> > their niche even more firmly. Eventually, they fade away.
> >
> > There are cases where TCP doesn't work well, like a path of
> > badly mismatched link layers, but such paths don't treat any
> > protocol following the end-to-end principle kindly.
> >
> > Another problem of TCP is that it has grown a bit too many
> > knobs you need to turn before it works over your really fast
> > really long pipe. (In one of the OLS after dinner speeches,
> > this was quite appropriately called the "wizard gap".)
> >
> >
> >>It's obviously not over a WAN...
> >
> >
> > That's why NFS turned off UDP checksums ;-) As soon as you put
> > it on IP, it will crawl to distances you didn't imagine in your
> > wildest dreams. It always does.
>
> Really fast, really long pipes in practice don't exist for 99.9% of all
> Internet users.
>
>
> When you approach traffic levels that push you want to offload most of
> the TCP net stack, then TCP isn't the right solution for you anymore,
> all things considered.
>
>
> The Linux net stack just isn't built to be offloaded. TOE engines will
> either need to (1) fall back to Linux software for all-but-the-common
> case (otherwise netfilter, etc. break), or, (2) will need to be
> hideously complex beasts themselves. And I can't see ASIC and firmware
> designers being excited about implementing netfilter on a PCI card :)
>
> Unfortunately some vendors seem to choosing TOE option #3: TCP offload
> which introduces many limitations (connection limits, netfilter not
> supported, etc.) which Linux never had before. Vendors don't seem to
> realize TOE has real potential to damage the "good network neighbor"
> image the net stack has. The Linux net stack's behavior is known,
> documented, predictable. TOE changes all that.
>
> There is one interesting TOE solution, that I have yet to see created:
> run Linux on an embedded processor, on the NIC. This stripped-down
> Linux kernel would perform all the header parsing, checksumming, etc.
> into the NIC's local RAM. The Linux OS driver interface becomes a
> virtual interface with a large MTU, that communicates from host CPU to
> NIC across the PCI bus using jumbo-ethernet-like data frames.
> Management frames would control the ethernet interface on the other side
> of the PCI bus "tunnel".
>
>
> >>So, fix the other end of the pipeline too, otherwise this fast network
> >>stuff is flashly but pointless. If you want to serve up data from disk,
> >>then start creating PCI cards that have both Serial ATA and ethernet
> >>connectors on them :) Cut out the middleman of the host CPU and host
> >>memory bus instead of offloading portions of TCP that do not need to be
> >>offloaded.
> >
> >
> > That's a good point. A hierarchical memory structure can help
> > here. Moving one end closer to the hardware, and letting it
> > know (e.g. through sendfile) that also the other end is close
> > (or can be reached more directly that through some hopelessly
> > crowded main bus) may help too.
>
> Definitely.
>
> Jeff
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-08-06 08:02:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: TOE brain dump

Werner Almesberger <[email protected]> writes:

> Eric W. Biederman wrote:
> > MPI is not a transport. It an interface like the Berkeley sockets
> > layer.
>
> Hmm, but doesn't it also unify transport semantics (i.e. chop
> TCP streams into messages), maybe add reliability to transports
> that don't have it, and provide addressing ? Okay, perhaps you
> wouldn't call this a transport in the OSI sense, but it still
> seems to have considerably more functionality than just
> providing an API.

Those are all features of the MPI implementation. It is
not that MPI does not have an underlying transport. MPI has
a lot of underlying transports. And there is a different MPI
implementation for each transport. Although a lot of them start
with a common base.

> > Mostly I think the that is less true, at least if they can stand the
> > process of severe code review and cleaning up their code.
>
> Hmm, people putting dozens of millions into building clusters
> can't afford to have what is probably their most essential
> infrastructure code reviewed and cleaned up ? Oh dear.

Afford, they can do. A lot of the users are researchers and
a lot of people doing the code are researchers. So corralling
them up and getting production quality code can be a challenge,
or getting them to take small enough steps that they don't
frighten the rest of the world.

Plus ten million dollars pretty much buys you a spot in the top 10 of
the top 500 supercomputers. The bulk of the clusters are a lot less
expensive than that.

> > But of course to get through the peer review process people need
> > to understand what they are doing.
>
> A good point :-)
>
> > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
> > per copy.
>
> But your switch could just do cut-through, no ? Or do they
> need to recompute checksums ?

Correct, switches can and generally do implement cut-through in that
kind of environment. I was just showing that even at 10Gbps treating
a packet as an atomic unit has issues. cut-through is necessary
to keep your latency down. Do any ethernet switches do cut-through?

> > A lot of the NICs which are used for MPI tend to be smart for two
> > reasons. 1) So they can do source routing. 2) So they can safely
> > export some of their interface to user space, so in the fast path
> > they can bypass the kernel.
>
> The second part could be interesting for TOE, too. Only that
> the interface exported would just be the socket interface.

Agreed.

Eric

2003-08-06 08:20:22

by Lincoln Dale

[permalink] [raw]
Subject: Re: TOE brain dump

At 05:12 PM 6/08/2003, Andre Hedrick wrote:
>Do be sure to check that your data payload is correct.
>Everyone knows that a router/gateway/switch with a sticky bit in its
>memory will recompute the net crc16 checksum insure it pass the to the nic
>regardless. It is amazing how much data can be corrupted by a network
>environment via all the NFS/NBD/etc wantabie storage products out there.

Andre, you are wrong.

firstly, do you REALLY think that most router(s)/switch(es) out there
recompute IP checksums because they did a IP TTL decrement when routing an
IP packet or NAT IP addresses?

no, they don't. just like netfilter or router-on-linux is smart enough to
be able to re-code an IP checksum by unmasking and re-masking the old/new
values in a header, so does the most router vendor's code.

secondly, why would a router or switch even be touching the data at layer-4
(TCP), let alone recalculating a CRC?

i know you really like your "we do ERL 2 in iSCSI" pitch, but lets stick to
facts here eh?


cheers,

lincoln.

2003-08-06 08:27:12

by David Miller

[permalink] [raw]
Subject: Re: TOE brain dump

On Wed, 06 Aug 2003 18:20:06 +1000
Lincoln Dale <[email protected]> wrote:

> secondly, why would a router or switch even be touching the data at layer-4
> (TCP), let alone recalculating a CRC?

To make sure emails about Falun Gong and other undesirable topics
don't make it into China.

2003-08-06 12:47:17

by Jesse Pollard

[permalink] [raw]
Subject: Re: TOE brain dump

On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote:
> Werner Almesberger <[email protected]> writes:
> > Eric W. Biederman wrote:
> > > The optimized for low latency cases seem to have a strong
> > > market in clusters.
> >
> > Clusters have captive, no, _desperate_ customers ;-) And it
> > seems that people are just as happy putting MPI as their
> > transport on top of all those link-layer technologies.
>
> MPI is not a transport. It an interface like the Berkeley sockets
> layer. The semantics it wants right now are usually mapped to
> TCP/IP when used on an IP network. Though I suspect SCTP might
> be a better fit.
>
> But right now nothing in the IP stack is a particularly good fit.
>
> Right now there is a very strong feeling among most of the people
> using and developing on clusters that by and large what they are doing
> is not of interest to the general kernel community, and so has no
> chance of going in. So you see hack piled on top of hack piled on
> top of hack.
>
> Mostly I think the that is less true, at least if they can stand the
> process of severe code review and cleaning up their code. If we can
> put in code to scale the kernel to 64 processors. NIC drivers for
> fast interconnects and a few similar tweaks can't hurt either.
>
> But of course to get through the peer review process people need
> to understand what they are doing.
>
> > > There is one place in low latency communications that I can think
> > > of where TCP/IP is not the proper solution. For low latency
> > > communication the checksum is at the wrong end of the packet.
> >
> > That's one of the few things ATM's AAL5 got right. But in the end,
> > I think it doesn't really matter. At 1 Gbps, an MTU-sized packet
> > flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point,
> > you may well treat it as an atomic unit.
>
> So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
> per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the
> second switch chip + 1.3us to the top level switch chip + 1.3us to a middle
> layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver.
>
> 1.3us * 7 = 9.1us to deliver a packet to the other side. That is
> still quite painful. Right now I can get better latencies over any of
> the cluster interconnects. I think 5 us is the current low end, with
> the high end being about 1 us.

I think you are off here since the second and third layer should not recompute
checksums other than for the header (if they even did that). Most of the
switches I used (mind, not configured) were wire speed. Only header checksums
had recomputes, and I understood it was only for routing.

> Quite often in MPI when a message is sent the program cannot continue
> until the reply is received. Possibly this is a fundamental problem
> with the application programming model, encouraging applications to
> be latency sensitive. But it is a well established API and
> programming paradigm so it has to be lived with.
>
> All of this is pretty much the reverse of the TOE case. Things are
> latency sensitive because real work needs to be done. And the more
> latency you have the slower that work gets done.
>
> A lot of the NICs which are used for MPI tend to be smart for two
> reasons. 1) So they can do source routing. 2) So they can safely
> export some of their interface to user space, so in the fast path
> they can bypass the kernel.

And bypass any security checks required. A single rogue MPI application
using such an interface can/will bring the cluster down.

Now this is not as much of a problem since many clusters use a standalone
internal network, AND are single application clusters. These clusters
tend to be relatively small (32 - 64 nodes? perhaps 16-32 is better. The
clusters I've worked with have always been large 128-300 nodes, so I'm
not a good judge of "small").

This is immediately broken when you schedule two or more batch jobs on
a cluster in parallel.

It is also broken if the two jobs require different security contexts.

2003-08-06 13:08:18

by Jesse Pollard

[permalink] [raw]
Subject: Re: TOE brain dump

On Wednesday 06 August 2003 03:22, David S. Miller wrote:
> On Wed, 06 Aug 2003 18:20:06 +1000
>
> Lincoln Dale <[email protected]> wrote:
> > secondly, why would a router or switch even be touching the data at
> > layer-4 (TCP), let alone recalculating a CRC?
>
> To make sure emails about Falun Gong and other undesirable topics
> don't make it into China.

Thats not a router, or switch... It's a firewall :-)

2003-08-06 13:38:09

by Werner Almesberger

[permalink] [raw]
Subject: Re: TOE brain dump

Eric W. Biederman wrote:
> Afford, they can do. A lot of the users are researchers and
> a lot of people doing the code are researchers. So corralling
> them up and getting production quality code can be a challenge,

Ah, the joy of herding cats :-) But I guess you just need a
sufficiently competent and sufficiently well-funded group
that goes ahead and does it. There is usually little point
in directly involving everyone who may have an opinion.

> to keep your latency down. Do any ethernet switches do cut-through?

According to Google, many at least claim to do this.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2003-08-06 15:58:35

by Andy Isaacson

[permalink] [raw]
Subject: Re: TOE brain dump

On Wed, Aug 06, 2003 at 10:37:58AM -0300, Werner Almesberger wrote:
> Eric W. Biederman wrote:
> > to keep your latency down. Do any ethernet switches do cut-through?
>
> According to Google, many at least claim to do this.

Do you have any references for this claim? I have never seen one that
panned out (at least not since the high-end-10mbps days).

Just to be clear, I am asking for an example of a Gigabit Ethernet
switch that supports cut-through switching. I contend that there is no
such beast commercially available today.

(It would be even more interesting if it could switch 9000-octet jumbo
frames, too.)

I'm sure someone is going to point me to a $10,000/port monster, and
while that's not very feasible for my needs, it would still be
interesting.

-andy

2003-08-06 16:26:03

by Andy Isaacson

[permalink] [raw]
Subject: Re: TOE brain dump

On Wed, Aug 06, 2003 at 07:46:33AM -0500, Jesse Pollard wrote:
> On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote:
> > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
> > per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the
> > second switch chip + 1.3us to the top level switch chip + 1.3us to a middle
> > layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver.
> >
> > 1.3us * 7 = 9.1us to deliver a packet to the other side. That is
> > still quite painful. Right now I can get better latencies over any of
> > the cluster interconnects. I think 5 us is the current low end, with
> > the high end being about 1 us.
>
> I think you are off here since the second and third layer should not recompute
> checksums other than for the header (if they even did that). Most of the
> switches I used (mind, not configured) were wire speed. Only header checksums
> had recomputes, and I understood it was only for routing.

The switches may be "wire speed" but that doesn't help the latency any.
AFAIK all GigE switches are store-and-forward, which automatically costs
you the full 1.3us for each link hop. (I didn't check Eric's numbers,
so I don't know that 1.3us is the right value, but it sounds right.)
Also I think you might be confused about what Eric meant by "3 layer
switch hierarchy"; he's referring to a tree topology network with
layer-one switches connecting hosts, layer-two switches connecting
layer-one switches, and layer-three switches connecting layer-two
switches. This means that your worst-case node-to-node latency has 6
wire hops with 7 "read the entire packet into memory" operations,
depending on how you count the initiating node's generation of the
packet.

[snip]
> > Quite often in MPI when a message is sent the program cannot continue
> > until the reply is received. Possibly this is a fundamental problem
> > with the application programming model, encouraging applications to
> > be latency sensitive. But it is a well established API and
> > programming paradigm so it has to be lived with.

This is true, in HPC. Some of the problem is the APIs encouraging such
behavior; another part of the problem is that sometimes, the problem has
fundamental latency dependencies that cannot be programmed around.

> > A lot of the NICs which are used for MPI tend to be smart for two
> > reasons. 1) So they can do source routing. 2) So they can safely
> > export some of their interface to user space, so in the fast path
> > they can bypass the kernel.
>
> And bypass any security checks required. A single rogue MPI application
> using such an interface can/will bring the cluster down.

This is just false. Kernel bypass (done properly) has no negative
effect on system stability, either on-node or on-network. By "done
properly" I mean that the NIC has mappings programmed into it by the
kernel at app-startup time, and properly bounds-checks all remote DMA,
and has a method for verifying that incoming packets are not rogue or
corrupt. (Of course a rogue *kernel* can probably interfere with other
*applications* on the network it's connected to, by inserting malicious
packets into the datastream, but even that is soluble with cookies or
routing checks. However, I don't believe any systems try to defend
against rogue nodes today.)

I believe that Myrinet's hardware has the capability to meet the "kernel
bypass done properly" requirement I state above; I make no claim that
their GM implementation actually meets the requirement (although I think
it might). It's pretty likely that QSW's Elan hardware can, too, but I
know even less about that.

-andy

2003-08-06 16:28:19

by Chris Friesen

[permalink] [raw]
Subject: Re: TOE brain dump

Andy Isaacson wrote:
> On Wed, Aug 06, 2003 at 10:37:58AM -0300, Werner Almesberger wrote:
>
>>Eric W. Biederman wrote:
>>
>>>to keep your latency down. Do any ethernet switches do cut-through?
>>>
>>According to Google, many at least claim to do this.
>>
>
> Do you have any references for this claim? I have never seen one that
> panned out (at least not since the high-end-10mbps days).
>
> Just to be clear, I am asking for an example of a Gigabit Ethernet
> switch that supports cut-through switching. I contend that there is no
> such beast commercially available today.
>
> (It would be even more interesting if it could switch 9000-octet jumbo
> frames, too.)

A few seconds of googling shows that these claim it:
http://www.blackbox.com.mx/products/pdf/europdf/81055.pdf
http://www.directdial.com/dd2/images/pdf_specsheet/J4119AABA.pdf

I'm sure there are others...

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2003-08-06 17:01:50

by Andy Isaacson

[permalink] [raw]
Subject: Re: TOE brain dump

On Wed, Aug 06, 2003 at 12:27:17PM -0400, Chris Friesen wrote:
> Andy Isaacson wrote:
> > On Wed, Aug 06, 2003 at 10:37:58AM -0300, Werner Almesberger wrote:
> >>Eric W. Biederman wrote:
> >>>to keep your latency down. Do any ethernet switches do cut-through?
> >>According to Google, many at least claim to do this.
> >
> > Do you have any references for this claim? I have never seen one that
> > panned out (at least not since the high-end-10mbps days).
> >
> > Just to be clear, I am asking for an example of a Gigabit Ethernet
> > switch that supports cut-through switching. I contend that there is no
> > such beast commercially available today.
>
> A few seconds of googling shows that these claim it:
> http://www.blackbox.com.mx/products/pdf/europdf/81055.pdf

This is a 100mbit product, not gigabit.

> http://www.directdial.com/dd2/images/pdf_specsheet/J4119AABA.pdf

The products referred to here are the HP ProCurve Switch 8000M and the
HP ProCurve Switch 1600M. The 1600M has only one (optional) GigE port,
so it's disqualified. The 8000M has up to 10 GigE ports, so it could be
interesting. The PDF says "Cut-through Layer 3 switching", which is a
bit of marketese that I have trouble deciphering. I'll run some tests
on our 4000M and see if I can come to any conclusions... if anyone can
point to a whitepaper on the ProCurve chassis design I'd apprecate it.

> I'm sure there are others...

I'm still curious.

-andy

2003-08-06 17:56:01

by Matti Aarnio

[permalink] [raw]
Subject: Re: TOE brain dump

On Wed, Aug 06, 2003 at 12:01:45PM -0500, Andy Isaacson wrote:
> On Wed, Aug 06, 2003 at 12:27:17PM -0400, Chris Friesen wrote:
> > Andy Isaacson wrote:
> > > On Wed, Aug 06, 2003 at 10:37:58AM -0300, Werner Almesberger wrote:
> > >>Eric W. Biederman wrote:
> > >>>to keep your latency down. Do any ethernet switches do cut-through?
> > >>According to Google, many at least claim to do this.

Quite a while back (several years) several "cut-through" routing
things were introduced, primarily over ATMish core networks.

The idea ran essentially as: "if you can't find header address
lookup from cache, run routing and form a VC to carry rest of
the flow, if you can find a VC from cache, send the packet there"
(what the "VC" is in the end is not that important.)

NOTHING in those implementations was (as I recall) specifying about
treatment of the packet before it was fully collected into router
local buffer memory.

In very high speed local networks (like Cray T3 series switch fabric
with _routable_ packets) one can implement protocols, which carry
destination node address selector bits in header, and if the fabric
is e.g. congestion free one, there is guaranteed success at delivering
the bits to desired destination. To make UDPish communication a bit
simpler, relevant hardware got signal back about "sent ok thru /
collision", so the sender hardware could automagically retry the xmit.

To certain extent one could handle e.g. ethernet in similar style
by fast-switching packets by cached destination MAC addresses.
When destination MAC lookup points to some destination port in local
hardware, internal VC is formed (reserved in output end, presuming
sufficient core bandwidth to handle everything), and incoming enet
frame is sent piece by piece thru the internal switch to the output
port. If the output port can not be contacted immediately, full frame
(possibly two or three) need to be buffered at the receiver.

That way switch internal buffering delay would be -- lets see:
- preamble 7 bytes
- SFD 1 byte
- dest mac 6 bytes
plus processing delay, but that is absolute minimum for 100BASE-T

Cheap cluster super-computer makers are using ethernets, and other
"off the shelf" stuff, but I don't see why semi-proprietary high
performance "LANs" could not emerge for this market.
E.g. I would love to have cheapish (mere 5 times price of Cu-GE card)
"LAN" cards for cluster binding, especially if I get direct memory
access to other machine's memory.

A whole bundle of various cluster interconnects are mentioned
at this white-paper from 2001:

http://www.dell.com/us/en/slg/topics/power_ps4q01-ctcinter.htm

VIA, VI-IP, SCI, FE, infiniband, etc...

/Matti Aarnio

2003-08-06 18:59:28

by Jesse Pollard

[permalink] [raw]
Subject: Re: TOE brain dump

On Wednesday 06 August 2003 11:25, Andy Isaacson wrote:
> On Wed, Aug 06, 2003 at 07:46:33AM -0500, Jesse Pollard wrote:
> > On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote:
> > > So store and forward of packets in a 3 layer switch hierarchy, at 1.3
> > > us per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us
> > > to the second switch chip + 1.3us to the top level switch chip + 1.3us
> > > to a middle layer switch chip + 1.3us to the receiving NIC + 1.3us the
> > > receiver.
> > >
> > > 1.3us * 7 = 9.1us to deliver a packet to the other side. That is
> > > still quite painful. Right now I can get better latencies over any of
> > > the cluster interconnects. I think 5 us is the current low end, with
> > > the high end being about 1 us.
> >
> > I think you are off here since the second and third layer should not
> > recompute checksums other than for the header (if they even did that).
> > Most of the switches I used (mind, not configured) were wire speed. Only
> > header checksums had recomputes, and I understood it was only for
> > routing.
>
> The switches may be "wire speed" but that doesn't help the latency any.
> AFAIK all GigE switches are store-and-forward, which automatically costs
> you the full 1.3us for each link hop. (I didn't check Eric's numbers,
> so I don't know that 1.3us is the right value, but it sounds right.)
> Also I think you might be confused about what Eric meant by "3 layer
> switch hierarchy"; he's referring to a tree topology network with
> layer-one switches connecting hosts, layer-two switches connecting
> layer-one switches, and layer-three switches connecting layer-two
> switches. This means that your worst-case node-to-node latency has 6
> wire hops with 7 "read the entire packet into memory" operations,
> depending on how you count the initiating node's generation of the
> packet.

If it reads the packet into memory before starting transmission, it isn't
"wire speed". It is a router.

> [snip]
>
> > > Quite often in MPI when a message is sent the program cannot continue
> > > until the reply is received. Possibly this is a fundamental problem
> > > with the application programming model, encouraging applications to
> > > be latency sensitive. But it is a well established API and
> > > programming paradigm so it has to be lived with.
>
> This is true, in HPC. Some of the problem is the APIs encouraging such
> behavior; another part of the problem is that sometimes, the problem has
> fundamental latency dependencies that cannot be programmed around.
>
> > > A lot of the NICs which are used for MPI tend to be smart for two
> > > reasons. 1) So they can do source routing. 2) So they can safely
> > > export some of their interface to user space, so in the fast path
> > > they can bypass the kernel.
> >
> > And bypass any security checks required. A single rogue MPI application
> > using such an interface can/will bring the cluster down.
>
> This is just false. Kernel bypass (done properly) has no negative
> effect on system stability, either on-node or on-network. By "done
> properly" I mean that the NIC has mappings programmed into it by the
> kernel at app-startup time, and properly bounds-checks all remote DMA,
> and has a method for verifying that incoming packets are not rogue or
> corrupt. (Of course a rogue *kernel* can probably interfere with other
> *applications* on the network it's connected to, by inserting malicious
> packets into the datastream, but even that is soluble with cookies or
> routing checks. However, I don't believe any systems try to defend
> against rogue nodes today.)

Just because the packet gets transfered to a buffer correctly does not
mean that buffer is the one it should have been sent to. If it didn't
have this problem, then there would be no kernel TCP/IP interaction. Just
open the ethernet device and start writing/reading. Ooops. known security
failure.

>
> I believe that Myrinet's hardware has the capability to meet the "kernel
> bypass done properly" requirement I state above; I make no claim that
> their GM implementation actually meets the requirement (although I think
> it might). It's pretty likely that QSW's Elan hardware can, too, but I
> know even less about that.

since the routing is done is user mode, as part of the library, it can be
used to directly affect processes NOT owned by the user. This bypasses
the kernel security checks by definition. Already known to happen with
raw myrinet, so there is a kernel layer on top of it to shield it (or
at least try to). If there is no kernel involvement, then there can be
no restrictions on what can be passed down the line to the device. Now
some of the modifications for myrinet were to use normal TCP/IP to establish
source/destination header information, then bypass any packet handshake, but
force EACH packet to include the pre-established source/destination header
info. This is equivalent to UDP, but without any checksums, and sometimes
can bypass part of the kernel cache. Unfortunately, it also means that
sometimes incoming data is NOT destined for the user, and must be
erased/copied before the final destination is achieved. This introduces leaks
due to the race condition caused by the transfer to the wrong buffer.

You can't DMA directly to a users buffer, because you MUST verify the header
before the data... and you can't do that until the buffer is in memory...
So bypassing the kernel generates security failures.

This is already a problem in fibre channel devices, and in other network
devices. Anytime you bypass the kernel security you also void any
restrictions on the network, and any hosts it is attached to.

2003-08-06 19:40:10

by Andy Isaacson

[permalink] [raw]
Subject: Re: TOE brain dump

On Wed, Aug 06, 2003 at 01:58:59PM -0500, Jesse Pollard wrote:
> On Wednesday 06 August 2003 11:25, Andy Isaacson wrote:
> > The switches may be "wire speed" but that doesn't help the latency any.
> > AFAIK all GigE switches are store-and-forward, which automatically costs
> > you the full 1.3us for each link hop. (I didn't check Eric's numbers,
> > so I don't know that 1.3us is the right value, but it sounds right.)
> > Also I think you might be confused about what Eric meant by "3 layer
> > switch hierarchy"; he's referring to a tree topology network with
> > layer-one switches connecting hosts, layer-two switches connecting
> > layer-one switches, and layer-three switches connecting layer-two
> > switches. This means that your worst-case node-to-node latency has 6
> > wire hops with 7 "read the entire packet into memory" operations,
> > depending on how you count the initiating node's generation of the
> > packet.
>
> If it reads the packet into memory before starting transmission, it isn't
> "wire speed". It is a router.

[Please read an implied "I might be totally off base here, since I've
never designed an Ethernet switch" disclaimer into this paragraph.]

This statement is completely false. Ethernet switches *do* read the
packet into memory before starting transmission. This must be so,
because an Ethernet switch does not propagate runts, jabber frames, or
frames with an incorrect ethernet crc. If the switch starts
transmission before it's received the last bit, it is provably
impossible for it to avoid propagating crc-failing-frames; ergo,
switches must have the entire packet on hand before starting
transmission.

> > > > A lot of the NICs which are used for MPI tend to be smart for two
> > > > reasons. 1) So they can do source routing. 2) So they can safely
> > > > export some of their interface to user space, so in the fast path
> > > > they can bypass the kernel.
> > >
> > > And bypass any security checks required. A single rogue MPI application
> > > using such an interface can/will bring the cluster down.
> >
> > This is just false. Kernel bypass (done properly) has no negative
> > effect on system stability, either on-node or on-network. By "done
> > properly" I mean that the NIC has mappings programmed into it by the
> > kernel at app-startup time, and properly bounds-checks all remote DMA,
> > and has a method for verifying that incoming packets are not rogue or
> > corrupt. (Of course a rogue *kernel* can probably interfere with other
> > *applications* on the network it's connected to, by inserting malicious
> > packets into the datastream, but even that is soluble with cookies or
> > routing checks. However, I don't believe any systems try to defend
> > against rogue nodes today.)
>
> Just because the packet gets transfered to a buffer correctly does not
> mean that buffer is the one it should have been sent to. If it didn't
> have this problem, then there would be no kernel TCP/IP interaction. Just
> open the ethernet device and start writing/reading. Ooops. known security
> failure.

You're ignoring the fact that there's a complete, programmable RISC CPU
on the Myrinet card which is running code (the MCP, Myrinet Control
Program) installed into it by the kernel. The kernel tells the MCP to
allow access to a given app (by mapping a page of PCI IO addresses into
the user's virtual address space), and the MCP checks the user's DMA
requests for validity. The user cannot generate arbitrary Myrinet
routing requests, cannot write to arbitrary addresses, cannot send
messages to hosts not in his allowed lists, et cetera. We do know that
the buffer is the one it should have been sent to, because the MCP on
the sending end verified that it was an allowed destination host, and
the MCP on the receiving end verified that the destination address was
valid. Myrinet Inc even offers a SDK allowing you to write your own
MCP, if you so desire, and various research projects have done precisely
that.

Demonstrating that dumb Ethernet cards cannot be smart does not
demonstrate that smart FooNet cards cannot be smart. (s/FooNet/$x/ as
desired.)

> > I believe that Myrinet's hardware has the capability to meet the "kernel
> > bypass done properly" requirement I state above; I make no claim that
> > their GM implementation actually meets the requirement (although I think
> > it might). It's pretty likely that QSW's Elan hardware can, too, but I
> > know even less about that.
>
> since the routing is done is user mode, as part of the library, it can be
> used to directly affect processes NOT owned by the user. This
> bypasses the kernel security checks by definition.

The routing is done on the MCP, not in a library. (Or at least, it
could be -- I don't know offhand how GM1 and GM2 work.) This is not an
insoluble problem.

> Already known to happen with raw myrinet, so there is a kernel layer
> on top of it to shield it (or at least try to).

Perhaps that's the case with GM1 (I don't know) but it is not a
fundamental flaw of the hardware or the network.

> If there is no kernel involvement, then there can be no restrictions
> on what can be passed down the line to the device.

The MCP provides the necessary checking.

> Now some of the modifications for myrinet were to use normal TCP/IP to
> establish source/destination header information, then bypass any
> packet handshake, but force EACH packet to include the pre-established
> source/destination header info.

I don't know what you're talking about here; perhaps this was some early
"TCP over Myrinet" thing. Currently on a host with GM1 running, the
myri0 interface shows up as an almost-normal Ethernet interface, and
most of the relevant networking ioctls work just fine. I can even
tcpdump it.

On a related topic, there is a Myrinet line card with a GigE port
available. I haven't looked into the software end deeply, but
apparently you just stick a standard Myrinet route to that switch port
on the front of the Myrinet frame, append an Ethernet frame, and your
Myrinet host can send GigE packets without bother. I don't know how
incoming ethernet packets are routed, alas -- presumably a Myrinet route
is encoded in the MAC somehow.

> This is equivalent to UDP, but without any checksums, and sometimes
> can bypass part of the kernel cache. Unfortunately, it also means that
> sometimes incoming data is NOT destined for the user, and must be
> erased/copied before the final destination is achieved. This introduces leaks
> due to the race condition caused by the transfer to the wrong buffer.
>
> You can't DMA directly to a users buffer, because you MUST verify the header
> before the data... and you can't do that until the buffer is in memory...
> So bypassing the kernel generates security failures.

Again, the security problems are solved by having the MCP check the
necessary conditions. You bring up a good point WRT error resilience,
though -- I don't know how Myrinet handles media bit errors.

You *can* DMA directly to a user's buffer, because the necessary header
information was checked on the MCP before the bits even touch the PCI
bus.

> This is already a problem in fibre channel devices, and in other network
> devices. Anytime you bypass the kernel security you also void any
> restrictions on the network, and any hosts it is attached to.

Sufficiently advanced HBA hardware and software solve this problem.
Please pick another windmill to tilt at. (Like the error one; I need to
find out what the answer to that is.)

-andy

2003-08-06 21:13:52

by David Schwartz

[permalink] [raw]
Subject: RE: TOE brain dump


> This statement is completely false. Ethernet switches *do* read the
> packet into memory before starting transmission.

Some do. Some don't. Some are configurable.

> This must be so,
> because an Ethernet switch does not propagate runts, jabber frames, or
> frames with an incorrect ethernet crc.

If they use cut-through switching, they do. Some use adaptive switching,
which means they use cut-through switching but change to store and forward
if there are too many runts, jabber frames, bad CRCs, and so on.

Obviously, you can't always do a cut-through. If the target port is busy,
cut-through is impossible. If the ports are different speeds, cut-through is
impossible. The Intel 510T switch for my home network does adaptive
switching with configurable error thresholds. In fact, it's even smarter
than that, with an intermediate mode that suppresses runts without doing a
full store and forward. See:
http://www.intel.com/support/express/switches/23188.htm

> If the switch starts
> transmission before it's received the last bit, it is provably
> impossible for it to avoid propagating crc-failing-frames; ergo,
> switches must have the entire packet on hand before starting
> transmission.

Except not all switches always avoid propogating bad frames.

DS


2003-08-07 02:14:47

by Lincoln Dale

[permalink] [raw]
Subject: Re: TOE brain dump

At 03:01 AM 7/08/2003, Andy Isaacson wrote:
> > > Just to be clear, I am asking for an example of a Gigabit Ethernet
> > > switch that supports cut-through switching. I contend that there is no
> > > such beast commercially available today.

i concur.
"cut-through" is generally marketing these days.

there are some switches in the marketplace today which do cut-through
switching, but fall back to store-&-forward when:
- there is congestion in a port (i.e. output port is busy; queue frame)
- the sender & receiver are of mismatched speeds
- the receiver initiates gig-e flowcontrol

note that "cut-through switching" means that you lose the ability of the
switch to drop corrupted frames. i.e. how can it check the ethernet crc32
and validate it until its all been sent? in short, it cannot.

in real-world traffic scenarios, there is very little real-world scenarios
where cut-through actually occurs.


cheers,

lincoln.