Has anyone worked on a standard interface between TOE and Linux? (ie.
something like Trapeze/Myrinet's GMS?)
Or TOE is a forbidden discussion? Any effort in making Linux the OS for TOE
at all even though Linux is a little too heavy for it?
Alan
On Sun, 13 Jul 2003 00:33:00 -0700
"Alan Shih" <[email protected]> wrote:
> Or TOE is a forbidden discussion?
TOE is evil, read this:
http://www.usenix.org/events/hotos03/tech/full_papers/mogul/mogul.pdf
TOE is exactly suboptimal for the very things performance
matters, high connection rates.
Your return is also absolutely questionable. Servers "serve" data
and we offload all of the send side TCP processing that can
reasonably be done (segmentation, checksumming).
I've never seen an impartial benchmark showing that TCP send
side performance goes up as a result of using TOE vs. the usual
segmentation + checksum offloading offered today.
On receive side, clever RX buffer flipping tricks are the way
to go and require no protocol changes and nothing gross like
TOE or weird buffer ownership protocols like RDMA requires.
I've made postings showing how such a scheme can work using a limited
flow cache on the networking card. I don't have a reference handy,
but I suppose someone else does.
And finally, this discussion belongs on the "networking" lists.
Nearly all of the "networking" developers don't have time to sift
through linux-kernel every day.
David> I didn't say I agree with all of Moguls ideas, just his
David> anti-TOE arguments. For example, I also think RDMA sucks
David> too yet he thinks it's a good iea.
Sure, he talks about some weaknesses of TOE, but his conclusion is
that the time has come for OS developers to start working on TCP
offload (for storage).
David> You obviously don't understand my ideas if you think that
David> it matters whether there is some relationship between the
David> MTU and the system page size necessary for the scheme to
David> work.
I was just quoting part of Mogul's paper that seemed to directly
contradict your original post. I also said it would be great to see
NIC hardware with support for flow classification.
Look, I pretty much agree with you about TOE hardware. Every chip
I've seen either requires a bunch of dedicated expensive memory
(including a giant CAM) or is just firmware running on a
low-performance embedded CPU. But I also think Mogul is right: iSCSI
HBAs are going to force OS designers to deal with TCP offload.
My whole point was just that it doesn't make much sense to dismiss the
whole idea by saying "TOE is evil" and then cite as support a paper
that explains why TOEs now make sense and need to be supported.
- Roland
On Sun, Jul 13, 2003 at 04:53:23PM -0700, David S. Miller wrote:
> On Sun, 13 Jul 2003 16:54:24 -0700
> Larry McVoy <[email protected]> wrote:
>
> > Every time I tried to push the page flip idea or offloading or any of
> > that crap, Andy Bechtolsheim would tell "the CPUs will get faster faster
> > than you can make that work". He was right.
>
> I really don't see why receive is so much of a big deal
> compared to send, and we do a send side version of this
> stuff already with zero problems.
Hey, maybe it isn't, but could you please quantify the cost of the VM
operations? How hard is that?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Sun, 13 Jul 2003 17:22:00 -0700
Larry McVoy <[email protected]> wrote:
> Hey, maybe it isn't, but could you please quantify the cost of the VM
> operations? How hard is that?
Ok.
So the page is in a non-uptodate state, NFS would have it locked,
and anyone else trying to get at it would sleep.
This page we have currently is "dummy" in that it is only a place
holder in case we don't get a full page from the networking.
We have all the infrastructure to do everything up to this point.
Next, if the networking gave us a full page, we'd "replace"
the dummy page with this one, which would involve:
1) delete the dummy page from the lookup, insert the networking's
page
2) arrange so that all sleepers on the dummy page will do a relookup
and find the new page
And when we're done with the operation we wake everyone up.
I can't see any part of this turning out to be expensive.
On 13 Jul 2003 17:20:41 -0700
Roland Dreier <[email protected]> wrote:
> David> I didn't say I agree with all of Moguls ideas, just his
> David> anti-TOE arguments. For example, I also think RDMA sucks
> David> too yet he thinks it's a good iea.
>
> Sure, he talks about some weaknesses of TOE, but his conclusion is
> that the time has come for OS developers to start working on TCP
> offload (for storage).
The bad assumption here is that this belongs in the OS.
Let me ask you this, how many modern scsi drivers have to speak every
piece of the SCSI bus protocol. Or fibre channel? All of it is
done on the cards, and that is what I think the iSCSI people should
be doing instead of putting garbage into the OS.
And I've presented a solution to the problem at the OS level that
doesn't require broken things like TOE and RDMA yet arrives at
the same solution.
> But I also think Mogul is right: iSCSI HBAs are going to force OS
> designers to deal with TCP offload.
You don't need to offload TCP, it's the segmentation and checksuming
that has the high cost not the actual TCP logic in the operating
system.
RDMA and TOE both add unnecessary complications. My solution requires
no protocol changes, just smart hardware which needs to be designed
for any of the presented ideas anyways.
On Sun, 13 Jul 2003 16:35:03 -0700
Larry McVoy <[email protected]> wrote:
> On Sun, Jul 13, 2003 at 04:02:00PM -0700, David S. Miller wrote:
> > On send this doesn't matter, on receive you use my clever receive
> > buffer handling + flow cache idea to accumulate the data portion of
> > packets into page sized chunks for the networking to flip.
>
> Please don't. I think page flipping was a bad idea. I think you'd be
> better off to try and make the data flow up the stack in small enough
> windows that it all sits in the cache.
At 10GB/sec nothing fits in the cache :-)
> One thing SGI taught me (not that they wanted to do so) is that infinitely
> large packets are infinitely stupid, for lots of reasons. One is that
> you have to buffer them somewhere and another is that the bigger they
> are the bigger your cache needs to be to go fast.
The whole point is to not touch any of this data.
The idea is to push the pages directly into the page cache
of the filesystem.
I'm not talking about doing this for userspace normal sys_recvmsg()
type reads, that's an entirely different topic but if we ever did
all agree to do something like that we'd have the network level
infrastructure to do it already.
On Sun, 13 Jul 2003 16:54:24 -0700
Larry McVoy <[email protected]> wrote:
> Every time I tried to push the page flip idea or offloading or any of
> that crap, Andy Bechtolsheim would tell "the CPUs will get faster faster
> than you can make that work". He was right.
I really don't see why receive is so much of a big deal
compared to send, and we do a send side version of this
stuff already with zero problems.
The NFS code is already basically ready to handle a fragmented packet
(headers + pages), and could stick the page part into the page cache
easily on receive.
And it's not the CPUs that really limit us here, it's memory
bandwidth. It's one thing to have a PCI-X bus fast enough
to service 10Ggb/sec rates, it's yet another thing to have
a memory bus and RAM underneath that which can handle moving
that data over it _twice_.
The infrastructure needed to support this on the networking side
help us support other useful things, such as driver local packet
buffer recycling.
> The whole point is to not touch any of this data.
>
> The idea is to push the pages directly into the page cache
> of the filesystem.
It doesn't work. Measure the cost of the VM operations before you go
down this path. Just set up a system call that swaps a page with a
kernel allocated buffer and then see how many of those you can do a
second. Maybe Linux is so blindingly fast this makes sense but IRIX
certainly wasn't, the VM overhead hurt like crazy.
Every time I tried to push the page flip idea or offloading or any of
that crap, Andy Bechtolsheim would tell "the CPUs will get faster faster
than you can make that work". He was right.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Sun, Jul 13, 2003 at 04:02:00PM -0700, David S. Miller wrote:
> On send this doesn't matter, on receive you use my clever receive
> buffer handling + flow cache idea to accumulate the data portion of
> packets into page sized chunks for the networking to flip.
Please don't. I think page flipping was a bad idea. I think you'd be
better off to try and make the data flow up the stack in small enough
windows that it all sits in the cache.
One thing SGI taught me (not that they wanted to do so) is that infinitely
large packets are infinitely stupid, for lots of reasons. One is that
you have to buffer them somewhere and another is that the bigger they
are the bigger your cache needs to be to go fast.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On 13 Jul 2003 09:22:32 -0700
Roland Dreier <[email protected]> wrote:
> David> TOE is evil, read this:
>
> David> http://www.usenix.org/events/hotos03/tech/full_papers/mogul/mogul.pdf
> Your ideas are certainly very interesting, and I would be happy to see
> hardware that supports flow identification. But the Usenix paper
> you're citing completely disagrees with you!
I didn't say I agree with all of Moguls ideas, just his anti-TOE
arguments. For example, I also think RDMA sucks too yet he thinks
it's a good iea.
> For example, Mogul writes:
>
> "Nevertheless, copy-avoidance designs have not been widely adopted,
> due to significant limitations. For example, when network maximum
> segment size (MSS) values are smaller than VM page sizes, which is
> often the case, page-remapping techniques are insufficient (and
> page-remapping often imposes overheads of its own.)"
On send this doesn't matter, on receive you use my clever receive
buffer handling + flow cache idea to accumulate the data portion of
packets into page sized chunks for the networking to flip.
You obviously don't understand my ideas if you think that it matters
whether there is some relationship between the MTU and the system
page size necessary for the scheme to work.
Alan Cox wrote:
> Finally if you are streaming objects by non mapped references (eg
> sendfile or see LM's paper from long ago on splice()) then the problem
> goes away.
As an aside, I really like sendfile's semantics except for
* People occasionally want to add a receivefile(2). I disagree...
sendfile(2) interface should be really be considered a universal
"fdcopy" interface, regardless of what the 'to' and 'from' file
descriptors are attached to. File to socket. Socket to file. File to
file. socket to socket. All should be supported, even if the fallback
is a stupid (but small!) in-kernel copy loop.
* Copy-until-EOF semantics are either undefined, or, unclear to me
personally.
Jeff
Alan Cox wrote:
> Finally if you are streaming objects by non mapped references (eg
> sendfile or see LM's paper from long ago on splice()) then the problem
> goes away.
I had forgotten all about splice.
For interested readers, here is the link:
http://www.bitmover.com/lm/papers/splice.ps
Jeff
On Sul, 2003-07-13 at 17:22, Roland Dreier wrote:
> Your ideas are certainly very interesting, and I would be happy to see
> hardware that supports flow identification. But the Usenix paper
> you're citing completely disagrees with you! For example, Mogul writes:
Take a look at who holds the official internet land speed record. Its
not a TOE using system.
> "Nevertheless, copy-avoidance designs have not been widely adopted,
> due to significant limitations. For example, when network maximum
> segment size (MSS) values are smaller than VM page sizes, which is
> often the case, page-remapping techniques are insufficient (and
> page-remapping often imposes overheads of its own.)"
Page remapping is adequate for send of data when the MSS is below the
VM page size since you don't have to send all of the page you pinned
or set COW/SOW (sleep on write)
For receive if your hardware can do demux from the tcp headers and
expecting sequence then page remapping isn't needed either.
Finally if you are streaming objects by non mapped references (eg
sendfile or see LM's paper from long ago on splice()) then the problem
goes away.
> In fact, his conclusion is:
>
> "However, as hardware trends change the feasibility and economics of
> network-based storage connections, RDMA will become a significant
> and appropriate justification for TOEs."
>
> - Roland
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
David> TOE is evil, read this:
David> http://www.usenix.org/events/hotos03/tech/full_papers/mogul/mogul.pdf
David> TOE is exactly suboptimal for the very things performance
David> matters, high connection rates.
David> Your return is also absolutely questionable. Servers
David> "serve" data and we offload all of the send side TCP
David> processing that can reasonably be done (segmentation,
David> checksumming).
David> I've never seen an impartial benchmark showing that TCP
David> send side performance goes up as a result of using TOE
David> vs. the usual segmentation + checksum offloading offered
David> today.
David> On receive side, clever RX buffer flipping tricks are the
David> way to go and require no protocol changes and nothing gross
David> like TOE or weird buffer ownership protocols like RDMA
David> requires.
David> I've made postings showing how such a scheme can work using
David> a limited flow cache on the networking card. I don't have
David> a reference handy, but I suppose someone else does.
Your ideas are certainly very interesting, and I would be happy to see
hardware that supports flow identification. But the Usenix paper
you're citing completely disagrees with you! For example, Mogul writes:
"Nevertheless, copy-avoidance designs have not been widely adopted,
due to significant limitations. For example, when network maximum
segment size (MSS) values are smaller than VM page sizes, which is
often the case, page-remapping techniques are insufficient (and
page-remapping often imposes overheads of its own.)"
In fact, his conclusion is:
"However, as hardware trends change the feasibility and economics of
network-based storage connections, RDMA will become a significant
and appropriate justification for TOEs."
- Roland
On Sun, 13 Jul 2003 16:53:23 PDT, "David S. Miller" said:
> I really don't see why receive is so much of a big deal
> compared to send, and we do a send side version of this
> stuff already with zero problems.
Well.... there's optimizations you can do on the send side..
> The NFS code is already basically ready to handle a fragmented packet
> (headers + pages), and could stick the page part into the page cache
> easily on receive.
For example, in this case, you know a priori what the IP header will look
like, so you can use tricks like scatter-gather to send the header from one
place and a page-aligned data buffer from another, or start the packet at
(page boundary - IP_hrd_len), or tricks of that sort. In 20 years, I've seen
a lot of vendors do a lot of ugly things to speed up their IP stack, often
based on the fact that they knew a lot about the packet before they started
assembling it.
It's hard to do tricks like that when you don't know (for instance) how
many IP option fields the packet has until you've already started sucking
the packet off the wire - at which point either the NIC itself has to be clever
(Hmm, there's that IP offload again) or you have literally about 30 CPU cycles
to do interrrupt latency *and* decide what to do....
On Sun, Jul 13, 2003 at 05:24:14PM -0700, David S. Miller wrote:
> I can't see any part of this turning out to be expensive.
In theory, practice and theory are the same...
I think the point I'm trying to make is that the VM stuff costs something
and it shouldn't be that hard to dummy up a system call to measure it.
It was counterintuitive as hell at SGI that the VM stuff would cost that
much and the reasons are subtle. Part of the problem turned out to be
falling out of the instruction cache - the network stack and the VM system
didn't fit and that left no room at all for the app.
If you are trading instruction cache misses for data misses, err, dude,
I think that might be a problem. The point is to process all the data
with less, not more, cache misses, right? In fact, if we agree on that
then that leads you to considering the various ways you could do this
and maybe your way is the right way but maybe there is a less cache
intensive way.
If you're right you're right, so peace. But I'd like the definition of
"right" to be "less cache misses to do the same thing". In fact, if
I managed to communicate only one thing in my entire set of rants and
it was "pay attention to cache misses", hey, that'd be cool with me.
That's how you make things go fast and I like fast.
Think about it, a 3GHz machine is a .3ns clock cycle and the suckers are
super scalar and hyper threaded and all that crud. Memory is about 133ns
away. That's 400 clocks of stall for each cache miss. Lotta code can run
in 400 clocks of super scalar/hyper threaded/fully buzzword enabled processors.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Alan Shih wrote:
> Has anyone worked on a standard interface between TOE and Linux? (ie.
> something like Trapeze/Myrinet's GMS?)
>
> Or TOE is a forbidden discussion? Any effort in making Linux the OS for TOE
> at all even though Linux is a little too heavy for it?
I do not forsee there _ever_ being a TOE interface for Linux.
It's not a forbidden discussion, but, the networking developers tend to
ignore people who mention TOE because it's been discussed to death, and
no evidence has ever been presented to prove it has advantages where it
matters, and it has significant _dis_advantages from the get-go.
I really should write an LKML FAQ entry for TOE.
Jeff
On Sun, 13 Jul 2003 20:46:38 -0400
[email protected] wrote:
> On Sun, 13 Jul 2003 16:53:23 PDT, "David S. Miller" said:
>
> > I really don't see why receive is so much of a big deal
> > compared to send, and we do a send side version of this
> > stuff already with zero problems.
>
> Well.... there's optimizations you can do on the send side..
I consider the send side complete covered already. We don't
touch any of the data portion, we only put together the
headers.
> It's hard to do tricks like that when you don't know (for instance) how
> many IP option fields the packet has until you've already started sucking
> the packet off the wire - at which point either the NIC itself has to be clever
> (Hmm, there's that IP offload again) or you have literally about 30 CPU cycles
> to do interrrupt latency *and* decide what to do....
There are cards, both existing and in development, that have
very simple header parsing engines you can program to do stuff
like this, it isn't hard at all.
But this is only half of the problem, you need a flow cache and
clever RX buffer management as well to make the RX side zero-copy
stuff work.
On Mon, 14 Jul 2003 22:42:55 -0700
"Jordi Ros" <[email protected]> wrote:
[ Please fix Outlook Express or whatever lame email client you
use to put newlines into the emails that you compose. These
excessive long lines make your emails nearly impossible to read ]
> TCP offloading does not necessarily need to be the goal but a MUST
> if one wants to build a performance-scalable architecture. This
> vision is in fact introduced by Mogul in his paper. He writes:
> "Therefore, offloading the transport layer becomes valuable not for
> its own sake, but rather because that allows offloading of the RDMA
> [...]".
I totally disagree. It is not a MUST, in fact I have described
an alternative implementation that requires none of the complexity
or RDMA, and none of the stupidity of TOE.
Read my lips: "We do not need to offload TCP itself to get the
attributes you desire, therefore we are NOT going to do it."
You can choose to ignore my suggestions and likewise I will continue
to ignore the endless (and frankly, broing after reading it for the
100th time) spouting from people like you that we somehow "NEED" or
"MUST" have TOE, which is complete bullshit as exemplified by my
alternative example scheme.
You also ignore the points others have made that the systems HAVE
SCALED to evolving networks technologies as they have become faster
and faster.
And when you ignore me, don't be surprised when other companies come
along, implement my scheme, it gets supported in Linux and
subsequently the stock of your company effectively becomes toilet
paper and TOE is an obscure piece of computing history gone wrong :-)
> TOE is believed to not provide performance. I may agree that TOE by
> itself may not, but TOE as a means to deliver some other technology
> (e.g. RDMA, encryption or Direct Path) it does optimize (in some
> instance dramatically) the overall performance. Let me show you the
> numbers in our Direct Path technology.
But our point is that you don't need any of this crap.
My RX receive page accumulation scheme handles all of the
receive side problems with touching the data and getting
into the filesystem and then the device. With my scheme
you can receive the data, go direct to the device, and the
cpu never touches one byte.
> Note that Microsoft is considering TOE under its Scalable Networking
> Program. To keep linux competitive, I would encourage a healthy
> discussion on this matter
I actually welcome Microsoft falling into this rathole of a
technology. Let them have to support that crap and have to field bug
reports on it, having to wonder who created the packets. And let them
deal with the negative effects TOE has on connection rates and things
like that.
Linux will be competitive, especially if people develop the scheme I
have described several times into the hardware. There are vendors
doing this, will you choose to be different and ignore this?
Ok I've taken a look at your scheme and I have a few questions.
>From: "David S. Miller" <[email protected]>
>You also ignore the points others have made that the systems HAVE
>SCALED to evolving networks technologies as they have become faster
>and faster.
>
This is not true in the embedded space. As I keep pointing out typical
embedded processors don't have as many free cycles as server computers.
>
>
>My RX receive page accumulation scheme handles all of the
>receive side problems with touching the data and getting
>into the filesystem and then the device. With my scheme
>you can receive the data, go direct to the device, and the
>cpu never touches one byte.
>
RDDP tries to get around needing a large amount of RAM on the NIC to collect
all of this data before writing it to the OS memory. Also, this store and
forward architecture you recommend adds latency in collecting all of this
data before moving it to the OS. Finally, I recall some resistance to page
flipping which could also lead to walking page tables. More latency. After
some extremely large amount of time your receive data has made it to your
application. Do you have a suggestion on how we could get around all of
this store and forward without RDDP? Just avoiding the CPU copy is not the
only issue.
>
>I actually welcome Microsoft falling into this rathole of a
>technology. Let them have to support that crap and have to field bug
>reports on it, having to wonder who created the packets. And let them
>deal with the negative effects TOE has on connection rates and things
>like that.
>
Would it be shame if they found away around this "problem" you see and are
successful and Linux failed because you felt the community is not able
overcome these though obstacles?
>
>Linux will be competitive, especially if people develop the scheme I
>have described several times into the hardware. There are vendors
>doing this, will you choose to be different and ignore this?
Your ideas are good, but they leave in this store and forward issue that I
mentioned. A good alternative would be one that kept things simple as you
suggested, but didn't introduce all of this latency.
_________________________________________________________________
The new MSN 8: smart spam protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
On Tue, 2003-07-15 at 01:51, David S. Miller wrote:
> > Note that Microsoft is considering TOE under its Scalable Networking
> > Program. To keep linux competitive, I would encourage a healthy
> > discussion on this matter
>
> I actually welcome Microsoft falling into this rathole of a
> technology. Let them have to support that crap and have to field bug
> reports on it, having to wonder who created the packets. And let them
> deal with the negative effects TOE has on connection rates and things
> like that.
>
> Linux will be competitive, especially if people develop the scheme I
> have described several times into the hardware. There are vendors
> doing this, will you choose to be different and ignore this?
A friend of mine mentioned that the MS support may all be a big scam.
It makes it easy to kill TOE if they get involved ;->
Yes, there will be some MIS managers who will buy the M$ B$.
What about infiniband which has all this built in offloading? What
happened to VIA?
cheers,
jamal
jamal> What about infiniband which has all this built in
jamal> offloading?
We're seeing some pretty good numbers (well above 5 Gb/sec, basically
PCI-X 64bit/133MHz limited) with sockets direct (SDP) on top of
InfiniBand. This is running standard sockets applications, just using
the AIO patches for kernel 2.4. Latency is also much better than TCP
on top of ethernet, although this is mostly just due to the
underlying transport.
- Roland
On Sun, Jul 13, 2003 at 12:48:18AM -0700, David S. Miller wrote:
> On receive side, clever RX buffer flipping tricks are the way
> to go and require no protocol changes and nothing gross like
> TOE or weird buffer ownership protocols like RDMA requires.
>
> I've made postings showing how such a scheme can work using a limited
> flow cache on the networking card. I don't have a reference handy,
> but I suppose someone else does.
The following reference should be useful for those following along
at home and wondering what the hell this hardware flow cache scheme
is:
http://www.ussg.iu.edu/hypermail/linux/kernel/0306.2/0429.html
Regards,
--
Matt Porter
[email protected]
On Sun, Jul 13, 2003 at 05:42:42PM -0700, David S. Miller wrote:
> There are cards, both existing and in development, that have
> very simple header parsing engines you can program to do stuff
> like this, it isn't hard at all.
Do you have a reference to an existing card that implements
a header parsing engine like this (and has obtainable docs)?
Regards,
--
Matt Porter
[email protected]