Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S269954AbTHERWd (ORCPT ); Tue, 5 Aug 2003 13:22:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S270013AbTHERWa (ORCPT ); Tue, 5 Aug 2003 13:22:30 -0400 Received: from ebiederm.dsl.xmission.com ([166.70.28.69]:10560 "EHLO frodo.biederman.org") by vger.kernel.org with ESMTP id S269954AbTHERW2 (ORCPT ); Tue, 5 Aug 2003 13:22:28 -0400 To: Werner Almesberger Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: 05 Aug 2003 11:19:09 -0600 In-Reply-To: <20030804162433.L5798@almesberger.net> Message-ID: User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3256 Lines: 68 Werner Almesberger writes: > Eric W. Biederman wrote: > > The optimized for low latency cases seem to have a strong > > market in clusters. > > Clusters have captive, no, _desperate_ customers ;-) And it > seems that people are just as happy putting MPI as their > transport on top of all those link-layer technologies. MPI is not a transport. It an interface like the Berkeley sockets layer. The semantics it wants right now are usually mapped to TCP/IP when used on an IP network. Though I suspect SCTP might be a better fit. But right now nothing in the IP stack is a particularly good fit. Right now there is a very strong feeling among most of the people using and developing on clusters that by and large what they are doing is not of interest to the general kernel community, and so has no chance of going in. So you see hack piled on top of hack piled on top of hack. Mostly I think the that is less true, at least if they can stand the process of severe code review and cleaning up their code. If we can put in code to scale the kernel to 64 processors. NIC drivers for fast interconnects and a few similar tweaks can't hurt either. But of course to get through the peer review process people need to understand what they are doing. > > There is one place in low latency communications that I can think > > of where TCP/IP is not the proper solution. For low latency > > communication the checksum is at the wrong end of the packet. > > That's one of the few things ATM's AAL5 got right. But in the end, > I think it doesn't really matter. At 1 Gbps, an MTU-sized packet > flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point, > you may well treat it as an atomic unit. So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the second switch chip + 1.3us to the top level switch chip + 1.3us to a middle layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver. 1.3us * 7 = 9.1us to deliver a packet to the other side. That is still quite painful. Right now I can get better latencies over any of the cluster interconnects. I think 5 us is the current low end, with the high end being about 1 us. Quite often in MPI when a message is sent the program cannot continue until the reply is received. Possibly this is a fundamental problem with the application programming model, encouraging applications to be latency sensitive. But it is a well established API and programming paradigm so it has to be lived with. All of this is pretty much the reverse of the TOE case. Things are latency sensitive because real work needs to be done. And the more latency you have the slower that work gets done. A lot of the NICs which are used for MPI tend to be smart for two reasons. 1) So they can do source routing. 2) So they can safely export some of their interface to user space, so in the fast path they can bypass the kernel. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/