Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754103Ab0H0QUZ (ORCPT ); Fri, 27 Aug 2010 12:20:25 -0400 Received: from sj-iport-2.cisco.com ([171.71.176.71]:7763 "EHLO sj-iport-2.cisco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753983Ab0H0QUT (ORCPT ); Fri, 27 Aug 2010 12:20:19 -0400 Authentication-Results: sj-iport-2.cisco.com; dkim=neutral (message not signed) header.i=none X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEABaCd0yrR7Hu/2dsb2JhbACgWXGgUJt9hTcEhDuFTg X-IronPort-AV: E=Sophos;i="4.56,279,1280707200"; d="scan'208";a="274974259" From: Roland Dreier To: Stephen Hemminger Cc: Marc Aurele La France , Eric Dumazet , Ben Hutchings , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, "David S. Miller" , Alexey Kuznetsov , "Pekka Savola \(ipv6\)" , James Morris , Hideaki YOSHIFUJI , Patrick McHardy Subject: Re: RFC: MTU for serving NFS on Infiniband References: <20100823080543.319143e3@nehalam> <1282672647.2302.15.camel@achroite.uk.solarflarecom.com> <1282688441.22839.34.camel@localhost> <20100824153920.63360072@s6510> <1282823827.2476.663.camel@edumazet-laptop> <20100826165359.3b79b27d@nehalam> X-Message-Flag: Warning: May contain useful information Date: Fri, 27 Aug 2010 09:20:15 -0700 In-Reply-To: <20100826165359.3b79b27d@nehalam> (Stephen Hemminger's message of "Thu, 26 Aug 2010 16:53:59 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2616 Lines: 49 > Infiniband device driver needs to be fixed to do SG and checksum offload. > Otherwise it is insane to try and run large MTU over it. I even wonder if > the dev_change_mtu() function should reject > PAGESIZE mtu for devices > that don't do scatter/gather or at least a raise a warning. It's not possible to "fix" the driver to do checksum offload, since the underlying hardware does not support it. Theoretically we could handle SG but of course there's no point in that without checksum offload. I think there is some confusion about what IPoIB is in this thread, so let me try to give some basic background to help the discussion. There are two "modes" that an IPoIB interface can operate in: datagram mode and connected mode. In datagram mode, packets given to the IPoIB driver are sent as IB unreliable datagram messages, which means each skb turns into one packet on the wire -- very much like the ethernet case. In this mode, the MTU is limited by the MTU on the IB side, which is typically either 2K or 4K depending on the adapter and the switches involved. Modern IB adapters do support checksum offload and large send offload for datagrams, so we can and do enable SG and IP_CSUM. In connected mode, the IPoIB driver actually makes a reliable connection to each peer. For reliable connections, IB adapters can actually send messages up to 4GB, with the adapter handling all the segmentation and transport level acks etc. -- the host system simply queues one work request for each message of any size. These work requests do support gather/scatter, but no existing adapter supports checksum offload for messages on reliable connections. However, since reliable connections support arbitrary sized messages, in connected mode the IPoIB driver allows an MTU up to roughly the maximum 64K IP message size. (I don't think anyone has tried it with bigger IPv6 jumbograms ;) It does seem even with all the horrible memory allocation problems caused by requiring huge linear skbs, connected mode does offer very good performance for at least some real-world uses (although apparently NFS is not one such use). In fact as far as I know, connected mode with a huge MTU continues to outperform datagram mode even with LSO and LRO (although I don't have any particularly recent numbers). So I don't think we want to completely disallow such uses. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/