Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2863689imm; Sun, 3 Jun 2018 13:35:59 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKdogN2iPYMqIQn2Daw/5DPuPD8ckPsR6XpTijuHvauswnDHXY4MyBXa1ULYz49ZrZburLm X-Received: by 2002:aa7:84cf:: with SMTP id x15-v6mr7289903pfn.220.1528058159538; Sun, 03 Jun 2018 13:35:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528058159; cv=none; d=google.com; s=arc-20160816; b=WtD1UnlnbnChghvSbtapPcWFf91ZrcPYQbytw+ZxHZUj4jbyHFau/CWNf66AAenMnV 6QwZZ5LMuUK30W5Jx3BD3RdmRoBV+GKy2Q/Bg6zycTghahkJsNusLEkwckD3hLC9cJCp tXrfNVFhyIykq2RcaYiU1asgIJDxYK+vqho3exjLuuZc7MOglaBy+LxarTcWZfvUiize 7GxpIeEORyFKDNdELheA08W1xDqia33uiycKVRuIJqEdmsYOK36vEtMedW28NXLLJ1LD nqe24wbHtlSkJ05uPIyqe4c3lXLWRPgi4XCJcNFbbPd2ibuR/bTjblvEZw3F+M0vXxqI 3qFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:content-transfer-encoding :content-id:content-language:accept-language:in-reply-to:references :message-id:date:thread-index:thread-topic:subject:cc:to:from :arc-authentication-results; bh=GjGHzQoSv9G8VPX6Nn8a+7qDSHJu4Li4kvkrkDwiDUs=; b=kdFYJ+YJCzRkemN6WpCF4a4ikHye1azjDNztZ/H593LA/ixwY5uN1cJ6OeGcVxmZii sZi7knVDI2QBWQwaZTcDSQFjYmcO3aUX69usSaPkvr0ljKpJ2pnUFE0Ik9KGMY9+QVRk l+WSEXG99VNyXGe7G+JPNV6S9m2Q+Xp6Rt4pWjfqyZxQ4ZodwczH2vm42g15kLpN9PIw rKXipci0uJafRTOc4Dm0wP7C5c1nvZaTpRbaS4TLUALD7WtMqMmdqvm1ZUOuS46IDPaz Tsypk0fahnjco5G+kmJiNCuuKo+tx3nLF9J7sOWYrDih5q8E0a2RSdPiRCOsosWK2wbO bNjQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o9-v6si44024401plk.434.2018.06.03.13.35.31; Sun, 03 Jun 2018 13:35:59 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751282AbeFCUe5 convert rfc822-to-8bit (ORCPT + 99 others); Sun, 3 Jun 2018 16:34:57 -0400 Received: from mga01.intel.com ([192.55.52.88]:41548 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751147AbeFCUe4 (ORCPT ); Sun, 3 Jun 2018 16:34:56 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 03 Jun 2018 13:34:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,473,1520924400"; d="scan'208";a="60039825" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by fmsmga004.fm.intel.com with ESMTP; 03 Jun 2018 13:34:53 -0700 Received: from fmsmsx112.amr.corp.intel.com (10.18.116.6) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.319.2; Sun, 3 Jun 2018 13:34:53 -0700 Received: from FMSMSX109.amr.corp.intel.com ([169.254.15.73]) by FMSMSX112.amr.corp.intel.com ([169.254.5.199]) with mapi id 14.03.0319.002; Sun, 3 Jun 2018 13:34:52 -0700 From: "Dilger, Andreas" To: NeilBrown CC: Doug Oucharek , Andreas Dilger , "devel@driverdev.osuosl.org" , Christoph Hellwig , Greg Kroah-Hartman , "Linux Kernel Mailing List" , "Drokin, Oleg" , "selinux@tycho.nsa.gov" , fsdevel , lustre-devel Subject: Re: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree. Thread-Topic: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree. Thread-Index: AQHT+dU6fAbKgzXle0qwVLWc0eCszqRMLX4AgABSNwCAAvacAA== Date: Sun, 3 Jun 2018 20:34:52 +0000 Message-ID: <58123CDD-8424-4E1D-A11F-0F899970A49B@intel.com> References: <20180601091133.GA27521@kroah.com> <20180601114151.GA25225@infradead.org> <29ACF5A8-7608-46BB-8191-E3FEB77D0F24@cray.com> <87h8mmrt6b.fsf@notabene.neil.brown.name> In-Reply-To: <87h8mmrt6b.fsf@notabene.neil.brown.name> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.254.10.77] Content-Type: text/plain; charset="us-ascii" Content-ID: <38D8D34AE692BC4B9A8E43C0D9D99683@intel.com> Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Jun 1, 2018, at 17:19, NeilBrown wrote: > > On Fri, Jun 01 2018, Doug Oucharek wrote: > >> Would it makes sense to land LNet and LNDs on their own first? Get >> the networking house in order first before layering on the file >> system? > > I'd like to turn that question on it's head: > Do we need LNet and LNDs? What value do they provide? > (this is a genuine question, not being sarcastic). > > It is a while since I tried to understand LNet, and then it was a > fairly superficial look, but I think it is an abstraction layer > that provides packet-based send/receive with some numa-awareness > and routing functionality. It sits over sockets (TCP) and IB and > provides a uniform interface. LNet is originally based on a high-performance networking stack called Portals (v3, http://www.cs.sandia.gov/Portals/), with additions for LNet routing to allow cross-network bridging. A critical part of LNet is that it is for RDMA and not packet-based messages. Everything in Lustre is structured around RDMA. Of course, RDMA is not possible with TCP so it just does send/receive under the covers, though it can do zero copy data sends (and at one time zero-copy receives, but those changes were rejected by the kernel maintainers). It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI, and previously older network types no longer supported). Even with TCP it has some improvements for performance, such as using separate sockets for send and receive of large messages, as well as a socket for small messages that has Nagle disabled so that it does not delay those packets for aggregation. In addition to the RDMA support, there is also multi-rail support in the out-of-tree version that we haven't been allowed to land, which can aggregate network bandwidth. While there exists channel bonding for TCP connections, that does not exist for IB or other RDMA networks. > That is almost a description of the xprt layer in sunrpc. sunrpc > doesn't have routing, but it does have some numa awareness (for the > server side at least) and it definitely provides packet-based > send/receive over various transports - tcp, udp, local (unix domain), > and IB. > So: can we use sunrpc/xprt in place of LNet? No, that would totally kill the performance of Lustre. > How much would we need to enhance sunrpc/xprt for this to work? What > hooks would be needed to implement the routing as a separate layer. > > If LNet is, in some way, much better than sunrpc, then can we share that > superior functionality with our NFS friends by adding it to sunrpc? There was some discussion at NetApp about adding a Lustre/LNet transport for pNFS, but I don't think it ever got beyond the proposal stage: https://tools.ietf.org/html/draft-faibish-nfsv4-pnfs-lustre-layout-07 > Maybe the answer to this is "no", but I think LNet would be hard to sell > without a clear statement of why that was the answer. There are other users outside of the kernel tree that use LNet in addition to just Lustre. The Cray "DVS" I/O forwarding service[*] uses LNet, and another experimental filesystem named Zest[+] also used LNet. [*] https://www.alcf.anl.gov/files/Sugiyama-Wallace-Thursday16B-slides.pdf [+] https://www.psc.edu/images/zest/zest-sc07-paper.pdf > One reason that I would like to see lustre stay in drivers/staging (so I > do not support Greg's patch) is that this sort of transition of Lustre > to using an improved sunrpc/xprt would be much easier if both were in > the same tree. Certainly it would be easier for a larger community to > be participating in the work. I don't think the proposal to encapsulate all of the Lustre protocol into pNFS made a lot of sense, since this would have only really been available on Linux, at which point it would be better to use the native Lustre client rather than funnel everything through pNFS. However, _just_ using the LNet transport for (p)NFS might make sense. LNet is largely independent from Lustre (it used to be a separate source tree) and is very efficient over the network. Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Intel Corporation