Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp3143771imm; Sun, 3 Jun 2018 20:55:51 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKaNYME4mtxLPBEz8x+q5D+m6d10P56JxVAmioNyQcrVq5yP/1iTDLxJEp2NjOirMA1IWFq X-Received: by 2002:a17:902:9b8f:: with SMTP id y15-v6mr20632117plp.187.1528084551728; Sun, 03 Jun 2018 20:55:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528084551; cv=none; d=google.com; s=arc-20160816; b=EaWcg/j4zISE4XcO++785z+9ERm+OspaJMYnQVoZiPAv9T3JAAeh356UddtrCod1xn iyCssyi9Gt16XJomV45t5PY0brP5J6qwSkptVVtcjjMx6PZW/BlnpmOtI5hvqeck0CLR P1yGMYJFu9uBMAeoFpi3lhrT8NmXK+RDqDbE99CmW2XnV2H3JXxnbQMJMWc6sdxTjM91 UM0kVg7Rtkddn0w44rsGz79cgYAObrwxWmUQ08KzZnPTUwFR2gAAvLdloM/0/ftcMfIM SJZ5oynxzzZxNI5IoMW4RSIfMe3pG3oJ9CLZMrS9MoOS2xY4MW3sIvbB0J20M0DCFMQ5 fV6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:references :in-reply-to:subject:cc:date:to:from:arc-authentication-results; bh=erBDh/UTGosbtz4iFbKtf1hSQcwjvu2iOn+i4DOEi50=; b=RnaXWA8y+atbGEAPaOJ4YQMBhtJr/IFeKodFDWpakcHSLIoRD++5AsH0J1s/whhvv/ EOyIkdY1Q6igtddw3zRigRciSjyW5zRYt532nDOEJAPWAfPCwTsLG4bsjZqEn490ypWo awlnn3B0eY0ZuI8F/8lS8kOX54YhKFlIEaJue1LpnXv5dNZ1j2voVfIYthW+Emd0TrN2 hGReYN7PG2labKqpJ9KuNwsDckxLef5wJbGrSiYNL1RfYZYTK+9sWqYyRyFm2eLDtwS2 LIbRyElIZ778g7hcWqSDkwKUQIHvruLEhznt5jBKxty7Wkn4vm4cniuyxHysbFDmdl7+ o08A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g14-v6si840390plo.95.2018.06.03.20.55.35; Sun, 03 Jun 2018 20:55:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751660AbeFDDzK (ORCPT + 99 others); Sun, 3 Jun 2018 23:55:10 -0400 Received: from mx2.suse.de ([195.135.220.15]:60612 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751273AbeFDDzI (ORCPT ); Sun, 3 Jun 2018 23:55:08 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id EFB12AC20; Mon, 4 Jun 2018 03:55:06 +0000 (UTC) From: NeilBrown To: "Dilger\, Andreas" Date: Mon, 04 Jun 2018 13:54:55 +1000 Cc: Doug Oucharek , Andreas Dilger , "devel\@driverdev.osuosl.org" , Christoph Hellwig , Greg Kroah-Hartman , "Linux Kernel Mailing List" , "Drokin\, Oleg" , "selinux\@tycho.nsa.gov" , fsdevel , lustre-devel Subject: Re: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree. In-Reply-To: <58123CDD-8424-4E1D-A11F-0F899970A49B@intel.com> References: <20180601091133.GA27521@kroah.com> <20180601114151.GA25225@infradead.org> <29ACF5A8-7608-46BB-8191-E3FEB77D0F24@cray.com> <87h8mmrt6b.fsf@notabene.neil.brown.name> <58123CDD-8424-4E1D-A11F-0F899970A49B@intel.com> Message-ID: <87h8mjp5o0.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Sun, Jun 03 2018, Dilger, Andreas wrote: > On Jun 1, 2018, at 17:19, NeilBrown wrote: >>=20 >> On Fri, Jun 01 2018, Doug Oucharek wrote: >>=20 >>> Would it makes sense to land LNet and LNDs on their own first? Get >>> the networking house in order first before layering on the file >>> system? >>=20 >> I'd like to turn that question on it's head: >> Do we need LNet and LNDs? What value do they provide? >> (this is a genuine question, not being sarcastic). >>=20 >> It is a while since I tried to understand LNet, and then it was a >> fairly superficial look, but I think it is an abstraction layer >> that provides packet-based send/receive with some numa-awareness >> and routing functionality. It sits over sockets (TCP) and IB and >> provides a uniform interface. > > LNet is originally based on a high-performance networking stack called > Portals (v3, http://www.cs.sandia.gov/Portals/), with additions for LNet > routing to allow cross-network bridging. > > A critical part of LNet is that it is for RDMA and not packet-based > messages. Everything in Lustre is structured around RDMA. Of course, > RDMA is not possible with TCP so it just does send/receive under the > covers, though it can do zero copy data sends (and at one time zero-copy > receives, but those changes were rejected by the kernel maintainers). > It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA > network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI, > and previously older network types no longer supported). Thanks! That will probably help me understand it more easily next time I dive in. > > Even with TCP it has some improvements for performance, such as using > separate sockets for send and receive of large messages, as well as > a socket for small messages that has Nagle disabled so that it does > not delay those packets for aggregation. That sounds like something that could benefit NFS... pNFS already partially does this by virtue of the fact that data often goes to a different server than control, so a different socket is needed. I wonder if it could benefit from more explicit separate of message sizes. Thanks a lot for this background info! NeilBrown > > In addition to the RDMA support, there is also multi-rail support in > the out-of-tree version that we haven't been allowed to land, which > can aggregate network bandwidth. While there exists channel bonding > for TCP connections, that does not exist for IB or other RDMA networks. > >> That is almost a description of the xprt layer in sunrpc. sunrpc >> doesn't have routing, but it does have some numa awareness (for the >> server side at least) and it definitely provides packet-based >> send/receive over various transports - tcp, udp, local (unix domain), >> and IB. >> So: can we use sunrpc/xprt in place of LNet? > > No, that would totally kill the performance of Lustre. > >> How much would we need to enhance sunrpc/xprt for this to work? What >> hooks would be needed to implement the routing as a separate layer. >>=20 >> If LNet is, in some way, much better than sunrpc, then can we share that >> superior functionality with our NFS friends by adding it to sunrpc? > > There was some discussion at NetApp about adding a Lustre/LNet transport > for pNFS, but I don't think it ever got beyond the proposal stage: > > https://tools.ietf.org/html/draft-faibish-nfsv4-pnfs-lustre-layout-07 > >> Maybe the answer to this is "no", but I think LNet would be hard to sell >> without a clear statement of why that was the answer. > > There are other users outside of the kernel tree that use LNet in addition > to just Lustre. The Cray "DVS" I/O forwarding service[*] uses LNet, and > another experimental filesystem named Zest[+] also used LNet. > > [*] https://www.alcf.anl.gov/files/Sugiyama-Wallace-Thursday16B-slides.pdf > [+] https://www.psc.edu/images/zest/zest-sc07-paper.pdf > >> One reason that I would like to see lustre stay in drivers/staging (so I >> do not support Greg's patch) is that this sort of transition of Lustre >> to using an improved sunrpc/xprt would be much easier if both were in >> the same tree. Certainly it would be easier for a larger community to >> be participating in the work. > > I don't think the proposal to encapsulate all of the Lustre protocol into > pNFS made a lot of sense, since this would have only really been available > on Linux, at which point it would be better to use the native Lustre clie= nt > rather than funnel everything through pNFS. > > However, _just_ using the LNet transport for (p)NFS might make sense. LN= et > is largely independent from Lustre (it used to be a separate source tree) > and is very efficient over the network. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Intel Corporation --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAlsUuA8ACgkQOeye3VZi gbmTDg/+PBq1EVSU3KikxQ8H9Fy2jjVvPrJrd9REip8xDwHjOIVvClKSkgfcg55y bsktdVzS+C5J9bnx6EE4S57vh3ZIpK2xJOf50Gr4+NLQqoQ6bS7gTq436gYijuax GJ/edLSpsW8aIjnSlZpIs+60CRaYZCyrkKsHa+EBW9vjflSVvtMvU/s++p0YKJTV 7dXoUJAHbsJv5nHVgImgeLFZIvvEu8/AgUtcVoIl1G/1LcxUN1KB4jbc2JX8zKVq XRzeR1I3lEqSCktOfuVSGZsefP+3kXZJdTiMMgFHAs9Dpvrqnv/qiDqn5Mz3T88R a04PoK9WovqWsqFfoMjgnVmdpgqHSJP+7n3X1jp9MXoSMhKyk87imEPb+gT++vfc O+3QB9+9M96HeY0o7LTDECVgTxN1My/B7Wu3hprcY5xXS+PZbaGxIyIEfkk5EKsC 5BRwK6CYNh5psJpgqMrJwlA2nouooME0hs7RyGSDh69l/TjHRi/vEbDxT7QqjNZW NzgjHqVZuUe9aDtaWnlGi2zfz5PVN9nchzv1+3/DT010c7/bdsJKOJOyLlqdxRaF 91C4EsitXgu5E0Qr4jWVmIDTpVBA3x3U+wrYQCU40pv4OgQ+Z7PiD4imapQZoWtP 6XOBRw/TV01u4IOTX9cNJUNFPfBmNwAz+W5q86DmMyvWFxJMbv8= =vVBa -----END PGP SIGNATURE----- --=-=-=--