Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp8346545rwp; Wed, 19 Jul 2023 08:34:12 -0700 (PDT) X-Google-Smtp-Source: APBJJlGtvt7npaXUkYJ/28ABVihWlMCDUm7ADUta50+si6b8CpCOXCBGrmviqZf/zeq+rwpIkflR X-Received: by 2002:a17:902:e804:b0:1b8:b436:bef3 with SMTP id u4-20020a170902e80400b001b8b436bef3mr16856745plg.24.1689780852095; Wed, 19 Jul 2023 08:34:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689780852; cv=none; d=google.com; s=arc-20160816; b=HJ9yM2hCgco6zuUIEPzdpvxvSRoQ5WeaWRNMBdU4PJIpaSoXbc1Rb8Et4glIGVXCUI ob6VOQ3B6nXGUmMqFn7430rQbrQLTizb4triEuUEOvabE+2pRVtQULGNJMJQtcw2ceHU OIRQHedQPp1oyL/U9+4AbsvJrCiA3xUGW0pEfLfV54W9p+Ugc+l5ibTp4LNTidJTjvIo wN+OjDu8xR9OKmxr9/0b3HOWxFwkrqgdMLHGYsmfyPvg18eL1ETyBMAZRqXQ/qkMwb+j GQR4q/nxpRSs4BPX7MJVpr7oN3pv1Km+yxGXHDcpLXwsrycAhohFQfdMPM9/JIlVSVns rDtA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=13/AwMP/8TITI/rLzUID1fTbfxpoHhwYclFZUG18RHI=; fh=IdxwEZZ7OZVUmhvvRQjGx85CISyVYlFaTqkd6tJ3XrA=; b=s2i3853UwzLcrF12bZVZiHB/UKr/H3ZKNVQxYii8kTSKBrMzc2NziSH9hM6dfONPWa /qoBquqyXAzvQzvCac7waedi1nHRW9bD/WL6dYhWUOVpoQzb8Yeu5IDcKyDrdfPr+ggm SVz1yZgLSsbOaib7CJkQBBPWdKvUvzwxUeD4IU4p6ECCUY7fOXIAhAAHd2MdgCkHyOZh IF3vDxt8szDtp8iFVvda0o+nvloXnkykpBaeL5Y3L3Na5di5k3wBgbpOm1GxFGlVOLQU /769NhBvXl1SnH+bQamEqD2mRHSw3C6K8x+gnsgM0dbx8/xSfa/nSUj17rZzWyutVvkn 86cg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=W9ugJhbZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o11-20020a170902e28b00b001b8c689060bsi3452688plc.344.2023.07.19.08.33.58; Wed, 19 Jul 2023 08:34:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=W9ugJhbZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230218AbjGSPLO (ORCPT + 99 others); Wed, 19 Jul 2023 11:11:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52036 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230118AbjGSPLN (ORCPT ); Wed, 19 Jul 2023 11:11:13 -0400 Received: from mail-vk1-xa2f.google.com (mail-vk1-xa2f.google.com [IPv6:2607:f8b0:4864:20::a2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 460F7132 for ; Wed, 19 Jul 2023 08:11:11 -0700 (PDT) Received: by mail-vk1-xa2f.google.com with SMTP id 71dfb90a1353d-4816db26ff1so2035710e0c.2 for ; Wed, 19 Jul 2023 08:11:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689779470; x=1690384270; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=13/AwMP/8TITI/rLzUID1fTbfxpoHhwYclFZUG18RHI=; b=W9ugJhbZzcOHBcK1ryGKNlMvfaR4eWFGClQmFa2DEuQjWBhQ6kKehpd1BqB++U5hgw 3TWXC4fD7TCBkoX1R5UZTE2yJt6ER7ceOa34DwcVaieTCFsfXsMO7ebD1kKBgUVll/il T5l+Tk+DvcinG+V75/JSeoIsCFQ8k6ObjIad/1StTlViDKPCllGsT3b32oMukAxIX4KN BOqJKyr8il3GkecRMqZu5HqHgWgyu4a5TW8UHq9CwM7tE1s4Y8N0BkYwnbOJf9wDlic2 T4I8G+BDJNbgq81Uhd6KHKUJLUl1eRoasPEOq2yCOXKm9K6XKz8GISCzVknnORxoJzmR xWFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689779470; x=1690384270; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=13/AwMP/8TITI/rLzUID1fTbfxpoHhwYclFZUG18RHI=; b=PRs3+5iUZd9IT3tSRYgeT1dVE7hz+LQ5fpQaTX9qw+aspwFOxqrgEaA6rnWI86OQlO 9iLIH3EHSOIkjLkklvsBbJdHH0CnThEEx48T7WbuFKzvfJ7kxEHtdCc975aGwKLroAZ9 L3nosVxkjLfQQDt+tdi8Tgs1JSTliQVs7LKZwNpx78DxuvTcdlbgegnkqjfYeUHFwy2p mNdezs4liCfdfZkJdzlJ7v2Bu9oz+JGpsV4s7Hg6/xUzPQElgNSWUprBk0wmWdLDT0Z0 7oJy7AxWJ1/g1usLyphgiPaIDGZDT7M4qQzdMk6nOCntZ14G4FeGR4AJUI6C4V4jplYW 8hcA== X-Gm-Message-State: ABy/qLbmz2qQu6pga+rqXvC0/51H9LM6AKZ1mBChmP8H3Dv9BIozKm4Z iA7emVL299qaNFuJhSUnj++B4i87GGwzlZJJof6dxA== X-Received: by 2002:a67:f60d:0:b0:443:5d85:99f3 with SMTP id k13-20020a67f60d000000b004435d8599f3mr10644906vso.7.1689779470214; Wed, 19 Jul 2023 08:11:10 -0700 (PDT) MIME-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> <12393cd2-4b09-4956-fff0-93ef3929ee37@kernel.org> <20230718111508.6f0b9a83@kernel.org> <35f3ec37-11fe-19c8-9d6f-ae5a789843cb@kernel.org> <20230718112940.2c126677@kernel.org> <20230718154503.0421b4cd@kernel.org> In-Reply-To: <20230718154503.0421b4cd@kernel.org> From: Mina Almasry Date: Wed, 19 Jul 2023 08:10:58 -0700 Message-ID: Subject: Re: [RFC PATCH 00/10] Device Memory TCP To: Jakub Kicinski Cc: David Ahern , Jason Gunthorpe , Andy Lutomirski , linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , Willem de Bruijn , Shuah Khan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 18, 2023 at 3:45=E2=80=AFPM Jakub Kicinski wr= ote: > > On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote: > > I do not see how 1 RSS context (or more specifically a h/w Rx queue) ca= n > > be used properly with memory from different processes (or dma-buf > > references). Right, my experience with dma-bufs from GPUs are that they're allocated from the userspace and owned by the process that allocated the backing GPU memory and generated the dma-buf from it. I.e., we're limited to 1 dma-buf per RX queue. If we enable binding multiple dma-bufs to the same RX queue, we have a problem, because AFAIU the NIC can't decide which dma-buf to put the packet into (it hasn't parsed the packet's destination yet). > > When the process dies, that memory needs to be flushed from > > the H/W queues. Queues with interlaced submissions make that more > > complicated. > When the process dies, do we really want to flush the memory from the hardware queues? The drivers I looked at don't seem to have a function to flush the rx queues alone, they usually do an entire driver reset to achieve that. Not sure if that's just convenience or there is some technical limitation there. Do we really want to trigger a driver reset at the event a userspace process crashes? > Agreed, one process, one control path socket. > > FWIW the rtnetlink use of netlink is very basic. genetlink already has > some infra which allows associate state with a user socket and cleaning > it up when the socket gets closed. This needs some improvements. A bit > of a chicken and egg problem, I can't make the improvements until there > are families making use of it, and nobody will make use of it until > it's in tree... But the basics are already in place and I can help with > building it out. > I had this approach in mind (which doesn't need netlink improvements) for the next POC. It's mostly inspired by the comments from the cover letter of Jakub's memory-provider RFC, if I understood it correctly. I'm sure there's going to be some iteration, but roughly: 1. A netlink CAP_NET_ADMIN API which binds the dma-buf to any number of rx queues on 1 NIC. It will do the dma_buf_attach() and dma_buf_map_attachment() and leave some indicator in the struct net_device to tell the NIC that it's bound to a dma-buf. The actual binding doesn't actuate until the next driver reset. The API, I guess, can cause a driver reset (or just a refill of the rx queues, if you think that's feasible) as well to streamline things a bit. The API returns a file handle to the user representing that binding. 2. On the driver reset, the driver notices that its struct net_device is bound to a dma-buf, and sets up the dma-buf memory-provider instead of the basic one which provides host memory. 3. The user can close the file handle received in #1 to unbind the dma-buf from the rx queues. Or if the user crashes, the kernel closes the handle for us. The unbind doesn't take effect until the next flushing or rx queues, or the next driver reset (not sure the former is feasible). 4. The dma-buf memory provider keeps the dma buf mapping alive until the next driver reset, where all the dma-buf slices are freed, and the dma buf attachment mapping can be unmapped. I'm thinking the user sets up RSS and flow steering outside this API using existing ethtool APIs, but things can be streamlined a bit by doing some of these RSS/flow steering steps in cohesion with the dma-buf binding/unbinding. The complication with setting up flow steering in cohesion with dma-buf bind unbind is that the application may start more connections after the bind, and it will need to install flow steering rules for those too, and use the ethtool api for that. May as well use the ethtool apis for all of it...? From Jakub and David's comments it sounds (if I understood correctly), you'd like to tie the dma-buf bind/unbind functions to the lifetime of a netlink socket, rather than a struct file like I was thinking. That does sound cleaner, but I'm not sure how. Can you link me to any existing code examples? Or rough pointers to any existing code? > > I guess the devil is in the details; I look forward to the evolution of > > the patches. > > +1 --=20 Thanks, Mina