Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp5115856rwp; Sun, 16 Jul 2023 20:11:46 -0700 (PDT) X-Google-Smtp-Source: APBJJlEFPXcWWowEa60uHIa1Pk++52bMKJJoJdkL07Xn2ZBydO8rQ6YGI2+fJeVEnLufT//6MDXN X-Received: by 2002:a50:ef19:0:b0:51f:f079:875f with SMTP id m25-20020a50ef19000000b0051ff079875fmr9958764eds.4.1689563505982; Sun, 16 Jul 2023 20:11:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689563505; cv=none; d=google.com; s=arc-20160816; b=p4lo50vl3GKwsbEkcguXM4CWPh6UYYi8ZAySljrdON1zrTq9SpmEFKEVsXEQXI8dMG zXlT2cs3doofpmeb5rNNGozECvCtYjyLlo3oLVAog/wGhUWbEELCtL7l69PHXtO91KoW Jh6Sfru/95yHRAj+/kZl7cVet5Z0oHuuULsWtppkJJwVBYTyPlgHVgBcnGE6yuqE2SFm xXpeCOS8/UP0ywtvn8zD1dVyWexKJgYe33+NeLuzEdUbcHjkEkzeve1aP4DhPq3MhxGe AemJOCTJY6z+yHCUvYPb+ORJGeLfaCyax8DlNDIVkbNxFsiXzQhL2mIIgRA8yDTIqZgn 6KXw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=rgbyNDF9qNQPm/frFbHeZ3ApslKpwHmElPls/EGcPBA=; fh=1QkAkWgYseOEp3feMHSbEKGRtG0x+g70vUG3GXq7rGY=; b=hKBrfXsROCHbeViOTyDGsESXtAE08KPp8Tnci6HOQmIXcEHFD2pQVTQu/p/MDKfz3Y AsDopEe8g/HLzbXuoo7LdbgEXqhC/0wIFm6MQIYFvFQsorAOFkWCt1r6eoz6YmbDa99F KqRYRGcromfWdzpryFZqow6B9+cUmo6DTbrRxpmTbtMWYH6k6t350cQnOL+wsnqisI5z oxPp2D9JhMOjqYtdwgnnrZzjqyTy0DmdlTIQkdxbDJjRzh205vpUIJT/WlkDHqfo5LzW l0hW4K0wDWl70/hZHVXiLQkKLuFBJOlrM+W7qT3VYYWj7UI/+noURzap5ihUOlVaMB8I 027w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=tC0pXdzA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f20-20020a056402069400b0051d92aee623si12824196edy.54.2023.07.16.20.11.22; Sun, 16 Jul 2023 20:11:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=tC0pXdzA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230318AbjGQCli (ORCPT + 99 others); Sun, 16 Jul 2023 22:41:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60180 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229496AbjGQClh (ORCPT ); Sun, 16 Jul 2023 22:41:37 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 312CEA4; Sun, 16 Jul 2023 19:41:36 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id AB7A960F02; Mon, 17 Jul 2023 02:41:35 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 048DCC433C7; Mon, 17 Jul 2023 02:41:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1689561695; bh=BMRFKeXZlpIRSbPMYrVconDlclfefxaU6mHhh2wyjZ0=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=tC0pXdzAqzWC61pPDdl/Dly3pRxNGL9fN4BDm+nuS7/odvvaFkj3p+PHEJfH4KrjC ivN6UF0bRGN3CSU0hlku6KpewVkw3N+ketF6UkIG8T5zgk46mrUAwJluS182IhS1FS evtOZBgtvXSoKbg+V1ocHWiMxb8Iw1bJcQCByj2a+gEBiJ+nIRHCRT2+OfhxHniPnP pvdKPUrRAZ9CT9efe2h1Ly+duTllVj8Cla61AWVPYtBFmfTjYesnWxHimYktLFmr0T PwBpxm93RFEFiT7tMUIS4/zHBBH7BPN1HrTe8L9PoOJiL6rZr+vnsWTEMhycsqx6br D3V282i3tMWeA== Message-ID: <12393cd2-4b09-4956-fff0-93ef3929ee37@kernel.org> Date: Sun, 16 Jul 2023 19:41:28 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [RFC PATCH 00/10] Device Memory TCP Content-Language: en-US To: Mina Almasry , linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Sumit Semwal , =?UTF-8?Q?Christian_K=c3=b6nig?= , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , jgg@ziepe.ca References: <20230710223304.1174642-1-almasrymina@google.com> From: Andy Lutomirski In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 7/10/23 15:32, Mina Almasry wrote: > * TL;DR: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. (I'm writing this as someone who might plausibly use this mechanism, but I don't think I'm very likely to end up working on the kernel side, unless I somehow feel extremely inspired to implement it for i40e.) I looked at these patches and the GVE tree, and I'm trying to wrap my head around the data path. As I understand it, for RX: 1. The GVE driver notices that the queue is programmed to use devmem, and it programs the NIC to copy packet payloads to the devmem that has been programmed. 2. The NIC receives the packet and copies the header to kernel memory and the payload to dma-buf memory. 3. The kernel tells userspace where in the dma-buf the data is. 4. Userspace does something with the data. 5. Userspace does DONTNEED to recycle the memory and make it available for new received packets. Did I get this right? This seems a bit awkward if there's any chance that packets not intended for the target device end up in the rxq. I'm wondering if a more capable if somewhat higher latency model could work where the NIC stores received packets in its own device memory. Then userspace (or the kernel or a driver or whatever) could initiate a separate DMA from the NIC to the final target *after* reading the headers. Can the hardware support this? Another way of putting this is: steering received data to a specific device based on the *receive queue* forces the logic selecting a destination device to be the same as the logic selecting the queue. RX steering logic is pretty limited on most hardware (as far as I know -- certainly I've never had much luck doing anything especially intelligent with RX flow steering, and I've tried on a couple of different brands of supposedly fancy NICs). But Linux has very nice capabilities to direct packets, in software, to where they are supposed to go, and it would be nice if all that logic could just work, scalably, with device memory. If Linux could examine headers *before* the payload gets DMAed to wherever it goes, I think this could plausibly work quite nicely. One could even have an easy-to-use interface in which one directs a *socket* to a PCIe device. I expect, although I've never looked at the datasheets, that the kernel could even efficiently make rx decisions based on data in device memory on upcoming CXL NICs where device memory could participate in the host cache hierarchy. My real ulterior motive is that I think it would be great to use an ability like this for DPDK-like uses. Wouldn't it be nifty if I could open a normal TCP socket, then, after it's open, ask the kernel to kindly DMA the results directly to my application memory (via udmabuf, perhaps)? Or have a whole VLAN or macvlan get directed to a userspace queue, etc? It also seems a bit odd to me that the binding from rxq to dma-buf is established by programming the dma-buf. This makes the security model (and the mental model) awkward -- this binding is a setting on the *queue*, not the dma-buf, and in a containerized or privilege-separated system, a process could have enough privilege to make a dma-buf somewhere but not have any privileges on the NIC. (And may not even have the NIC present in its network namespace!) --Andy