Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp2475685rdf; Mon, 6 Nov 2023 15:56:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IE46k/0Ox69UcSqIyN0xosw5Blo6CLzzEKXd3YlOznjAHY5UK9ybXMUdGX++mPE8TyT/X2v X-Received: by 2002:a17:902:e841:b0:1cc:51ca:52e5 with SMTP id t1-20020a170902e84100b001cc51ca52e5mr25495867plg.44.1699314998405; Mon, 06 Nov 2023 15:56:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699314998; cv=none; d=google.com; s=arc-20160816; b=GM3FfoPQOxttYLjyrRmHOK+jwXtMMI/a6cC5PniHSJjCIqzekHKVWT7lz3pGqPEuwd +YoYVn1XfctRd4db+zihPuh2i36cucm+Sw68qi5DwnQQGJivOpxqTZy/I3QuqskuN5r/ by8ZifcD7ToWMV/jAhBcoA9n460Fq0C6YV28gSUvYGHYRIoIzrPdkDw5gyxLs932EvHh pBfoeJgDG6Nb1blxO6nerh6Jg7k3gZoylaCIH8roZ5Zarv6FV81mpKqkzcZm7Uw1e97p QkmDQmqU5wE8d73xN7lSxVfhu8OTFL2UXth/V8JtlsofCyO1FgI+0vFYMjQqmjzu6gOV sWzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=4hfYHFzoJ/SNtiqEXs9KmoOT951/cEwpd7yo/Kqsjms=; fh=o/FWBPwpXBrZtvChmm0btIub16f5eC28bIpx4Ebc2oE=; b=bkGMbHld5KEAiTtBzEDk2MjGkE7bVmkQ3BuemxMzf6AlutEJXFoSvHNYajybzwJzU7 FoI9xKb1MIAiN5N+PtuDgz79Y9vtn9lL2P4al/Y2JDSIMocCiTEJzR4P3gv1E0jqulVQ 6TTvWQF/FMDaMCT18KLOy6K0HepoxEP2w613iJz6elLiRZJgFmimbz/7EnKmVmup1CFw ShWeER0lhMz3Ki6+yPSocrpBkpgDewk/kHL40qbMVm+SXCXGDP3aLLj7jNwW9sUNKKIh Vco+KHsSPp/qiWt4bVNp3+eJVwbFAU0BQtkG3mqmZeXVmG84NTCR8Ee17iFE3VKDhziQ 8jUg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=t4YpVpBK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id u6-20020a170902e5c600b001c9d66601ddsi924995plf.162.2023.11.06.15.56.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 15:56:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=t4YpVpBK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 31CA1802EE49; Mon, 6 Nov 2023 15:56:35 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233267AbjKFXz6 (ORCPT + 99 others); Mon, 6 Nov 2023 18:55:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34058 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233242AbjKFXz4 (ORCPT ); Mon, 6 Nov 2023 18:55:56 -0500 Received: from mail-vk1-xa2a.google.com (mail-vk1-xa2a.google.com [IPv6:2607:f8b0:4864:20::a2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 09B25D7B for ; Mon, 6 Nov 2023 15:55:53 -0800 (PST) Received: by mail-vk1-xa2a.google.com with SMTP id 71dfb90a1353d-4ac2c1a4b87so859265e0c.0 for ; Mon, 06 Nov 2023 15:55:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1699314952; x=1699919752; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4hfYHFzoJ/SNtiqEXs9KmoOT951/cEwpd7yo/Kqsjms=; b=t4YpVpBK33bGquzqZztD96bfBBe70/YDYL7KzBIIf01t8PN/pmQLnovJpNoaj3hnpG +TElTFZ9uxV1yU0mjbua07o7/z53TTSH5Oj9j8GuV4TDZY1KoiPC15hSCNL3h7wUSGko Z22qO40AF/1y0c2zqSvxxYTjAL/PAXEN+jlu6zWhI8Ol+o5kAr0vtSikeaTcXvaGJH59 oHuqVQ0D3w7pkDieI9x3Ms4MxmUjX0ZFf0FuKKo9gWSb7ODvWJLMTisvdSCoSFbbbd7g O3JxnBaNQDE4ATPWXcSkW2gbcSYOTTunqqU3m+aPP5ZweKRVbm3mGYf3lv/qWfcO8SNq Csyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699314952; x=1699919752; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4hfYHFzoJ/SNtiqEXs9KmoOT951/cEwpd7yo/Kqsjms=; b=QO3wryyVUHJNEOtwBlJR473puQLsJmgrI3dgg5/+jTLNGfSFroqSQpM0/gc3kx0v5R iuZwUNX/BbYMRn55IJeYaYQ+3O+S8kjxX9Bi/ZZU47fa5BXsCdR5byjFRVF2tN/F8duG pXKjoQ2cj21XgAMicQ0BxuEqnH7FDnmcGdVuydiG+6GmSmAKqDLOp0CLkCFuzwqe+wgu keeQT/oTDFGKTcPwO5sNvNr3ObFZgkxdU3oQjW42FGivhGM2JCD6e41MokR+yibZTmvV GXI50TDt9TmNXPwAhPatjzHZ+Vn2dZB3LCiEWTYIlGMVERjCngGLodP68CTZcnYDYBA4 R9pQ== X-Gm-Message-State: AOJu0Yw1LljrvhdkV72TwyNZu3rXwyEoBWZaAE++s2Rlx9hWfQSQv41Y 3CONeG3lwxLrCIb3jCOlYAqJsEyHhVibRi0Bs4gDrw== X-Received: by 2002:a1f:26c4:0:b0:49e:1eca:f849 with SMTP id m187-20020a1f26c4000000b0049e1ecaf849mr27587440vkm.13.1699314951931; Mon, 06 Nov 2023 15:55:51 -0800 (PST) MIME-Version: 1.0 References: <20231106024413.2801438-1-almasrymina@google.com> <20231106024413.2801438-10-almasrymina@google.com> <19129763-6f74-4b04-8a5f-441255b76d34@kernel.org> In-Reply-To: From: Stanislav Fomichev Date: Mon, 6 Nov 2023 15:55:37 -0800 Message-ID: Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags To: Mina Almasry Cc: David Ahern , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , Willem de Bruijn , Shuah Khan , Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Willem de Bruijn , Kaiyuan Zhang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 06 Nov 2023 15:56:35 -0800 (PST) On Mon, Nov 6, 2023 at 3:27=E2=80=AFPM Mina Almasry wrote: > > On Mon, Nov 6, 2023 at 2:59=E2=80=AFPM Stanislav Fomichev wrote: > > > > On 11/06, Mina Almasry wrote: > > > On Mon, Nov 6, 2023 at 1:59=E2=80=AFPM Stanislav Fomichev wrote: > > > > > > > > On 11/06, Mina Almasry wrote: > > > > > On Mon, Nov 6, 2023 at 11:34=E2=80=AFAM David Ahern wrote: > > > > > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote: > > > > > > > On 11/05, Mina Almasry wrote: > > > > > > >> For device memory TCP, we expect the skb headers to be avail= able in host > > > > > > >> memory for access, and we expect the skb frags to be in devi= ce memory > > > > > > >> and unaccessible to the host. We expect there to be no mixin= g and > > > > > > >> matching of device memory frags (unaccessible) with host mem= ory frags > > > > > > >> (accessible) in the same skb. > > > > > > >> > > > > > > >> Add a skb->devmem flag which indicates whether the frags in = this skb > > > > > > >> are device memory frags or not. > > > > > > >> > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for pa= ge_pool_iovs, > > > > > > >> and marks the skb as skb->devmem accordingly. > > > > > > >> > > > > > > >> Add checks through the network stack to avoid accessing the = frags of > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem= skbs. > > > > > > >> > > > > > > >> Signed-off-by: Willem de Bruijn > > > > > > >> Signed-off-by: Kaiyuan Zhang > > > > > > >> Signed-off-by: Mina Almasry > > > > > > >> > > > > > > >> --- > > > > > > >> include/linux/skbuff.h | 14 +++++++- > > > > > > >> include/net/tcp.h | 5 +-- > > > > > > >> net/core/datagram.c | 6 ++++ > > > > > > >> net/core/gro.c | 5 ++- > > > > > > >> net/core/skbuff.c | 77 +++++++++++++++++++++++++++++++= +++++------ > > > > > > >> net/ipv4/tcp.c | 6 ++++ > > > > > > >> net/ipv4/tcp_input.c | 13 +++++-- > > > > > > >> net/ipv4/tcp_output.c | 5 ++- > > > > > > >> net/packet/af_packet.c | 4 +-- > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-) > > > > > > >> > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > > > > > >> index 1fae276c1353..8fb468ff8115 100644 > > > > > > >> --- a/include/linux/skbuff.h > > > > > > >> +++ b/include/linux/skbuff.h > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t; > > > > > > >> * @csum_level: indicates the number of consecutive checks= ums found in > > > > > > >> * the packet minus one that have been verified as > > > > > > >> * CHECKSUM_UNNECESSARY (max 3) > > > > > > >> + * @devmem: indicates that all the fragments in this skb a= re backed by > > > > > > >> + * device memory. > > > > > > >> * @dst_pending_confirm: need to confirm neighbour > > > > > > >> * @decrypted: Decrypted SKB > > > > > > >> * @slow_gro: state present at GRO time, slower prepare st= ep required > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff { > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP) > > > > > > >> __u8 csum_not_inet:1; > > > > > > >> #endif > > > > > > >> - > > > > > > >> + __u8 devmem:1; > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) > > > > > > >> __u16 tc_index; /* traffic cont= rol index */ > > > > > > >> #endif > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrad= e_managed(struct sk_buff *skb) > > > > > > >> __skb_zcopy_downgrade_managed(skb); > > > > > > >> } > > > > > > >> > > > > > > >> +/* Return true if frags in this skb are not readable by the= host. */ > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_b= uff *skb) > > > > > > >> +{ > > > > > > >> + return skb->devmem; > > > > > > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to = 'not_readable'? > > > > > > > It better communicates the fact that the stack shouldn't dere= ference the > > > > > > > frags (because it has 'devmem' fragments or for some other po= tential > > > > > > > future reason). > > > > > > > > > > > > +1. > > > > > > > > > > > > Also, the flag on the skb is an optimization - a high level sig= nal that > > > > > > one or more frags is in unreadable memory. There is no requirem= ent that > > > > > > all of the frags are in the same memory type. > > > > > > > > David: maybe there should be such a requirement (that they all are > > > > unreadable)? Might be easier to support initially; we can relax lat= er > > > > on. > > > > > > > > > > Currently devmem =3D=3D not_readable, and the restriction is that all= the > > > frags in the same skb must be either all readable or all unreadable > > > (all devmem or all non-devmem). > > > > > > > > The flag indicates that the skb contains all devmem dma-buf memor= y > > > > > specifically, not generic 'not_readable' frags as the comment say= s: > > > > > > > > > > + * @devmem: indicates that all the fragments in this skb are= backed by > > > > > + * device memory. > > > > > > > > > > The reason it's not a generic 'not_readable' flag is because hand= ing > > > > > off a generic not_readable skb to the userspace is semantically n= ot > > > > > what we're doing. recvmsg() is augmented in this patch series to > > > > > return a devmem skb to the user via a cmsg_devmem struct which re= fers > > > > > specifically to the memory in the dma-buf. recvmsg() in this patc= h > > > > > series is not augmented to give any 'not_readable' skb to the > > > > > userspace. > > > > > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is > > > > > correct. If a new type of unreadable skbs are introduced to the s= tack, > > > > > I imagine the stack would implement: > > > > > > > > > > 1. new header flag: skb->newmem > > > > > 2. > > > > > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *= skb) > > > > > { > > > > > return skb->devmem || skb->newmem; > > > > > } > > > > > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this = patch > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs. > > > > > > > > You copy it to the userspace in a special way because your frags > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is > > > > just and optimization. > > > > > > > > For most of the core stack, it doesn't matter why your skb is not > > > > readable. For a few places where it matters (recvmsg?), you can > > > > double-check your frags (all or some) with page_is_page_pool_iov. > > > > > > > > > > I see, we can do that then. I.e. make the header flag 'not_readable' > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or > > > something else. We can even assume not_readable =3D=3D devmem because > > > currently devmem is the only type of unreadable frag currently. > > > > > > > Unrelated: we probably need socket to dmabuf association as well (v= ia > > > > netlink or something). > > > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and > > > any packets that land on that rx-queue are bound to that dma-buf, > > > regardless of which socket that packet belongs to. So the association > > > IMO must be rx-queue to dma-buf, not socket to dma-buf. > > > > But there is still always 1 dmabuf to 1 socket association (on rx), rig= ht? > > Because otherwise, there is no way currently to tell, at recvmsg, which > > dmabuf the received token belongs to. > > > > Yes, but this 1 dma-buf to 1 socket association happens because the > user binds the dma-buf to an rx-queue and configures flow steering of > the socket to that rx-queue. It's still fixed and won't change during the socket lifetime, right? And the socket has to know this association; otherwise those tokens are useless since they don't carry anything to identify the dmabuf. I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that it somehow implies that I have an option of passing or not passing it for an individual system call. If we know that we're going to use dmabuf with the socket, maybe we should move this flag to the socket() syscall? fd =3D socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM); ? > > So why not have a separate control channel action to say: this socket f= d > > is supposed to receive into this dmabuf fd? > > This action would put > > the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also > > put some checks at the lower level to to enforce this dmabuf > > association. (to avoid any potential issues with flow steering) > > > > setsockopt(SO_DEVMEM_ASSERT_DMA_BUF, dmabuf_fd)? Sounds interesting, > but maybe a bit of a weird API to me. Because the API can't enforce > the socket to receive packets on a dma-buf (rx-queue binding + flow > steering does that), but the API can assert that incoming packets are > received on said dma-buf. I guess it would check packets before they > are acked and would drop packets that landed on the wrong queue. > > I'm a bit unsure about defensively programming features (and uapi no > less) to 'avoid any potential issues with flow steering'. Flow > steering is supposed to work. > > Also if we wanted to defensively program something to avoid flow > steering issues, then I'd suggest adding to cmsg_devmem the dma-buf fd > that the data is on, not this setsockopt() that asserts. IMO it's a > weird API for the userspace to ask the kernel to assert some condition > (at least I haven't seen it before or commonly). > > But again, in general, I'm a bit unsure about defensively designing > uapi around a feature like flow steering that's supposed to work.