Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp3712655rdb; Sun, 10 Dec 2023 18:20:03 -0800 (PST) X-Google-Smtp-Source: AGHT+IG6kqZq0XXN9Rdwrn/rSarEcW5R4RMwdCEACvSuDBV+a2F38cVFXPMFlKN8s+S5Rf1BYaxD X-Received: by 2002:a05:6870:2307:b0:1fb:75a:77a3 with SMTP id w7-20020a056870230700b001fb075a77a3mr3238247oao.84.1702261202987; Sun, 10 Dec 2023 18:20:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702261202; cv=none; d=google.com; s=arc-20160816; b=fAxZegq8FhscvKiP1ocMY64DFq+6XHZGIbIWnl+zeAcbTlzD1HkvyCMPS3eJ2cise7 l0gdzIQk2dIcdC/HTmAy8Kemd8FRL9UnHWzfUIe+zdiaJnNLmeLdBTlCoyXYy19lBZLG eRpHjslW+mjFBxRQbMibxjyWtALSaqYdO/SZo3Gs0o/AlrAcdyU5LnmId4nHCiz3dCkN X7DNdTiQALiKXQ5o0x/xncScZ6O/40UeJadUuu24xbrzDXhSInE+aLykdhRxY0tTM1tg H3U+7KEdxh5r8RQzckkdNXsr26OAeudG1I1DaJibzvQVq3Q8wzCATvYE8rf+as22hR59 UQzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=XKXPy8iwxJNjhTZTw+LJ8Vv0y/XagWuzrwhYe/uRc+M=; fh=3vdDtv+sUddP2wEEeB9sfn0I6QTRYGCWb3VYJk0Hkkw=; b=AyrojTe9X7kWKjbNdpjimGqK7++Eo6VQt03ttah9UCofa0nqUsAeUY2UPwLlMB/Mkg rJazZ2u9ObhLYcJSld9xdZUKB5jUcKHGTMmXEHED8iCiew3y/qZ9oU+VcwxvGHj+OmOp TsGf91MSkvgKTn3TGAD0F3co2onegUpGSJ/PM8o34o281BqPlDijhGrOlb5+izOb1F1/ TImCSBok1j2E47rIvfs2tn6n6QKNg+50s4tt5s993c6J04imLbe2KNdMd+p6AwrfQfiK l1iM2FZQvrmlyiBh39VjP25h778JWevPNxWXi6aIE6fMHFXmW9JQc0xonceEP0tnucUp KvCQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=KDno01d7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id c6-20020a631c46000000b005b99ea783aasi5160201pgm.755.2023.12.10.18.20.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Dec 2023 18:20:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=KDno01d7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id DCE4A8088A50; Sun, 10 Dec 2023 18:19:59 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229643AbjLKCTm (ORCPT + 99 others); Sun, 10 Dec 2023 21:19:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45638 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229483AbjLKCTk (ORCPT ); Sun, 10 Dec 2023 21:19:40 -0500 Received: from mail-vs1-xe2c.google.com (mail-vs1-xe2c.google.com [IPv6:2607:f8b0:4864:20::e2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22962D0 for ; Sun, 10 Dec 2023 18:19:45 -0800 (PST) Received: by mail-vs1-xe2c.google.com with SMTP id ada2fe7eead31-46484f37549so1367454137.1 for ; Sun, 10 Dec 2023 18:19:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702261183; x=1702865983; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XKXPy8iwxJNjhTZTw+LJ8Vv0y/XagWuzrwhYe/uRc+M=; b=KDno01d75+zS4pjfKWKOC9dAezwDOu/J7edfZ8RjonSYjckTSYHquV0lK6Mcw8WZuk ODYUSm58Yr5rRs4hW2FB4dXzn+zzC6MoTADOyE2d7cYFaqG3eoH+c0exuXuVhF0azbBz nngL8Ujwvr6JzAdblvnxQm1mN5kUfWqS2RVuhohJXSaRBAFoyZxXtArAghmTdEF6Qslo EEiq9cBHumbvYvnTjWOSTY9PXIwANV8orhBREH6y2i5C5cHlBXvxFBquDz4eum/rDZpH PTdTI7Pu0ep25NABQXrNiEMd12UzinfvSz30molHN5tK8wa+34XZcO+ZoF/EYcoeYdkh hurg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702261183; x=1702865983; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XKXPy8iwxJNjhTZTw+LJ8Vv0y/XagWuzrwhYe/uRc+M=; b=fY21GzruKns421wwIyUYUaLje1Y8epSFiy7FXbdmGMsIdWQoTmO6kGHwsr2Xsz7qQs NGvhl6CnDoEy8gpLJPouJaXJ4+OgMs8rE482vOTU34ksC+9rTQS+j7V6zws1njDohQLf DbperhchwJILitMfCZ8VAuPuNlWoXQC6Tj4KNLQTlZEfUAlNInsXSQW1swUb3BhB7JM1 87OJoV8HSWhO79TxZxEqtcvOo73DijruateMK7bzl61+X2J/5Jwg4BPiwyDbyEyQVOLx xBmlhKUqdzi2jfmz1M2nGL+MalXawGB/45d+Im3TE3yiYMwDCdH8cGVROC1eP3QIQsa0 oc4A== X-Gm-Message-State: AOJu0YxJqSE3j2+h09Y4sKtTDoxdlWlOZnYAM75CmZnU7vxg1BnZoIkq UR/4f6A8C6xbA1jOcJDqSOdgQaGw4hf5MJj5OcYw8g== X-Received: by 2002:a67:ef44:0:b0:464:77f2:557 with SMTP id k4-20020a67ef44000000b0046477f20557mr2151093vsr.41.1702261183358; Sun, 10 Dec 2023 18:19:43 -0800 (PST) MIME-Version: 1.0 References: <20231208005250.2910004-1-almasrymina@google.com> <20231208005250.2910004-7-almasrymina@google.com> <5752508c-f7bc-44ac-8778-c807b2ee5831@kernel.org> <279a2999-3c0a-4839-aa2e-602864197410@kernel.org> In-Reply-To: <279a2999-3c0a-4839-aa2e-602864197410@kernel.org> From: Mina Almasry Date: Sun, 10 Dec 2023 18:19:29 -0800 Message-ID: Subject: Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice To: David Ahern Cc: Shailend Chand , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Jeroen de Borst , Praveen Kaligineedi , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , Willem de Bruijn , Shuah Khan , Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , Yunsheng Lin , Harshitha Ramamurthy , Shakeel Butt , Willem de Bruijn , Kaiyuan Zhang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Sun, 10 Dec 2023 18:20:00 -0800 (PST) On Sat, Dec 9, 2023 at 3:29=E2=80=AFPM David Ahern wro= te: > > On 12/8/23 12:22 PM, Mina Almasry wrote: > > On Fri, Dec 8, 2023 at 9:48=E2=80=AFAM David Ahern = wrote: > >> > >> On 12/7/23 5:52 PM, Mina Almasry wrote: > > ... > >>> + > >>> + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) { > >>> + if (rxq->binding =3D=3D binding) { > >>> + /* We hold the rtnl_lock while binding/unbindin= g > >>> + * dma-buf, so we can't race with another threa= d that > >>> + * is also modifying this value. However, the d= river > >>> + * may read this config while it's creating its > >>> + * rx-queues. WRITE_ONCE() here to match the > >>> + * READ_ONCE() in the driver. > >>> + */ > >>> + WRITE_ONCE(rxq->binding, NULL); > >>> + > >>> + rxq_idx =3D get_netdev_rx_queue_index(rxq); > >>> + > >>> + netdev_restart_rx_queue(binding->dev, rxq_idx); > >> > >> Blindly restarting a queue when a dmabuf is heavy handed. If the dmabu= f > >> has no outstanding references (ie., no references in the RxQ), then no > >> restart is needed. > >> > > > > I think I need to stop the queue while binding to a dmabuf for the > > sake of concurrency, no? I.e. the softirq thread may be delivering a > > packet, and in parallel a separate thread holds rtnl_lock and tries to > > bind the dma-buf. At that point the page_pool recreation will race > > with the driver doing page_pool_alloc_page(). I don't think I can > > insert a lock to handle this into the rx fast path, no? > > I think it depends on the details of how entries are added and removed > from the pool. I am behind on the pp details at this point, so I do need > to do some homework. > I think it also depends on the details of how to invalidate buffers posted to the rx queue of a particular driver. For GVE as far as I understands when the queue is started I believe it allocates a bunch of buffers and posts them to the rx queue. Then it processes the completion descriptors from the hardware and posts new buffers to replace the ones consumed, so any started queue would have postesd buffers in it. As far as I know we also don't support invalidating posted buffers without first stopping the queue, replacing the buffers, and starting again. But I don't think these are limitations overly specific to GVE, I believe non-RDMA NICs work similarly? But I'd stress that what I'm proposing here should be extensible to capabilities of specific drivers. If one has a driver that allows them to invalidate posted buffers on the fly, I imagine they can extend the queue API to declare that support to the netstack in a genric way, and the net stack can invalidate buffers from the previous page pool and supply the new one. > > > > Also, this sounds like it requires (lots of) more changes. The > > page_pool + driver need to report how many pending references there > > are (with locking so we don't race with incoming packets), and have > > them reported via an ndo so that we can skip restarting the queue. > > Implementing the changes in to a huge issue but handling the > > concurrency may be a genuine blocker. Not sure it's worth the upside > > of not restarting the single rx queue? > > It has to do with the usability of this overall solution. As I mentioned > most ML use cases can (and will want to) use many memory allocations for > receiving packets - e.g., allocations per message and receiving multiple > messages per socket connection. > We support that by flow steering different flows to different RX queues. Our NICs don't support smart choosing of which page_pool to place the packet in (based on ntuple rule or what not). So flows that must land on a given dmabuf are flow steered to that dmabuf, and flows that need to land host memory and not flow steered and are RSS'd to the non-dmabuf bound queues. This should also be extensible by folks that have NICs with the appropriate support. > > > >>> + } > >>> + } > >>> + > >>> + xa_erase(&netdev_dmabuf_bindings, binding->id); > >>> + > >>> + netdev_dmabuf_binding_put(binding); > >>> +} > >>> + > >>> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, > >>> + struct netdev_dmabuf_binding *binding) > >>> +{ > >>> + struct netdev_rx_queue *rxq; > >>> + u32 xa_idx; > >>> + int err; > >>> + > >>> + rxq =3D __netif_get_rx_queue(dev, rxq_idx); > >>> + > >>> + if (rxq->binding) > >>> + return -EEXIST; > >>> + > >>> + err =3D xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_lim= it_32b, > >>> + GFP_KERNEL); > >>> + if (err) > >>> + return err; > >>> + > >>> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we= can't > >>> + * race with another thread that is also modifying this value. = However, > >>> + * the driver may read this config while it's creating its * rx= -queues. > >>> + * WRITE_ONCE() here to match the READ_ONCE() in the driver. > >>> + */ > >>> + WRITE_ONCE(rxq->binding, binding); > >>> + > >>> + err =3D netdev_restart_rx_queue(dev, rxq_idx); > >> > >> Similarly, here binding a dmabuf to a queue. I was expecting the dmabu= f > >> binding to add entries to the page pool for the queue. > > > > To be honest, I think maybe there's a slight disconnect between how > > you think the page_pool works, and my primitive understanding of how > > it works. Today, I see a 1:1 mapping between rx-queue and page_pool in > > the code. I don't see 1:many or many:1 mappings. > > I am not referring to 1:N or N:1 for page pool and queues. I am > referring to entries within a single page pool for a single Rx queue. > > Thanks, glad to hear that. I was afraid there is a miscommunication here. > > > > In theory mapping 1 rx-queue to n page_pools is trivial: the driver > > can call page_pool_create() multiple times to generate n queues and > > decide for incoming packets which one to use. > > > > However, mapping n rx-queues to 1 page_pool seems like a can of worms. > > I see code in the page_pool that looks to me (and Willem) like it's > > safe only because the page_pool is used from the same napi context. > > with a n rx-queueue: 1 page_pool mapping, that is no longer true, no? > > There is a tail end of issues to resolve to be able to map 1 page_pool > > to n queues as I understand and even if resolved I'm not sure the > > maintainers are interested in taking the code. > > > > So, per my humble understanding there is no such thing as "add entries > > to the page pool for the (specific) queue", the page_pool is always > > used by 1 queue. > > > > Note that even though this limitation exists, we still support binding > > 1 dma-buf to multiple queues, because multiple page pools can use the > > same netdev_dmabuf_binding. I should add that to the docs. > > > >> If the pool was > >> previously empty, then maybe the queue needs to be "started" in the > >> sense of creating with h/w or just pushing buffers into the queue and > >> moving the pidx. > >> > >> > > > > I don't think it's enough to add buffers to the page_pool, no? The > > existing buffers in the page_pool (host mem) must be purged. I think > > maybe the queue needs to be stopped as well so that we don't race with > > incoming packets and end up with skbs with devmem and non-devmem frags > > (unless you're thinking it becomes a requirement to support that, I > > think things are complicated as-is and it's a good simplification). > > When we already purge the existing buffers & restart the queue, it's > > little effort to migrate this to become in line with Jakub's queue-api > > that he also wants to use for per-queue configuration & ndo_stop/open. > > > --=20 Thanks, Mina