Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp3715913rdb; Sun, 10 Dec 2023 18:31:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IHQ98CWHzAdknfXcPHO2B39xiXw4ZC7s0r44bpRJGqMCb/a2Bb9sh7NZRJulyFOP7jF6Kzz X-Received: by 2002:a05:6a20:970f:b0:190:8f32:7899 with SMTP id hr15-20020a056a20970f00b001908f327899mr3575744pzc.18.1702261863807; Sun, 10 Dec 2023 18:31:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702261863; cv=none; d=google.com; s=arc-20160816; b=RgBLxUiyPDE3Xtn8WyB347z5utSOdD7T9ExJ2diwtCzKh17dmOIXXAgXIuThwNPpeo NWTUo+CR1QzRyxq3mNx10yRuesI0bSzWRntusgVwlskswnZWPhZI/HTLTtjpKEMolQuq L0T+/fXbH8VRhXdfGl4fbceGaF21IWM6oNwfa7EjL0DWaIBtWWKW2+6WYL2Wix6W3aao cLbXoRiXI8kwHQmkzy6L23piKzlTQoA8SEW42sSZq/HVPpcwh1YR7gPYeSLmQ28Tzn9z mUKXRbAe6GTUZTvq5Sf6T+fAy4nYP/xXhxm9GnkxuGdGdOBzBv2xZ8mX6fpUGwB+fZwx spGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=TAtSrNzMXsIYIDI7PZZlxLCtSwOpIMs+ASH6ed+/tjA=; fh=mWmiplQrnbq4Ct0XX0ceuJLcMVyboGLhX+tCa5rMS2I=; b=lCpBKCsdi9V7LF0yo4sn31PGWYCn83dvfJiqLwAMuu/0p4hfagLL/io6WMAfAJLe3X iOFSI/r+pEj7yCDbKq3uxDWrKz9MDvKVRmJnBSbNbUq5UO8jC4q5674CADVgAwfIg8Mf 2fPD0WT1TigPqTgjkNT7Rp5FqT+1v+s6BwuV3N+aI3ujT4wsIrnoAZ42YQhCrVrqgxmE A32vFl1sWb84p5hWswvgcHJD8qWMbChuxcYXuGIzrSOX1zeIj12Mu6Xo4avzlLtW2s5W g8WtazEucbjt2/HwFTkfDqZwXDhFboQ2fIkS68kTWMysyvSMFs7qPYviSBQxlj2Yt7Ya q7/Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=kfkHS50p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id d3-20020a056a00244300b006cd989cfe57si5206066pfj.195.2023.12.10.18.31.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Dec 2023 18:31:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=kfkHS50p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id CA8898067AA7; Sun, 10 Dec 2023 18:31:00 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232355AbjLKCak (ORCPT + 99 others); Sun, 10 Dec 2023 21:30:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232887AbjLKCai (ORCPT ); Sun, 10 Dec 2023 21:30:38 -0500 Received: from mail-yw1-x112a.google.com (mail-yw1-x112a.google.com [IPv6:2607:f8b0:4864:20::112a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 193B510A for ; Sun, 10 Dec 2023 18:30:44 -0800 (PST) Received: by mail-yw1-x112a.google.com with SMTP id 00721157ae682-5e121d3a7b5so603687b3.1 for ; Sun, 10 Dec 2023 18:30:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702261843; x=1702866643; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TAtSrNzMXsIYIDI7PZZlxLCtSwOpIMs+ASH6ed+/tjA=; b=kfkHS50p/4DfNP73jFkplJQx4+JWvLxxa4LNvTSeGoobcrko2XLvH/NZ6quTsMZOab NdmWYAdwVF5MHT60/4DLYtOy/Mnc6lgNY5mRQRi9VN+MeHEjJn6W2Nu7My0R/qMau35o ZGeLxcxNzIt6VNoq6m66hbN4DOADYj6G1N1sQLZNH6L83ux1jH7ZPP/ng7Nzbx+oFzOs B/FYaffWCTuNe3H+AgTPhV+T83nDHYMsEmdMmzfrStfzLvqsqDQPt7MTUaVyRjg1gyTr 4xAR8TXLHR0iX6iGwEDVYqFq2zs52meOjlTYsps2O2U7TXT5bJFrOft9um4vkdwyxILq PQPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702261843; x=1702866643; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TAtSrNzMXsIYIDI7PZZlxLCtSwOpIMs+ASH6ed+/tjA=; b=BQW2fGv4d6YY2pQwD//CA06QDvyQa9g5doP1LGxaTxxkvnmNFR9nsoNWBwNCM4XzIm QbmF9uZ5fFhL9FRocapludGkXlTMPm1aOUyt1rW+S54DfZa8cLKF0amedUMOaWd2DYRi FQtXgVFh5PRD4F9Mmr0vSXLCCqxssBR7oG3Q5GKp4mrJnOEpBuCCARonxmLjXythUVB+ a9wj/K4Ehg7aW5Gq9YymrKqO0QJVqcGkgbeApNuV/X/cHErAzH5k3AolWKCVgvmIn8zg 58q5W2Z2Z4eyJozI+OZMD12UHN7l6ktIqSABmCldOJ+FD7a65AwAOXQg5vyoqhPYh0Hx RamQ== X-Gm-Message-State: AOJu0YyprRUv31XEjDgxtXIn32jSoZQgWDrMj5nYifhMGjxoSkfYuQ/y h2EMhppz93z0tk3OnWSmkl1m06TONehteS7jePgdGg== X-Received: by 2002:a0d:ef46:0:b0:5d7:1941:2c26 with SMTP id y67-20020a0def46000000b005d719412c26mr2628477ywe.83.1702261842924; Sun, 10 Dec 2023 18:30:42 -0800 (PST) MIME-Version: 1.0 References: <20231208005250.2910004-1-almasrymina@google.com> <20231208005250.2910004-9-almasrymina@google.com> In-Reply-To: From: Mina Almasry Date: Sun, 10 Dec 2023 18:30:31 -0800 Message-ID: Subject: Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory provider To: Pavel Begunkov Cc: Shailend Chand , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Jeroen de Borst , Praveen Kaligineedi , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , Yunsheng Lin , Harshitha Ramamurthy , Shakeel Butt , Willem de Bruijn , Kaiyuan Zhang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Sun, 10 Dec 2023 18:31:01 -0800 (PST) On Sat, Dec 9, 2023 at 7:05=E2=80=AFPM Pavel Begunkov wrote: > > On 12/8/23 23:25, Mina Almasry wrote: > > On Fri, Dec 8, 2023 at 2:56=E2=80=AFPM Pavel Begunkov wrote: > >> > >> On 12/8/23 00:52, Mina Almasry wrote: > > ... > >>> + if (pool->p.queue) > >>> + binding =3D READ_ONCE(pool->p.queue->binding); > >>> + > >>> + if (binding) { > >>> + pool->mp_ops =3D &dmabuf_devmem_ops; > >>> + pool->mp_priv =3D binding; > >>> + } > >> > >> Hmm, I don't understand why would we replace a nice transparent > >> api with page pool relying on a queue having devmem specific > >> pointer? It seemed more flexible and cleaner in the last RFC. > >> > > > > Jakub requested this change and may chime in, but I suspect it's to > > further abstract the devmem changes from driver. In this iteration, > > the driver grabs the netdev_rx_queue and passes it to the page_pool, > > and any future configurations between the net stack and page_pool can > > be passed this way with the driver unbothered. > > Ok, that makes sense, but even if passed via an rx queue I'd > at least hope it keeping abstract provider parameters, e.g. > ops, but not hard coded with devmem specific code. > > It might even be better done with a helper like > create_page_pool_from_queue(), unless there is some deeper > interaction b/w pp and rx queues is predicted. > Off hand I don't see the need for a new create_page_pool_from_queue(). page_pool_create() already takes in a param arg that lets us pass in the queue as well as any other params. > >>> + > >>> if (pool->mp_ops) { > >>> err =3D pool->mp_ops->init(pool); > >>> if (err) { > >>> @@ -1020,3 +1033,77 @@ void page_pool_update_nid(struct page_pool *po= ol, int new_nid) > >>> } > >>> } > >>> EXPORT_SYMBOL(page_pool_update_nid); > >>> + > >>> +void __page_pool_iov_free(struct page_pool_iov *ppiov) > >>> +{ > >>> + if (WARN_ON(ppiov->pp->mp_ops !=3D &dmabuf_devmem_ops)) > >>> + return; > >>> + > >>> + netdev_free_dmabuf(ppiov); > >>> +} > >>> +EXPORT_SYMBOL_GPL(__page_pool_iov_free); > >> > >> I didn't look too deep but I don't think I immediately follow > >> the pp refcounting. It increments pages_state_hold_cnt on > >> allocation, but IIUC doesn't mark skbs for recycle? Then, they all > >> will be put down via page_pool_iov_put_many() bypassing > >> page_pool_return_page() and friends. That will call > >> netdev_free_dmabuf(), which doesn't bump pages_state_release_cnt. > >> > >> At least I couldn't make it work with io_uring, and for my purposes, > >> I forced all puts to go through page_pool_return_page(), which calls > >> the ->release_page callback. The callback will put the reference and > >> ask its page pool to account release_cnt. It also gets rid of > >> __page_pool_iov_free(), as we'd need to add a hook there for > >> customization otherwise. > >> > >> I didn't care about overhead because the hot path for me is getting > >> buffers from a ring, which is somewhat analogous to sock_devmem_dontne= ed(), > >> but done on pp allocations under napi, and it's done separately. > >> > >> Completely untested with TCP devmem: > >> > >> https://github.com/isilence/linux/commit/14bd56605183dc80b540999e8058c= 79ac92ae2d8 > >> > > > > This was a mistake in the last RFC, which should be fixed in v1. In > > the RFC I was not marking the skbs as skb_mark_for_recycle(), so the > > unreffing path wasn't as expected. > > > > In this iteration, that should be completely fixed. I suspect since I > > just posted this you're actually referring to the issue tested on the > > last RFC? Correct me if wrong. > > Right, it was with RFCv3 > > > In this iteration, the reffing story: > > > > - memory provider allocs ppiov and returns it to the page pool with > > ppiov->refcount =3D=3D 1. > > - The page_pool gives the page to the driver. The driver may > > obtain/release references with page_pool_page_[get|put]_many(), but > > the driver is likely not doing that unless it's doing its own page > > recycling. > > - The net stack obtains references via skb_frag_ref() -> > > page_pool_page_get_many() > > - The net stack drops references via skb_frag_unref() -> > > napi_pp_put_page() -> page_pool_return_page() and friends. > > > > Thus, the issue where the unref path was skipping > > page_pool_return_page() and friends should be resolved in this > > iteration, let me know if you think otherwise, but I think this was an > > issue limited to the last RFC. > > Then page_pool_iov_put_many() should and supposedly would never be > called by non devmap code because all puts must circle back into > ->release_page. Why adding it to into page_pool_page_put_many()? > > @@ -731,6 +731,29 @@ __page_pool_put_page(struct page_pool *pool, struct = page *page, > + if (page_is_page_pool_iov(page)) { > ... > + page_pool_page_put_many(page, 1); > + return NULL; > + } > > Well, I'm looking at this new branch from Patch 10, it can put > the buffer, but what if we race at it's actually the final put? > Looks like nobody is going to to bump up pages_state_release_cnt > Good catch, I think indeed the release_cnt would be incorrect in this case. I think the race is benign in the sense that the ppiov will be freed correctly and available for allocation when the page_pool next needs it; the issue is with the stats AFAICT. > If you remove the branch, let it fall into ->release and rely > on refcounting there, then the callback could also fix up > release_cnt or ask pp to do it, like in the patch I linked above > Sadly I don't think this is possible due to the reasons I mention in the commit message of that patch. Prematurely releasing ppiov and not having them be candidates for recycling shows me a 4-5x degradation in performance. What I could do here is detect that the refcount was dropped to 0 and fix up the stats in that case. --=20 Thanks, Mina