Received: by 2002:a05:7412:2a91:b0:fc:a2b0:25d7 with SMTP id u17csp181927rdh; Tue, 13 Feb 2024 13:24:16 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUrpaOievp4VzOwklzExrh1SQH+zCiekOsXSDb6KNaVPu2xs++42mkIOPlie5KDdsH+InAXVLhjqdZipsmCb/CHUW8Sz61DMeAXeO8h8A== X-Google-Smtp-Source: AGHT+IHVUWEPOpyk32OvU+xLkaSCEtT9cIwTfRx/Bvx+xA+Zv9BD2TH0hLZbBDOJGuFWkE8Cao1d X-Received: by 2002:a17:90a:f490:b0:290:2e08:74f1 with SMTP id bx16-20020a17090af49000b002902e0874f1mr682109pjb.21.1707859456168; Tue, 13 Feb 2024 13:24:16 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707859456; cv=pass; d=google.com; s=arc-20160816; b=ra+isoi4s1xBuu+2Ci+bdfup6i4OQGZPyqjYFcmlo+VZu84vAi/h54n0kAohgYx78e QLXCo8uCXZWdsPE6uZt7v+97saEmjP8OzqJ9Mc/BHgneThWLvyTshKc7jP560KaGy296 BeIgHwUr2CwtFn7eMD3ZbGlE5Fb3yOv+yz5ThCtatSRH0MPU3XtsEMy7UcONsGKFvw32 VvI15qG16Rs8LGVRkSjs0znliRUJcGRkk4RFy3n/zh6WWyuxw+oydGYc1didZ/ITY0Pq Q24WI7yPJ6ywDJeRuuLTZxFRRZOCtQMDZrJ6n4V4//1qYbYKgv8YrhhLCSauUTlLtPYG mGUg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=zsOIUemMTVXHoEVROLysOcQX18quXE3mcOSmusaiTVU=; fh=uk0HPiB7q4BqEjtWoS5nrEIt6QE8/oRXR7VwE/pVhcc=; b=FxklQblshEx5QWP5lWq+LrD/cvFgpOYRz+52JlxVnIsLdiGEnZgyrfLDUWcjesGMNW BBHUPEG06d7EGlbnVnjI6UCQqUeGxNpqM1zoewJMx3YeyGXrgTbGknYBdpKNnnv5eOnW eeEt3TMMWKH2BhYMZd/Ql5zSDIShPR68ohhsegKINH7GGUGLn4dCI6NrnQ02lucbbPRe TDI4DfNAUWc3KCYgz835zzZqy0UsPlYU03DeRyaNNEaaunIfE3e4j5TsKHgiOcQbjgVs lH/wwzkXIZ6rHEoCt1OABhFDMblTnN+V+KWCaua3Hc8rkWmc++HK4wj0By88LRyaOEkR leuQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=3qw+DHL3; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-64321-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-64321-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com X-Forwarded-Encrypted: i=2; AJvYcCWu523KPeT8Sh+ZLq2idRJBU/zJ/f2Cn0L19NinyEmPwa8gvBQEzL9MDmUBf9VjiREWu3xVQvpW0gvNsBFQk6sK7Db/UBG3CuBphzdwCQ== Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id mv9-20020a17090b198900b00298d1afe107si120178pjb.69.2024.02.13.13.24.15 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Feb 2024 13:24:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-64321-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=3qw+DHL3; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-64321-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-64321-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id CA8FE28D8D6 for ; Tue, 13 Feb 2024 21:24:15 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 672CD63125; Tue, 13 Feb 2024 21:11:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="3qw+DHL3" Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B2902626BB for ; Tue, 13 Feb 2024 21:11:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.50 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707858707; cv=none; b=e5/qxOPY4aE05mKU8UfgoqRoiZcap7Naur5fIu3ADfW0ivIuefCDdu/AN/6MUEdN19YpgUad/rmJG7s0ammpoEkPnWiu+z1cmza3qyYINqYPZa1FKqHfY3qW2f86lFdM02+uXw0HbavNIA9/S1b0S2KhrU13lIz3wEy78pp3GDM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707858707; c=relaxed/simple; bh=8PgonvMIbvrH9MSbhuItBe3sWcnPc7IU56AmceLpoEw=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=hinJxqFdN1dm8P9YswDaUdjRu14B3eD/9YycXH5jfxn+YRIdEBOkprGIiA9s5gDWvkQ+pgSY4X7y13D23DWtAq3+QZMZyxeD86akJ7X7+Nzmpwc7BZp+eesUfioRrfYsImfB0yg8X5QiBsGWZeaQ5Sh3ph8sftWrSVyXW1siox8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=3qw+DHL3; arc=none smtp.client-ip=209.85.218.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-a3d2cce1126so48481766b.2 for ; Tue, 13 Feb 2024 13:11:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707858701; x=1708463501; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zsOIUemMTVXHoEVROLysOcQX18quXE3mcOSmusaiTVU=; b=3qw+DHL3WKlvdWe731UgqTCdG+qi223RsObe3z8CUbETTRm40/NN6mBjH9dyNhgr5g s33N2CJFc+rqmBDXPEqp1KSyPW8W2LMKeGsrgtN8m9F4+AZBS1GiPhBk73x4Po73Ug1z ghvS03vAElxM0Ms8sO62V0tRLtk/6tnozaNNfNNYN1NGGj+qvBdNh70xCj0EIvwg1LBI o148syW51Ot0cGQdz0rlPhjYAmhNxxf4TbCdrHIvglxbE1M/2+B2q3GZuB6YP1S8lZjr LkBH0EFmodTlsm4KnccJYrPjlq0NCH1D3nDjfCEfEQ/88RtrGhpZgh66F8SUReet9Eqi P68A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707858701; x=1708463501; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zsOIUemMTVXHoEVROLysOcQX18quXE3mcOSmusaiTVU=; b=Gz0UT8c6kPffieTz/FxufemNiRHx0vCD0tD3o+j4/cCt7w+bSbXOuyaiYx+KZ/4kBA 49ITDwTtUewd/ncwSUcuXbFV7IyaqnwZ1hhmSUMpuuf42VVTT+pGa80KOsPcfi6oQkAS iupB0HTd4huShiQ5T/7VVSIiMz+facErFYyBAunRYK0iOJ1laTysA8+YrMNcpgUHg2Nt HCVvyYM1H5bcKmdBXEIw8uQ5pTr6FqJ+5zQKhAqMortN5CW62Rj6/zNKwVcVsC7YynMd B6ZlOofR65eYanTCCnSkpTw8Bq60yXxiw89OrrcWblqwlI/ZFjAkG0cITI7BJrd1Xmaw VyzA== X-Forwarded-Encrypted: i=1; AJvYcCVeekZP7mHVcvgEaMdv024XhwaDtSRLEEPw04X9sWZtqWTfMyAFi5wvy61FsUCtiCflRjp4MFOGVa1etRtDfbIcgFlU2PCBAOP4p1t2 X-Gm-Message-State: AOJu0Yx4vbfCH+Vmjf1H0SWSDQdhtdRJs7/O669CUbmzomkj695G1eOQ TksY0gwDM+Nh3rh7cWpgFyTaMveHm9UAYRLL929kwPFjUBatOpaAftlFpamzPcFJSNdAdiYNm+0 gxHk5Susjz72NNc/b6Cyyw0RKFSvGvH8+5eF4 X-Received: by 2002:a17:906:8410:b0:a3c:eb18:8a4d with SMTP id n16-20020a170906841000b00a3ceb188a4dmr349416ejx.62.1707858700745; Tue, 13 Feb 2024 13:11:40 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20231218024024.3516870-1-almasrymina@google.com> <20231218024024.3516870-8-almasrymina@google.com> <3374356e-5f4b-4a6f-bb19-8cb7c56103bc@gmail.com> In-Reply-To: <3374356e-5f4b-4a6f-bb19-8cb7c56103bc@gmail.com> From: Mina Almasry Date: Tue, 13 Feb 2024 13:11:28 -0800 Message-ID: Subject: Re: [RFC PATCH net-next v5 07/14] page_pool: devmem support To: Pavel Begunkov Cc: Mathieu Desnoyers , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Feb 13, 2024 at 5:28=E2=80=AFAM Pavel Begunkov wrote: > > On 12/18/23 02:40, Mina Almasry wrote: > > Convert netmem to be a union of struct page and struct netmem. Overload > > the LSB of struct netmem* to indicate that it's a net_iov, otherwise > > it's a page. > > > > Currently these entries in struct page are rented by the page_pool and > > used exclusively by the net stack: > > > > struct { > > unsigned long pp_magic; > > struct page_pool *pp; > > unsigned long _pp_mapping_pad; > > unsigned long dma_addr; > > atomic_long_t pp_ref_count; > > }; > > > > Mirror these (and only these) entries into struct net_iov and implement > > netmem helpers that can access these common fields regardless of > > whether the underlying type is page or net_iov. > > Implement checks for net_iov in netmem helpers which delegate to mm > > APIs, to ensure net_iov are never passed to the mm stack. > > > > Signed-off-by: Mina Almasry > > > > --- > > > > RFCv5: > > - Use netmem instead of page* with LSB set. > > - Use pp_ref_count for refcounting net_iov. > > - Removed many of the custom checks for netmem. > > > > v1: > > - Disable fragmentation support for iov properly. > > - fix napi_pp_put_page() path (Yunsheng). > > - Use pp_frag_count for devmem refcounting. > > > > --- > > include/net/netmem.h | 145 ++++++++++++++++++++++++++++++-= - > > include/net/page_pool/helpers.h | 25 +++--- > > net/core/page_pool.c | 26 +++--- > > net/core/skbuff.c | 9 +- > > 4 files changed, 164 insertions(+), 41 deletions(-) > > > > diff --git a/include/net/netmem.h b/include/net/netmem.h > > index 31f338f19da0..7557aecc0f78 100644 > > --- a/include/net/netmem.h > > +++ b/include/net/netmem.h > > @@ -12,11 +12,47 @@ > > > > /* net_iov */ > > > > +DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers); > > + > > +/* We overload the LSB of the struct page pointer to indicate whether= it's > > + * a page or net_iov. > > + */ > > +#define NET_IOV 0x01UL > > + > > struct net_iov { > > + unsigned long __unused_padding; > > + unsigned long pp_magic; > > + struct page_pool *pp; > > struct dmabuf_genpool_chunk_owner *owner; > > unsigned long dma_addr; > > + atomic_long_t pp_ref_count; > > }; > > I wonder if it would be better to extract a common sub-struct > used in struct page, struct_group_tagged can help to avoid > touching old code: > > struct page { > unsigned long flags; > union { > ... > struct_group_tagged(, ..., > /** > * @pp_magic: magic value to avoid recycling non > * page_pool allocated pages. > */ > unsigned long pp_magic; > struct page_pool *pp; > unsigned long _pp_mapping_pad; > unsigned long dma_addr; > atomic_long_t pp_ref_count; > ); > }; > } > > struct net_iov { > unsigned long pad; > struct p; > }; > > > A bit of a churn with the padding and nesting net_iov but looks > sturdier. No duplication, and you can just check positions of the > structure instead of per-field NET_IOV_ASSERT_OFFSET, which you > have to not forget to update e.g. when adding a new field. Also, Yes, this is nicer. If possible I'll punt it to a minor cleanup as a follow up change. Logistically I think if this series need-not touch code outside of net/, that's better. > with the change __netmem_clear_lsb can return a pointer to that > structure, casting struct net_iov when it's a page is a bit iffy. > > And the next question would be whether it'd be a good idea to encode > iov vs page not by setting a bit but via one of the fields in the > structure, maybe pp_magic. > I will push back against this, for 2 reasons: 1. I think pp_magic's first 2 bits (and maybe more) are used by mm code and thus I think extending usage of pp_magic in this series is a bit iffy and I would like to avoid it. I just don't want to touch the semantics of struct page if I don't have to. 2. I think this will be a measurable perf regression. Currently we can tell if a pointer is a page or net_iov without dereferencing the pointer and dirtying the cache-line. This will cause us to possibly dereference the pointer in areas where we don't need to. I think I had an earlier version of this code that required a dereference to tell if a page was devmem and Eric pointed to me it was a perf regression. I also don't see any upside of using pp_magic, other than making the code slightly more readable, maybe. > With that said I'm a bit concerned about the net_iov size. If each > represents 4096 bytes and you're registering 10MB, then you need > 30 pages worth of memory just for the iov array. Makes kvmalloc > a must even for relatively small sizes. > This I think is an age-old challenge with pages. 1.6% of the machine's memory is 'wasted' on every machine because a struct page needs to be allocated for each PAGE_SIZE region. We're running into the same issue here where if we want to refer to PAGE_SIZE regions of memory we need to allocate some reference to it. Note that net_iov can be relatively easily extended to support N order pages. Also note that in the devmem TCP use case it's not really an issue; the minor increase in mem utilization is more than offset by the saving in memory bw as compared to using host memory as a bounce buffer. All in all I vote this is something that can be tuned or improved in the future if someone finds the extra memory usage a hurdle to using devmem TCP or this net_iov infra. > And the final bit, I don't believe the overlay is necessary in > this series. Optimisations are great, but this one is a bit more on > the controversial side. Unless I missed something and it does make > things easier, it might make sense to do it separately later. > I completely agree, the overlay is not necessary. I implemented the overlay in response to Yunsheng's strong requests for more 'unified' processing between page and devmem. This is the most unification I can do IMO without violating the requirements from Jason. I'm prepared to remove the overlay if it turns out controversial, but so far I haven't seen any complaints. Jason, please do take a look if you have not already. > > > +/* These fields in struct page are used by the page_pool and net stack= : > > + * > > + * struct { > > + * unsigned long pp_magic; > > + * struct page_pool *pp; > > + * unsigned long _pp_mapping_pad; > > + * unsigned long dma_addr; > > + * atomic_long_t pp_ref_count; > > + * }; > > + * > > + * We mirror the page_pool fields here so the page_pool can access the= se fields > > + * without worrying whether the underlying fields belong to a page or = net_iov. > > + * > > + * The non-net stack fields of struct page are private to the mm stack= and must > > + * never be mirrored to net_iov. > > + */ > > +#define NET_IOV_ASSERT_OFFSET(pg, iov) \ > > + static_assert(offsetof(struct page, pg) =3D=3D \ > > + offsetof(struct net_iov, iov)) > > +NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic); > > +NET_IOV_ASSERT_OFFSET(pp, pp); > > +NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); > > +NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); > > +#undef NET_IOV_ASSERT_OFFSET > > + > > static inline struct dmabuf_genpool_chunk_owner * > > net_iov_owner(const struct net_iov *niov) > > { > > @@ -47,19 +83,25 @@ net_iov_binding(const struct net_iov *niov) > > struct netmem { > > union { > > struct page page; > > - > > - /* Stub to prevent compiler implicitly converting from pa= ge* > > - * to netmem_t* and vice versa. > > - * > > - * Other memory type(s) net stack would like to support > > - * can be added to this union. > > - */ > > - void *addr; > > + struct net_iov niov; > > }; > > }; > > > ... > > -- > Pavel Begunkov -- Thanks, Mina