Received: by 2002:ac8:3b51:0:b0:3f3:9eb6:4eb6 with SMTP id r17csp1417065qtf; Fri, 16 Jun 2023 08:06:54 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7zUGL3btW7YsLtPgUW4E5lcTB7bkm/UyQm/Mm3v0ZaXXBbT6QgAKus7p11DYgpJzTs0MfU X-Received: by 2002:a05:6a00:1c92:b0:666:e45e:989 with SMTP id y18-20020a056a001c9200b00666e45e0989mr1111875pfw.11.1686928014154; Fri, 16 Jun 2023 08:06:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686928014; cv=none; d=google.com; s=arc-20160816; b=Q9WS5kvjSCO5qyV7L7D5VN9hU5ldaOq7bgmniGQiq0G7PZ1ty8fIgA3FIjXKZO4Zbs O0jpi6SLqh5f0VYw5rLJ+vgdRVsgjeEY6UkeMHTCQtO5gkJH6t8WeS+tITSila13AOE4 qnL1VHNDCXT1cfvdgp8pnJAr4zCGDxoKpVyMr5fTYa3hpdmif1G9KDHEr00RjFjs9moz 7EoChbFHJwMIO9S1Tgn6Vf5xmc+5ROjhDpLIqV6X9f05nQ228CBb5eWXpfHkKTwzMKZH EahmXuImh9BQ40T+ghofby13GVnwxUal3XYcy6btUZO+g4q54rC0U3fSFokYCEjIyMdJ njOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=FwOiF1tuQY0sMSzPSitIXKF3uUPCmzP49g6125H5nno=; b=QXuw/L5W8z0MOayDjrZrqUIT1mwxBXw1EFlwhuvUbrbrCm7whpCdYtmM/T0hvr9Zqt Wwx2hw1+1PLFtFSSZWTmq9BZ7FGuyjBKIRlsjukGcbsAQD8oWg9tKXYa4jn6BK6hvi9+ akwYnVCMvUBw0QkOq6E3MdIMhP6+KKwxqiuSzI+UC1vDMkwV7SmEu7nk8pABcEMVEDo/ yz9MAu4fEi4KRVpiWfnpBDv78So51CEDURJHE5K0KLVgApD3UHJpinrrk+lcWD5iHtQ7 2MZ60CSA19En8xiWpltge8dcKmd2IJAattingS4R1tN2qz4+XPeXGTeZPqyLh0OrDhdN MHuQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=iFSLtmOZ; spf=pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h188-20020a6253c5000000b0065ff4b819b3si12587275pfb.52.2023.06.16.08.06.27; Fri, 16 Jun 2023 08:06:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=iFSLtmOZ; spf=pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345812AbjFPPCK (ORCPT + 61 others); Fri, 16 Jun 2023 11:02:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58256 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345808AbjFPPBy (ORCPT ); Fri, 16 Jun 2023 11:01:54 -0400 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 223CC1FF9; Fri, 16 Jun 2023 08:01:44 -0700 (PDT) Received: by mail-pj1-x1029.google.com with SMTP id 98e67ed59e1d1-25bec2512f3so687515a91.0; Fri, 16 Jun 2023 08:01:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686927703; x=1689519703; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FwOiF1tuQY0sMSzPSitIXKF3uUPCmzP49g6125H5nno=; b=iFSLtmOZxcYOYrhk8SjOyBfNSdHP6AxxWzy1ZOozDYnPcQghaoGALev11CBsXLS2oJ W1TblpdgaZjZFhzAjUdJr7Gm50KRcc6rVYkEzZYfX1SokDSLSRnFQHqd5JalEqOM18Ts 84h3OHZV3xEWlkzCH27P0UJdHBP1r4S3d2pZLTm5yhCeO3WtQnOWAKzcipwkZAprXZ9t g0LKmiM/GmvN9axhV8Vo7vXmG/+3s6dro6zqfVJneOimcIxrBYfIuoMI16rU5yNmrYrZ E3O2Mplnv4X1vaPEQcTVZ6hJTbeVbFCYOw62zOSZITHBFAGXhVyzIY3yWAtnX/rVuXU3 xKqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686927703; x=1689519703; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FwOiF1tuQY0sMSzPSitIXKF3uUPCmzP49g6125H5nno=; b=TFC/j2N+/6CHfidwTOl+CDbkywzQpBGkrcM9fim4Co5Xhi/YzOzSNvSPWMb7Wkv0su 8A4td+5hQJKdfBkRLn9UdhNs54IqjTWpp9dH1jwL5ALNUHpM3CBRnlPHf6+QYCWKzlko vxcnTlNSDmnZDk+bTDwxZnjOf7IDkH+irZzlAiNrTSzYrhBsuwGCy6V8Mbxc9wX0C7xk bmvpDlgul57Dq6VRTQ9JRnXDVHGZl+hiuk17t2Rs6wkca3w+gDkUxwgLiuDnj/56f3Vo yhXWaRFjTVKQSSBmZ6IQNAYey2DRoClNWJiiCGq0CR5GmYyLnnw6IvFlPkmuS5nbsLH2 3rdw== X-Gm-Message-State: AC+VfDxJeY2TLlDnbqrrKkQfpxqushITZJCD5rtEG+HtPmpcOr7iqL5Q 5rFns0S4YEgBMU0OCvrwPyFxsNE/dcthXhQv3A4= X-Received: by 2002:a17:90a:7f05:b0:250:648b:781d with SMTP id k5-20020a17090a7f0500b00250648b781dmr10488732pjl.23.1686927703194; Fri, 16 Jun 2023 08:01:43 -0700 (PDT) MIME-Version: 1.0 References: <20230612130256.4572-1-linyunsheng@huawei.com> <20230612130256.4572-5-linyunsheng@huawei.com> <20230614101954.30112d6e@kernel.org> <8c544cd9-00a3-2f17-bd04-13ca99136750@huawei.com> <20230615095100.35c5eb10@kernel.org> <908b8b17-f942-f909-61e6-276df52a5ad5@huawei.com> In-Reply-To: <908b8b17-f942-f909-61e6-276df52a5ad5@huawei.com> From: Alexander Duyck Date: Fri, 16 Jun 2023 08:01:06 -0700 Message-ID: Subject: Re: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag To: Yunsheng Lin Cc: Jakub Kicinski , davem@davemloft.net, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Lorenzo Bianconi , Yisen Zhuang , Salil Mehta , Eric Dumazet , Sunil Goutham , Geetha sowjanya , Subbaraya Sundeep , hariprasad , Saeed Mahameed , Leon Romanovsky , Felix Fietkau , Ryder Lee , Shayne Chen , Sean Wang , Kalle Valo , Matthias Brugger , AngeloGioacchino Del Regno , Jesper Dangaard Brouer , Ilias Apalodimas , linux-rdma@vger.kernel.org, linux-wireless@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-wireless@vger.kernel.org On Fri, Jun 16, 2023 at 5:21=E2=80=AFAM Yunsheng Lin wrote: > > On 2023/6/16 2:26, Alexander Duyck wrote: > > On Thu, Jun 15, 2023 at 9:51=E2=80=AFAM Jakub Kicinski wrote: > >> > >> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote: > >>>> Does hns3_page_order() set a good example for the users? > >>>> > >>>> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ri= ng) > >>>> { > >>>> #if (PAGE_SIZE < 8192) > >>>> if (ring->buf_size > (PAGE_SIZE / 2)) > >>>> return 1; > >>>> #endif > >>>> return 0; > >>>> } > >>>> > >>>> Why allocate order 1 pages for buffers which would fit in a single p= age? > >>>> I feel like this soft of heuristic should be built into the API itse= lf. > >>> > >>> hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 20= 48 bytes > >>> 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is fo= r buf size > >>> with 4096 bytes and system page size with 4K, as hns3 driver still su= pport the > >>> per-desc ping-pong way of page splitting when page_pool_enabled is fa= lse. > >>> > >>> With page pool enabled, you are right that order 0 pages is enough, a= nd I am not > >>> sure about the exact reason we use the some order as the ping-pong wa= y of page > >>> splitting now. > >>> As 2048 bytes buf size seems to be the default one, and I has not hea= rd any one > >>> changing it. Also, it caculates the pool_size using something as belo= w, so the > >>> memory usage is almost the same for order 0 and order 1: > >>> > >>> .pool_size =3D ring->desc_num * hns3_buf_size(ring) / > >>> (PAGE_SIZE << hns3_page_order(ring)), > >>> > >>> I am not sure it worth changing it, maybe just change it to set good = example for > >>> the users:) anyway I need to discuss this with other colleague intern= ally and do > >>> some testing before doing the change. > >> > >> Right, I think this may be a leftover from the page flipping mode of > >> operation. But AFAIU we should leave the recycling fully to the page > >> pool now. If we make any improvements try to make them at the page poo= l > >> level. > > I checked, the per-desc buf with 4096 bytes for hnse does not seem to > be used mainly because of the larger memory usage you mentioned below. > > >> > >> I like your patches as they isolate the drivers from having to make th= e > >> fragmentation decisions based on the system page size (4k vs 64k but > >> we're hearing more and more about ARM w/ 16k pages). For that use case > >> this is great. > > Yes, That is my point. For hw case, the page splitting in page pool is > mainly to enble multi-descs to use the same page as my understanding. > > >> > >> What we don't want is drivers to start requesting larger page sizes > >> because it looks good in iperf on a freshly booted, idle system :( > > > > Actually that would be a really good direction for this patch set to > > look at going into. Rather than having us always allocate a "page" it > > would make sense for most drivers to allocate a 4K fragment or the > > like in the case that the base page size is larger than 4K. That might > > be a good use case to justify doing away with the standard page pool > > page and look at making them all fragmented. > > I am not sure if I understand the above, isn't the frag API able to > support allocating a 4K fragment when base page size is larger than > 4K before or after this patch? what more do we need to do? I'm not talking about the frag API. I am talking about the non-fragmented case. Right now standard page_pool will allocate an order 0 page. So if a driver is using just pages expecting 4K pages that isn't true on these ARM or PowerPC systems where the page size is larger than 4K. For a bit of historical reference on igb/ixgbe they had a known issue where they would potentially run a system out of memory when page size was larger than 4K. I had originally implemented things with just the refcounting hack and at the time it worked great on systems with 4K pages. However on a PowerPC it would trigger OOM errors because they could run with 64K pages. To fix that I started adding all the PAGE_SIZE checks in the driver and moved over to a striping model for those that would free the page when it reached the end in order to force it to free the page and make better use of the available memory. > > > > In the case of the standard page size being 4K a standard page would > > just have to take on the CPU overhead of the atomic_set and > > atomic_read for pp_ref_count (new name) which should be minimal as on > > most sane systems those just end up being a memory write and read. > > If I understand you correctly, I think what you are trying to do > may break some of Jesper' benchmarking:) > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/l= ib/bench_page_pool_simple.c So? If it breaks an out-of-tree benchmark the benchmark can always be fixed. The point is enabling a use case that can add value across the board instead of trying to force the community to support a niche use case. Ideally we should get away from using the pages directly for most cases in page pool. In my mind the page pool should start operating more like __get_free_pages where what you get is a virtual address instead of the actual page. That way we could start abstracting it away and eventually get to something more like a true page_pool api instead of what feels like a set of add-ons for the page allocator. Although at the end of the day this still feels more like we are just reimplementing slab so it is hard for me to say this is necessarily the best solution either.