Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4153338rdb; Mon, 11 Dec 2023 10:14:34 -0800 (PST) X-Google-Smtp-Source: AGHT+IGHMOTeKJ1VQHYr/Zgp8nMmbxZ8O5pjVEf3z+Ql1QJeOSg+dcvaZLLe9hGr7U9+MeylpMgP X-Received: by 2002:a05:6e02:1a69:b0:35d:59a2:2e1 with SMTP id w9-20020a056e021a6900b0035d59a202e1mr7753854ilv.129.1702318473834; Mon, 11 Dec 2023 10:14:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702318473; cv=none; d=google.com; s=arc-20160816; b=w/fp0xwTFUDZT1GElrfdL30wllL03StHTds6D6zUO2yWkgjFxjl5nXZlVHwvM+jl8C 2NNzJsRvuXsRE0/vcRY7ov3o8rfrHRlJNBoVn8KPgMB7RG75t+KXLFqUKYPbviqlot+m dfnp0Q1JuFRb81CGg7l3xHLUfOcR1nBkGTXoETmu+BT79Nhf8308TqVVf7cW/z1GUhWD eUMr8Fh/bs5IpVMVC3YgfLMsTuRH+i7znxioS5k0D2WVNYgY5rI4DomYNHxMo5c265/o Lf4pZjpF+Eq66L+QAkdhhCc6hqnzqbDYmNHQnwj9OlOmLQUMcHfV1X4pHZiKBcmRb9kp 2WhA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=zbiqO7RxhSXNHZRSCjwyZy4HmqVXPvY1sANPCgpMBIA=; fh=cKsYnnJ8XI1CJGQl1Br/VlqgbJO/As7eX690Hy9c2ak=; b=IaP/XO1TlL1g0oexCEiV+l3AH6/8kWzsvke9cXnGZRxjqCyPbU3Q6lFAa06IeeWeHm oNZcfMmJs7Ca81I+NVHU/hR4CDhZ5YtJ3lrOK27l7I4kEnaViMwCCCplK0q9zSJfHapv 8W/Kh9lUrW0YBEHpzC22KMNXKXQKZQEfFVQZyx7uEqYzoJts/Xb0GPTxly5oCVNScr9H J1NKne32QkuAkj4SxRi85NSdwoUhbXwLHkCKonLXSaiCUStF5WEcb4cI5JghyZJ64A79 8eS+Qfz9xdD/9CZ/4570YVBHBPYpf5XmvkAN0AEfbPYgyDsxKJJjqmSG5GUfzNYxPXXr +fzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=2ZR9oew+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id l68-20020a632547000000b005be00714949si6470148pgl.222.2023.12.11.10.14.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 10:14:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=2ZR9oew+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 6368B80785CC; Mon, 11 Dec 2023 10:14:30 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231745AbjLKSOO (ORCPT + 99 others); Mon, 11 Dec 2023 13:14:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47944 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230101AbjLKSOM (ORCPT ); Mon, 11 Dec 2023 13:14:12 -0500 Received: from mail-oi1-x233.google.com (mail-oi1-x233.google.com [IPv6:2607:f8b0:4864:20::233]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E797A9 for ; Mon, 11 Dec 2023 10:14:17 -0800 (PST) Received: by mail-oi1-x233.google.com with SMTP id 5614622812f47-3b9f111c114so1753018b6e.1 for ; Mon, 11 Dec 2023 10:14:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702318456; x=1702923256; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zbiqO7RxhSXNHZRSCjwyZy4HmqVXPvY1sANPCgpMBIA=; b=2ZR9oew+tVsgkdbtVJPAIjNlqjQeakbUN08vwSybi21pD7JMCR0VjzZNIAL4Ql+O9C 7+xRmtnoHMGO7vGqublUTMIoEUpQyMp5C/Q15mq28NKrqJo84IPm+I27gcHRRz1QvPk9 3Bt0ieiBUMQ0b50x1yG5uZKdQu7gk0sJiVHr4pFlWY1CWFkqiCt1k94Yr+K2ec94ewBv sGZGHVvgxwJoJpWq6xDMFSAIbxsm7+NGgyHh/5zNQOFLdwzCn7Q68XHR3R3Iw0Y85XfS 0z1l8SQXvz9P1h9Rve0ew1xgE2nb8v3ipP5ZhXKhMFwu497xQ+GX8qKpLGznUFXJdDtR VuLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702318456; x=1702923256; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zbiqO7RxhSXNHZRSCjwyZy4HmqVXPvY1sANPCgpMBIA=; b=Y3JNgsuWPQoIqnBFDRTKIwgeXP75IGpExX4+LVqYpLPpNa7749usKoZG55vO5bl/fA WieEz2Io9jGoj5gMWrvzk5ZZKsX3Hopb2KIljC3Uh2kwFTo72XmxVsNsKD3jlwk3aFFK I3YE5SfTZsxrLRIIhOA5Jjy8mK0Ez+Stn5/MsW4soPYLBA5bUNimyUe9tmbgOwoVLC/x L5FepMHXZdnxoLU63aUAhis2xWoIIMlh4k+ktpHnbN5CYAMacdBDt7ru+93/EzytDr0s z+5gDz+Nrep+/5l4JAgTnwIHSThs8qXVo3T/yU6vTIkrUW+vGHVC0qC0C74vLhxZEwq3 Am1A== X-Gm-Message-State: AOJu0Yy+IjDIzPyTIsp3RGZkd3QSQGGwyshbM39p53cmw0H5J75bT8aV Y0VCs3DXD/DQjtmydEKtLofC7onoNAQdhlnIAxUaig== X-Received: by 2002:a05:6830:164a:b0:6d9:d902:44f0 with SMTP id h10-20020a056830164a00b006d9d90244f0mr4280561otr.50.1702318456301; Mon, 11 Dec 2023 10:14:16 -0800 (PST) MIME-Version: 1.0 References: <20231208005250.2910004-1-almasrymina@google.com> <20231208005250.2910004-10-almasrymina@google.com> <32211cbf-3a4e-8a86-6214-4304ddb18a98@huawei.com> <92e30bd9-6df4-b72f-7bcd-f4fe5670eba2@huawei.com> <59e07233-24cb-7fb2-1aee-e1cf7eb72fa9@huawei.com> In-Reply-To: <59e07233-24cb-7fb2-1aee-e1cf7eb72fa9@huawei.com> From: Mina Almasry Date: Mon, 11 Dec 2023 10:14:05 -0800 Message-ID: Subject: Re: [net-next v1 09/16] page_pool: device memory support To: Yunsheng Lin Cc: Shailend Chand , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Jeroen de Borst , Praveen Kaligineedi , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , Harshitha Ramamurthy , Shakeel Butt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 11 Dec 2023 10:14:30 -0800 (PST) On Mon, Dec 11, 2023 at 3:51=E2=80=AFAM Yunsheng Lin wrote: > > On 2023/12/11 12:04, Mina Almasry wrote: > > On Sun, Dec 10, 2023 at 6:26=E2=80=AFPM Mina Almasry wrote: > >> > >> On Sun, Dec 10, 2023 at 6:04=E2=80=AFPM Yunsheng Lin wrote: > >>> > >>> On 2023/12/9 0:05, Mina Almasry wrote: > >>>> On Fri, Dec 8, 2023 at 1:30=E2=80=AFAM Yunsheng Lin wrote: > >>>>> > >>>>> > >>>>> As mentioned before, it seems we need to have the above checking ev= ery > >>>>> time we need to do some per-page handling in page_pool core, is the= re > >>>>> a plan in your mind how to remove those kind of checking in the fut= ure? > >>>>> > >>>> > >>>> I see 2 ways to remove the checking, both infeasible: > >>>> > >>>> 1. Allocate a wrapper struct that pulls out all the fields the page = pool needs: > >>>> > >>>> struct netmem { > >>>> /* common fields */ > >>>> refcount_t refcount; > >>>> bool is_pfmemalloc; > >>>> int nid; > >>>> ... > >>>> union { > >>>> struct dmabuf_genpool_chunk_owner *owner; > >>>> struct page * page; > >>>> }; > >>>> }; > >>>> > >>>> The page pool can then not care if the underlying memory is iov or > >>>> page. However this introduces significant memory bloat as this struc= t > >>>> needs to be allocated for each page or ppiov, which I imagine is not > >>>> acceptable for the upside of removing a few static_branch'd if > >>>> statements with no performance cost. > >>>> > >>>> 2. Create a unified struct for page and dmabuf memory, which the mm > >>>> folks have repeatedly nacked, and I imagine will repeatedly nack in > >>>> the future. > >>>> > >>>> So I imagine the special handling of ppiov in some form is critical > >>>> and the checking may not be removable. > >>> > >>> If the above is true, perhaps devmem is not really supposed to be int= ergated > >>> into page_pool. > >>> > >>> Adding a checking for every per-page handling in page_pool core is ju= st too > >>> hacky to be really considerred a longterm solution. > >>> > >> > >> The only other option is to implement another page_pool for ppiov and > >> have the driver create page_pool or ppiov_pool depending on the state > >> of the netdev_rx_queue (or some helper in the net stack to do that for > >> the driver). This introduces some code duplication. The ppiov_pool & > >> page_pool would look similar in implementation. > > I think there is a design pattern already to deal with this kind of probl= em, > refactoring common code used by both page_pool and ppiov into a library t= o > aovid code duplication if most of them have similar implementation. > Code can be refactored if it's identical, not if it is similar. I suspect the page_pools will be only similar, and if you're not willing to take devmem handling into the page pool then refactoring page_pool code into helpers that do devmem handling may also not be an option. > >> > >> But this was all discussed in detail in RFC v2 and the last response I > >> heard from Jesper was in favor if this approach, if I understand > >> correctly: > >> > >> https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@re= dhat.com/ > >> > >> Would love to have the maintainer weigh in here. > >> > > > > I should note we may be able to remove some of the checking, but maybe = not all. > > > > - Checks that disable page fragging for ppiov can be removed once > > ppiov has frag support (in this series or follow up). > > > > - If we use page->pp_frag_count (or page->pp_ref_count) for > > refcounting ppiov, we can remove the if checking in the refcounting. > > I'm not sure this is actually possible in the short term. The page_pool uses both page->_refcount and page->pp_frag_count for refcounting, and I will not be able to remove the special handling around page->_refcount as i'm not allowed to call page_ref_*() APIs on a non-struct page. > > - We may be able to store the dma_addr of the ppiov in page->dma_addr, > > but I'm unsure if that actually works, because the dma_buf dmaddr is > > dma_addr_t (u32 or u64), but page->dma_addr is unsigned long (4 bytes > > I think). But if it works for pages I may be able to make it work for > > ppiov as well. > > > > - Checks that obtain the page->pp can work with ppiov if we align the > > offset of page->pp and ppiov->pp. > > > > - Checks around page->pp_magic can be removed if we also have offset > > aligned ppiov->pp_magic. > > > > Sadly I don't see us removing the checking for these other cases: > > > > - page_is_pfmemalloc(): I'm not allowed to pass a non-struct page into > > that helper. > > We can do similar trick like above as bit 1 of page->pp_magic is used to > indicate that if it is a pfmemalloc page. > Likely yes. > > > > - page_to_nid(): I'm not allowed to pass a non-struct page into that he= lper. > > Yes, this one need special case. > > > > > - page_pool_free_va(): ppiov have no va. > > Doesn't the skb_frags_readable() checking will protect the page_pool_free= _va() > from being called on devmem? > This function seems to be only called from veth which doesn't support devmem. I can remove the handling there. > > > > - page_pool_sync_for_dev/page_pool_dma_map: ppiov backed by dma-buf > > fundamentally can't get mapped again. > > Can we just fail the page_pool creation with PP_FLAG_DMA_MAP and > DMA_ATTR_SKIP_CPU_SYNC flags for devmem provider? > Jakub says PP_FLAG_DMA_MAP must be enabled for devmem, such that the page_pool handles the dma mapping of the devmem and the driver doesn't use it on its own. We may fail creating the page pool on PP_FLAG_DMA_SYNC_DEV maybe, and remove the checking from page_pool_sync_for_dev(), I think. > > > > Are the removal (or future removal) of these checks enough to resolve t= his? > > Yes, that is somewhat similar to my proposal, the biggest objection seems= to > be that we need to have a safe type checking for it to work correctly. > > > > >>> It is somewhat ironical that devmem is using static_branch to allivia= te the > >>> performance impact for normal memory at the possible cost of performa= nce > >>> degradation for devmem, does it not defeat some purpose of intergatin= g devmem > >>> to page_pool? > >>> > >> > >> I don't see the issue. The static branch sets the non-ppiov path as > >> default if no memory providers are in use, and flips it when they are, > >> making the default branch prediction ideal in both cases. > > You are assuming the we are not using page pool for both normal memory an= d > devmem at the same. But a generic solution should not have that assumptio= n > as my understanding. > > >> > >>>> > >>>>> Even though a static_branch check is added in page_is_page_pool_iov= (), it > >>>>> does not make much sense that a core has tow different 'struct' for= its > >>>>> most basic data. > >>>>> > >>>>> IMHO, the ppiov for dmabuf is forced fitting into page_pool without= much > >>>>> design consideration at this point. > >>>>> > >>>> ... > >>>>> > >>>>> For now, the above may work for the the rx part as it seems that yo= u are > >>>>> only enabling rx for dmabuf for now. > >>>>> > >>>>> What is the plan to enable tx for dmabuf? If it is also intergrated= into > >>>>> page_pool? There was a attempt to enable page_pool for tx, Eric see= med to > >>>>> have some comment about this: > >>>>> https://lkml.kernel.org/netdev/2cf4b672-d7dc-db3d-ce90-15b4e91c4005= @huawei.com/T/#mb6ab62dc22f38ec621d516259c56dd66353e24a2 > >>>>> > >>>>> If tx is not intergrated into page_pool, do we need to create a new= layer for > >>>>> the tx dmabuf? > >>>>> > >>>> > >>>> I imagine the TX path will reuse page_pool_iov, page_pool_iov_*() > >>>> helpers, and page_pool_page_*() helpers, but will not need any core > >>>> page_pool changes. This is because the TX path will have to piggybac= k > >>> > >>> We may need another bit/flags checking to demux between page_pool own= ed > >>> devmem and non-page_pool owned devmem. > >>> > >> > >> The way I'm imagining the support, I don't see the need for such > >> flags. We'd be re-using generic helpers like > >> page_pool_iov_get_dma_address() and what not that don't need that > >> checking. > >> > >>> Also calling page_pool_*() on non-page_pool owned devmem is confusing > >>> enough that we may need a thin layer handling non-page_pool owned dev= mem > >>> in the end. > >>> > >> > >> The page_pool_page* & page_pool_iov* functions can be renamed if > >> confusing. I would think that's no issue (note that the page_pool_* > > When you rename those functions, you will have a thin layer automatically= . > > >> functions need not be called for TX path). > >> > >>>> on MSG_ZEROCOPY (devmem is not copyable), so no memory allocation fr= om > >>>> the page_pool (or otherwise) is needed or possible. RFCv1 had a TX > >>>> implementation based on dmabuf pages without page_pool involvement, = I > >>>> imagine I'll do something similar. > >>> It would be good to have a tx implementation for the next version, so > >>> that we can have a whole picture of devmem. --=20 Thanks, Mina