Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp2427043rwb; Sat, 29 Jul 2023 06:43:37 -0700 (PDT) X-Google-Smtp-Source: APBJJlHcIKm8OSThtzjF3Ht0HCc/ASAJKr+Yt2w6I1Rr6eJNc8tnURUIhFLl9lIxCMfwALc+LmEj X-Received: by 2002:aa7:8893:0:b0:67c:c8e4:8689 with SMTP id z19-20020aa78893000000b0067cc8e48689mr4853319pfe.12.1690638217012; Sat, 29 Jul 2023 06:43:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690638216; cv=none; d=google.com; s=arc-20160816; b=IDo4US7YJAMtgyjWjwu0eiVN0SVOshy/b6ldfw1eAZ+mDFRSg5N6ilKSrLUF3k2qO4 NDlwu/mX3L2lv16HuNo3xZi5DxXS26hxfz5NX4zXErX1VHE2QpNE98GIPcnyz7VgLfwi 13GF64RzE3+YuU4tcd3be7MLZLyphPCn9DbDROERCEjLdCqlaBYiPLOaxAddNY4wNngc NbIRII01y06Pf6Q9+Esza6RsrKGOkKzMktc1jta8cbtM784hHUwJ0S83rWAXVZa6GWb3 ltPh5XXPxKKuE4cSBxTE2ljPlZjAdHyJeOee0AY8c9K3jnFGayO2QcdMvoJlJ69z/Rux 3fLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=sceSGpDwwAxB7SdaIyUrYVqZA4NJBs7IqPHK9JIcmWM=; fh=y6eT55y7sywPcRe1liUgm4npLz18zD3ECrznAeqJlH8=; b=NjdN4aK64hDaliy3dJNUvIFXM1VbFEHDQ0azxLYaFTDdqzhumoCrbNcs2N94LCPdQK FzQsfKfqfPXhAowpmT2LbXFgOLTsRVxxdRQ/lgL624wwbG3JKe8Q828oiE5DRaeRzxzl X7YRxuUyekeVDGf9mH8ZxtbRTM/eQuT/U/ftAORDohHAPAWHBSqD3PlmfTAPaHk4M+t1 vEy9Ov+zGQ2InLuSmO27jr6PeBP6x8EXBj/WzIT27cX+hFfiMcPSwQNIPzgR0dc5yKqY A6qlieNvlvLEBO1TPETSQQqrfbCJoUuVw+k6r7pK5cKHsPqA7WOOrAJQPKGP7I18yxjt D5+A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x33-20020a056a0018a100b006871d101748si1115516pfh.254.2023.07.29.06.43.13; Sat, 29 Jul 2023 06:43:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231656AbjG2Mqb (ORCPT + 99 others); Sat, 29 Jul 2023 08:46:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229661AbjG2Mq3 (ORCPT ); Sat, 29 Jul 2023 08:46:29 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23A06186; Sat, 29 Jul 2023 05:46:28 -0700 (PDT) Received: from dggpemm500005.china.huawei.com (unknown [172.30.72.56]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4RCkjS171yzVjf2; Sat, 29 Jul 2023 20:44:44 +0800 (CST) Received: from [10.69.30.204] (10.69.30.204) by dggpemm500005.china.huawei.com (7.185.36.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Sat, 29 Jul 2023 20:46:22 +0800 Subject: Re: [PATCH net-next 6/9] page_pool: avoid calling no-op externals when possible To: Alexander Lobakin CC: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Maciej Fijalkowski , Larysa Zaremba , Alexander Duyck , Jesper Dangaard Brouer , Ilias Apalodimas , Simon Horman , , References: <20230727144336.1646454-1-aleksander.lobakin@intel.com> <20230727144336.1646454-7-aleksander.lobakin@intel.com> <604d4f6c-a6e7-e921-2d9a-45fe46ab9e79@intel.com> From: Yunsheng Lin Message-ID: <799ebbaf-961d-860a-6071-b74e10360e29@huawei.com> Date: Sat, 29 Jul 2023 20:46:22 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <604d4f6c-a6e7-e921-2d9a-45fe46ab9e79@intel.com> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.69.30.204] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemm500005.china.huawei.com (7.185.36.74) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2023/7/28 22:14, Alexander Lobakin wrote: > From: Yunsheng Lin > Date: Fri, 28 Jul 2023 20:39:24 +0800 > >> On 2023/7/27 22:43, Alexander Lobakin wrote: >>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles >>> even when on DMA-coherent platforms (like x86) with no active IOMMU or >>> swiotlb, just for the call ladder. >>> Indeed, it's >>> >>> page_pool_put_page() >>> page_pool_put_defragged_page() <- external >>> __page_pool_put_page() >>> page_pool_dma_sync_for_device() <- non-inline >>> dma_sync_single_range_for_device() >>> dma_sync_single_for_device() <- external >>> dma_direct_sync_single_for_device() >>> dev_is_dma_coherent() <- exit >>> >>> For the inline functions, no guarantees the compiler won't uninline them >>> (they're clearly not one-liners and sometimes compilers uninline even >>> 2 + 2). The first external call is necessary, but the rest 2+ are done >>> for nothing each time, plus a bunch of checks here and there. >>> Since Page Pool mappings are long-term and for one "device + addr" pair >>> dma_need_sync() will always return the same value (basically, whether it >>> belongs to an swiotlb pool), addresses can be tested once right after >>> they're obtained and the result can be reused until the page is unmapped. >>> Define the new PP DMA sync operation type, which will mean "do DMA syncs >>> for the device, but only when needed" and turn it on by default when the >>> driver asks to sync pages. When a page is mapped, check whether it needs >>> syncs and if so, replace that "sync when needed" back to "always do >>> syncs" globally for the whole pool (better safe than sorry). As long as >>> the pool has no pages requiring DMA syncs, this cuts off a good piece >>> of calls and checks. When at least one page required it, the pool >>> conservatively falls back to "always call sync functions", no per-page >>> verdicts. It's a fairly rare case anyway that only a few pages would >>> require syncing. >>> On my x86_64, this gives from 2% to 5% performance benefit with no >>> negative impact for cases when IOMMU is on and the shortcut can't be >>> used. >>> >> >> It seems other subsystem may have the similar problem as page_pool, >> is it possible to implement this kind of trick in the dma subsystem >> instead of every subsystem inventing their own trick? > > In the ladder I described above most of overhead comes from jumping > between Page Pool functions, not the generic DMA ones. Let's say I do > this shortcut in dma_sync_single_range_for_device(), that is too late > already to count on some good CPU saves. We can force inline the page_pool_dma_sync_for_device() function if it is 'the good CPU saves' you mentioned above. > Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's > not clear to me where to store this "we can shortcut" bit in that case. It seems we only need one bit in 'struct device' to do the 'shortcut', and there seems to have avaliable bit at the end of 'struct device'? Is it possible that we do something like this patch does in dma_sync_single_range_for_device()? One thing to note is that there may be multi concurrent callers to dma_sync_single_range_for_device(), which seems to be different from atomic context for page_pool_dma_map(), so it may need some atomic operation for the state changing if we want to implement it in a 'generic' way. > >>From "other subsystem" I remember only XDP sockets. There, they also > avoid calling their own non-inline functions in the first place, not the > generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via > some "generic" solution. If PP and XSk both have a similar trick, isn't it a more clear sight that it may be solved via some "generic" solution? Is there any reason there is no a similar trick for sync for cpu in XSk as below code indicates? https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152 > > Thanks, > Olek > > . >