Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp1525383pxb; Sun, 22 Aug 2021 20:35:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwRF6WHHPP6XNXzIXIdv3gCedgbMIURsCSqqlqOilHKJrAoMtzTCoFzcQwKhjI3H26t3rel X-Received: by 2002:a17:906:1416:: with SMTP id p22mr32800838ejc.364.1629689701959; Sun, 22 Aug 2021 20:35:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1629689701; cv=none; d=google.com; s=arc-20160816; b=CY0U9Bd/VukAOrUFJm2bNrbdZ/l2evHt+UGoDU03W0sVlL1G7yCyi824UnAp0VxWFS 3aRwGV9Goc1iBqykZKh5Z9EUebcWzO4u45QaKJpwFFAP9iVoXd7z1dfu2wwWSHNIwDkA Qx57XmKBNHCEV2hDcw+uKnPC/uVtzuNGufCku/pJSiglIxe51EVjFzDd80nWgSpDD2Ns SD6+tpqPQW7gWqZopfLAyb/XD0rkzsk1vkYF/ZLxTbZV7n+7jri+mRHpV17e9CXx7pV+ OJKIWIK9/LcLK7zpu+GtFKI2nc9UEGR7lI7X6RpnYCiaWCxQlNhR1KYzqI4fOkt2qHUu AMkQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=57qgLTZrXGjWTw6x40q2Dq12sPRDuWD47shgeu06L4Y=; b=eI6I9f8dShbPw4qXTNJSRLk/5pWLGWbHji1Xp2hvfooFLfsbWEFnJaKVFg7IF7dDaf 3/BP0e4ZBgxHXRZHnHxbcy3wNH67cb37d8F6WPy/4+upNlcLuVD/rQnzHE76k+Yd2aSR AAdtySFulVm1ncIXYj2e3R3pDs0WX+3T81iSYKDA/RbGVC+zlpmUTEjFsbEQ9vx67ciN p2oXlPia8uoFdboQZP9+66jmjS9wDyt7KYs5eLMZ0YzOQFQD6qeUE5pCs/FSm+1csMBY fRg/2Y01nX0i1TvfxLOZz8RXFj611qUfIQ5kg8htaXBPW3udwU+133vDCJuA+uNimOqw aN4Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b22si7579621edw.438.2021.08.22.20.34.39; Sun, 22 Aug 2021 20:35:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232866AbhHWDcs (ORCPT + 99 others); Sun, 22 Aug 2021 23:32:48 -0400 Received: from szxga02-in.huawei.com ([45.249.212.188]:8919 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231565AbhHWDcs (ORCPT ); Sun, 22 Aug 2021 23:32:48 -0400 Received: from dggemv704-chm.china.huawei.com (unknown [172.30.72.53]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4GtHkN4QSQz8tFT; Mon, 23 Aug 2021 11:27:56 +0800 (CST) Received: from dggpemm500005.china.huawei.com (7.185.36.74) by dggemv704-chm.china.huawei.com (10.3.19.47) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Mon, 23 Aug 2021 11:32:03 +0800 Received: from [10.69.30.204] (10.69.30.204) by dggpemm500005.china.huawei.com (7.185.36.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2176.2; Mon, 23 Aug 2021 11:32:02 +0800 Subject: Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support To: David Ahern , , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , References: <1629257542-36145-1-git-send-email-linyunsheng@huawei.com> <83b8bae8-d524-36a1-302e-59198410d9a9@gmail.com> <619b5ca5-a48b-49e9-2fef-a849811d62bb@gmail.com> From: Yunsheng Lin Message-ID: <744e88b6-7cb4-ea99-0523-4bfa5a23e15c@huawei.com> Date: Mon, 23 Aug 2021 11:32:01 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <619b5ca5-a48b-49e9-2fef-a849811d62bb@gmail.com> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.69.30.204] X-ClientProxiedBy: dggeme716-chm.china.huawei.com (10.1.199.112) To dggpemm500005.china.huawei.com (7.185.36.74) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2021/8/20 22:35, David Ahern wrote: > On 8/19/21 2:18 AM, Yunsheng Lin wrote: >> On 2021/8/19 6:05, David Ahern wrote: >>> On 8/17/21 9:32 PM, Yunsheng Lin wrote: >>>> This patchset adds the socket to netdev page frag recycling >>>> support based on the busy polling and page pool infrastructure. >>>> >>>> The profermance improve from 30Gbit to 41Gbit for one thread iperf >>>> tcp flow, and the CPU usages decreases about 20% for four threads >>>> iperf flow with 100Gb line speed in IOMMU strict mode. >>>> >>>> The profermance improve about 2.5% for one thread iperf tcp flow >>>> in IOMMU passthrough mode. >>>> >>> >>> Details about the test setup? cpu model, mtu, any other relevant changes >>> / settings. >> >> CPU is arm64 Kunpeng 920, see: >> https://www.hisilicon.com/en/products/Kunpeng/Huawei-Kunpeng-920 >> >> mtu is 1500, the relevant changes/settings I can think of the iperf >> client runs on the same numa as the nic hw exists(which has one 100Gbit >> port), and the driver has the XPS enabled too. >> >>> >>> How does that performance improvement compare with using the Tx ZC API? >>> At 1500 MTU I see a CPU drop on the Tx side from 80% to 20% with the ZC >>> API and ~10% increase in throughput. Bumping the MTU to 3300 and >>> performance with the ZC API is 2x the current model with 1/2 the cpu. >> >> I added a sysctl node to decide whether pfrag pool is used: >> net.ipv4.tcp_use_pfrag_pool >> [..] >> >> >>> >>> Epyc 7502, ConnectX-6, IOMMU off. >>> >>> In short, it seems like improving the Tx ZC API is the better path >>> forward than per-socket page pools. >> >> The main goal is to optimize the SMMU mapping/unmaping, if the cost of memcpy >> it higher than the SMMU mapping/unmaping + page pinning, then Tx ZC may be a >> better path, at leas it is not sure for small packet? >> > > It's a CPU bound problem - either Rx or Tx is cpu bound depending on the > test configuration. In my tests 3.3 to 3.5M pps is the limit (not using > LRO in the NIC - that's a different solution with its own problems). I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the bottleneck? It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings is not changed when testing, MTU is 1500: IOMMU in strict mode: 1. Tx ZC case: 22Gbit with Tx being bottleneck(cpu bound) 2. Tx non-ZC case with pfrag pool enabled: 40Git with Rx being bottleneck(cpu bound) 3. Tx non-ZC case with pfrag pool disabled: 30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does not have a single CPU reaching about 100% usage. > > At 1500 MTU lowering CPU usage on the Tx side does not accomplish much > on throughput since the Rx is 100% cpu. As above performance data, enabling ZC does not seems to help when IOMMU is involved, which has about 30% performance degrade when pfrag pool is disabled and 50% performance degrade when pfrag pool is enabled. > > At 3300 MTU you have ~47% the pps for the same throughput. Lower pps > reduces Rx processing and lower CPU to process the incoming stream. Then > using the Tx ZC API you lower the Tx overehad allowing a single stream > to faster - sending more data which in the end results in much higher > pps and throughput. At the limit you are CPU bound (both ends in my > testing as Rx side approaches the max pps, and Tx side as it continually > tries to send data). > > Lowering CPU usage on Tx the side is a win regardless of whether there > is a big increase on the throughput at 1500 MTU since that configuration > is an Rx CPU bound problem. Hence, my point that we have a good start > point for lowering CPU usage on the Tx side; we should improve it rather > than add per-socket page pools. Acctually it is not a per-socket page pools, the page pool is still per NAPI, this patchset adds multi allocation context to the page pool, so that the tx can reuse the same page pool with rx, which is quite usefully if the ARFS is enabled. > > You can stress the Tx side and emphasize its overhead by modifying the > receiver to drop the data on Rx rather than copy to userspace which is a > huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow As the frag page is supported in page pool for Rx, the Rx probably is not a bottleneck any more, at least not for IOMMU in strict mode. It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that? > stream to go faster and emphasize Tx bottlenecks as the pps at 3300 > approaches the top pps at 1500. e.g., doing this with iperf3 shows the > spinlock overhead with tcp_sendmsg, overhead related to 'select' and > then gup_pgd_range. When IOMMU is in strict mode, the overhead with IOMMU seems to be much bigger than spinlock(23% to 10%). Anyway, I still think ZC mostly benefit to packet which is bigger than a specific size and IOMMU disabling case. > . >