Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FE71C05027 for ; Fri, 17 Feb 2023 15:26:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230103AbjBQP0u (ORCPT ); Fri, 17 Feb 2023 10:26:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55992 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229506AbjBQP0r (ORCPT ); Fri, 17 Feb 2023 10:26:47 -0500 Received: from nbd.name (nbd.name [46.4.11.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4375D6EF1E; Fri, 17 Feb 2023 07:26:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nbd.name; s=20160729; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:Subject:From :References:Cc:To:MIME-Version:Date:Message-ID:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=7zz7HWj8yFk3oFDeByU+f0VjnF+tLK2rPbT6D+jyM9k=; b=P9h3PZcLPr9lpcs+sS/hGddgHF LgXXjYadMwNIfi/O855g7Fx2P7NKAyXgbzid+4JEuFMaOaeUQ3PlbnBvF7Blh3ABmXaDjndtq42OL LLTbWwi0hQxnFePpfFwDQ/nmKMDKYDKTcUiqJp+pjwkLYxutWHIrNaOCuNbqArb51F6w=; Received: from p54ae9730.dip0.t-ipconnect.de ([84.174.151.48] helo=nf.local) by ds12 with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1pT2dC-009E6c-6L; Fri, 17 Feb 2023 16:26:38 +0100 Message-ID: Date: Fri, 17 Feb 2023 16:26:37 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Content-Language: en-US To: Eric Dumazet Cc: netdev@vger.kernel.org, "David S. Miller" , Jakub Kicinski , Paolo Abeni , linux-kernel@vger.kernel.org References: <20230217100606.1234-1-nbd@nbd.name> From: Felix Fietkau Subject: Re: [RFC v2] net/core: add optional threading for rps backlog processing In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 17.02.23 15:38, Eric Dumazet wrote: > On Fri, Feb 17, 2023 at 2:40 PM Felix Fietkau wrote: >> >> On 17.02.23 13:57, Eric Dumazet wrote: >> > On Fri, Feb 17, 2023 at 1:35 PM Felix Fietkau wrote: >> >> >> >> On 17.02.23 13:23, Eric Dumazet wrote: >> >> > On Fri, Feb 17, 2023 at 11:06 AM Felix Fietkau wrote: >> >> >> >> >> >> When dealing with few flows or an imbalance on CPU utilization, static RPS >> >> >> CPU assignment can be too inflexible. Add support for enabling threaded NAPI >> >> >> for RPS backlog processing in order to allow the scheduler to better balance >> >> >> processing. This helps better spread the load across idle CPUs. >> >> >> >> >> >> Signed-off-by: Felix Fietkau >> >> >> --- >> >> >> >> >> >> RFC v2: >> >> >> - fix rebase error in rps locking >> >> > >> >> > Why only deal with RPS ? >> >> > >> >> > It seems you propose the sofnet_data backlog be processed by a thread, >> >> > instead than from softirq ? >> >> Right. I originally wanted to mainly improve RPS, but my patch does >> >> cover backlog in general. I will update the description in the next >> >> version. Does the approach in general make sense to you? >> >> >> > >> > I do not know, this seems to lack some (perf) numbers, and >> > descriptions of added max latencies and stuff like that :) >> I just ran some test where I used a MT7621 device (dual-core 800 MHz >> MIPS, 4 threads) as a router doing NAT without flow offloading. >> >> Using the flent RRUL test between 2 PCs connected through the router, >> I get these results: >> >> rps_threaded=0: (combined CPU idle time around 27%) >> avg median 99th % # data pts >> Ping (ms) ICMP : 26.08 28.70 54.74 ms 199 >> Ping (ms) UDP BE : 1.96 24.12 37.28 ms 200 >> Ping (ms) UDP BK : 1.88 15.86 27.30 ms 200 >> Ping (ms) UDP EF : 1.98 31.77 54.10 ms 200 >> Ping (ms) avg : 1.94 N/A N/A ms 200 >> TCP download BE : 69.25 70.20 139.55 Mbits/s 200 >> TCP download BK : 95.15 92.51 163.93 Mbits/s 200 >> TCP download CS5 : 133.64 129.10 292.46 Mbits/s 200 >> TCP download EF : 129.86 127.70 254.47 Mbits/s 200 >> TCP download avg : 106.97 N/A N/A Mbits/s 200 >> TCP download sum : 427.90 N/A N/A Mbits/s 200 >> TCP totals : 864.43 N/A N/A Mbits/s 200 >> TCP upload BE : 97.54 96.67 163.99 Mbits/s 200 >> TCP upload BK : 139.76 143.88 190.37 Mbits/s 200 >> TCP upload CS5 : 97.52 94.70 206.60 Mbits/s 200 >> TCP upload EF : 101.71 106.72 147.88 Mbits/s 200 >> TCP upload avg : 109.13 N/A N/A Mbits/s 200 >> TCP upload sum : 436.53 N/A N/A Mbits/s 200 >> >> rps_threaded=1: (combined CPU idle time around 16%) >> avg median 99th % # data pts >> Ping (ms) ICMP : 13.70 16.10 27.60 ms 199 >> Ping (ms) UDP BE : 2.03 18.35 24.16 ms 200 >> Ping (ms) UDP BK : 2.03 18.36 29.13 ms 200 >> Ping (ms) UDP EF : 2.36 25.20 41.50 ms 200 >> Ping (ms) avg : 2.14 N/A N/A ms 200 >> TCP download BE : 118.69 120.94 160.12 Mbits/s 200 >> TCP download BK : 134.67 137.81 177.14 Mbits/s 200 >> TCP download CS5 : 126.15 127.81 174.84 Mbits/s 200 >> TCP download EF : 78.36 79.41 143.31 Mbits/s 200 >> TCP download avg : 114.47 N/A N/A Mbits/s 200 >> TCP download sum : 457.87 N/A N/A Mbits/s 200 >> TCP totals : 918.19 N/A N/A Mbits/s 200 >> TCP upload BE : 112.20 111.55 164.38 Mbits/s 200 >> TCP upload BK : 144.99 139.24 205.12 Mbits/s 200 >> TCP upload CS5 : 93.09 95.50 132.39 Mbits/s 200 >> TCP upload EF : 110.04 108.21 207.00 Mbits/s 200 >> TCP upload avg : 115.08 N/A N/A Mbits/s 200 >> TCP upload sum : 460.32 N/A N/A Mbits/s 200 >> >> As you can see, both throughput and latency improve because load can be >> better distributed across CPU cores. >> > > What happens if user threads are competing with your kthreads ? > > It seems you are adding another variant of ksoftirqd. > > More threads might look better in some cases, if we accept to be at > the mercy of process scheduling decisions. > > NUMA affinities matter, I do not see how you are dealing with this. > > Your patch assumes all cpus can participate in network processing, > but rps is fine grained: > Boxes with 128 or 256 cpus usually have different rps_mask per receive > queue, with 2 or 4 bits set in the per-rx-queue rps_mask. I'm assuming that in cases where this matters, user space can set the affinities and priority of the rps threads. The number in the rps-%d process name matches the bit number for rps_mask, so it can control things in a more fine grained way. Would you prefer having support for selectively enabling threading on individual backlogs with a mask? > Then, process_backlog() has been designed to run only from the cpu > tied to the per-cpu data (softnet_data) > There are multiple comments about this assumption, and various things > that would need to be changed > (eg sd_has_rps_ipi_waiting() would be wrong in its current implementation) That's why I added the NAPI_STATE_THREADED check in napi_schedule_rps, so that sd_has_rps_ipi_waiting would always return false. Or are you worried about a race when enabling threading? - Felix