Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp207845pxm; Tue, 22 Feb 2022 09:03:04 -0800 (PST) X-Google-Smtp-Source: ABdhPJzA8Qw9bU3LRs4lW6g1H+kr+4pABCluPqQwjEZkAXHYBup8w9p8Z/hVw0f6IDljgZUFgOjl X-Received: by 2002:a17:906:37cf:b0:6ce:6321:5ecb with SMTP id o15-20020a17090637cf00b006ce63215ecbmr20202045ejc.385.1645549384021; Tue, 22 Feb 2022 09:03:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645549384; cv=none; d=google.com; s=arc-20160816; b=0eE18n4c4fIdtPu4oNWpCPYF6hySFav3T/NlTS+9DFBI/84gFyw8rhu7ZjZ1tQyOQD HT2s3k5j9c3u3qk47T9fmmoORZUyoP7cZcOOsJSljPDZR6P+n3Wu2XaK26Yd7JIcYi4W 6+6Lk6+4s/cDziO6T4IQ0B2uzNxospkRzaZutBlis2kcMqNvSbe7c/2FCgQsOcWEsnFt I5XxrFbUJYkMnKe6XjXIZgJpYz34xiJ09hX8cIMa+wo3sF+BQR6RFXU65ExkCowRiUj7 I3QoD7Xku7UIIjc8nuC/seIG0JjKArPnEkKLXJKvjNK3pbPXe0QFfJg84oFuSoT1hAIc Xt+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=hdCmgZwGiGRUTSFvqbcXvg2RHsyPVggW1uP00P5Dcik=; b=0bR5X1DEuT2EmUu70pc4Hi1xHLDjxuZCEyoy0roq7SQzfmVbrM73ljeHIrWr0m5Q7W 1oPlaHYgaOG8PhrLYZTVXkZmxa10xQmr8r4LwE7rKr/fQYUMfDYB3NcgkV1xiZMyTadR zhoqAJvdUkXQ/hAqvD5N4yBlzgPsCfj7/godqRIRvO51IgJmEguJj2zBmZrMh4/9yFCd IJrVkAGgl3tOhzE0ap7BbupsZTybs8qH11QtQRr7qPSscWblXQ3iamTfoW0zUvVGvg75 cwuvDEl3UpLLXu9Z0HublFMSfyuRTNAGahc1d6NBZwqTHk30DjDYK86pe/DfX6qM2RTC zRjQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j3si11411959ejj.296.2022.02.22.09.02.28; Tue, 22 Feb 2022 09:03:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232277AbiBVN14 (ORCPT + 99 others); Tue, 22 Feb 2022 08:27:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232442AbiBVN1x (ORCPT ); Tue, 22 Feb 2022 08:27:53 -0500 Received: from outbound-smtp11.blacknight.com (outbound-smtp11.blacknight.com [46.22.139.106]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9446F8CDAF for ; Tue, 22 Feb 2022 05:27:26 -0800 (PST) Received: from mail.blacknight.com (pemlinmail03.blacknight.ie [81.17.254.16]) by outbound-smtp11.blacknight.com (Postfix) with ESMTPS id 220FF1C37F5 for ; Tue, 22 Feb 2022 13:27:25 +0000 (GMT) Received: (qmail 5624 invoked from network); 22 Feb 2022 13:27:24 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.17.223]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 22 Feb 2022 13:27:24 -0000 Date: Tue, 22 Feb 2022 13:27:22 +0000 From: Mel Gorman To: K Prateek Nayak Cc: peterz@infradead.org, aubrey.li@linux.intel.com, efault@gmx.de, gautham.shenoy@amd.com, linux-kernel@vger.kernel.org, mingo@kernel.org, song.bao.hua@hisilicon.com, srikar@linux.vnet.ibm.com, valentin.schneider@arm.com, vincent.guittot@linaro.org Subject: Re: [PATCH v5] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group Message-ID: <20220222132722.GC4423@techsingularity.net> References: <20220222102133.2956-1-kprateek.nayak@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20220222102133.2956-1-kprateek.nayak@amd.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 22, 2022 at 03:51:33PM +0530, K Prateek Nayak wrote: > In the case of systems containing multiple LLCs per socket, like > AMD Zen systems, users want to spread bandwidth hungry applications > across multiple LLCs. Stream is one such representative workload where > the best performance is obtained by limiting one stream thread per LLC. > To ensure this, users are known to pin the tasks to a specify a subset > of the CPUs consisting of one CPU per LLC while running such bandwidth > hungry tasks. > > Suppose we kickstart a multi-threaded task like stream with 8 threads > using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3 > server where each socket contains 128 CPUs > (0-63,128-191 in one socket, 64-127,192-255 in another socket) > > Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8 > > Here each CPU in the list is from a different LLC and 4 of those LLCs > are on one socket, while the other 4 are on another socket. > > Ideally we would prefer that each stream thread runs on a different > CPU from the allowed list of CPUs. However, the current heuristics in > find_idlest_group() do not allow this during the initial placement. > > Suppose the first socket (0-63,128-191) is our local group from which > we are kickstarting the stream tasks. The first four stream threads > will be placed in this socket. When it comes to placing the 5th > thread, all the allowed CPUs are from the local group (0,16,32,48) > would have been taken. > > However, the current scheduler code simply checks if the number of > tasks in the local group is fewer than the allowed numa-imbalance > threshold. This threshold was previously 25% of the NUMA domain span > (in this case threshold = 32) but after the v6 of Mel's patchset > "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip, > Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance > when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs > in the NUMA domain, for processors with multiple LLCs. > (in this case threshold = 8). > > For this example, the number of tasks will always be within threshold > and thus all the 8 stream threads will be woken up on the first socket > thereby resulting in sub-optimal performance. > > The following sched_wakeup_new tracepoint output shows the initial > placement of tasks in the current tip/sched/core on the Zen3 machine: > > stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048 > stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000 > stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016 > stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032 > stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032 > stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032 > stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032 > > Once the first four threads are distributed among the allowed CPUs of > socket one, the rest of the treads start piling on these same CPUs > when clearly there are CPUs on the second socket that can be used. > > Following the initial pile up on a small number of CPUs, though the > load-balancer eventually kicks in, it takes a while to get to {4}{4} > and even {4}{4} isn't stable as we observe a bunch of ping ponging > between {4}{4} to {5}{3} and back before a stable state is reached > much later (1 Stream thread per allowed CPU) and no more migration is > required. > > We can detect this piling and avoid it by checking if the number of > allowed CPUs in the local group are fewer than the number of tasks > running in the local group and use this information to spread the > 5th task out into the next socket (after all, the goal in this > slowpath is to find the idlest group and the idlest CPU during the > initial placement!). > > The following sched_wakeup_new tracepoint output shows the initial > placement of tasks after adding this fix on the Zen3 machine: > > stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048 > stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000 > stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064 > stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112 > stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096 > stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016 > stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080 > > We see that threads are using all of the allowed CPUs and there is > no pileup. > > No output is generated for tracepoint sched_migrate_task with this > patch due to a perfect initial placement which removes the need > for balancing later on - both across NUMA boundaries and within > NUMA boundaries for stream. > > Following are the results from running 8 Stream threads with and > without pinning on a dual socket Zen3 Machine (2 x 64C/128T): > > Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8 > > 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1 > tip sched/core tip sched/core tip sched/core > (no pinning) +pinning + this-patch > + pinning > > Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct) > Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct) > Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct) > Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct) > > Pinning currently hurts the performance compared to unbound case on > tip/sched/core. With the addition of this patch, we are able to > outperform tip/sched/core by a good margin with pinning. > > Following are the results from running 16 Stream threads with and > without pinning on a dual socket Skylake Machine (2 x 24C/48T): > > Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16 > > 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1 > tip sched/core tip sched/core tip sched/core > (no pinning) +pinning + this-patch > + pinning > > Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct) > Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct) > Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct) > Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct) > > In case of Skylake machine, with single LLC per socket, we see good > improvement brought about by pinning which is further benefited by > this patch. > > Signed-off-by: K Prateek Nayak Only minor nit would that the cpumask can be declared within the if block to limit scope but that is just being picky so Acked-by: Mel Gorman Thanks! -- Mel Gorman SUSE Labs