Received: by 2002:a05:6a10:7420:0:0:0:0 with SMTP id hk32csp313214pxb; Thu, 17 Feb 2022 04:52:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJxyqpNv2cC1hoM2gTiutx9HSLgLIAVy8Wq9BHyzVxiIdc1DQlWHKmr/9/Xcqvrz2O7+vylZ X-Received: by 2002:a17:902:7447:b0:14d:77d2:a739 with SMTP id e7-20020a170902744700b0014d77d2a739mr2724296plt.46.1645102343720; Thu, 17 Feb 2022 04:52:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645102343; cv=none; d=google.com; s=arc-20160816; b=Nlm4WyN1Ch2y9i+idCpK7lOs47q+o4tUm0/eEEcfQ+UMf/Gd+GxGlVRlzntiXfAryW opZrQBlhjrt4sTeRixkCgcZaRwZshluKiMUWL3wDK9F1AGS0eQWT+Swqz7gTGGV565f4 iOw99h+7PVA5faeGdN7/3ESxNe0xmV+WQha/60+iTYvWs/chCnd0haXoF/zCayK2awBK c7/4/SjBMKE9/hT5j2r4oYcdX6vuo4kxqSYyRcWHXUzOsS4gk03d6TxRWlgfVelHxXUc +voQZeJXu0sLCV44AvRUcRBSOTtrQeERdS5FfSvyt3+YGen2x/Jmo0ySi3YRQ2lIw6NM sPqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=yXn9+Q+uLP85FzakAsliw3xRo1KbHGVIQVv53jNqJdw=; b=grL4+5vVrhtoiOY7EWBYEdh4kr/kZaI/M2aXwgOy5fMIJoiEfKIUxGslzTLxhB4oiI XzY9yK2ZFNpA0IEUph/ox2KHiiHFcf3wWJGsScuIMvGg1cd5BLCgBfnBeUh93Q7bJhn/ EQ1w0c+y8S4VuMEUpujYUyCDT8pQW3DDqXq1M6MLwVF0AE6tPbA9KgMinHH5KS2HzyaG e0+HgD4XxB45PJh5wDDF+vDcN+yUt4r/qzRwH0VV7vQfBMXR/2vrjL/vJF3ldj1jNrVd /SaRKAL4Gk2olEWVcwMVzyPha0e1eLXRPjP1R0aCgy554FXkUHSxv9cxW4TTHMgqUipk 5sqg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a3si8433141pgv.502.2022.02.17.04.51.45; Thu, 17 Feb 2022 04:52:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238974AbiBQKFn (ORCPT + 99 others); Thu, 17 Feb 2022 05:05:43 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:50570 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230434AbiBQKFl (ORCPT ); Thu, 17 Feb 2022 05:05:41 -0500 Received: from outbound-smtp21.blacknight.com (outbound-smtp21.blacknight.com [81.17.249.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2A8D21D685C for ; Thu, 17 Feb 2022 02:05:27 -0800 (PST) Received: from mail.blacknight.com (pemlinmail03.blacknight.ie [81.17.254.16]) by outbound-smtp21.blacknight.com (Postfix) with ESMTPS id B2186CCD29 for ; Thu, 17 Feb 2022 10:05:25 +0000 (GMT) Received: (qmail 29878 invoked from network); 17 Feb 2022 10:05:25 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.17.223]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 17 Feb 2022 10:05:25 -0000 Date: Thu, 17 Feb 2022 10:05:24 +0000 From: Mel Gorman To: K Prateek Nayak Cc: peterz@infradead.org, aubrey.li@linux.intel.com, efault@gmx.de, gautham.shenoy@amd.com, linux-kernel@vger.kernel.org, mingo@kernel.org, song.bao.hua@hisilicon.com, srikar@linux.vnet.ibm.com, valentin.schneider@arm.com, vincent.guittot@linaro.org Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group Message-ID: <20220217100523.GV3366@techsingularity.net> References: <20220217055408.28151-1-kprateek.nayak@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20220217055408.28151-1-kprateek.nayak@amd.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thanks Prateek, On Thu, Feb 17, 2022 at 11:24:08AM +0530, K Prateek Nayak wrote: > In AMD Zen like systems which contains multiple LLCs per socket, > users want to spread bandwidth hungry applications across multiple > LLCs. Stream is one such representative workload where the best > performance is obtained by limiting one stream thread per LLC. To > ensure this, users are known to pin the tasks to a specify a subset of > the CPUs consisting of one CPU per LLC while running such bandwidth > hungry tasks. > > Suppose we kickstart a multi-threaded task like stream with 8 threads > using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3 > server where each socket contains 128 CPUs > (0-63,128-191 in one socket, 64-127,192-255 in another socket) > > Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8 > In this case the stream threads can use any CPU of the subset, presumably this is parallelised with OpenMP without specifying spread or bind directives. > stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048 > stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000 > stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016 > stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032 > stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032 > stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032 > stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032 > Resulting in some stacking with the baseline > stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048 > stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000 > stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064 > stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112 > stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096 > stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016 > stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080 > And no stacking with your patch. So far so good. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 5c4bfffe8c2c..6e875f1f34e2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9130,6 +9130,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > > case group_has_spare: > if (sd->flags & SD_NUMA) { > + struct cpumask *cpus; > + int imb; > #ifdef CONFIG_NUMA_BALANCING > int idlest_cpu; > /* > @@ -9147,10 +9149,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > * Otherwise, keep the task close to the wakeup source > * and improve locality if the number of running tasks > * would remain below threshold where an imbalance is > - * allowed. If there is a real need of migration, > - * periodic load balance will take care of it. > + * allowed while accounting for the possibility the > + * task is pinned to a subset of CPUs. If there is a > + * real need of migration, periodic load balance will > + * take care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) > + cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr); > + imb = min(cpumask_weight(cpus), sd->imb_numa_nr); > + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb)) > return NULL; One concern I have is that we incur a cpumask setup and cpumask_weight cost on every clone whether a restricted CPU mask is used or not. Peter, is it acceptable to avoid the cpumask check if there is no restrictions on allowed cpus like this? imb = sd->imb_numa_nr; if (p->nr_cpus_allowed != num_online_cpus()) struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); cpumask_and(cpus, sched_group_span(local), p->cpus_ptr); imb = min(cpumask_weight(cpus), imb); } It's not perfect as a hotplug event could occur but that would be a fairly harmless race with a limited impact (race with hotplug during clone may stack for a short interval before LB intervenes). -- Mel Gorman SUSE Labs