Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp32529551rwd; Fri, 7 Jul 2023 15:58:51 -0700 (PDT) X-Google-Smtp-Source: APBJJlGzOKEoENVgQOxgnGkMOIpR1/BaKu9WGgtupgLF6py922TuRuerEZ/ebYi2PEFZqmpvmoOJ X-Received: by 2002:a05:6808:e8a:b0:3a3:dca7:e5b4 with SMTP id k10-20020a0568080e8a00b003a3dca7e5b4mr6586491oil.0.1688770730987; Fri, 07 Jul 2023 15:58:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688770730; cv=none; d=google.com; s=arc-20160816; b=BFCGwB6/7al7vX5PbSDeRlKbQHCSVlCfm/ZRApZx5P9PISxuO8MoSrxaeW8mH9eLZ1 3/1Ap9J7gFcX1CzZ9cU3yWm9yNmYe+c8URResdp5gPyyxAqNezMI8bfgU2DVDmIT5mVO s1S8skqpPIyqtm2wwDUQHwlNK5moSnkw3SyjRvybiXL6TE0RTSPyOldxe5fGNjuEIRFr XFn2j0bhtvd38s8BlJguXUUC8XqYKjN7AUzs+MlCNIC7gAInG6Ry+OWY8IJiNPWPmY1a Y2WlzsU0PWplgWv28kfWl5ggB7HNLrEZjxFdqgWqJoXloYq6zxdBtmsOTfy9LnkGNvP2 sqHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=lxu+BVIw/qgL4Z91Wg6qw5PCEWZw5rG8tOJ3uaaYJg8=; fh=4tv6885AGcJ8YvvipL25y/n+e+aw2LwA0TLilxxK7G0=; b=ovpKCz3VCXz7ctXPbksjh1bjj1KomwiBStwuzxcSBDht3v94oItYBc2UyvwUzBZZw0 pIxrV+vfNdbjgUAV77B017P0ZNa03RVx0DWtbdIDEfy5UCgGv56BYKKOeSMNOI6YdXUJ borSQtwfyg3XYchmikipFnA2QAxKK82LU8uQBPecKJi7oHhBkxC5RzTAfk8DzxB3TggE QLZPJ+MgRAeMim+rW8vp9dkxkSZnk8n8KG/Fh+IPoIWzHeDPbwTVUrhLYaGKhfm+4z94 OjMVEC+smSCgRZkyDEtXiKWHhfWj7MZoN7szYjWD7HehRZ3UwcwKN2I7xE7XreEXP+Ni LhsQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=JuKPwUko; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r28-20020a63515c000000b0055b12486641si4396590pgl.186.2023.07.07.15.58.39; Fri, 07 Jul 2023 15:58:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=JuKPwUko; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231788AbjGGW46 (ORCPT + 99 others); Fri, 7 Jul 2023 18:56:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60484 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230414AbjGGW4z (ORCPT ); Fri, 7 Jul 2023 18:56:55 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 678721999 for ; Fri, 7 Jul 2023 15:56:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770614; x=1720306614; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6HBQc6gDrsn14hkdM0BGPFn3U8cheq47FEDyafZnmmY=; b=JuKPwUkoPviVvlFnNkKNkLljKSj/CIDCkliHkHeT562IhDA5quZyZ227 fp7PIC0ZGh3H2r3Fmh5x+Ma1ACcXpMHby6GYOSs1GYztnH8nOq5B5ge+4 LHzp2GcfnWjanBRSLMzdr2Nbam1QifEtoek7IGBSqeURmAGs4OFTP06RT uTH23aIX+69W6AvZqjBWDvDwhC2EdVxRiGuj0vQJrRPQyKI+BiL3BuiOt ptLJwNq84gMhaLER9bkRJ7Xl7i3vQay4yHENo0ovc0NzCBdcNzQbk7rPT QTWNUpSaS/32565GtWM4m8hxUnSusk/PoqjxJPht468PbjWk+eCDO+hca Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683440" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683440" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176657" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176657" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:53 -0700 From: Tim Chen To: Peter Zijlstra Cc: Tim C Chen , Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton Subject: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Date: Fri, 7 Jul 2023 15:57:00 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Tim C Chen On hybrid CPUs with scheduling cluster enabled, we will need to consider balancing between SMT CPU cluster, and Atom core cluster. Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores. Each scheduling cluster span a L2 cache. --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------ [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14] Big Big Big Big Atom Atom core core core core Module Module If the busiest group is a big core with both SMT CPUs busy, we should active load balance if destination group has idle CPU cores. Such condition is considered by asym_active_balance() in load balancing but not considered when looking for busiest group and computing load imbalance. Add this consideration in find_busiest_group() and calculate_imbalance(). In addition, update the logic determining the busier group when one group is SMT and the other group is non SMT but both groups are partially busy with idle CPU. The busier group should be the group with idle cores rather than the group with one busy SMT CPU. We do not want to make the SMT group the busiest one to pull the only task off SMT CPU and causing the whole core to go empty. Otherwise suppose in the search for the busiest group, we first encounter an SMT group with 1 task and set it as the busiest. The destination group is an atom cluster with 1 task and we next encounter an atom cluster group with 3 tasks, we will not pick this atom cluster over the SMT group, even though we should. As a result, we do not load balance the busier Atom cluster (with 3 tasks) towards the local atom cluster (with 1 task). And it doesn't make sense to pick the 1 task SMT group as the busier group as we also should not pull task off the SMT towards the 1 task atom cluster and make the SMT core completely empty. Signed-off-by: Tim Chen --- kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 77 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87317634fab2..f636d6c09dc6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8279,6 +8279,11 @@ enum group_type { * more powerful CPU. */ group_misfit_task, + /* + * Balance SMT group that's fully busy. Can benefit from migration + * a task on SMT with busy sibling to another CPU on idle core. + */ + group_smt_balance, /* * SD_ASYM_PACKING only: One local CPU with higher capacity is available, * and the task should be migrated to it instead of running on the @@ -8987,6 +8992,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ + unsigned int group_smt_balance; /* Task on busy SMT be moved */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; @@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int imbalance_pct, if (sgs->group_asym_packing) return group_asym_packing; + if (sgs->group_smt_balance) + return group_smt_balance; + if (sgs->group_misfit_task_load) return group_misfit_task; @@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu); } +/* One group has more than one SMT CPU while the other group does not */ +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1, + struct sched_group *sg2) +{ + if (!sg1 || !sg2) + return false; + + return (sg1->flags & SD_SHARE_CPUCAPACITY) != + (sg2->flags & SD_SHARE_CPUCAPACITY); +} + +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + if (env->idle == CPU_NOT_IDLE) + return false; + + /* + * For SMT source group, it is better to move a task + * to a CPU that doesn't have multiple tasks sharing its CPU capacity. + * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY + * will not be on. + */ + if (group->flags & SD_SHARE_CPUCAPACITY && + sgs->sum_h_nr_running > 1) + return true; + + return false; +} + static inline bool sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) { @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_asym_packing = 1; } + /* Check for loaded SMT group to be balanced to dst CPU */ + if (!local_group && smt_balance(env, sgs, group)) + sgs->group_smt_balance = 1; + sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs); /* Computing avg_load makes sense only when group is overloaded */ @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, return false; break; + case group_smt_balance: case group_fully_busy: /* * Select the fully busy group with highest avg_load. In @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env, break; case group_has_spare: + /* + * Do not pick sg with SMT CPUs over sg with pure CPUs, + * as we do not want to pull task off SMT core with one task + * and make the core idle. + */ + if (smt_vs_nonsmt_groups(sds->busiest, sg)) { + if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1) + return false; + else + return true; + } + /* * Select not overloaded group with lowest number of idle cpus * and highest number of running tasks. We could also compare @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest, case group_imbalanced: case group_asym_packing: + case group_smt_balance: /* Those types are not used in the slow wakeup path */ return false; @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) case group_imbalanced: case group_asym_packing: + case group_smt_balance: /* Those type are not used in the slow wakeup path */ return NULL; @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s return; } + if (busiest->group_type == group_smt_balance) { + /* Reduce number of tasks sharing CPU capacity */ + env->migration_type = migrate_task; + env->imbalance = 1; + return; + } + if (busiest->group_type == group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance; if (busiest->group_type != group_overloaded) { - if (env->idle == CPU_NOT_IDLE) + if (env->idle == CPU_NOT_IDLE) { /* * If the busiest group is not overloaded (and as a * result the local one too) but this CPU is already * busy, let another idle CPU try to pull task. */ goto out_balanced; + } + + if (busiest->group_type == group_smt_balance && + smt_vs_nonsmt_groups(sds.local, sds.busiest)) { + /* Let non SMT CPU pull from SMT CPU sharing with sibling */ + goto force_balance; + } if (busiest->group_weight > 1 && - local->idle_cpus <= (busiest->idle_cpus + 1)) + local->idle_cpus <= (busiest->idle_cpus + 1)) { /* * If the busiest group is not overloaded * and there is no imbalance between this and busiest @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * there is more than 1 CPU per group. */ goto out_balanced; + } - if (busiest->sum_h_nr_running == 1) + if (busiest->sum_h_nr_running == 1) { /* * busiest doesn't have any tasks waiting to run */ goto out_balanced; + } } force_balance: -- 2.32.0