Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp831687pxb; Tue, 12 Apr 2022 14:42:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwtSTV68frSmHAo3modVSGyBOpvj4Xi9T4f+MEgljBx7dLAKf8j4EC1C1DJiHJJYJVfxbos X-Received: by 2002:a17:90b:e8f:b0:1cb:a308:5ee5 with SMTP id fv15-20020a17090b0e8f00b001cba3085ee5mr7045780pjb.147.1649799750198; Tue, 12 Apr 2022 14:42:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649799750; cv=none; d=google.com; s=arc-20160816; b=CP0zf/O/WksTlA0PjuhsiKA3aMCWtJtvV5Cb56Yq2/nl8NDGSCdCfFRqTvzf28sqBU 8dsNPlCDiOV+GmIVxwKrwruzxtYTTqzn7TW3yvdnsLwYiOLzDnkz58mAHaziwVb7gFyU J+HJ2bVL/wBSqNpFAM197chsm+3xyQapbsrcXh2rBENcMke9Lg2bq/5sP1mDAo5azaX0 EHbzz5DhY4DtsdIY9QBg2XCRjXmq38a1GQIYkuh0bvpAaWeWXgsfaE39vy0gUWpbkfHn 3usAqqq2wEEHpFpwiby/wv0TPnQ7K3DnLCj1ZKjG17ruWV/vI+dc6Es0aOEHlIPi+IFv R3TA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=JMm7jrb08dwa4TNW9El0B969rP5rs/JZ0el348oYQhg=; b=uSmAU9/5H9DiTAd/mpat3VnlQ87ltAUMYBheDcE8R8Hu+XlxjoSn/Ac+bT669bIg5f Ee8/1UvBWvvY+ty2alDDvYQcZFuYacirjDFdeTwKbb4L1FIvds1x1LoZEs/DWdaSKGqg ATPJ/VmUYhIhU47m96j0CQLjvYQr5QHGCAPOviAizhy0xo2zDZ/hWnW+Dx82ohY9EXT2 RsS00YEgGaA7tY/7VIUu58Wv0UT/NHpf9ukOSnLsT/GXAL2qmCHk/q1PT0bLfwn9RZyj 71Y+DKrpMySCQP975VBDRgLne+gCHdUi+bcuWVrn97B7rMSBMKaF7ehPQJKxn+QrbZJj 8vFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hqpTdL0Q; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id j9-20020a170903024900b00153b2d165cbsi14048419plh.467.2022.04.12.14.42.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Apr 2022 14:42:30 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hqpTdL0Q; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 06C75129E96; Tue, 12 Apr 2022 13:43:37 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244126AbiDLBZv (ORCPT + 99 others); Mon, 11 Apr 2022 21:25:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44678 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232253AbiDLBZt (ORCPT ); Mon, 11 Apr 2022 21:25:49 -0400 Received: from mail-yb1-xb35.google.com (mail-yb1-xb35.google.com [IPv6:2607:f8b0:4864:20::b35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0897319C31 for ; Mon, 11 Apr 2022 18:23:32 -0700 (PDT) Received: by mail-yb1-xb35.google.com with SMTP id f38so30577183ybi.3 for ; Mon, 11 Apr 2022 18:23:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JMm7jrb08dwa4TNW9El0B969rP5rs/JZ0el348oYQhg=; b=hqpTdL0QATE2nRBZcg9/sARzvk++crfDpAB0EJ8/t3kvFMRc3oOYdwDiaMQclM51yq dzbn/LyreN2tZmKHjMv444GCF9WOcrieYOATDxHNqefI7RT29WEgN+rEhl6MENEbGKQx aMlWfUypfPV1zRZtWmrvvbEsAq2B3dkKWWdNCHiOm1TToXulgzLplLkIFVr9QSiZ5Pgh KBzIf/EjkbuUd37WBrSVsJj9g5uUe2YEu7c3dkFSqdO2H6NpCw6HnIm0TrbDp42Ty2DA CRT0rr69UcO5H4E3HKJMs0thlWixD26F/vmyyApHfhJzRbLdjkviycao4CsfAiMgEDLA 0P/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JMm7jrb08dwa4TNW9El0B969rP5rs/JZ0el348oYQhg=; b=RTOBHBCJK6iBkCL6CpjVsItlLK9tTyX09j7+ecJyj7UsI+ZrMm+84RODQ7BAxPcgIe x0Tp3Z2n0PAXxU2hmwZpsCi6djvUyYaRjgW51vvhPeHuwavk6zQOLSOshj1bWTgfF9iI VTQ9DnVC+KpR5DR4oZVlHEujxgGFjr+yr+yf5z0t43gKxoz1wlmBNmws9jKpmJDJS/V+ Su3jT4vwQdXmS6T9dX23aV57o86/MgHMqgXg9MMBdwbDpfvqiqO/ZZKx430nGTZzF5EC WuYUmSck64Qc8nj9Hb/U85iAiXq2Nii2BYImiwPWtYIW8dry5rMK0xxAa+rJRFTvzO3I aEqA== X-Gm-Message-State: AOAM532URJgF/MSP3I6NTLIL64/ZeyN3zQ+tKbOBE5BwUOlM3RfZekAF z6DazY4gdh3HYdoIZoQ7vF6JmU9xu/PqrD4Gyi5jpQ== X-Received: by 2002:a25:41cf:0:b0:641:1857:ac7c with SMTP id o198-20020a2541cf000000b006411857ac7cmr11832405yba.281.1649726610934; Mon, 11 Apr 2022 18:23:30 -0700 (PDT) MIME-Version: 1.0 References: <20220409135104.3733193-1-wuyun.abel@bytedance.com> <20220409135104.3733193-2-wuyun.abel@bytedance.com> In-Reply-To: <20220409135104.3733193-2-wuyun.abel@bytedance.com> From: Josh Don Date: Mon, 11 Apr 2022 18:23:20 -0700 Message-ID: Subject: Re: [RFC v2 1/2] sched/fair: filter out overloaded cpus in SIS To: Abel Wu Cc: Peter Zijlstra , Mel Gorman , Vincent Guittot , linux-kernel Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Abel, Thanks for the patch, a few comments: > /* > + * It would be very unlikely to find an unoccupied cpu when system is heavily > + * overloaded. Even if we could, the cost might bury the benefit. > + */ > +static inline bool sched_domain_overloaded(struct sched_domain *sd, int nr_overloaded) > +{ > + return nr_overloaded > sd->span_weight - (sd->span_weight >> 4); > +} > + > +/* > * Scan the LLC domain for idle CPUs; this is dynamically regulated by > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > * average idle time for this rq (as found in rq->avg_idle). > @@ -6291,7 +6300,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) > { > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > - int i, cpu, idle_cpu = -1, nr = INT_MAX; > + int i, cpu, idle_cpu = -1, nr = INT_MAX, nro; > struct rq *this_rq = this_rq(); > int this = smp_processor_id(); > struct sched_domain *this_sd; > @@ -6301,7 +6310,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > if (!this_sd) > return -1; > > + nro = atomic_read(&sd->shared->nr_overloaded); > + if (sched_domain_overloaded(sd, nro)) > + return -1; This early bail out doesn't seem to be related to the main idea of your patch. Apart from deciding the exact heuristic value for what is considered too unlikely to find an idle cpu, this doesn't work well with tasks constrained by affinity; a task may have a small affinity mask containing idle cpus it may wake onto, even if most of the node is overloaded. > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > + if (nro) > + cpumask_andnot(cpus, cpus, sdo_mask(sd->shared)); To prevent us from exhausting our search attempts too quickly, this only needs to go under the sched_feat(SIS_PROP) && !has_idle_core case below. But by doing this unconditionally here, I guess your secondary goal is to reduce total search cost in both cases. Just wondering, did you observe significant time spent here that you are trying to optimize? By reducing our search space by the overload mask, it is important that the mask is relatively up to date, or else we could miss an opportunity to find an idle cpu. > if (sched_feat(SIS_PROP) && !has_idle_core) { > u64 avg_cost, avg_idle, span_avg; > @@ -7018,6 +7033,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > return newidle_balance(rq, rf) != 0; > } > + > +static inline bool cfs_rq_overloaded(struct rq *rq) > +{ > + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1; > +} Why > 1 instead of > 0? If a cpu is running 1 non-idle task and any number of idle tasks, I'd think it is still "occupied" in the way you've defined. We'd want to steer wakeups to cpus running 0 non-idle tasks. > +static void update_overload_status(struct rq *rq) > +{ > + struct sched_domain_shared *sds; > + bool overloaded = cfs_rq_overloaded(rq); > + int cpu = cpu_of(rq); > + > + lockdep_assert_rq_held(rq); > + > + if (rq->overloaded == overloaded) > + return; > + > + rcu_read_lock(); > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > + if (unlikely(!sds)) > + goto unlock; > + > + if (overloaded) { > + cpumask_set_cpu(cpu, sdo_mask(sds)); > + atomic_inc(&sds->nr_overloaded); > + } else { > + cpumask_clear_cpu(cpu, sdo_mask(sds)); > + atomic_dec(&sds->nr_overloaded); > + } Why are these cpu mask writes not atomic? > + > + rq->overloaded = overloaded; > +unlock: > + rcu_read_unlock(); > +} > + > +#else > + > +static inline void update_overload_status(struct rq *rq) { } > + > #endif /* CONFIG_SMP */ > > static unsigned long wakeup_gran(struct sched_entity *se) > @@ -7365,6 +7425,8 @@ done: __maybe_unused; > if (new_tasks > 0) > goto again; > > + update_overload_status(rq); > + > /* > * rq is about to be idle, check if we need to update the > * lost_idle_time of clock_pelt > @@ -11183,6 +11245,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > if (static_branch_unlikely(&sched_numa_balancing)) > task_tick_numa(rq, curr); > > + update_overload_status(rq); > update_misfit_status(curr, rq); > update_overutilized_status(task_rq(curr)); I'd caution about using task_tick and pick_next_task_fair as the places we set and clear overload. Some issues with task_tick: - ticks may be disabled in NO_HZ_FULL (an issue if we define overload as > 0 non-idle tasks) - most ticks will have the same state, so somewhat redundant checking. Could use an edge based trigger instead, such as enqueue/dequeue (somewhat similar to rq->rd->overload). With pick_next_task_fair(): - there's a window between a thread dequeuing, and then scheduler running through to the end of pick_next_task_fair(), during which we falsely observe the cpu as overloaded - this breaks with core scheduling, since we may use pick_task_fair rather than pick_next_task_fair Thanks, Josh