Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4588988pxf; Tue, 23 Mar 2021 14:41:09 -0700 (PDT) X-Google-Smtp-Source: ABdhPJySpbB4/AdJ7KpaTzs6YMURAUuBNz39PjIH8M+vJHqyneUbG4uu0bmZICH104GPCCbCF8IR X-Received: by 2002:a17:906:fa0e:: with SMTP id lo14mr228525ejb.263.1616535669261; Tue, 23 Mar 2021 14:41:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616535669; cv=none; d=google.com; s=arc-20160816; b=Lji4uz3UE8qdsdz33ytuFv+EaoH4g91R4XvR6xyiS02azvrjtuD/7OHg6YaOl5VOyy jno7Ciwxgu0r9l7Kb1E/Ddusr37+TBu44fGhTn+BMvfuONEsnG8UEy0J1Rz5PUxuHTWD psml+ySyFCDBgkeITnHc2MWYC7qU4rIXDcHqs27KFcf1BL2Vcocrmkl6XSi33abfMkGC 9aLVkeVDvk3n8Kecjr84B6BAVhtJ2ZPzCpvinp2C6rEXPo3OpihXMrYIqXo+y7NzuAiz wV2qdoTuglRrufzvMv6VOpXGNrqaJnPwkVe9xdRT7WwF9LsBGxptFRvGGpI95UjufV+C MQtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:ironport-sdr:ironport-sdr; bh=p6pCUQDhrvRFRj6zXgaKZ1z9G8bROroYCsbP1ZtVz0g=; b=wmz2Hae4S9Hff2roSDe6CZsRuTluHXyV+zTzQtnBGkogZ5poLmwTPlmNgoenDRHlUh qcklOWCz/zjDqa7i5SEFjiFoMUpCjLASKDJJeyuTzIFmCxMnauPF2K3Lc33eAyc3xTk4 t2OuV8YSLnBDi6DkEA/WQ8VXZziEv5FyQp8vBA9EmxUggCPnsaQVBuK9ovZJmwOi1qvg wFGr6haSwiKNsybcHaIQutN7upLt/1HOs0jknGuweQbUnmLAbzbftRnlEtnmwaJ608U3 Eix+gkbAAcCe3v5FLMA3YYSsGc8nSlANBW0cB+Gf0T75HR3gLbF+NTldX5K2pAIC6n2D qGSg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b65si189547edf.391.2021.03.23.14.40.43; Tue, 23 Mar 2021 14:41:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233652AbhCWViP (ORCPT + 99 others); Tue, 23 Mar 2021 17:38:15 -0400 Received: from mga07.intel.com ([134.134.136.100]:22783 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233615AbhCWViC (ORCPT ); Tue, 23 Mar 2021 17:38:02 -0400 IronPort-SDR: WGIp1a7ms5mgGnSzRpnmeUBvHz1XVa0kNmQkA/Qw3sSLgLFuUHJkEloYesVax45FoBwKFSXIz9 MeJU/BTk4a4A== X-IronPort-AV: E=McAfee;i="6000,8403,9932"; a="254570958" X-IronPort-AV: E=Sophos;i="5.81,272,1610438400"; d="scan'208";a="254570958" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Mar 2021 14:38:01 -0700 IronPort-SDR: VAdvdFdEJldWUrJzV9R65qdpWz8CtQ0f6g+Mw0vulM0WBIYTaJadyhStwa0UoY+d5o0/S6J9iw ZrerXI4uvrAw== X-IronPort-AV: E=Sophos;i="5.81,272,1610438400"; d="scan'208";a="607880678" Received: from schen9-mobl.amr.corp.intel.com ([10.255.229.173]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Mar 2021 14:38:00 -0700 Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ To: Vincent Guittot , Joel Fernandes Cc: linux-kernel , Paul McKenney , Frederic Weisbecker , Dietmar Eggeman , Qais Yousef , Ben Segall , Daniel Bristot de Oliveira , Ingo Molnar , Juri Lelli , Mel Gorman , Peter Zijlstra , Steven Rostedt , "Uladzislau Rezki (Sony)" , Neeraj upadhyay , Aubrey Li References: <20210122154600.1722680-1-joel@joelfernandes.org> <20210129172727.GA30719@vingu-book> From: Tim Chen Message-ID: <274d8ae5-8f4d-7662-0e04-2fbc92b416fc@linux.intel.com> Date: Tue, 23 Mar 2021 14:37:59 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20210129172727.GA30719@vingu-book> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/29/21 9:27 AM, Vincent Guittot wrote: > > The patch below moves the update of the blocked load of CPUs outside newidle_balance(). On a well known database workload, we also saw a lot of overhead to do update_blocked_averages in newidle_balance(). So changes to reduce this overhead is much welcomed. Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system. A big part of the overhead in our database workload comes from updating blocked averages in newidle_balance, caused by I/O threads making some CPUs go in and out of idle frequently in the following code path: ----__blkdev_direct_IO_simple | |----io_schedule_timeout | | | ----schedule_timeout | | | ----schedule | | | ----__schedule | | | ----pick_next_task_fair | | | ----newidle_balance | | ----update_blocked_averages We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup gets turned on. I hacked up Joe's original patch to rate limit the update of blocked averages called from newidle_balance(). The 9% throughput degradation reduced to 5.4%. We'll be testing Vincent's change to see if it can give similar performance improvement. Though in our test environment, sysctl_sched_migration_cost was kept much lower (25000) compared to the default (500000), to encourage migrations to idle cpu and reduce latency. We got quite a lot of calls to update_blocked_averages directly and then try to load_balance in newidle_balance instead of relegating the responsibility to idle load balancer. (See code snippet in newidle_balance below) if (this_rq->avg_idle < sysctl_sched_migration_cost || <-----sched_migration_cost check !READ_ONCE(this_rq->rd->overload)) { rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) update_next_balance(sd, &next_balance); rcu_read_unlock(); goto out; <--- invoke idle load balancer } raw_spin_unlock(&this_rq->lock); update_blocked_averages(this_cpu); .... followed by load balance code --- So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less effective in this case with small sched_migration_cost. Looking at the code a bit more, we don't actually load balance every time in this code path unless our avg_idle time exceeds some threshold. Doing update_blocked_averages immediately is only needed if we do call load_balance(). If we don't do any load balance in the code path, we can let the idle load balancer update the blocked averages lazily. Something like the following perhaps on top of Vincent's patch? We haven't really tested this change yet but want to see if this change makes sense to you. Tim Signed-off-by: Tim Chen --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 63950d80fd0b..b93f5f52658a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) struct sched_domain *sd; int pulled_task = 0; u64 curr_cost = 0; + bool updated_blocked_avg = false; update_misfit_status(NULL, this_rq); /* @@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) raw_spin_unlock(&this_rq->lock); - update_blocked_averages(this_cpu); rcu_read_lock(); for_each_domain(this_cpu, sd) { int continue_balancing = 1; @@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) } if (sd->flags & SD_BALANCE_NEWIDLE) { + if (!updated_blocked_avg) { + update_blocked_averages(this_cpu); + updated_blocked_avg = true; + } + t0 = sched_clock_cpu(this_cpu); pulled_task = load_balance(this_cpu, this_rq,