Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp387960rdg; Thu, 12 Oct 2023 08:27:04 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH2nh6rJYgpR0QuwwB7kwYoxNv3y3UxZ9AZfp65AB3iLEZY006SdoBtwXhapcADfJAWnPmW X-Received: by 2002:a05:6359:2888:b0:166:94b8:da17 with SMTP id qa8-20020a056359288800b0016694b8da17mr844748rwb.16.1697124424496; Thu, 12 Oct 2023 08:27:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697124424; cv=none; d=google.com; s=arc-20160816; b=aeZYKd5CJ64xUNalV6CbmT+E0e/riq0Y2I3T8AJoqKL6ffTtcwTB/w0oouVHBlywrS By1NP68qfm4tOfFd2ATv7zdDc2SMKiGgtJQ4C3VdM0EUObPSlJYXjbRBdnMi5g8eJ97d cqnCME9jDAHAKLWkJvY+udbLG+UoOHan1hwNJ80fwrXD1dYj/gZVailiw83JL3pG5y34 7+++rZ+vXncM/2LqFkKof5GJ373vfwmSgW3BZkUZ0t5lS+IedL1pKokMLt4xSXVQr1Gl /25gOaSmS7lIAGvDrNnkPQy267607ZwXXr3n645KWlsSQ2oPP19AFGQ6uCt4vgtAhEfd uy5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=0aiSrmhyZnUtCvLbxnd3hOR5zEzU1J8KjNK8bV2rzK8=; fh=cFHHETb1NHxE8V5GbxzrKifUnFee518ymD8F/2TYO8o=; b=q1YnoS6tXyatKyEX2egIyHfCdkADSamD31+kT0DpgplKyFvARStWZc3d87/MShSWsP v40XdHRHzZSArofru6UGI82QJ29q99Q4Pg3CEvMUNKSeKAY3p3lZ9l1Pktn2RrfJHS3x qRgf1TS/fBxRvmYIyDe3MVRrY7DFLyHzECziiReQaPjD9vCawP0zNR5MW/BYCeptmSrc j5cz+U/vVBEjv5iwMYyYAPbP0FWNvjD5Pr9ijrC6ABC9Uds58eH1mqmH9Ew5Qm57yzvR hq8pewMqbU6LYaI9O5gQBe+9QG/T0Xn850fhlR9VuCvdDrg99vmrWIUWWLphnJtPQGjr wMfg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="NaDbY6/1"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id bt23-20020a056a00439700b0068e39cd7acdsi14670023pfb.83.2023.10.12.08.27.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 08:27:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="NaDbY6/1"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 370898043059; Thu, 12 Oct 2023 08:27:01 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379382AbjJLP0v (ORCPT + 99 others); Thu, 12 Oct 2023 11:26:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1379351AbjJLP0u (ORCPT ); Thu, 12 Oct 2023 11:26:50 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22125D3 for ; Thu, 12 Oct 2023 08:26:48 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id 98e67ed59e1d1-27d18475ed4so931631a91.0 for ; Thu, 12 Oct 2023 08:26:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1697124407; x=1697729207; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=0aiSrmhyZnUtCvLbxnd3hOR5zEzU1J8KjNK8bV2rzK8=; b=NaDbY6/1e1AxqBc5dH/hCdmN4DJ0KT/i5LJEky5sQc0JDBxt64cpo9vq6XLS1ifzzM g8k8I66F8TAdBZLCrw0pmxUgUl+jylwjDs6SXkngL/ajlqboUXRvjuF1miqkTsMZxjUi 56pd6l99paeQcxpa7PqAekKLsg+1iJcbPV7GVz4XmnIkBjU8I4OOB+7nZ3F7tJGUilzE Yh8Cux4F1J579ds7RmiApXt2/A+RMWXtNSTCpSw9SSn4cKznnrz1TDlU4yZaiD1dyg9A qVpAsHHG7JnXP++BeSLuNO9HkaxdQ8DH/u0NL+76VOQRA837vb5BU49Ez2xKbi32cJnA J9+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697124407; x=1697729207; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0aiSrmhyZnUtCvLbxnd3hOR5zEzU1J8KjNK8bV2rzK8=; b=pjk2fy4emPPslRK4a9Aap6mjCowynEErrQY9ejtxyWEgY/Wv28UtXxP6OSd19T9ODN X7cj6DvCw01Pf0v2SVlAWqUz7M05p4Tu+pw92i1epiRACmef8gzlI14dHrg8eUpsr5RA 73eOJK2RBjhH8SiFUAQkqC4gWKUj8ZcZTc3a0MHQxMSTBa8Rl2OA7GGzHOXIo05GtSoC FlJxhJ2vKem5p10hEyyDAz1ff50K8lUNLtF5Fo40ovnjYiskXW5VCy4sLc1kKS4IMB0X WqcbRRYtp8K6igzGUcoE/I2EzkXVDePc7gT6QvVnhX0QPfMPET+Ev49kZKF4u6s+F6rY OxZA== X-Gm-Message-State: AOJu0Yww1QWU12eqCePkD2+/+BnR8Y484HtCAM/MnTLlC0B6SNCvZ1dc HEFTnvJORvjpgfmOh760SEGZAf5Wb5oyhSOblJ3yRw== X-Received: by 2002:a17:90b:4d82:b0:274:4f21:deae with SMTP id oj2-20020a17090b4d8200b002744f21deaemr19087167pjb.35.1697124407392; Thu, 12 Oct 2023 08:26:47 -0700 (PDT) MIME-Version: 1.0 References: <20230929183350.239721-1-mathieu.desnoyers@efficios.com> <0f3cfff3-0df4-3cb7-95cb-ea378517e13b@efficios.com> In-Reply-To: From: Vincent Guittot Date: Thu, 12 Oct 2023 17:26:36 +0200 Message-ID: Subject: Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU To: Chen Yu Cc: Mathieu Desnoyers , Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Juri Lelli , Swapnil Sapkal , Aaron Lu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , x86@kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Thu, 12 Oct 2023 08:27:01 -0700 (PDT) On Wed, 11 Oct 2023 at 12:17, Chen Yu wrote: > > On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > > On 2023-10-09 01:14, Chen Yu wrote: > > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > > > > On 9/30/23 03:11, Chen Yu wrote: > > > > > Hi Mathieu, > > > > > > > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > > > > select_task_rq towards the previous CPU if it was almost idle > > > > > > (avg_load <= 0.1%). > > > > > > > > > > Yes, this is a promising direction IMO. One question is that, > > > > > can cfs_rq->avg.load_avg be used for percentage comparison? > > > > > If I understand correctly, load_avg reflects that more than > > > > > 1 tasks could have been running this runqueue, and the > > > > > load_avg is the direct proportion to the load_weight of that > > > > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > > > > that load_avg can reach, it is the sum of > > > > > 1024 * (y + y^1 + y^2 ... ) > > > > > > > > > > For example, > > > > > taskset -c 1 nice -n -20 stress -c 1 > > > > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > > > > .load_avg : 88763 > > > > > .load_avg : 1024 > > > > > > > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > > > > > > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > > > > but it appears that it does not happen in practice. > > > > > > > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > > > > does it really matter ? > > > > > > > > > Maybe the util_avg can be used for precentage comparison I suppose? > > > > [...] > > > > > Or > > > > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > > > > > > > Unfortunately using util_avg does not seem to work based on my testing. > > > > Even at utilization thresholds at 0.1%, 1% and 10%. > > > > > > > > Based on comments in fair.c: > > > > > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > > > * recent utilization of currently non-runnable tasks on that CPU. > > > > > > > > I think we don't want to include currently non-runnable tasks in the > > > > statistics we use, because we are trying to figure out if the cpu is a > > > > idle-enough target based on the tasks which are currently running, for the > > > > purpose of runqueue selection when waking up a task which is considered at > > > > that point in time a non-runnable task on that cpu, and which is about to > > > > become runnable again. > > > > > > > > > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > > based threshold is modified a little bit: > > > > > > The theory is, if there is only 1 task on the CPU, and that task has a nice > > > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > > idle. > > > > > > The load_sum of the task is: > > > 50 * (1 + y + y^2 + ... + y^n) > > > The corresponding avg_load of the task is approximately > > > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > > So: > > > > > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > > #define ALMOST_IDLE_CPU_LOAD 50 > > > > Sorry to be slow at understanding this concept, but this whole "load" value > > is still somewhat magic to me. > > > > Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > > Where is it documented that the load is a value in "us" out of a window of > > 1000 us ? > > > > My understanding is that, the load_sum of a single task is a value in "us" out > of a window of 1000 us, while the load_avg of the task will multiply the weight I'm not sure we can say this. We use a 1024us sampling rate for calculating weighted average but load_sum is in the range [0:47742] so what does it mean 47742us out of a window of 1000us ? Beside this we have util_avg in the range [0:cpu capacity] which gives you the average running time of the cpu > of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > > __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > is comments around ___update_load_sum to describe the pelt calculation), > and ___update_load_avg() calculate the load_avg based on the task's weight. > > > And with this value "50", it would cover the case where there is only a > > single task taking less than 50us per 1000us, and cases where the sum for > > the set of tasks on the runqueue is taking less than 50us per 1000us > > overall. > > > > > > > > static bool > > > almost_idle_cpu(int cpu, struct task_struct *p) > > > { > > > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > > > return false; > > > return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > > > } > > > > > > Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > > > total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > > > > > > socket mode: > > > hackbench -g 16 -f 20 -l 480000 -s 100 > > > > > > Before patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 81.084 > > > > > > After patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 78.083 > > > > > > > > > pipe mode: > > > hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > > > > > > Before patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 38.219 > > > > > > After patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 38.348 > > > > > > It suggests that, if the workload has larger working-set/cache footprint, waking up > > > the task on its previous CPU could get more benefit. > > > > In those tests, what is the average % of idleness of your cpus ? > > > > For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle > For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle > > Then the CPUs in packge 1 are offlined to get stable result when the group number is low. > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Some CPUs are busy, others are idle, and some are half-busy. > Core CPU Busy% > - - 49.57 > 0 0 1.89 > 0 72 75.55 > 1 1 100.00 > 1 73 0.00 > 2 2 100.00 > 2 74 0.00 > 3 3 100.00 > 3 75 0.01 > 4 4 78.29 > 4 76 17.72 > 5 5 100.00 > 5 77 0.00 > > > hackbench -g 1 -f 20 -l 480000 -s 100 > Core CPU Busy% > - - 48.29 > 0 0 57.94 > 0 72 21.41 > 1 1 83.28 > 1 73 0.00 > 2 2 11.44 > 2 74 83.38 > 3 3 21.45 > 3 75 77.27 > 4 4 26.89 > 4 76 80.95 > 5 5 5.01 > 5 77 83.09 > > > echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 9.434 > > echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 9.373 > > thanks, > Chenyu