Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp412866rdg; Thu, 12 Oct 2023 09:06:37 -0700 (PDT) X-Google-Smtp-Source: AGHT+IET9lcF6vGpWaTZCs78MZ2cFGqQAQGnF/jEUzh0+EyZypZWpMwFflhpCzMGotQhAY+2r6oQ X-Received: by 2002:aa7:88ce:0:b0:690:c701:e0ed with SMTP id k14-20020aa788ce000000b00690c701e0edmr23366591pff.0.1697126796750; Thu, 12 Oct 2023 09:06:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697126796; cv=none; d=google.com; s=arc-20160816; b=C1aJ0aa6WhCTrws4WVbp7sf1JTIwlv0rOyBNdWOfU8/JqHmXMbB7qxg01R+6JUepaY RV7lMv8MOL+UUsyrRg9hk1cnyLynd8J5AsC/BrZVXjRrYGgD+sGwn6qbGSrt1Yv1e+EB 0jZ3a3rQq76NAKSWWGP8kMLnAcAbsGkevVn/t+GTnuUGqAwz7S4CXRk49/ZzGa4XNcM6 eNxyNyjUL9HfGUIEu25gOlM2VQQtbOIOwsODB64TPyW4rvVSoryJMczlTbBDFAAI2ntP TGP3eDg9G7ouWhcjZnWmi12gVqlLQGSI9cd5cZ/V0vZsEV1/FXEHeX/ctb+6I1s7pDro wgUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id:dkim-signature; bh=J8neF5Thn0UL0q8uoL9fT8rXQ5M2wDrwNDOWQu+kfLg=; fh=VXKEHTr+QIJty+lPwFw9SPmiODYplGC81p2yS8XZPgs=; b=nUmncNk+1UZMfOANtrIN2fHJo2LgCu4VcuZI8LySjLXDLAnztmQ4ibu90PnGQHFcY0 0niR2QCEjX36xrrrsF3Jf39nJ0ONsm3OZS2UdXv/W1UDDmmIHLbVxD+fRBkkDI6LGpuP 4LgzjG4RRipCc11YevWwL105dIUgcS9X3t4cAFdbJUSMQc8Qobhk/uAJ/uJBoQro9+Q+ RqRP8L+bmewugxjVrHURyCB8ZWnkHq8DWuH0i8sYgpFn9ALYTHj4tEqqCjSAlanCzF/b 9Ueae7pC7sE6Wm9fBdehNkUQ9wpxo7PhTd+YoDK9XjgKXHPZxF9Jx7QakGwq4Vu6+TaO 4NAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=jAsXPQDF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id h71-20020a63834a000000b005a9b20408a2si34472pge.3.2023.10.12.09.06.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 09:06:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=jAsXPQDF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id BEAB9807EDAC; Thu, 12 Oct 2023 09:06:01 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347300AbjJLQF4 (ORCPT + 99 others); Thu, 12 Oct 2023 12:05:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347307AbjJLQFy (ORCPT ); Thu, 12 Oct 2023 12:05:54 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E1752CA for ; Thu, 12 Oct 2023 09:05:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1697126750; bh=FqzgXxrWXwo2oYqJtbqJhODA4577xo5z/PX2K0dXMgA=; h=Date:Subject:From:To:Cc:References:In-Reply-To:From; b=jAsXPQDFt6v6lD5oss9LYO0MfeS87nfn3yVLWtuAvUoNRTdtl05DhFhWjvR09AwmD Aa6eyhDwCuIoA0p9UBDeU0Wd5qEk6rVOc4vlBc0og+vTt+GTH6MIWQeF3u+2e5Bc5J oTBiRz8DWHrHkppX6wtbmSwkmI6SaR54+kV1TMhJDDLyU9fkWU5PfqHQ/C8BOrzawf ANtNlAAM8yG9tZrp3GXiOgZPiD5E/omo0ZqN5wQBilCE/DpDohVYxm3moey6RWQj8g B5bmrIwz8kVHCjsQPQAVRi7t7pvr5OMt33lkW44JE9FN8SCNyPFkIsepJxGdIEmg8m BnY6vtUy5RoEg== Received: from [172.16.0.134] (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4S5vct466tz1XL9; Thu, 12 Oct 2023 12:05:50 -0400 (EDT) Message-ID: <1d5cb61a-1ff4-4737-9de5-cb7f2204d9ab@efficios.com> Date: Thu, 12 Oct 2023 12:05:55 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU Content-Language: en-US From: Mathieu Desnoyers To: Vincent Guittot Cc: Chen Yu , Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Juri Lelli , Swapnil Sapkal , Aaron Lu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , x86@kernel.org References: <20230929183350.239721-1-mathieu.desnoyers@efficios.com> <0f3cfff3-0df4-3cb7-95cb-ea378517e13b@efficios.com> <1ae6290c-843f-4e50-9c81-7146d3597ed3@efficios.com> <15be1d39-7901-4ffd-8f70-4be0e0f6339b@efficios.com> In-Reply-To: <15be1d39-7901-4ffd-8f70-4be0e0f6339b@efficios.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 12 Oct 2023 09:06:02 -0700 (PDT) On 2023-10-12 11:56, Mathieu Desnoyers wrote: > On 2023-10-12 11:01, Vincent Guittot wrote: >> On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers >> wrote: >>> >>> On 2023-10-11 06:16, Chen Yu wrote: >>>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: >>>>> On 2023-10-09 01:14, Chen Yu wrote: >>>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >>>>>>> On 9/30/23 03:11, Chen Yu wrote: >>>>>>>> Hi Mathieu, >>>>>>>> >>>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>>>>>>> select_task_rq towards the previous CPU if it was almost idle >>>>>>>>> (avg_load <= 0.1%). >>>>>>>> >>>>>>>> Yes, this is a promising direction IMO. One question is that, >>>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? >>>>>>>> If I understand correctly, load_avg reflects that more than >>>>>>>> 1 tasks could have been running this runqueue, and the >>>>>>>> load_avg is the direct proportion to the load_weight of that >>>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>>>>>>> that load_avg can reach, it is the sum of >>>>>>>> 1024 * (y + y^1 + y^2 ... ) >>>>>>>> >>>>>>>> For example, >>>>>>>> taskset -c 1 nice -n -20 stress -c 1 >>>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | >>>>>>>> grep "\.load_avg" >>>>>>>>       .load_avg                      : 88763 >>>>>>>>       .load_avg                      : 1024 >>>>>>>> >>>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 >>>>>>> >>>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX >>>>>>> somehow, >>>>>>> but it appears that it does not happen in practice. >>>>>>> >>>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the >>>>>>> real max, >>>>>>> does it really matter ? >>>>>>> >>>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? >>>>>>> [...] >>>>>>>> Or >>>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= >>>>>>>> capacity_orig_of(cpu) ? >>>>>>> >>>>>>> Unfortunately using util_avg does not seem to work based on my >>>>>>> testing. >>>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. >>>>>>> >>>>>>> Based on comments in fair.c: >>>>>>> >>>>>>>     * CPU utilization is the sum of running time of runnable >>>>>>> tasks plus the >>>>>>>     * recent utilization of currently non-runnable tasks on that >>>>>>> CPU. >>>>>>> >>>>>>> I think we don't want to include currently non-runnable tasks in the >>>>>>> statistics we use, because we are trying to figure out if the cpu >>>>>>> is a >>>>>>> idle-enough target based on the tasks which are currently >>>>>>> running, for the >>>>>>> purpose of runqueue selection when waking up a task which is >>>>>>> considered at >>>>>>> that point in time a non-runnable task on that cpu, and which is >>>>>>> about to >>>>>>> become runnable again. >>>>>>> >>>>>> >>>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still >>>>>> want to find >>>>>> a proper threshold to decide if the CPU is almost idle. The >>>>>> LOAD_AVG_MAX >>>>>> based threshold is modified a little bit: >>>>>> >>>>>> The theory is, if there is only 1 task on the CPU, and that task >>>>>> has a nice >>>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded >>>>>> as almost >>>>>> idle. >>>>>> >>>>>> The load_sum of the task is: >>>>>> 50 * (1 + y + y^2 + ... + y^n) >>>>>> The corresponding avg_load of the task is approximately >>>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. >>>>>> So: >>>>>> >>>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ >>>>>> #define ALMOST_IDLE_CPU_LOAD   50 >>>>> >>>>> Sorry to be slow at understanding this concept, but this whole >>>>> "load" value >>>>> is still somewhat magic to me. >>>>> >>>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it >>>>> independent ? >>>>> Where is it documented that the load is a value in "us" out of a >>>>> window of >>>>> 1000 us ? >>>>> >>>> >>>> My understanding is that, the load_sum of a single task is a value >>>> in "us" out >>>> of a window of 1000 us, while the load_avg of the task will multiply >>>> the weight >>>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. >>>> >>>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of >>>> a task(there >>>> is comments around ___update_load_sum to describe the pelt >>>> calculation), >>>> and ___update_load_avg() calculate the load_avg based on the task's >>>> weight. >>> >>> Thanks for your thorough explanation, now it makes sense. >>> >>> I understand as well that the cfs_rq->avg.load_sum is the result of >>> summing >>> each task load_sum multiplied by their weight: >> >> Please don't use load_sum but only *_avg. >> As already said, util_avg or runnable_avg are better metrics for you > > I think I found out why using util_avg was not working for me. > > Considering this comment from cpu_util(): > >  * CPU utilization is the sum of running time of runnable tasks plus the >  * recent utilization of currently non-runnable tasks on that CPU. > > I don't want to include the recent utilization of currently non-runnable > tasks on that CPU in order to choose that CPU to do task placement in a > context where many tasks were recently running on that cpu (but are > currently blocked). I do not want those blocked tasks to be part of the > avg. > > So I think the issue here is that I was using the cpu_util() (and > cpu_util_without()) helpers which are considering max(util, runnable), > rather than just "util". Actually AFAIU the part of cpu_util() responsible for adding the utilization of recently blocked tasks is the code under UTIL_EST. Thanks, Mathieu > > Based on your comments, just doing this to match a rq util_avg <= 1% > (10us of 1024us) > seems to work fine: > >   return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); > > Is this approach acceptable ? > > Thanks! > > Mathieu > >> >>> >>> static inline void >>> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) >>> { >>>           cfs_rq->avg.load_avg += se->avg.load_avg; >>>           cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; >>> } >>> >>> Therefore I think we need to multiply the load_sum value we aim for by >>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. >>> >>> I plan to compare the rq load sum to "10 * >>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" >>> to match runqueues which were previously idle (therefore with prior >>> periods contribution >>> to the rq->load_sum being pretty much zero), and which have a current >>> period rq load_sum >>> below or equal 10us per 1024us (<= 1%): >>> >>> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq >>> *cfs_rq) >>> { >>>           return cfs_rq->avg.load_sum; >>> } >>> >>> static unsigned long cpu_weighted_load_sum(struct rq *rq) >>> { >>>           return cfs_rq_weighted_load_sum(&rq->cfs); >>> } >>> >>> /* >>>    * A runqueue is considered almost idle if: >>>    * >>>    *   cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 >>> <= 1% >>>    * >>>    * This inequality is transformed as follows to minimize arithmetic: >>>    * >>>    *   cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 >>>    */ >>> static bool >>> almost_idle_cpu(int cpu, struct task_struct *p) >>> { >>>           if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>                   return false; >>>           return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * >>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg); >>> } >>> >>> Does it make sense ? >>> >>> Thanks, >>> >>> Mathieu >>> >>> >>>> >>>>> And with this value "50", it would cover the case where there is >>>>> only a >>>>> single task taking less than 50us per 1000us, and cases where the >>>>> sum for >>>>> the set of tasks on the runqueue is taking less than 50us per 1000us >>>>> overall. >>>>> >>>>>> >>>>>> static bool >>>>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>>>> { >>>>>>           if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>>>>                   return false; >>>>>>           return cpu_load_without(cpu_rq(cpu), p) <= >>>>>> ALMOST_IDLE_CPU_LOAD; >>>>>> } >>>>>> >>>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 >>>>>> core/package, >>>>>> total 72 core/144 CPUs. Slight improvement is observed in >>>>>> hackbench socket mode: >>>>>> >>>>>> socket mode: >>>>>> hackbench -g 16 -f 20 -l 480000 -s 100 >>>>>> >>>>>> Before patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 81.084 >>>>>> >>>>>> After patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 78.083 >>>>>> >>>>>> >>>>>> pipe mode: >>>>>> hackbench -g 16 -f 20 --pipe  -l 480000 -s 100 >>>>>> >>>>>> Before patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 38.219 >>>>>> >>>>>> After patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 38.348 >>>>>> >>>>>> It suggests that, if the workload has larger working-set/cache >>>>>> footprint, waking up >>>>>> the task on its previous CPU could get more benefit. >>>>> >>>>> In those tests, what is the average % of idleness of your cpus ? >>>>> >>>> >>>> For hackbench -g 16 -f 20 --pipe  -l 480000 -s 100, it is around >>>> 8~10% idle >>>> For hackbench -g 16 -f 20   -l 480000 -s 100, it is around 2~3% idle >>>> >>>> Then the CPUs in packge 1 are offlined to get stable result when the >>>> group number is low. >>>> hackbench -g 1 -f 20 --pipe  -l 480000 -s 100 >>>> Some CPUs are busy, others are idle, and some are half-busy. >>>> Core  CPU     Busy% >>>> -     -       49.57 >>>> 0     0       1.89 >>>> 0     72      75.55 >>>> 1     1       100.00 >>>> 1     73      0.00 >>>> 2     2       100.00 >>>> 2     74      0.00 >>>> 3     3       100.00 >>>> 3     75      0.01 >>>> 4     4       78.29 >>>> 4     76      17.72 >>>> 5     5       100.00 >>>> 5     77      0.00 >>>> >>>> >>>> hackbench -g 1 -f 20  -l 480000 -s 100 >>>> Core  CPU     Busy% >>>> -     -       48.29 >>>> 0     0       57.94 >>>> 0     72      21.41 >>>> 1     1       83.28 >>>> 1     73      0.00 >>>> 2     2       11.44 >>>> 2     74      83.38 >>>> 3     3       21.45 >>>> 3     75      77.27 >>>> 4     4       26.89 >>>> 4     76      80.95 >>>> 5     5       5.01 >>>> 5     77      83.09 >>>> >>>> >>>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>> hackbench -g 1 -f 20 --pipe  -l 480000 -s 100 >>>> Running in process mode with 1 groups using 40 file descriptors each >>>> (== 40 tasks) >>>> Each sender will pass 480000 messages of 100 bytes >>>> Time: 9.434 >>>> >>>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>> hackbench -g 1 -f 20 --pipe  -l 480000 -s 100 >>>> Running in process mode with 1 groups using 40 file descriptors each >>>> (== 40 tasks) >>>> Each sender will pass 480000 messages of 100 bytes >>>> Time: 9.373 >>>> >>>> thanks, >>>> Chenyu >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>> > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com