Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp209592rdg; Tue, 10 Oct 2023 08:11:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF51kosxSDglNzVTvK00P1H7F9c59r91bsWqLhjqKZl0xCVR8i96HhAu5RTKV/Vzwnfl079 X-Received: by 2002:a17:903:41c1:b0:1c4:1cd3:8068 with SMTP id u1-20020a17090341c100b001c41cd38068mr24250870ple.5.1696950700112; Tue, 10 Oct 2023 08:11:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696950700; cv=none; d=google.com; s=arc-20160816; b=K9Du1bydDzULJ694IuCqP6yc9yu6iwx4Zlb7SeE0lpqiwcOMAxIRJBLQrHfvW71wSR u81+H1yyR3sQbB3nO638j3Iya0FKzf00Kqzcdd6dzlAJJxgJ2lgKv3faK96uZPiPx6QH l49cksNMGPCnzkiDLiA8zphk6jZLoAdGxhg52Zs5iN3fwNwXrYxIzCcAn1h8HGAJonnR 5rUuHJjFvb0eHTK8u4JQspq8MtZZzYVz0itt85IANMx+SLSNG1R0rDyPbbjhHOZtRDi8 Q/qx4gTNLhxY8ABtPTNWb8HM+iNSVQ0UvV+ZKlDl2hUFp+ho3xFPgJYwLQ60t3y5ODhI K4EQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=3o7lXdoi8/vCLslYBSAmuDPbLS9cWUrXNd7Dt/FwkyM=; fh=gh6xFt+/5doZcunmy70vkZtCGORkU5FnVM1kv1yTHso=; b=ezbYFiTV4te0+Zb6PQa35VphjKMo4vNk8wFtdBWrPH0ug2jspY3DwJWQotlQoH+sin jK14XSdVkHOBWgT3BWIvMtWvwiRwn/D8DwFkEEu3PTaZPLqAWeHQYzPoRmNmfVXCFYjZ LuAi+V7k+HbmmdOX/TCUBXXZmAGfg9DSQYumTMYXB9OWZdr7dXpRvD0Y2Imv2/Bs8+18 Y7E1eD6/xpvbdFwTjDh+mN5gdH9iDxjLZ6u5wUMpUP1SK/JB5NnTyBtAYWQ0FddpJDGP LDy7lGw9m6eit/lEcgu0QczAchnSNwW7gsirOy0hEHxe4+WK7swiAooK5mF/0gsfBv0C 9qWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=jE+8Hby9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id o15-20020a656a4f000000b00565dd3fbfdfsi11157152pgu.214.2023.10.10.08.11.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Oct 2023 08:11:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=jE+8Hby9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 130818023E2B; Tue, 10 Oct 2023 08:11:37 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233201AbjJJPL1 (ORCPT + 99 others); Tue, 10 Oct 2023 11:11:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41816 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232184AbjJJPL0 (ORCPT ); Tue, 10 Oct 2023 11:11:26 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10A2AC6 for ; Tue, 10 Oct 2023 08:11:22 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id 98e67ed59e1d1-2773f776f49so4228607a91.1 for ; Tue, 10 Oct 2023 08:11:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1696950681; x=1697555481; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=3o7lXdoi8/vCLslYBSAmuDPbLS9cWUrXNd7Dt/FwkyM=; b=jE+8Hby9QphLm+cAbmirVgK4cZkU3wnoYlSh0ZrKARgMEZum4unVkkxwKUJReIVQcr 2IvK/ao+lLjpBmTgBg5gKcCCzkMnSLYaekVZDQSNsI9x/MWVrtmS40PRGxo3nIm0HoCj zfw4r9etx2gte5knSKw6uk4vxxAUpWw9lN6B9YyHQxzqCAs0wuHzM0qk4ABGS8B9AJym T8sfTXLjKzEeRdwcunhqSbmbR5n6mwVM/pdy1H20e/4ByKPZ15Lu3+fMWoWhQyTqA9DI rqUdMgkMb3qqiHGqC2Vj3g72ozDHYRxRAjXvR2Z9G0RsR/otcj+ZzoVcnAdcBATmtakW FcEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696950681; x=1697555481; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3o7lXdoi8/vCLslYBSAmuDPbLS9cWUrXNd7Dt/FwkyM=; b=hISNlZigN6JO4SArFJ9FwAzTV+451D2G5TV/cCtbA5xfJ1F0hobxg3n6G4nElfe/hM Nf6xHXy5STPqGUlHqXsGsspLzLGkf9IswVo555e5AaM4OVtsZ8MWQy5NwhjMuBWNXVxB XhAxx0EJ90Gyb2/dHJzYIVlR6tzKluVz/yJCaavmuwCrJlxAQb6TMBR07vroVN9+8TnO 8YvnVrt2cJFXtMOvEM3c6dY4M1rZ3zv/M9O2ubL8Zp8uH1fnvN23v6ar0MPKsp6TT8L5 GE6aT0gngyncihofjCJvvppaB1A8+hsW/eAZ8m9tUEi8r+V0PhOillS/ppSPlIXAqIwx hw9Q== X-Gm-Message-State: AOJu0YzbweTetyt92WmWs3TOTZ5Re+d5Xhx/kXLWaFE/ou8gcm5c3PvJ yA9fGJpypeKV4R9Yo7DkK8qi4h0NGC0GoergeCElKA== X-Received: by 2002:a17:90a:634c:b0:277:7810:ac74 with SMTP id v12-20020a17090a634c00b002777810ac74mr25857925pjs.10.1696950675992; Tue, 10 Oct 2023 08:11:15 -0700 (PDT) MIME-Version: 1.0 References: <20230929183350.239721-1-mathieu.desnoyers@efficios.com> <0f3cfff3-0df4-3cb7-95cb-ea378517e13b@efficios.com> In-Reply-To: From: Vincent Guittot Date: Tue, 10 Oct 2023 17:11:04 +0200 Message-ID: Subject: Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU To: Mathieu Desnoyers Cc: Chen Yu , Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Juri Lelli , Swapnil Sapkal , Aaron Lu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , x86@kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=2.7 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Tue, 10 Oct 2023 08:11:37 -0700 (PDT) X-Spam-Level: ** On Tue, 10 Oct 2023 at 15:49, Mathieu Desnoyers wrote: > > On 2023-10-09 01:14, Chen Yu wrote: > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > >> On 9/30/23 03:11, Chen Yu wrote: > >>> Hi Mathieu, > >>> > >>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > >>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > >>>> select_task_rq towards the previous CPU if it was almost idle > >>>> (avg_load <= 0.1%). > >>> > >>> Yes, this is a promising direction IMO. One question is that, > >>> can cfs_rq->avg.load_avg be used for percentage comparison? > >>> If I understand correctly, load_avg reflects that more than > >>> 1 tasks could have been running this runqueue, and the > >>> load_avg is the direct proportion to the load_weight of that > >>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > >>> that load_avg can reach, it is the sum of > >>> 1024 * (y + y^1 + y^2 ... ) > >>> > >>> For example, > >>> taskset -c 1 nice -n -20 stress -c 1 > >>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > >>> .load_avg : 88763 > >>> .load_avg : 1024 > >>> > >>> 88763 is higher than LOAD_AVG_MAX=47742 > >> > >> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > >> but it appears that it does not happen in practice. > >> > >> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > >> does it really matter ? > >> > >>> Maybe the util_avg can be used for precentage comparison I suppose? > >> [...] > >>> Or > >>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > >> > >> Unfortunately using util_avg does not seem to work based on my testing. > >> Even at utilization thresholds at 0.1%, 1% and 10%. > >> > >> Based on comments in fair.c: > >> > >> * CPU utilization is the sum of running time of runnable tasks plus the > >> * recent utilization of currently non-runnable tasks on that CPU. > >> > >> I think we don't want to include currently non-runnable tasks in the > >> statistics we use, because we are trying to figure out if the cpu is a > >> idle-enough target based on the tasks which are currently running, for the > >> purpose of runqueue selection when waking up a task which is considered at > >> that point in time a non-runnable task on that cpu, and which is about to > >> become runnable again. > >> > > > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > based threshold is modified a little bit: > > > > The theory is, if there is only 1 task on the CPU, and that task has a nice > > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > idle. > > > > The load_sum of the task is: > > 50 * (1 + y + y^2 + ... + y^n) > > The corresponding avg_load of the task is approximately > > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > So: > > > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > #define ALMOST_IDLE_CPU_LOAD 50 > > Sorry to be slow at understanding this concept, but this whole "load" > value is still somewhat magic to me. > > Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it > independent ? Where is it documented that the load is a value in "us" > out of a window of 1000 us ? nowhere because load_avg is not in usec. load_avg is the sum of entities' load_avg which is based on the weight of the entity. The weight of an entity is in the range [2:88761] and as a result its load_avg. LOAD_AVG_MAX can be used with the *_sum fields but not the *_avg fields of struct sched_avg If you want to evaluate the idleness of a CPU with pelt signal, you should better use util_avg or runnable_avg which are unweighted values in the range [0:1024] > > And with this value "50", it would cover the case where there is only a > single task taking less than 50us per 1000us, and cases where the sum > for the set of tasks on the runqueue is taking less than 50us per 1000us > overall. > > > > > static bool > > almost_idle_cpu(int cpu, struct task_struct *p) > > { > > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > > return false; > > return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > > } > > > > Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > > total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > > > > socket mode: > > hackbench -g 16 -f 20 -l 480000 -s 100 > > > > Before patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 81.084 > > > > After patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 78.083 > > > > > > pipe mode: > > hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > > > > Before patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 38.219 > > > > After patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 38.348 > > > > It suggests that, if the workload has larger working-set/cache footprint, waking up > > the task on its previous CPU could get more benefit. > > In those tests, what is the average % of idleness of your cpus ? > > Thanks, > > Mathieu > > > > > thanks, > > Chenyu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >