Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp2741618rdb; Tue, 12 Sep 2023 10:35:14 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFS17QlJ9AGBZ3ngd+1w1Q12mYS4+20LZj8dV5zud+9oH+01UsnEIHoiGA7xePsHx9oVBmi X-Received: by 2002:a05:6358:8823:b0:134:d45b:7dd1 with SMTP id hv35-20020a056358882300b00134d45b7dd1mr384627rwb.21.1694540113627; Tue, 12 Sep 2023 10:35:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694540113; cv=none; d=google.com; s=arc-20160816; b=wbuaJn4HEdEpTPVFHyo0xFWvjWKOywl+NcIMecb+H8Yh0d64GOqI3n6plUvb0h38J7 GW4Pl50b7hWZ6blwVH/dqKXiRYrCJ2AwE7qLVs3O5sISMSkW6WHaBmjP/TLALwNRajXc 1Up4BUuMTZcYSoEMwkGH/NT4aR+2YGFOxT9T9UCK93G3ap2nVe2ZV4JbaPlxu8hUSeLc bb0FUi2R4u/vlaQG5emQ/1o2TimwRuBTOalmtyrvkVdONcTjPKdxFt06pEhmwdHldLB0 a4wI48ZAeWzKA33FYtXSpgAe6/yV8ZJY7WvE+RijK5mA/MAjMpDsg35DuZRJkK1w6IOr nzlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=58AlLtnKTjTF5GsyTuNbWGjhXCr3iuZM8V95SGNzVDY=; fh=/jqJp0TNwk72KgB4SfOJIChrtgMm7CMPVMogAEC2dkg=; b=VSGQKTDaT4NjQJL2pGUXzJ5Fozf2L38JFH/OUjPNegTSgzaDXiD6/Fp3NIsyrYuEtq MyinY1DnjYUdxu1zTQ324O+o5ThtrlSSZmVVzpXqdUfsUxbYN5LTrKnjv2hBfZ16mTda bMwC+lRiWUpIxmZ/75XsVOEI7vtLGtNmilyWx7IIXbKWB21ojc/sIvT8E2q/FrsnKHpC mLlnyxCV+Q4w/us7jSPsus8fSi7Y0dZFbWH2DBzd1F9T1UCJGv7xyTBskLU+ub/Ey1D+ t9hB583P8VN5vOWDTb7rBC5PlEvQb2goAFui+hHyEsH7YYNI+YvFGgpHKG2meCJsXMg7 yNsg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=LtiPgkPr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id i15-20020a63e44f000000b00543a89c95c2si8011415pgk.207.2023.09.12.10.35.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Sep 2023 10:35:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=LtiPgkPr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id A9E3980AFE89; Mon, 11 Sep 2023 23:58:51 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.8 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230441AbjILG6g (ORCPT + 99 others); Tue, 12 Sep 2023 02:58:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230500AbjILG6c (ORCPT ); Tue, 12 Sep 2023 02:58:32 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3341DE7C for ; Mon, 11 Sep 2023 23:58:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1694501909; x=1726037909; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1CVeoZ5eX0aMEMe3Na4QWZn6jXaWrpbe+otIrf8FLTQ=; b=LtiPgkPrqkzciNojGtOPOBcz84UGsz5NiOrjuBx+i5NABZitRbav+rLe J/Uv0PgPqiXmiF/XqXqdTz5obkTVUV6CHWDBi9C/NKHU4dg3qdcoC5ALf XfAqutg3tKBTvMA6O9Kr1upo5e1iDnqw1lGvo+H5rH0sL1itvN3r4ihh2 O22vz4B3e+dN0BK3r1wlRUXnl7U3ks5EIv0kErYnHBjeKvAC+N+7A305G l4f8lmRoKFF0EeWgEDOEyGWjAa9Mj/6Hmp6o3h6cPVBZE6uTrOWWP4UBB 6Xg8gUfFNqAFVIRDVzhtJEHOttXTnD6gPJgURqzIqK4FSMlxzcI1U2XqE g==; X-IronPort-AV: E=McAfee;i="6600,9927,10830"; a="358571028" X-IronPort-AV: E=Sophos;i="6.02,245,1688454000"; d="scan'208";a="358571028" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2023 23:58:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10830"; a="746762655" X-IronPort-AV: E=Sophos;i="6.02,245,1688454000"; d="scan'208";a="746762655" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2023 23:58:25 -0700 From: Aaron Lu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar Cc: Dietmar Eggemann , Mathieu Desnoyers , "Gautham R . Shenoy" , David Vernet , Nitin Tekchandani , Yu Chen , Daniel Jordan , Tim Chen , Swapnil Sapkal , linux-kernel@vger.kernel.org Subject: [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Date: Tue, 12 Sep 2023 14:58:08 +0800 Message-ID: <20230912065808.2530-2-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230912065808.2530-1-aaron.lu@intel.com> References: <20230912065808.2530-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Mon, 11 Sep 2023 23:58:51 -0700 (PDT) X-Spam-Status: No, score=-2.0 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. The update frequency is a tradeoff between tracking accuracy and overhead. 1ms is chosen because PELT window is roughly 1ms and it delivered good results for the tests that I've done. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. ============================== postgres_sysbench on SPR: 25% base: 42382±19.8% patch: 50174±9.5% (noise) 50% base: 67626±1.3% patch: 67365±3.1% (noise) 75% base: 100216±1.2% patch: 112470±0.1% +12.2% 100% base: 93671±0.4% patch: 113563±0.2% +21.2% ============================== hackbench on ICL: group=1 base: 114912±5.2% patch: 117857±2.5% (noise) group=4 base: 359902±1.6% patch: 361685±2.7% (noise) group=8 base: 461070±0.8% patch: 491713±0.3% +6.6% group=16 base: 309032±5.0% patch: 378337±1.3% +22.4% ============================= hackbench on SPR: group=1 base: 100768±2.9% patch: 103134±2.9% (noise) group=4 base: 413830±12.5% patch: 378660±16.6% (noise) group=8 base: 436124±0.6% patch: 490787±3.2% +12.5% group=16 base: 457730±3.2% patch: 680452±1.3% +48.8% ============================ netperf/udp_rr on ICL 25% base: 114413±0.1% patch: 115111±0.0% +0.6% 50% base: 86803±0.5% patch: 86611±0.0% (noise) 75% base: 35959±5.3% patch: 49801±0.6% +38.5% 100% base: 61951±6.4% patch: 70224±0.8% +13.4% =========================== netperf/udp_rr on SPR 25% base: 104954±1.3% patch: 107312±2.8% (noise) 50% base: 55394±4.6% patch: 54940±7.4% (noise) 75% base: 13779±3.1% patch: 36105±1.1% +162% 100% base: 9703±3.7% patch: 28011±0.2% +189% ============================================== netperf/tcp_stream on ICL (all in noise range) 25% base: 43092±0.1% patch: 42891±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% =============================================== netperf/tcp_stream on SPR (all in noise range) 25% base: 34491±0.3% patch: 34886±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% Reported-by: Nitin Tekchandani Suggested-by: Vincent Guittot Signed-off-by: Aaron Lu Reviewed-by: Vincent Guittot Reviewed-by: Mathieu Desnoyers Tested-by: Mathieu Desnoyers Reviewed-by: David Vernet Tested-by: Swapnil Sapkal --- kernel/sched/fair.c | 13 ++++++++++++- kernel/sched/sched.h | 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0b7445cd5af9..0dd6f44c8e02 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3887,7 +3887,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + long delta; + u64 now; /* * No need to update load_avg for root_task_group as it is not used. @@ -3895,9 +3896,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) if (cfs_rq->tg == &root_task_group) return; + /* + * For migration heavy workloads, access to tg->load_avg can be + * unbound. Limit the update rate to at most once per ms. + */ + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) + return; + + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; + cfs_rq->last_update_tg_load_avg = now; } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3a01b7a2bf66..7df010ef9e5c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -594,6 +594,7 @@ struct cfs_rq { } removed; #ifdef CONFIG_FAIR_GROUP_SCHED + u64 last_update_tg_load_avg; unsigned long tg_load_avg_contrib; long propagate; long prop_runnable_sum; -- 2.41.0