Received: by 2002:a05:7412:40d:b0:e2:908c:2ebd with SMTP id 13csp211179rdf; Mon, 20 Nov 2023 23:43:01 -0800 (PST) X-Google-Smtp-Source: AGHT+IFZhqZ/+k7VieiTUIRTffERWQvWXV6HZp8bB0KkzEh+HckWoDP7RxSuNsqRM+YZ8mXpEf8G X-Received: by 2002:a17:902:8209:b0:1cd:fbc7:2737 with SMTP id x9-20020a170902820900b001cdfbc72737mr7305861pln.52.1700552581102; Mon, 20 Nov 2023 23:43:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700552581; cv=none; d=google.com; s=arc-20160816; b=vZhyMvrweICvhiyjdUfQ42vM/Rnd2kxqOgETi4Qri2iH7ENSGJ0r17lOvEIkM9VgsM ZQsnVboU8rIq7SQXryhV0TbQTitdyUki+oEpx07CLm9l/R9tQNfMZzFqXD4cWlqetsF6 P4yyaGf0T7Qk6aJbLVnTpzFqHCKDLz+1edy/bZtzRBNyw3Vhs27339/Z9hCck395dxKt TkWdr3ZEYHFMO/OxQtlK0VklJd195uOn70ro4yEzpPCMIUdwnV5OMjMQEKJN0In40PBo OSD2Dger9ukab9vv7toWGAuW5yIUDVrKbpkvmG+/klgRssHGMoGIDfQweWzXkxeJjjsW HNTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=0reafCfcZhkS7fIOAwnFx+OTUqtvQpairYuzD61b+RU=; fh=0dOHTwLhZRcSeL1cn2LfAx4yZf5ptHxTdWpvFvnOHEs=; b=CVATE/8xP65hgY3fDD8o3hGPsMv0QUFmLwx6PFml4pxZKxozqNvrFp6YxpOURzGztw MnP+UESZlzJOaIHK23WPjviBmHvFxFNA2k6c55Qleyuxh8G5pEP1tWub/dT4ribXIiMu DK/NYYVG1zaac3UYBw8+eJu31ZrvP+MqkF8l9eU41AQpngRexOSy8Zje1S8xDKRQv/cB JKh+RdmY4Ebj+B3t9ZgKsN3G2rOu0YxG4FvvPBP5/h3xrM+eIGgHRMi4K2HLgrBcRpRm UWZvUWzU6pN7bq9WKYI9SDkymwjOloYp6Qr5JvcCGIQT8P/ImrZ7pjJmaffmokow4SiX zezA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Q43qAnCl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id jc5-20020a17090325c500b001c62ca6d540si9294924plb.77.2023.11.20.23.43.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 23:43:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Q43qAnCl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 82EBF809214C; Mon, 20 Nov 2023 23:41:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229843AbjKUHlY (ORCPT + 99 others); Tue, 21 Nov 2023 02:41:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50658 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230414AbjKUHlW (ORCPT ); Tue, 21 Nov 2023 02:41:22 -0500 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA06110C for ; Mon, 20 Nov 2023 23:41:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700552478; x=1732088478; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rwOMzL38TQAcQogt+9N+0DIqcb4pS3AdZb0OQ+1utqQ=; b=Q43qAnClPHXpydnhHDAKJdJ1Xj5gB+bVGs62epE1F3dAWP/lRL/aLWhl FWycfQhrB+f0rlpzgAKQ6jNj+H+il861QvcMhJTn+JjdHGt9fqMYCIulo gTRDdYudTBlBRnwQLz/4IGmKVhuECwukzA0+eKy6IcSysWksrgTe8r0y9 aqR1BNERI7uCIIHj8Q5Y5qWSwPql9VG7QPlIp2izTZGBxQZl8WTpQPMwL lBxKdaEAqBgQtB4MBAQam0/LrGoyxOiPO3jkLAQa3enYAkZfGzAmW++qH UlNsrEF5lUGygK56h0fyFzZm9TcvwZgC6kfvkFGSo98vCl9BDUO/4938V Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10900"; a="4978292" X-IronPort-AV: E=Sophos;i="6.04,215,1695711600"; d="scan'208";a="4978292" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2023 23:41:17 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10900"; a="836972502" X-IronPort-AV: E=Sophos;i="6.04,215,1695711600"; d="scan'208";a="836972502" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga004.fm.intel.com with ESMTP; 20 Nov 2023 23:41:13 -0800 From: Chen Yu To: Peter Zijlstra , Mathieu Desnoyers , Ingo Molnar , Vincent Guittot , Juri Lelli Cc: Tim Chen , Aaron Lu , Dietmar Eggemann , Steven Rostedt , Mel Gorman , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , linux-kernel@vger.kernel.org, Chen Yu Subject: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu() Date: Tue, 21 Nov 2023 15:40:14 +0800 Message-Id: <35e612eb2851693a52f7ed1ff9be5bc24011136f.1700548379.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 20 Nov 2023 23:41:45 -0800 (PST) Problem statement: When task p is woken up, the scheduler leverages select_idle_sibling() to find an idle CPU for it. p's previous CPU is usually a preference because it can improve cache locality. However in many cases, the previous CPU has already been taken by other wakees, thus p has to find another idle CPU. Proposal: Introduce the SIS_CACHE. It considers the sleep time of the task for better task placement. Based on the task's short sleeping history, tag p's previous CPU as cache-hot. Later when p is woken up, it can choose its previous CPU in select_idle_sibling(). When other task is woken up, skip this cache-hot idle CPU when possible. SIS_CACHE still prefers to choose an idle CPU during task wakeup, the idea is to optimize the idle CPU scan sequence. As pointed out by Prateek, this has the potential that all idle CPUs are cache-hot and skipped. Mitigate this by returning the first cache-hot idle CPU. Meanwhile, to reduce the time spend on scanning, limit the max number of cache-hot CPU search depth to half of the number suggested by SIS_UTIL. Tested on Xeon 2 x 60C/120T platforms: netperf ======= case load baseline(std%) compare%( std%) TCP_RR 60-threads 1.00 ( 1.37) +0.04 ( 1.47) TCP_RR 120-threads 1.00 ( 1.77) -1.03 ( 1.31) TCP_RR 180-threads 1.00 ( 2.03) +1.25 ( 1.66) TCP_RR 240-threads 1.00 ( 41.31) +73.71 ( 22.02) TCP_RR 300-threads 1.00 ( 12.79) -0.11 ( 15.84) tbench ====== case load baseline(std%) compare%( std%) loopback 60-threads 1.00 ( 0.35) +0.40 ( 0.31) loopback 120-threads 1.00 ( 1.94) -1.89 ( 1.17) loopback 180-threads 1.00 ( 13.59) +9.97 ( 0.93) loopback 240-threads 1.00 ( 11.68) +42.85 ( 7.28) loopback 300-threads 1.00 ( 4.47) +15.12 ( 1.40) hackbench ========= case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 9.21) -7.88 ( 2.03) process-pipe 2-groups 1.00 ( 7.09) +5.47 ( 9.02) process-pipe 4-groups 1.00 ( 1.60) +1.53 ( 1.70) schbench ======== case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 0.98) +0.26 ( 0.37) normal 2-mthreads 1.00 ( 3.99) -7.97 ( 7.33) normal 4-mthreads 1.00 ( 3.07) -1.55 ( 3.27) Also did some experiments on the OLTP workload on a 112 core 2 socket SPR machine. The OLTP workload have a mixture of threads handling database updates on disks and handling transaction queries over network. Around 0.7% improvement is observed with less than 0.2% run-to-run variation. Thanks Madadi for testing the SIS_CACHE on a power system with 96 CPUs. The results showed a max of 29% improvements in hackbench, 13% improvements in producer_consumer workload, and 2% improvements in real life workload named Daytrader. Thanks Prateek for running microbenchmarks on top of the latest patch on a 3rd Generation EPYC System: - 2 sockets each with 64C/128T - NPS1 (Each socket is a NUMA node) - C2 Disabled (POLL and C1(MWAIT) remained enabled) No consistent regression was observed in v2 version. Analysis: The reason why waking up the task on its previous CPU brings benefits is due to less task migration and higher local resource locality. Take netperf 240 case as an example, run the following script to track the migration number within 10 seconds. Use perf topdown to track the PMU events. The task migration and stall cycles have been reduced a lot with SIS_CACHE: kretfunc:select_task_rq_fair { $p = (struct task_struct *)args->p; if ($p->comm == "netperf") { if ($p->thread_info.cpu != retval) { @wakeup_migrate_netperf = count(); } else { @wakeup_prev_netperf = count(); } } } NO_SIS_CACHE: @wakeup_migrate_netperf: 57473509 @wakeup_prev_netperf: 14935964 RESOURCE_STALLS: 19.1% * 7.1% * 35.0% SIS_CACHE: @wakeup_migrate_netperf: 799 @wakeup_prev_netperf: 132937436 RESOURCE_STALLS: 5.4% * 7.5% * 39.8% Suggested-by: Tim Chen Signed-off-by: Chen Yu --- kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c309b3d203c0..d149eca74fca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7360,7 +7360,7 @@ static inline int select_idle_smt(struct task_struct *p, int target) * Return true if the idle CPU is cache-hot for someone, * return false otherwise. */ -static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu) +static bool cache_hot_cpu(int cpu, int *hot_cpu) { if (!sched_feat(SIS_CACHE)) return false; @@ -7383,7 +7383,7 @@ static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu) static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); - int i, cpu, idle_cpu = -1, nr = INT_MAX; + int i, cpu, idle_cpu = -1, nr = INT_MAX, nr_hot = 0, hot_cpu = -1; struct sched_domain_shared *sd_share; cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); @@ -7396,6 +7396,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool /* overloaded LLC is unlikely to have idle cpu/core */ if (nr == 1) return -1; + + /* max number of cache-hot idle cpu check */ + nr_hot = nr >> 1; } } @@ -7426,18 +7429,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); - if ((unsigned int)i < nr_cpumask_bits) + if ((unsigned int)i < nr_cpumask_bits) { + if (--nr_hot >= 0 && cache_hot_cpu(i, &hot_cpu)) + continue; + return i; + } } else { if (--nr <= 0) return -1; idle_cpu = __select_idle_cpu(cpu, p); - if ((unsigned int)idle_cpu < nr_cpumask_bits) + if ((unsigned int)idle_cpu < nr_cpumask_bits) { + if (--nr_hot >= 0 && cache_hot_cpu(idle_cpu, &hot_cpu)) + continue; + break; + } } } + /* pick the first cache-hot CPU as the last resort */ + if (idle_cpu == -1 && hot_cpu != -1) + idle_cpu = hot_cpu; + if (has_idle_core) set_idle_cores(target, false); -- 2.25.1