Received: by 2002:a05:6602:18e:0:0:0:0 with SMTP id m14csp2385042ioo; Sat, 28 May 2022 12:00:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzzz/jtSFVjI7HXCbmBQ06+X3sAXM6CLiGqHuJQGxaLtRGmoQDIgHzvZo0ZeI/TfannSTrD X-Received: by 2002:a63:8b4b:0:b0:3fa:4c70:1204 with SMTP id j72-20020a638b4b000000b003fa4c701204mr26279506pge.405.1653764423339; Sat, 28 May 2022 12:00:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653764423; cv=none; d=google.com; s=arc-20160816; b=e4lxDpKj+lWZ7VDtpzl9XlhD/nHv+qHs4ZwxXif4uyNQzFSnNaSBWWxV54rzyKtmw6 dIs3c+F4FiaVJCdrG/UPr1SUV7A0bJI/n4Bk4agCCJDAi197ctWBGTA7LRIAvqXtRQU0 tWTNzpvzc0AIy6cex6+73XVhnh6JMiSpQj1yRZLX8tbMAZ4saqOmkAC5QxAo556PeCz+ GrTUHzv5y/xJGRNJRf+sdNShHjRn6/I3GTUaoe397rQNuiJjOq5k285vTzMvQ7lKGq0X XYuL0oFDDQvRtzxO9FOe2gCcDAR49m/Yc1ArghgjpAxdCD7hD1PTNTy0ptm6ZHFxnWMY b7GQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=Wjd5Zw2jXP6nALenz4l2KGGbHbXoLe2zyVUW3RgmTKk=; b=LipFSzo8J7eZZr8bsoykE5lYX5/2iEgYclPR8Vs5qcBodkt0SwyiWMs2D0bqNrDDrR OQJlhWFM1+V/O0iHYlkkpQNi2NnxAtTZ7vXneyWrYo+8WoVWVdeC/eacYouAU137NQ96 u6quZekkqp9AV5FD8eEtEDCazv46fZjOPTDcdgKMvnP1Aml7y8PTMgohJbNiHwGzgHEl PowchDnW/+uvPf+YVP7egDk+iMR1rkMdeZsa8gALYLI0J6DJv/erg3gVGkHz9Cb1hrGg 6H+pB2QQv5NqDV+bUa5r86ypbGFXyvExC/LkvzsmIRdyWMIGCuBELZYS+lGffIIvhnML cLXA== ARC-Authentication-Results: i=1; mx.google.com; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id q2-20020a170902bd8200b00161fa692df9si9098231pls.274.2022.05.28.12.00.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 28 May 2022 12:00:23 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 821C3381BC; Sat, 28 May 2022 11:43:11 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351350AbiE0JLC (ORCPT + 99 others); Fri, 27 May 2022 05:11:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350578AbiE0JKD (ORCPT ); Fri, 27 May 2022 05:10:03 -0400 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C078BBC10 for ; Fri, 27 May 2022 02:05:57 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R541e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0VEWfMRV_1653642344; Received: from localhost.localdomain(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0VEWfMRV_1653642344) by smtp.aliyun-inc.com(127.0.0.1); Fri, 27 May 2022 17:05:54 +0800 From: Tianchen Ding To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle Date: Fri, 27 May 2022 17:05:44 +0800 Message-Id: <20220527090544.527411-1-dtcccc@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,HK_RANDOM_FROM,MAILING_LIST_MULTI, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The main idea of wakelist is to avoid cache bouncing. However, commit 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") disabled queuing tasks on wakelist when the cpus share llc. This is because, at that time, the scheduler must send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also supports TIF_POLLING, so this is not a problem now when the wakee cpu is in idle polling. Benefits: Queuing the task on idle cpu can help improving performance on waker cpu and utilization on wakee cpu, and further improve locality because the wakee cpu can handle its own rq. This patch helps improving rt on our real java workloads where wakeup happens frequently. Consider the normal condition (CPU0 and CPU1 share same llc) Before this patch: CPU0 CPU1 select_task_rq() idle rq_lock(CPU1->rq) enqueue_task(CPU1->rq) notify CPU1 (by sending IPI or CPU1 polling) resched() After this patch: CPU0 CPU1 select_task_rq() idle add to wakelist of CPU1 notify CPU1 (by sending IPI or CPU1 polling) rq_lock(CPU1->rq) enqueue_task(CPU1->rq) resched() We see CPU0 can finish its work earlier. It only needs to put task to wakelist and return. While CPU1 is idle, so let itself handle its own runqueue data. This patch brings no difference about IPI. This patch only takes effect when the wakee cpu is: 1) idle polling 2) idle not polling For 1), there will be no IPI with or without this patch. For 2), there will always be an IPI before or after this patch. Before this patch: waker cpu will enqueue task and check preempt. Since "idle" will be sure to be preempted, waker cpu must send an resched IPI. After this patch: waker cpu will put the task to the wakelist of wakee cpu, and send an IPI. Benchmark: We've tested schbench, unixbench, and hachbench on both x86 and arm64. On x86 (Intel Xeon Platinum 8269CY): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 8 6 75.0000th: 10 7 90.0000th: 11 8 95.0000th: 12 8 *99.0000th: 15 10 99.5000th: 16 11 99.9000th: 20 14 Unixbench with full threads (104) before after Dhrystone 2 using register variables 3004614211 3004725417 0.00% Double-Precision Whetstone 616764.3 617355.9 0.10% Execl Throughput 26449.2 26468.6 0.07% File Copy 1024 bufsize 2000 maxblocks 832763.3 824099.4 -1.04% File Copy 256 bufsize 500 maxblocks 210718.7 211775.1 0.50% File Copy 4096 bufsize 8000 maxblocks 2393528.2 2398755.4 0.22% Pipe Throughput 144559102.7 144605068.8 0.03% Pipe-based Context Switching 3192192.9 3571238.1 11.87% Process Creation 95270.5 98865.4 3.77% Shell Scripts (1 concurrent) 113780.6 113924.7 0.13% Shell Scripts (8 concurrent) 15557.2 15508.9 -0.31% System Call Overhead 5359984.1 5356711.4 -0.06% hackbench -g 1 -l 100000 before after Time 3.246 2.251 On arm64 (Ampere Altra): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 14 10 75.0000th: 19 14 90.0000th: 22 16 95.0000th: 23 16 *99.0000th: 24 17 99.5000th: 24 17 99.9000th: 31 25 Unixbench with full threads (80) before after Dhrystone 2 using register variables 3536787968 3536476016 -0.01% Double-Precision Whetstone 629370.6 629333.3 -0.01% Execl Throughput 66615.9 66288.8 -0.49% File Copy 1024 bufsize 2000 maxblocks 1038402.1 1050181.2 1.13% File Copy 256 bufsize 500 maxblocks 311054.2 310317.2 -0.24% File Copy 4096 bufsize 8000 maxblocks 2276795.6 2297703 0.92% Pipe Throughput 130409359.9 130390848.7 -0.01% Pipe-based Context Switching 3148440.7 3383705.1 7.47% Process Creation 111574.3 119728.6 7.31% Shell Scripts (1 concurrent) 122980.7 122657.4 -0.26% Shell Scripts (8 concurrent) 17482.8 17476.8 -0.03% System Call Overhead 4424103.4 4430062.6 0.13% hackbench -g 1 -l 100000 before after Time 4.217 2.916 Our patch has improvement on schbench, hackbench and Pipe-based Context Switching of unixbench when there exists idle cpus, and no obvious regression on other tests of unixbench. This can help improve rt in scenes where wakeup happens frequently. Signed-off-by: Tianchen Ding --- v2: Modify commit log to describe key point in detail. Add more benchmark results on more archs. v1: https://lore.kernel.org/all/20220513062427.2375743-1-dtcccc@linux.alibaba.com/ --- kernel/sched/core.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bfa7452ca92e..8764ad152f6e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3817,6 +3817,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) if (!cpu_active(cpu)) return false; + if (cpu == smp_processor_id()) + return false; + /* * If the CPU does not share cache, then queue the task on the * remote rqs wakelist to avoid accessing remote data. @@ -3824,6 +3827,12 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) if (!cpus_share_cache(smp_processor_id(), cpu)) return true; + /* + * If the CPU is idle, let itself do activation to improve utilization. + */ + if (available_idle_cpu(cpu)) + return true; + /* * If the task is descheduling and the only running task on the * CPU then use the wakelist to offload the task activation to @@ -3839,9 +3848,6 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags) { if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { - if (WARN_ON_ONCE(cpu == smp_processor_id())) - return false; - sched_clock_cpu(cpu); /* Sync clocks across CPUs */ __ttwu_queue_wakelist(p, cpu, wake_flags); return true; -- 2.27.0