Received: by 2002:a5d:9c59:0:0:0:0:0 with SMTP id 25csp122873iof; Wed, 8 Jun 2022 17:02:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzmAE6Db4GXwd/mTA2Dbo06mofeQphjrMg1bMkEHiRdwuT9QD/dGxPkhNzDQ7VvBJXzzLgr X-Received: by 2002:a05:6402:5386:b0:42a:cbe0:2ac8 with SMTP id ew6-20020a056402538600b0042acbe02ac8mr20990493edb.412.1654732947491; Wed, 08 Jun 2022 17:02:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654732947; cv=none; d=google.com; s=arc-20160816; b=WNbQ8NWDLqHpVqlgTW9B7TXsztEF0oA1CknezI1zH3Yr00gUBd3CsKuEEVPJpgP56G s2ZToHi5D+xVEpPuamVuh+mb996WnlhIQ72LeblS82Yfyjo7Iz5UUTTOBtSpxwTNJTRq qHIkHqJyN+deYhinMLsgABSWow/s7T9gfJh0sHfRck+SSSJKiXcYiY8gbmH9jhTa4l5T f5njdmTRdCEaRXyNZNYy9CaREi2C5Us6BZw7iSifFi1Ciw7IG2n+R2TyH6XbAa44k/7J lyNeBQbZc0YLWpR+4DFXu8Xqim9qXLlsMi67dYfoeABaoiRDceh8hV3rrMgkNy3uS8eL nIww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id; bh=M/ZqWKHSN+ovOD21mNqXWYRpk5vIEU3nE3h1ys1KzlY=; b=tyYmE0HGFEh98sKAGeRoEUOB44bBFDgru+wr/SKeLNvdQXugJSUerO7W18SUy1Y3j4 V6wwGcdow/P6y5iYaFKVnpx8eI2ppKp8xmDoqyTAry1rAQ6FzHyGBJdPmJh1ij9vQ3Zo MzeZziQ/XvZElNeRo/8MQx8KEY+UkbhJiLiGvyhGjIPNb8Pn2tmkpyPomb0qz8DDkDCo sn95F0KJuRZlFAt/bGNuLDNSEn+3SKBfhtJMKIvKyDCs3JcVqKSi9d1W6egP4NE3pgop MXhPS2FJF5tyjpAUmh/VdywbAqId2MhSe09JtC96e/2FK65g44f6fNw/1g6i6yZrs9h+ i7Kw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hb30-20020a170907161e00b00711f161476bsi4327165ejc.789.2022.06.08.17.02.01; Wed, 08 Jun 2022 17:02:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232493AbiFHXbD (ORCPT + 99 others); Wed, 8 Jun 2022 19:31:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232434AbiFHXbA (ORCPT ); Wed, 8 Jun 2022 19:31:00 -0400 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C9D4013D24 for ; Wed, 8 Jun 2022 16:30:58 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0VFoXUN2_1654731054; Received: from 192.168.0.205(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0VFoXUN2_1654731054) by smtp.aliyun-inc.com; Thu, 09 Jun 2022 07:30:56 +0800 Message-ID: <3f87bc5c-1611-8e8d-0ab1-288b516530b2@linux.alibaba.com> Date: Thu, 9 Jun 2022 07:30:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle Content-Language: en-US From: Tianchen Ding To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider Cc: linux-kernel@vger.kernel.org References: <20220608163518.324276-1-dtcccc@linux.alibaba.com> <20220608163518.324276-3-dtcccc@linux.alibaba.com> In-Reply-To: <20220608163518.324276-3-dtcccc@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-9.1 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/6/9 00:35, Tianchen Ding wrote: > Wakelist can help avoid cache bouncing and offload the overhead of waker > cpu. So far, using wakelist within the same llc only happens on > WF_ON_CPU, and this limitation could be removed to further improve > wakeup performance. > > The commit 518cd6234178 ("sched: Only queue remote wakeups when > crossing cache boundaries") disabled queuing tasks on wakelist when > the cpus share llc. This is because, at that time, the scheduler must > send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also > supports TIF_POLLING, so this is not a problem now when the wakee cpu is > in idle polling. > > Benefits: > Queuing the task on idle cpu can help improving performance on waker cpu > and utilization on wakee cpu, and further improve locality because > the wakee cpu can handle its own rq. This patch helps improving rt on > our real java workloads where wakeup happens frequently. > > Consider the normal condition (CPU0 and CPU1 share same llc) > Before this patch: > > CPU0 CPU1 > > select_task_rq() idle > rq_lock(CPU1->rq) > enqueue_task(CPU1->rq) > notify CPU1 (by sending IPI or CPU1 polling) > > resched() > > After this patch: > > CPU0 CPU1 > > select_task_rq() idle > add to wakelist of CPU1 > notify CPU1 (by sending IPI or CPU1 polling) > > rq_lock(CPU1->rq) > enqueue_task(CPU1->rq) > resched() > > We see CPU0 can finish its work earlier. It only needs to put task to > wakelist and return. > While CPU1 is idle, so let itself handle its own runqueue data. > > This patch brings no difference about IPI. > This patch only takes effect when the wakee cpu is: > 1) idle polling > 2) idle not polling > > For 1), there will be no IPI with or without this patch. > > For 2), there will always be an IPI before or after this patch. > Before this patch: waker cpu will enqueue task and check preempt. Since > "idle" will be sure to be preempted, waker cpu must send a resched IPI. > After this patch: waker cpu will put the task to the wakelist of wakee > cpu, and send an IPI. > > Benchmark: > We've tested schbench, unixbench, and hachbench on both x86 and arm64. > > On x86 (Intel Xeon Platinum 8269CY): > schbench -m 2 -t 8 > > Latency percentiles (usec) before after > 50.0000th: 8 6 > 75.0000th: 10 7 > 90.0000th: 11 8 > 95.0000th: 12 8 > *99.0000th: 13 10 > 99.5000th: 15 11 > 99.9000th: 18 14 > > Unixbench with full threads (104) > before after > Dhrystone 2 using register variables 3011862938 3009935994 -0.06% > Double-Precision Whetstone 617119.3 617298.5 0.03% > Execl Throughput 27667.3 27627.3 -0.14% > File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12% > File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20% > File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36% > Pipe Throughput 145535622.8 145323033.2 -0.15% > Pipe-based Context Switching 3221686.4 3583975.4 11.25% > Process Creation 101347.1 103345.4 1.97% > Shell Scripts (1 concurrent) 120193.5 123977.8 3.15% > Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55% > System Call Overhead 5300604.8 5312213.6 0.22% > > hackbench -g 1 -l 100000 > before after > Time 3.246 2.251 > > On arm64 (Ampere Altra): > schbench -m 2 -t 8 > > Latency percentiles (usec) before after > 50.0000th: 14 10 > 75.0000th: 19 14 > 90.0000th: 22 16 > 95.0000th: 23 16 > *99.0000th: 24 17 > 99.5000th: 24 17 > 99.9000th: 28 25 > > Unixbench with full threads (80) > before after > Dhrystone 2 using register variables 3536194249 3536476016 -0.01% > Double-Precision Whetstone 629383.6 629333.3 -0.01% > Execl Throughput 65920.5 66288.8 -0.49% > File Copy 1024 bufsize 2000 maxblocks 1038402.1 1050181.2 1.13% > File Copy 256 bufsize 500 maxblocks 311054.2 310317.2 -0.24% > File Copy 4096 bufsize 8000 maxblocks 2276795.6 2297703 0.92% > Pipe Throughput 130409359.9 130390848.7 -0.01% > Pipe-based Context Switching 3148440.7 3383705.1 7.47% > Process Creation 111574.3 119728.6 7.31% > Shell Scripts (1 concurrent) 122980.7 122657.4 -0.26% > Shell Scripts (8 concurrent) 17482.8 17476.8 -0.03% > System Call Overhead 4424103.4 4430062.6 0.13% > > Dhrystone 2 using register variables 3536194249 3537019613 0.02% > Double-Precision Whetstone 629383.6 629431.6 0.01% > Execl Throughput 65920.5 65846.2 -0.11% > File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03% > File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23% > File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83% > Pipe Throughput 133542875.3 131619389.8 -1.44% > Pipe-based Context Switching 3215356.1 3576945.1 11.25% > Process Creation 108520.5 120184.6 10.75% > Shell Scripts (1 concurrent) 122636.3 121888 -0.61% > Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46% > System Call Overhead 4429998.9 4435006.7 0.11% Oops... I forgot to remove the previous result. Let me resend one.