Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp1675748rwb; Thu, 29 Sep 2022 00:08:26 -0700 (PDT) X-Google-Smtp-Source: AMsMyM51n7/hzXucJ+SsUUV93P4OxhPJtrFzLFVlJ9WIr8pADoa91xfUYeRIiMeB0XYVQj14sIOE X-Received: by 2002:a05:6402:e9b:b0:454:351c:c222 with SMTP id h27-20020a0564020e9b00b00454351cc222mr1862029eda.216.1664435306372; Thu, 29 Sep 2022 00:08:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664435306; cv=none; d=google.com; s=arc-20160816; b=I96KtX/4O87xg163g5d75AXdaYDNlkNAib9VKs47VpObTtYFJQLm5yz2Xil0WOR5mi SKS+bZgJhtrNumpuaJEcmgkMIECWi5rn/QVm78OTpkmt5x7TamQDloe52CxtfjG6lq8a xwSijipE64vdNHjKiCcfv6wzIi2ZkfZJJTTxipBKtd1lmDgvNU7pr1tCGCC7MhQr6CWI 0LlUY8ijeuLmIrhQoTWsXlUoDXVcF1NLRVqGnUpyV94rfIZ2jg+iyp+ROTDBwzVBLYzp GnMic0wAQofChglKzJTxUZmkkfjV2lsXwNrsgait0+BB1lyHqpAwjJ6A3x+irtmK7voZ I3KA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from:cc:to :content-language:subject:user-agent:mime-version:date:message-id; bh=dkucbfFyXSBjwqT6+b92+soRijNQvTcmXBShPRbZd94=; b=PFPmIFOeDHSl+mM7bHPXleUn/BKb2CpB6+aEQcOdeeiEAy1MUgOhS7LalIeFwCOoTL uDCJ7ulgHXOuIO69pjJeaBwrBiC1ozGoNAG3rOZcrU9yLXGbBOd3UXVCigLd2OT2itM/ ZdlPri1zqoFhkY08+FZufF2uVKLiVhQ5KHindpHlLx25H0y3oL+llp5d4gyhYZ76/VNU oXJ81i5T8n+H6PD4jyGe5jPoTYsBF7dAMueLT724fzmwGmLtmuUc5x8q8LimlIY1DQWN D1eQG6GnrN0fb8AWygegKszyOOHeCuVYRxd5yOKg0mttj61Z8qvGBUYHrDYz6SfxlmUU P3IQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h4-20020a056402280400b00457b871604bsi8085641ede.545.2022.09.29.00.08.01; Thu, 29 Sep 2022 00:08:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235114AbiI2HCm (ORCPT + 99 others); Thu, 29 Sep 2022 03:02:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235099AbiI2HCG (ORCPT ); Thu, 29 Sep 2022 03:02:06 -0400 X-Greylist: delayed 122 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Thu, 29 Sep 2022 00:01:59 PDT Received: from mx6.didiglobal.com (mx6.didiglobal.com [111.202.70.123]) by lindbergh.monkeyblade.net (Postfix) with SMTP id BEB9E13072F for ; Thu, 29 Sep 2022 00:01:58 -0700 (PDT) Received: from mail.didiglobal.com (unknown [10.79.64.13]) by mx6.didiglobal.com (Maildata Gateway V2.8) with ESMTPS id 62C61110025C27; Thu, 29 Sep 2022 14:59:53 +0800 (CST) Received: from [172.24.140.16] (10.79.65.101) by ZJY01-ACTMBX-03.didichuxing.com (10.79.64.13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.17; Thu, 29 Sep 2022 14:59:52 +0800 Message-ID: Date: Thu, 29 Sep 2022 14:59:46 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [RFC PATCH] sched/fair: Choose the CPU where short task is running during wake up Content-Language: en-US To: Chen Yu , K Prateek Nayak CC: Peter Zijlstra , Vincent Guittot , Tim Chen , Mel Gorman , Juri Lelli , Rik van Riel , Aaron Lu , Abel Wu , Yicong Yang , "Gautham R . Shenoy" , Ingo Molnar , Dietmar Eggemann , Steven Rostedt , Ben Segall , Daniel Bristot de Oliveira , Valentin Schneider , X-MD-Sfrom: wanghonglei@didiglobal.com X-MD-SrcIP: 10.79.64.13 From: Honglei Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.79.65.101] X-ClientProxiedBy: ZJY02-PUBMBX-01.didichuxing.com (10.79.65.31) To ZJY01-ACTMBX-03.didichuxing.com (10.79.64.13) X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/9/29 13:25, Chen Yu wrote: > Hi Prateek, > On 2022-09-26 at 11:20:16 +0530, K Prateek Nayak wrote: >> Hello Chenyu, >> >> When testing the patch on a dual socket Zen3 system (3 x 64C/128T) we >> noticed some regressions in some standard benchmark. >> >> tl;dr >> >> o Hackbench shows noticeable regression in most cases. Looking at schedstat >> data, we see there is an increased number of affine wakeup and an increase >> in the average wait time. As the LLC size on the Zen3 machine is only >> 16 CPUs, there is good chance the LLC was overloaded and it required >> intervention from load balancer to distribute tasks optimally. >> >> o There is a regression in Stream which is cause by piling up of more than >> one Stream thread on the same LLC. This happens as a result of migration >> in the wakeup path where the logic goes for an affine wakeup if the >> waker is short lived task even if sync flag is not set and the previous >> CPU might be idle. >> > Nice analysis and thanks for your testing. >> I'll inline the results are detailed observation below: >> >> On 9/15/2022 10:24 PM, Chen Yu wrote: >>> [Background] >>> At LPC 2022 Real-time and Scheduling Micro Conference we presented >>> the cross CPU wakeup issue. This patch is a text version of the >>> talk, and hopefully we can clarify the problem and appreciate for any >>> feedback. >>> >>> [re-send due to the previous one did not reach LKML, sorry >>> for any inconvenience.] >>> >>> [Problem Statement] >>> For a workload that is doing frequent context switches, the throughput >>> scales well until the number of instances reaches a peak point. After >>> that peak point, the throughput drops significantly if the number of >>> instances continues to increase. >>> >>> The will-it-scale context_switch1 test case exposes the issue. The >>> test platform has 112 CPUs per LLC domain. The will-it-scale launches >>> 1, 8, 16 ... 112 instances respectively. Each instance is composed >>> of 2 tasks, and each pair of tasks would do ping-pong scheduling via >>> pipe_read() and pipe_write(). No task is bound to any CPU. >>> We found that, once the number of instances is higher than >>> 56(112 tasks in total, every CPU has 1 task), the throughput >>> drops accordingly if the instance number continues to increase: >>> >>> ^ >>> throughput| >>> | X >>> | X X X >>> | X X X >>> | X X >>> | X X >>> | X >>> | X >>> | X >>> | X >>> | >>> +-----------------.-------------------> >>> 56 >>> number of instances >>> >>> [Symptom analysis] >>> Both perf profile and lockstat have shown that, the bottleneck >>> is the runqueue spinlock. Take perf profile for example: >>> >>> nr_instance rq lock percentage >>> 1 1.22% >>> 8 1.17% >>> 16 1.20% >>> 24 1.22% >>> 32 1.46% >>> 40 1.61% >>> 48 1.63% >>> 56 1.65% >>> -------------------------- >>> 64 3.77% | >>> 72 5.90% | increase >>> 80 7.95% | >>> 88 9.98% v >>> 96 11.81% >>> 104 13.54% >>> 112 15.13% >>> >>> And the rq lock bottleneck is composed of two paths(perf profile): >>> >>> (path1): >>> raw_spin_rq_lock_nested.constprop.0; >>> try_to_wake_up; >>> default_wake_function; >>> autoremove_wake_function; >>> __wake_up_common; >>> __wake_up_common_lock; >>> __wake_up_sync_key; >>> pipe_write; >>> new_sync_write; >>> vfs_write; >>> ksys_write; >>> __x64_sys_write; >>> do_syscall_64; >>> entry_SYSCALL_64_after_hwframe;write >>> >>> (path2): >>> raw_spin_rq_lock_nested.constprop.0; >>> __sched_text_start; >>> schedule_idle; >>> do_idle; >>> cpu_startup_entry; >>> start_secondary; >>> secondary_startup_64_no_verify >>> >>> The idle percentage is around 30% when there are 112 instances: >>> %Cpu0 : 2.7 us, 66.7 sy, 0.0 ni, 30.7 id >>> >>> As a comparison, if we set CPU affinity to these workloads, >>> which stops them from migrating among CPUs, the idle percentage >>> drops to nearly 0%, and the throughput increases by about 300%. >>> This indicates that there is room for optimization. >>> >>> A possible scenario to describe the lock contention: >>> task A tries to wakeup task B on CPU1, then task A grabs the >>> runqueue lock of CPU1. If CPU1 is about to quit idle, it needs >>> to grab its own lock which has been taken by someone else. Then >>> CPU1 takes more time to quit which hurts the performance. >>> >>> TTWU_QUEUE could mitigate the cross CPU runqueue lock contention. >>> Since commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU >>> on wakelist if wakee cpu is idle"), TTWU_QUEUE offloads the work from >>> the waker and leverages the idle CPU to queue the wakee. However, a long >>> idle duration is still observed. The idle task spends quite some time >>> on sched_ttwu_pending() before it switches out. This long idle >>> duration would mislead SIS_UTIL, then SIS_UTIL suggests the waker scan >>> for more CPUs. The time spent searching for an idle CPU would make >>> wakee waiting for more time, which in turn leads to more idle time. >>> The NEWLY_IDLE balance fails to pull tasks to the idle CPU, which >>> might be caused by no runnable wakee being found. >>> >>> [Proposal] >>> If a system is busy, and if the workloads are doing frequent context >>> switches, it might not be a good idea to spread the wakee on different >>> CPUs. Instead, consider the task running time and enhance wake affine >>> might be applicable. >>> >>> This idea has been suggested by Rik at LPC 2019 when discussing >>> the latency nice. He asked the following question: if P1 is a small-time >>> slice task on CPU, can we put the waking task P2 on the CPU and wait for >>> P1 to release the CPU, without wasting time to search for an idle CPU? >>> At LPC 2021 Vincent Guittot has proposed: >>> 1. If the wakee is a long-running task, should we skip the short idle CPU? >>> 2. If the wakee is a short-running task, can we put it onto a lightly loaded >>> local CPU? >>> >>> Current proposal is a variant of 2: >>> If the target CPU is running a short-time slice task, and the wakee >>> is also a short-time slice task, the target CPU could be chosen as the >>> candidate when the system is busy. >>> >>> The definition of a short-time slice task is: The average running time >>> of the task during each run is no more than sysctl_sched_min_granularity. >>> If a task switches in and then voluntarily relinquishes the CPU >>> quickly, it is regarded as a short-running task. Choosing >>> sysctl_sched_min_granularity because it is the minimal slice if there >>> are too many runnable tasks. >>> >>> Reuse the nr_idle_scan of SIS_UTIL to decide if the system is busy. >>> If yes, then a compromised "idle" CPU might be acceptable. >>> >>> The reason is that, if the waker is a short running task, it might >>> relinquish the CPU soon, the wakee has the chance to be scheduled. >>> On the other hand, if the wakee is also a short-running task, the >>> impact it brings to the target CPU is small. If the system is >>> already busy, maybe we could lower the bar to find an idle CPU. >>> The effect is, the wake affine is enhanced. >>> >>> [Benchmark results] >>> The baseline is 6.0-rc4. >>> >>> The throughput of will-it-scale.context_switch1 has been increased by >>> 331.13% with this patch applied. >>> >>> netperf >>> ======= >>> case load baseline(std%) compare%( std%) >>> TCP_RR 28 threads 1.00 ( 0.57) +0.29 ( 0.59) >>> TCP_RR 56 threads 1.00 ( 0.49) +0.43 ( 0.43) >>> TCP_RR 84 threads 1.00 ( 0.34) +0.24 ( 0.34) >>> TCP_RR 112 threads 1.00 ( 0.26) +1.57 ( 0.20) >>> TCP_RR 140 threads 1.00 ( 0.20) +178.05 ( 8.83) >>> TCP_RR 168 threads 1.00 ( 10.14) +0.87 ( 10.03) >>> TCP_RR 196 threads 1.00 ( 13.51) +0.90 ( 11.84) >>> TCP_RR 224 threads 1.00 ( 7.12) +0.66 ( 8.28) >>> UDP_RR 28 threads 1.00 ( 0.96) -0.10 ( 0.97) >>> UDP_RR 56 threads 1.00 ( 10.93) +0.24 ( 0.82) >>> UDP_RR 84 threads 1.00 ( 8.99) +0.40 ( 0.71) >>> UDP_RR 112 threads 1.00 ( 0.15) +0.72 ( 7.77) >>> UDP_RR 140 threads 1.00 ( 11.11) +135.81 ( 13.86) >>> UDP_RR 168 threads 1.00 ( 12.58) +147.63 ( 12.72) >>> UDP_RR 196 threads 1.00 ( 19.47) -0.34 ( 16.14) >>> UDP_RR 224 threads 1.00 ( 12.88) -0.35 ( 12.73) >>> >>> hackbench >>> ========= >>> case load baseline(std%) compare%( std%) >>> process-pipe 1 group 1.00 ( 1.02) +0.14 ( 0.62) >>> process-pipe 2 groups 1.00 ( 0.73) +0.29 ( 0.51) >>> process-pipe 4 groups 1.00 ( 0.16) +0.24 ( 0.31) >>> process-pipe 8 groups 1.00 ( 0.06) +11.56 ( 0.11) >>> process-sockets 1 group 1.00 ( 1.59) +0.06 ( 0.77) >>> process-sockets 2 groups 1.00 ( 1.13) -1.86 ( 1.31) >>> process-sockets 4 groups 1.00 ( 0.14) +1.76 ( 0.29) >>> process-sockets 8 groups 1.00 ( 0.27) +2.73 ( 0.10) >>> threads-pipe 1 group 1.00 ( 0.43) +0.83 ( 2.20) >>> threads-pipe 2 groups 1.00 ( 0.52) +1.03 ( 0.55) >>> threads-pipe 4 groups 1.00 ( 0.44) -0.08 ( 0.31) >>> threads-pipe 8 groups 1.00 ( 0.04) +11.86 ( 0.05) >>> threads-sockets 1 groups 1.00 ( 1.89) +3.51 ( 0.57) >>> threads-sockets 2 groups 1.00 ( 0.04) -1.12 ( 0.69) >>> threads-sockets 4 groups 1.00 ( 0.14) +1.77 ( 0.18) >>> threads-sockets 8 groups 1.00 ( 0.03) +2.75 ( 0.03) >>> >>> tbench >>> ====== >>> case load baseline(std%) compare%( std%) >>> loopback 28 threads 1.00 ( 0.08) +0.51 ( 0.25) >>> loopback 56 threads 1.00 ( 0.15) -0.89 ( 0.16) >>> loopback 84 threads 1.00 ( 0.03) +0.35 ( 0.07) >>> loopback 112 threads 1.00 ( 0.06) +2.84 ( 0.01) >>> loopback 140 threads 1.00 ( 0.07) +0.69 ( 0.11) >>> loopback 168 threads 1.00 ( 0.09) +0.14 ( 0.18) >>> loopback 196 threads 1.00 ( 0.04) -0.18 ( 0.20) >>> loopback 224 threads 1.00 ( 0.25) -0.37 ( 0.03) >>> >>> Other benchmarks are under testing. >> >> Discussed below are the results from running standard benchmarks on >> a dual socket Zen3 (2 x 64C/128T) machine configured in different >> NPS modes. >> >> NPS Modes are used to logically divide single socket into >> multiple NUMA region. >> Following is the NUMA configuration for each NPS mode on the system: >> >> NPS1: Each socket is a NUMA node. >> Total 2 NUMA nodes in the dual socket machine. >> >> Node 0: 0-63, 128-191 >> Node 1: 64-127, 192-255 >> >> NPS2: Each socket is further logically divided into 2 NUMA regions. >> Total 4 NUMA nodes exist over 2 socket. >> >> Node 0: 0-31, 128-159 >> Node 1: 32-63, 160-191 >> Node 2: 64-95, 192-223 >> Node 3: 96-127, 223-255 >> >> NPS4: Each socket is logically divided into 4 NUMA regions. >> Total 8 NUMA nodes exist over 2 socket. >> >> Node 0: 0-15, 128-143 >> Node 1: 16-31, 144-159 >> Node 2: 32-47, 160-175 >> Node 3: 48-63, 176-191 >> Node 4: 64-79, 192-207 >> Node 5: 80-95, 208-223 >> Node 6: 96-111, 223-231 >> Node 7: 112-127, 232-255 >> >> Benchmark Results: >> >> Kernel versions: >> - tip: 5.19.0 tip sched/core >> - shortrun: 5.19.0 tip sched/core + this patch >> >> When we started testing, the tip was at: >> commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks") >> >> ~~~~~~~~~~~~~ >> ~ hackbench ~ >> ~~~~~~~~~~~~~ >> >> NPS1 >> >> Test: tip shortrun >> 1-groups: 4.23 (0.00 pct) 4.24 (-0.23 pct) >> 2-groups: 4.93 (0.00 pct) 5.68 (-15.21 pct) >> 4-groups: 5.32 (0.00 pct) 6.21 (-16.72 pct) >> 8-groups: 5.46 (0.00 pct) 6.49 (-18.86 pct) >> 16-groups: 7.31 (0.00 pct) 7.78 (-6.42 pct) >> >> NPS2 >> >> Test: tip shortrun >> 1-groups: 4.19 (0.00 pct) 4.19 (0.00 pct) >> 2-groups: 4.77 (0.00 pct) 5.43 (-13.83 pct) >> 4-groups: 5.15 (0.00 pct) 6.20 (-20.38 pct) >> 8-groups: 5.47 (0.00 pct) 6.54 (-19.56 pct) >> 16-groups: 6.63 (0.00 pct) 7.28 (-9.80 pct) >> >> NPS4 >> >> Test: tip shortrun >> 1-groups: 4.23 (0.00 pct) 4.39 (-3.78 pct) >> 2-groups: 4.78 (0.00 pct) 5.48 (-14.64 pct) >> 4-groups: 5.17 (0.00 pct) 6.14 (-18.76 pct) >> 8-groups: 5.63 (0.00 pct) 6.51 (-15.63 pct) >> 16-groups: 7.88 (0.00 pct) 7.03 (10.78 pct) >> >> ~~~~~~~~~~~~ >> ~ schbench ~ >> ~~~~~~~~~~~~ >> >> NPS1 >> >> #workers: tip shortrun >> 1: 22.00 (0.00 pct) 36.00 (-63.63 pct) >> 2: 34.00 (0.00 pct) 38.00 (-11.76 pct) >> 4: 37.00 (0.00 pct) 36.00 (2.70 pct) >> 8: 55.00 (0.00 pct) 51.00 (7.27 pct) >> 16: 69.00 (0.00 pct) 68.00 (1.44 pct) >> 32: 113.00 (0.00 pct) 116.00 (-2.65 pct) >> 64: 219.00 (0.00 pct) 232.00 (-5.93 pct) >> 128: 506.00 (0.00 pct) 1019.00 (-101.38 pct) >> 256: 45440.00 (0.00 pct) 44864.00 (1.26 pct) >> 512: 76672.00 (0.00 pct) 73600.00 (4.00 pct) >> >> NPS2 >> >> #workers: tip shortrun >> 1: 31.00 (0.00 pct) 36.00 (-16.12 pct) >> 2: 36.00 (0.00 pct) 36.00 (0.00 pct) >> 4: 45.00 (0.00 pct) 39.00 (13.33 pct) >> 8: 47.00 (0.00 pct) 48.00 (-2.12 pct) >> 16: 66.00 (0.00 pct) 71.00 (-7.57 pct) >> 32: 114.00 (0.00 pct) 123.00 (-7.89 pct) >> 64: 215.00 (0.00 pct) 248.00 (-15.34 pct) >> 128: 495.00 (0.00 pct) 531.00 (-7.27 pct) >> 256: 48576.00 (0.00 pct) 47552.00 (2.10 pct) >> 512: 79232.00 (0.00 pct) 74624.00 (5.81 pct) >> >> NPS4 >> >> #workers: tip shortrun >> 1: 30.00 (0.00 pct) 36.00 (-20.00 pct) >> 2: 34.00 (0.00 pct) 38.00 (-11.76 pct) >> 4: 41.00 (0.00 pct) 44.00 (-7.31 pct) >> 8: 60.00 (0.00 pct) 53.00 (11.66 pct) >> 16: 68.00 (0.00 pct) 73.00 (-7.35 pct) >> 32: 116.00 (0.00 pct) 125.00 (-7.75 pct) >> 64: 224.00 (0.00 pct) 248.00 (-10.71 pct) >> 128: 495.00 (0.00 pct) 569.00 (-14.94 pct) >> 256: 45888.00 (0.00 pct) 38720.00 (15.62 pct) >> 512: 78464.00 (0.00 pct) 73600.00 (6.19 pct) >> >> >> ~~~~~~~~~~ >> ~ tbench ~ >> ~~~~~~~~~~ >> >> NPS1 >> >> Clients: tip shortrun >> 1 550.66 (0.00 pct) 546.56 (-0.74 pct) >> 2 1009.69 (0.00 pct) 1010.01 (0.03 pct) >> 4 1795.32 (0.00 pct) 1782.71 (-0.70 pct) >> 8 2971.16 (0.00 pct) 3035.58 (2.16 pct) >> 16 4627.98 (0.00 pct) 4816.82 (4.08 pct) >> 32 8065.15 (0.00 pct) 9269.52 (14.93 pct) >> 64 14994.32 (0.00 pct) 14704.38 (-1.93 pct) >> 128 5175.73 (0.00 pct) 5174.77 (-0.01 pct) >> 256 48763.57 (0.00 pct) 49649.67 (1.81 pct) >> 512 43780.78 (0.00 pct) 44717.04 (2.13 pct) >> 1024 40341.84 (0.00 pct) 42078.99 (4.30 pct) >> >> NPS2 >> >> Clients: tip shortrun >> 1 551.06 (0.00 pct) 549.17 (-0.34 pct) >> 2 1000.76 (0.00 pct) 993.75 (-0.70 pct) >> 4 1737.02 (0.00 pct) 1773.33 (2.09 pct) >> 8 2992.31 (0.00 pct) 2971.05 (-0.71 pct) >> 16 4579.29 (0.00 pct) 4470.71 (-2.37 pct) >> 32 9120.73 (0.00 pct) 8080.89 (-11.40 pct) >> 64 14918.58 (0.00 pct) 14395.57 (-3.50 pct) >> 128 20830.61 (0.00 pct) 20579.09 (-1.20 pct) >> 256 47708.18 (0.00 pct) 47416.37 (-0.61 pct) >> 512 43721.79 (0.00 pct) 43754.83 (0.07 pct) >> 1024 40920.49 (0.00 pct) 40701.90 (-0.53 pct) >> >> NPS4 >> >> Clients: tip shortrun >> 1 549.22 (0.00 pct) 548.36 (-0.15 pct) >> 2 1000.08 (0.00 pct) 1037.74 (3.76 pct) >> 4 1794.78 (0.00 pct) 1802.11 (0.40 pct) >> 8 3008.50 (0.00 pct) 2989.22 (-0.64 pct) >> 16 4804.71 (0.00 pct) 4706.51 (-2.04 pct) >> 32 9156.57 (0.00 pct) 8253.84 (-9.85 pct) >> 64 14901.45 (0.00 pct) 15049.51 (0.99 pct) >> 128 20771.20 (0.00 pct) 13229.50 (-36.30 pct) >> 256 47033.88 (0.00 pct) 46737.17 (-0.63 pct) >> 512 43429.01 (0.00 pct) 43246.64 (-0.41 pct) >> 1024 39271.27 (0.00 pct) 42194.75 (7.44 pct) >> >> >> ~~~~~~~~~~ >> ~ stream ~ >> ~~~~~~~~~~ >> >> NPS1 >> >> 10 Runs: >> >> Test: tip shortrun >> Copy: 336311.52 (0.00 pct) 330116.75 (-1.84 pct) >> Scale: 212955.82 (0.00 pct) 215330.30 (1.11 pct) >> Add: 251518.23 (0.00 pct) 250926.53 (-0.23 pct) >> Triad: 262077.88 (0.00 pct) 259618.70 (-0.93 pct) >> >> 100 Runs: >> >> Test: tip shortrun >> Copy: 339533.83 (0.00 pct) 323452.74 (-4.73 pct) >> Scale: 194736.72 (0.00 pct) 215789.55 (10.81 pct) >> Add: 218294.54 (0.00 pct) 244916.33 (12.19 pct) >> Triad: 262371.40 (0.00 pct) 252997.84 (-3.57 pct) >> >> NPS2 >> >> 10 Runs: >> >> Test: tip shortrun >> Copy: 335277.15 (0.00 pct) 305516.57 (-8.87 pct) >> Scale: 220990.24 (0.00 pct) 207061.22 (-6.30 pct) >> Add: 264156.13 (0.00 pct) 243368.49 (-7.86 pct) >> Triad: 268707.53 (0.00 pct) 223486.30 (-16.82 pct) >> >> 100 Runs: >> >> Test: tip shortrun >> Copy: 334913.73 (0.00 pct) 319677.81 (-4.54 pct) >> Scale: 230522.47 (0.00 pct) 222757.62 (-3.36 pct) >> Add: 264567.28 (0.00 pct) 254883.62 (-3.66 pct) >> Triad: 272974.23 (0.00 pct) 260561.08 (-4.54 pct) >> >> NPS4 >> >> 10 Runs: >> >> Test: tip shortrun >> Copy: 356452.47 (0.00 pct) 255911.77 (-28.20 pct) >> Scale: 242986.42 (0.00 pct) 171587.28 (-29.38 pct) >> Add: 268512.09 (0.00 pct) 188244.75 (-29.89 pct) >> Triad: 281622.43 (0.00 pct) 193271.97 (-31.37 pct) >> >> 100 Runs: >> >> Test: tip shortrun >> Copy: 367384.81 (0.00 pct) 273101.20 (-25.66 pct) >> Scale: 254289.04 (0.00 pct) 189986.88 (-25.28 pct) >> Add: 273683.33 (0.00 pct) 206384.96 (-24.58 pct) >> Triad: 285696.90 (0.00 pct) 217214.10 (-23.97 pct) >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~ >> ~ Notes and Observations ~ >> ~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> o Schedstat data for Hackbench with 2 groups in NPS1 mode: >> >> --------------------------------------------------------------------------------------------------- >> cpu: all_cpus (avg) vs cpu: all_cpus (avg) >> --------------------------------------------------------------------------------------------------- >> kernel: : tip shortrun >> sched_yield count : 0, 0 >> Legacy counter can be ignored : 0, 0 >> schedule called : 53305, 40615 | -23.81| >> schedule left the processor idle : 22406, 16919 | -24.49| >> try_to_wake_up was called : 30822, 23625 | -23.35| >> try_to_wake_up was called to wake up the local cpu : 984, 2583 | 162.50| >> total runtime by tasks on this processor (in jiffies) : 596998654, 481267347 | -19.39| * >> total waittime by tasks on this processor (in jiffies) : 514142630, 766745576 | 49.13| * Longer wait time > Agree, the wait length is 766745576 / 481267347 = 1.59 after patched, which is > much bigger than 514142630 / 596998654 = 0.86 before the patch. > >> total timeslices run on this cpu : 30893, 23691 | -23.31| * >> --------------------------------------------------------------------------------------------------- >> >> >> < -------------------------------------- Wakeup info: -------------------------------------- > >> kernel: : tip shortrun >> Wakeups on same SMT cpus = all_cpus (avg) : 1470, 1301 | -11.50| >> Wakeups on same MC cpus = all_cpus (avg) : 22913, 18606 | -18.80| >> Wakeups on same DIE cpus = all_cpus (avg) : 3634, 693 | -80.93| >> Wakeups on same NUMA cpus = all_cpus (avg) : 1819, 440 | -75.81| >> Affine wakeups on same SMT cpus = all_cpus (avg) : 1025, 1421 | 38.63| * More affine wakeups on possibly >> Affine wakeups on same MC cpus = all_cpus (avg) : 14455, 17514 | 21.16| * busy runqueue leading to longer >> Affine wakeups on same DIE cpus = all_cpus (avg) : 2828, 701 | -75.21| wait time >> Affine wakeups on same NUMA cpus = all_cpus (avg) : 1194, 456 | -61.81| >> ------------------------------------------------------------------------------------------------ > Agree, for SMT and MC domain, the wake affine has been enhanced to suggest to pick > a short running CPU rather than an idle one. Then later SIS_UTIL would prefer to > pick this candidate CPU. >> >> We observe a larger wait time which which the patch which points >> to the fact that the tasks are piling on the run queue. I believe >> Tim's suggestion will help here where we can avoid a pileup as a >> result of waker task being a short run task. > Yes, we'll raise the bar to pick a short running CPU. >> >> o Tracepoint data for Stream for 100 runs in NPS4 >> >> Following tracepoints were enabled for Stream threads: >> - sched_wakeup_new: To observe initial placement >> - sched_waking: To check if migration is in wakeup context or lb contxt >> - sched_wakeup: To check if migration is in wakeup context or lb contxt >> - sched_migrate_task: To observe task movements >> >> --> tip: >> >> run_stream.sh-3724 [057] d..2. 450.593407: sched_wakeup_new: comm=run_stream.sh pid=3733 prio=120 target_cpu=050 *LLC: 6 >> -0 [182] d.s4. 450.594375: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.594381: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [182] d.s4. 450.594657: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.594661: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> stream-3733 [050] d..2. 450.594893: sched_wakeup_new: comm=stream pid=3735 prio=120 target_cpu=057 *LLC: 7 >> stream-3733 [050] d..2. 450.594955: sched_wakeup_new: comm=stream pid=3736 prio=120 target_cpu=078 *LLC: 9 >> stream-3733 [050] d..2. 450.594988: sched_wakeup_new: comm=stream pid=3737 prio=120 target_cpu=045 *LLC: 5 >> stream-3733 [050] d..2. 450.595016: sched_wakeup_new: comm=stream pid=3738 prio=120 target_cpu=008 *LLC: 1 >> stream-3733 [050] d..2. 450.595029: sched_waking: comm=stream pid=3737 prio=120 target_cpu=045 >> -0 [045] dNh2. 450.595037: sched_wakeup: comm=stream pid=3737 prio=120 target_cpu=045 >> stream-3737 [045] d..2. 450.595072: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.595078: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> stream-3738 [008] d..2. 450.595102: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.595111: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> stream-3733 [050] d..2. 450.595151: sched_wakeup_new: comm=stream pid=3739 prio=120 target_cpu=097 *LLC: 12 >> stream-3733 [050] d..2. 450.595181: sched_wakeup_new: comm=stream pid=3740 prio=120 target_cpu=194 *LLC: 8 >> stream-3733 [050] d..2. 450.595221: sched_wakeup_new: comm=stream pid=3741 prio=120 target_cpu=080 *LLC: 10 >> stream-3733 [050] d..2. 450.595249: sched_wakeup_new: comm=stream pid=3742 prio=120 target_cpu=144 *LLC: 2 >> stream-3733 [050] d..2. 450.595285: sched_wakeup_new: comm=stream pid=3743 prio=120 target_cpu=239 *LLC: 13 >> stream-3733 [050] d..2. 450.595320: sched_wakeup_new: comm=stream pid=3744 prio=120 target_cpu=130 *LLC: 0 >> stream-3733 [050] d..2. 450.595364: sched_wakeup_new: comm=stream pid=3745 prio=120 target_cpu=113 *LLC: 14 >> stream-3744 [130] d..2. 450.595407: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.595416: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> stream-3733 [050] d..2. 450.595423: sched_waking: comm=stream pid=3745 prio=120 target_cpu=113 >> -0 [113] dNh2. 450.595433: sched_wakeup: comm=stream pid=3745 prio=120 target_cpu=113 >> stream-3733 [050] d..2. 450.595452: sched_wakeup_new: comm=stream pid=3746 prio=120 target_cpu=160 *LLC: 4 >> stream-3733 [050] d..2. 450.595486: sched_wakeup_new: comm=stream pid=3747 prio=120 target_cpu=255 *LLC: 15 >> stream-3733 [050] d..2. 450.595513: sched_wakeup_new: comm=stream pid=3748 prio=120 target_cpu=159 *LLC: 3 >> stream-3746 [160] d..2. 450.595533: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.595542: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> stream-3747 [255] d..2. 450.595562: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050 >> -0 [050] dNh2. 450.595573: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050 >> stream-3733 [050] d..2. 450.595614: sched_wakeup_new: comm=stream pid=3749 prio=120 target_cpu=222 *LLC: 11 >> stream-3740 [194] d..2. 451.140510: sched_waking: comm=stream pid=3747 prio=120 target_cpu=255 >> -0 [255] dNh2. 451.140523: sched_wakeup: comm=stream pid=3747 prio=120 target_cpu=255 >> stream-3733 [050] d..2. 451.617257: sched_waking: comm=stream pid=3740 prio=120 target_cpu=194 >> stream-3733 [050] d..2. 451.617267: sched_waking: comm=stream pid=3746 prio=120 target_cpu=160 >> stream-3733 [050] d..2. 451.617269: sched_waking: comm=stream pid=3739 prio=120 target_cpu=097 >> stream-3733 [050] d..2. 451.617272: sched_waking: comm=stream pid=3742 prio=120 target_cpu=144 >> stream-3733 [050] d..2. 451.617275: sched_waking: comm=stream pid=3749 prio=120 target_cpu=222 >> ... (No migrations observed) >> >> In most cases, each LLCs is running only 1 stream thread leading to optimal performance. >> >> --> with patch: >> >> run_stream.sh-4383 [070] d..2. 1237.764236: sched_wakeup_new: comm=run_stream.sh pid=4392 prio=120 target_cpu=206 *LLC: 9 >> stream-4392 [206] d..2. 1237.765121: sched_wakeup_new: comm=stream pid=4394 prio=120 target_cpu=070 *LLC: 8 >> stream-4392 [206] d..2. 1237.765171: sched_wakeup_new: comm=stream pid=4395 prio=120 target_cpu=169 *LLC: 5 >> stream-4392 [206] d..2. 1237.765204: sched_wakeup_new: comm=stream pid=4396 prio=120 target_cpu=111 *LLC: 13 >> stream-4392 [206] d..2. 1237.765243: sched_wakeup_new: comm=stream pid=4397 prio=120 target_cpu=130 *LLC: 0 >> stream-4392 [206] d..2. 1237.765249: sched_waking: comm=stream pid=4396 prio=120 target_cpu=111 >> -0 [111] dNh2. 1237.765260: sched_wakeup: comm=stream pid=4396 prio=120 target_cpu=111 >> stream-4392 [206] d..2. 1237.765281: sched_wakeup_new: comm=stream pid=4398 prio=120 target_cpu=182 *LLC: 6 >> stream-4392 [206] d..2. 1237.765318: sched_wakeup_new: comm=stream pid=4399 prio=120 target_cpu=060 *LLC: 7 >> stream-4392 [206] d..2. 1237.765368: sched_wakeup_new: comm=stream pid=4400 prio=120 target_cpu=124 *LLC: 15 >> stream-4392 [206] d..2. 1237.765408: sched_wakeup_new: comm=stream pid=4401 prio=120 target_cpu=031 *LLC: 3 >> stream-4392 [206] d..2. 1237.765439: sched_wakeup_new: comm=stream pid=4402 prio=120 target_cpu=095 *LLC: 11 >> stream-4392 [206] d..2. 1237.765475: sched_wakeup_new: comm=stream pid=4403 prio=120 target_cpu=015 *LLC: 1 >> stream-4401 [031] d..2. 1237.765497: sched_waking: comm=stream pid=4392 prio=120 target_cpu=206 >> stream-4401 [031] d..2. 1237.765506: sched_migrate_task: comm=stream pid=4392 prio=120 orig_cpu=206 dest_cpu=152 *LLC: 9 -> 3 >> -0 [152] dNh2. 1237.765540: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=152 >> stream-4403 [015] d..2. 1237.765562: sched_waking: comm=stream pid=4392 prio=120 target_cpu=152 >> stream-4403 [015] d..2. 1237.765570: sched_migrate_task: comm=stream pid=4392 prio=120 orig_cpu=152 dest_cpu=136 *LLC: 3 -> 1 >> -0 [136] dNh2. 1237.765602: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=136 >> stream-4392 [136] d..2. 1237.765799: sched_wakeup_new: comm=stream pid=4404 prio=120 target_cpu=097 *LLC: 12 >> stream-4392 [136] d..2. 1237.765893: sched_wakeup_new: comm=stream pid=4405 prio=120 target_cpu=084 *LLC: 10 >> stream-4392 [136] d..2. 1237.765957: sched_wakeup_new: comm=stream pid=4406 prio=120 target_cpu=119 *LLC: 14 >> stream-4392 [136] d..2. 1237.766018: sched_wakeup_new: comm=stream pid=4407 prio=120 target_cpu=038 *LLC: 4 >> stream-4406 [119] d..2. 1237.766044: sched_waking: comm=stream pid=4392 prio=120 target_cpu=136 >> stream-4406 [119] d..2. 1237.766050: sched_migrate_task: comm=stream pid=4392 prio=120 orig_cpu=136 dest_cpu=240 *LLC: 1 -> 14 >> -0 [240] dNh2. 1237.766154: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=240 >> stream-4392 [240] d..2. 1237.766361: sched_wakeup_new: comm=stream pid=4408 prio=120 target_cpu=023 *LLC: 2 >> stream-4399 [060] d..2. 1238.300605: sched_waking: comm=stream pid=4406 prio=120 target_cpu=119 *LLC: 14 <--- Two stream threads are >> stream-4399 [060] d..2. 1238.300611: sched_waking: comm=stream pid=4392 prio=120 target_cpu=240 *LLC: 14 <--- on the same LLC leading to >> -0 [119] dNh2. 1238.300620: sched_wakeup: comm=stream pid=4406 prio=120 target_cpu=119 *LLC: 14 cache contention, degrading >> -0 [240] dNh2. 1238.300621: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=240 *LLC: 14 the Stream throughput. >> ... (No more migrations observed) >> >> After all the wakeups and migrations, LLC 14 contains two stream threads (pid: 4392 and 4406) >> All the migrations happen between the events sched_waking and sched_wakeup showing the migrations >> happens during a wakeup and not as a resutl of load balancing. >> >>> >>> This patch is more about enhancing the wake affine, rather than improving >>> the SIS efficiency, so Mel's SIS statistic patch was not deployed for now. >>> >>> [Limitations] >>> When the number of CPUs suggested by SIS_UTIL is lower than 60% of the LLC >>> CPUs, the LLC domain is regarded as relatively busy. However, the 60% is >>> somewhat hacky, because it indicates that the util_avg% is around 50%, >>> a half busy LLC. I don't have other lightweight/accurate method in mind to >>> check if the LLC domain is busy or not. >>> >>> [Misc] >>> At LPC we received useful suggestions. The first one is that we should look at >>> the time from the task is woken up, to the time the task goes back to sleep. >>> I assume this is aligned with what is proposed here - we consider the average >>> running time, rather than the total running time. The second one is that we >>> should consider the long-running task. And this is under investigation. >>> >>> Besides, Prateek has mentioned that the SIS_UTIL is unable to deal with >>> burst workload. Because there is a delay to reflect the instantaneous >>> utilization and SIS_UTIL expects the workload to be stable. If the system >>> is idle most of the time, but suddenly the workloads burst, the SIS_UTIL >>> overscans. The current patch might mitigate this symptom somehow, as burst >>> workload is usually regarded as a short-running task. >>> >>> Suggested-by: Tim Chen >>> Signed-off-by: Chen Yu >>> --- >>> kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++++- >>> 1 file changed, 30 insertions(+), 1 deletion(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 914096c5b1ae..7519ab5b911c 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -6020,6 +6020,19 @@ static int wake_wide(struct task_struct *p) >>> return 1; >>> } >>> >>> +/* >>> + * If a task switches in and then voluntarily relinquishes the >>> + * CPU quickly, it is regarded as a short running task. >>> + * sysctl_sched_min_granularity is chosen as the threshold, >>> + * as this value is the minimal slice if there are too many >>> + * runnable tasks, see __sched_period(). >>> + */ >>> +static int is_short_task(struct task_struct *p) >>> +{ >>> + return (p->se.sum_exec_runtime <= >>> + (p->nvcsw * sysctl_sched_min_granularity)); >>> +} >>> + >>> /* >>> * The purpose of wake_affine() is to quickly determine on which CPU we can run >>> * soonest. For the purpose of speed we only consider the waking and previous >>> @@ -6050,7 +6063,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) >>> if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) >>> return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; >>> >>> - if (sync && cpu_rq(this_cpu)->nr_running == 1) >>> + if ((sync && cpu_rq(this_cpu)->nr_running == 1) || >>> + is_short_task(cpu_curr(this_cpu))) Seems it a bit breaks idle (or will be idle) purpose of wake_affine_idle() here. Maybe we can do it something like this? if ((sync || is_short_task(cpu_curr(this_cpu))) && cpu_rq(this_cpu)->nr_running == 1) Thanks, Honglei >> >> This change seems to optimize for affine wakeup which benefits >> tasks with producer-consumer pattern but is not ideal for Stream. >> Currently the logic ends will do an affine wakeup even if sync >> flag is not set: >> >> stream-4135 [029] d..2. 353.580953: sched_waking: comm=stream pid=4129 prio=120 target_cpu=082 >> stream-4135 [029] d..2. 353.580957: select_task_rq_fair: wake_affine_idle: Select this_cpu: sync(0) rq->nr_running(1) is_short_task(1) >> stream-4135 [029] d..2. 353.580960: sched_migrate_task: comm=stream pid=4129 prio=120 orig_cpu=82 dest_cpu=30 >> -0 [030] dNh2. 353.580993: sched_wakeup: comm=stream pid=4129 prio=120 target_cpu=030 >> >> I believe a consideration should be made for the sync flag when >> going for an affine wakeup. Also the check for short running could >> be at the end after checking if prev_cpu is an available_idle_cpu. >> > We can move the short running check after the prev_cpu check. If we > add the sync flag check would it shrink the coverage of this change? > Since I found that there is limited scenario would enable the sync > flag and we want to make the short running check a generic optimization. > But yes, we can test with/without sync flag constrain to see which one > gives better data. >>> return this_cpu; >>> >>> if (available_idle_cpu(prev_cpu)) >>> @@ -6434,6 +6448,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool >>> /* overloaded LLC is unlikely to have idle cpu/core */ >>> if (nr == 1) >>> return -1; >>> + >>> + /* >>> + * If nr is smaller than 60% of llc_weight, it >>> + * indicates that the util_avg% is higher than 50%. >>> + * This is calculated by SIS_UTIL in >>> + * update_idle_cpu_scan(). The 50% util_avg indicates >>> + * a half-busy LLC domain. System busier than this >>> + * level could lower its bar to choose a compromised >>> + * "idle" CPU. If the waker on target CPU is a short >>> + * task and the wakee is also a short task, pick >>> + * target directly. >>> + */ >>> + if (!has_idle_core && (5 * nr < 3 * sd->span_weight) && >>> + is_short_task(p) && is_short_task(cpu_curr(target))) >>> + return target; >> >> Pileup seen in hackbench could also be a result of an early >> bailout here for smaller LLCs but I don't have any data to >> substantiate that claim currently. >> >>> } >>> } >>> >> Please let me know if you need any more data from the test >> system for any of the benchmarks covered or if you would like >> me to run any other benchmark on the test system. > Thank you for your testing, I'll enable SNC to divide the LLC domain > into smaller ones, and to see if the issue could be reproduced > on my platform too, then I'll update my finding on this. > > thanks, > Chenyu >> -- >> Thanks and Regards, >> Prateek