Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp277095pxb; Wed, 15 Sep 2021 01:36:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyAq9JuHfMh6xHubiAJoOi3c0j9arlGugd+bCJni5epYg4MarTpeiBiesUrWM1zFBYZqS4t X-Received: by 2002:a92:c9c9:: with SMTP id k9mr9137909ilq.82.1631695013006; Wed, 15 Sep 2021 01:36:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631695013; cv=none; d=google.com; s=arc-20160816; b=oWHpNiJWrp7ft13Pe4vkb1ArS1ZqE72tn9yZoFdX+M0fWKAs0pqhG2rFyUyS118gRh FScUyCX/sDc4a16lPFY9xtdq0S2/o1gaq/1qeV5shy03T6xn7+6M0UaYOi2W0ZR4r4dg MbDG36BKU/EKCI3rQW+MX8XWi8B26xaC8izP6d5vLvR7J6oENmEMMJTSn4jaW522uy7U yV0nWcXEaX0gi2p2zNoDZKaN6B+jzkkZeX56Pm9ovbN56D/XbpgaMZ3w0VZznhPIPEDn Y06xTl0DByWmru1pTN1XoKc8wsVivEViI3620TcvqT0XGBZ/+EhNBwxq8usk8JrM8fS4 cFIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=3KDwv7yLVTKaPq7RwH2XmojulAaS6YkAXtrkpSEaePQ=; b=IMEu5+wKfeIrNcV9sFOEq8qwpBXZpmMFuZE2ARNSPOBQFlECXoj67U/0ExCfZ/DDv6 rWiU8wmBzywaDfvKpvska26zY2yWv+IGMvLRELWGBGb0kUQ1hFx2tQjmMjws0brSbnHC injayXvgJKfzFkNLhRYsNrDwYr6HnWqMdm4gwFe4rKZhHVLUm2Kc6QDp6BPWqqR/RcYv TqHLKEBYRlumTfQIXNPlqXWyA5S7SD8Pn1bxSfEAEgwRG8vl5CIb42LBelHf7j3OwITq 9V9Rlvcj8nWtiqL3P1RSZEghYiZ20suylHzR7WWdpiS6s3PY/dakv4hragregK3G5GGR L8XQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=hisilicon.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q17si4896501ilj.154.2021.09.15.01.36.41; Wed, 15 Sep 2021 01:36:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=hisilicon.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236951AbhIOIfa (ORCPT + 99 others); Wed, 15 Sep 2021 04:35:30 -0400 Received: from szxga03-in.huawei.com ([45.249.212.189]:16260 "EHLO szxga03-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236916AbhIOIf3 (ORCPT ); Wed, 15 Sep 2021 04:35:29 -0400 Received: from dggemv703-chm.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4H8YQM6lZ2z8t9Q; Wed, 15 Sep 2021 16:33:31 +0800 (CST) Received: from dggema772-chm.china.huawei.com (10.1.198.214) by dggemv703-chm.china.huawei.com (10.3.19.46) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2308.8; Wed, 15 Sep 2021 16:34:08 +0800 Received: from [127.0.0.1] (10.69.38.196) by dggema772-chm.china.huawei.com (10.1.198.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2308.8; Wed, 15 Sep 2021 16:34:07 +0800 Subject: Re: [RFC] Perfomance varies according to sysctl_sched_migration_cost To: Vincent Guittot CC: Ingo Molnar , Peter Zijlstra , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , "Daniel Bristot de Oliveira" , <21cnbao@gmail.com>, Linux Kernel Mailing List , "Song Bao Hua (Barry Song)" , , "guodong.xu@linaro.org" References: From: Yicong Yang Message-ID: Date: Wed, 15 Sep 2021 16:34:07 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.69.38.196] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggema772-chm.china.huawei.com (10.1.198.214) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2021/9/14 20:55, Vincent Guittot wrote: > On Tue, 14 Sept 2021 at 14:08, Yicong Yang wrote: >> >> Hi Vincent, >> >> thanks for the reply! >> >> On 2021/9/14 17:04, Vincent Guittot wrote: >>> Hi Yicong, >>> >>> On Tue, 14 Sept 2021 at 09:27, Yicong Yang wrote: >>>> >>>> Hi all, >>>> >>>> I noticed that some benchmark performance varies after tunning the sysctl_sched_migration_cost >>>> through /sys/kernel/debug/sched/migration_cost_ns on arm64. The default value is 500000, and >>>> I tried 10000, 100000, 1000000. Below are some results from mmtests, based on 5.14-release. >>>> >>>> On Kunpeng920 (128cores, 4numa, 2socket): >>>> >>>> pgbench (config-db-pgbench-timed-ro-medium) >>>> mig-cost-500000 mig-cost-100000 mig-cost-10000 mig-cost-1000000 >>>> Hmean 1 9558.99 ( 0.00%) 9735.31 * 1.84%* 9410.84 * -1.55%* 9602.47 * 0.45%* >>>> Hmean 8 17615.90 ( 0.00%) 17439.78 * -1.00%* 18056.44 * 2.50%* 19222.18 * 9.12%* >>>> Hmean 12 25228.38 ( 0.00%) 25592.69 * 1.44%* 26739.06 * 5.99%* 27575.48 * 9.30%* >>>> Hmean 24 46623.27 ( 0.00%) 48853.30 * 4.78%* 47386.02 * 1.64%* 48542.94 * 4.12%* >>>> Hmean 32 60578.78 ( 0.00%) 62116.81 * 2.54%* 59961.36 * -1.02%* 58681.07 * -3.13%* >>>> Hmean 48 68159.12 ( 0.00%) 67867.90 ( -0.43%) 65631.79 * -3.71%* 66487.16 * -2.45%* >>>> Hmean 80 66894.87 ( 0.00%) 73440.92 * 9.79%* 68751.63 * 2.78%* 67326.70 ( 0.65%) >>>> Hmean 112 68582.27 ( 0.00%) 65339.90 * -4.73%* 68454.99 ( -0.19%) 67211.66 * -2.00%* >>>> Hmean 144 76290.98 ( 0.00%) 70455.65 * -7.65%* 64851.23 * -14.99%* 64940.61 * -14.88%* >>>> Hmean 172 63245.68 ( 0.00%) 68790.24 * 8.77%* 66246.46 * 4.74%* 69536.96 * 9.95%* >>>> Hmean 204 61793.47 ( 0.00%) 63711.62 * 3.10%* 66055.64 * 6.90%* 58023.20 * -6.10%* >>>> Hmean 236 61486.75 ( 0.00%) 68404.44 * 11.25%* 70499.70 * 14.66%* 58285.67 * -5.21%* >>>> Hmean 256 57476.13 ( 0.00%) 65645.83 * 14.21%* 69437.05 * 20.81%* 60518.05 * 5.29%* >>>> >>>> tbench (config-network-tbench) >>>> mig-cost-500000 mig-cost-100000 mig-cost-10000 mig-cost-1000000 >>>> Hmean 1 333.12 ( 0.00%) 332.93 ( -0.06%) 335.34 * 0.67%* 334.36 * 0.37%* >>>> Hmean 2 665.88 ( 0.00%) 667.19 * 0.20%* 666.47 * 0.09%* 667.02 * 0.17%* >>>> Hmean 4 1324.10 ( 0.00%) 1312.23 * -0.90%* 1313.07 * -0.83%* 1315.13 * -0.68%* >>>> Hmean 8 2618.85 ( 0.00%) 2602.00 * -0.64%* 2577.49 * -1.58%* 2600.48 * -0.70%* >>>> Hmean 16 5100.74 ( 0.00%) 5068.80 * -0.63%* 5041.34 * -1.16%* 5069.78 * -0.61%* >>>> Hmean 32 8157.22 ( 0.00%) 8163.50 ( 0.08%) 7936.25 * -2.71%* 8329.18 * 2.11%* >>>> Hmean 64 4824.56 ( 0.00%) 4890.81 * 1.37%* 5319.97 * 10.27%* 4830.68 * 0.13%* >>>> Hmean 128 4635.17 ( 0.00%) 6810.90 * 46.94%* 5304.36 * 14.44%* 4516.06 * -2.57%* >>>> Hmean 256 8816.62 ( 0.00%) 8851.28 * 0.39%* 8448.76 * -4.17%* 6840.12 * -22.42%* >>>> Hmean 512 7825.56 ( 0.00%) 8538.04 * 9.10%* 8002.77 * 2.26%* 7946.54 * 1.55%* >>>> >>>> Also on Raspberrypi 4B: >>>> >>>> pgbench (config-db-pgbench-timed-ro-medium) >>>> mig-cost-500000 mig-cost-100000 >>>> Hmean 1 1651.41 ( 0.00%) 3444.27 * 108.56%* >>>> Hmean 4 4015.83 ( 0.00%) 6883.21 * 71.40%* >>>> Hmean 7 4161.45 ( 0.00%) 6646.18 * 59.71%* >>>> Hmean 8 4277.28 ( 0.00%) 6764.60 * 58.15%* >>>> >>>> For tbench on Raspberrypi 4B and both pgbench and tbench on x86, tuning sysctl_sched_migration_cost >>>> doesn't have such huge difference and will have some degradations (max -8% on x86 for pgbench) in some cases. >>>> >>>> The sysctl_sched_migration_cost will affects the frequency of load balance. It will affect >>> >>> So it doesn't affect the periodic load but only the newly idle load balance >>> >> >> In load_balance(), it's used to judge whether a task is hot in task_hot(). so I think it >> participates in the periodic load balance. > > Not really. The periodic load balance always happens but task_hot is > used to skip task that have recently run on the cpu and select older > tasks instead > At the contrary, sysctl_sched_migration_cost is used to decide if we > should abort newly_idle_load_balance > well. I think I get it. In periodic load balance sysctl_sched_migration_cost will affect which task we choose to migrate but won't abort the process like what it does in new idle balance. > As a side point, would be good to know if the improvement and > regression seen in your tests are more linked to the task hotness or > for skipping/aborting newly idle load balance > sure. I think I can get some hints by comparing the scheduler statistics after tuning sysctl_sched_migration_cost. >> >>>> directly in task_hot() and newidle_balance() to decide whether we can do a migration or load >>>> balance. And affects other parameters like rq->avg_idle, rq->max_idle_balance_cost and >>>> sd->max_newidle_lb_cost to indirectly affect the load balance process. These parameters record >>>> the load_balance() cost and will be limited up to sysctl_sched_migration_cost, so I measure >>>> the average cost of load_balance() on Kunpeng920 with bcc tools(./funclantency load_balance -d 10): >>>> >>>> system status idle 50%load 100%load >>>> avg cost 3160ns 4790ns 7563ns >>> >>> What is the setup of your test ? has this been measured during the >>> benchmarks above ? >>> >> >> I use stress-ng to generate the load. Since it's a 128core server, `stress-ng -c 64` for >> 50% load, and `stress-ng -c 128` for 100% load. This is not measured during the benchmarks' >> process. > > I don't think this is the best benchmark to evaluate the real cost of > load_balance because it create always running task and you measure > only the periodic load balance and not the newly load balance which is > the one really impacted by sysctl_sched_migration_cost > it's right. It doesn't cover the newidle balance case and bcc is based on kprobe which may have large latency on arm64 [1]. My original purpose is not to measure it accurately but to see whehter the cost is comparable to the sysctl_sched_migration_cost. [1] https://lore.kernel.org/lkml/20210818073336.59678-1-liuqi115@huawei.com/ >> >>> Also, do you have more details about the topology and the number of >>> sched domain ? >>> >> >> sure. for `numactl -H`: >> >> available: 4 nodes (0-3) >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 >> node 0 size: 257149 MB >> node 0 free: 253518 MB >> node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 >> node 1 size: 193531 MB >> node 1 free: 192916 MB >> node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 >> node 2 size: 96763 MB >> node 2 free: 92654 MB >> node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 >> node 3 size: 127668 MB >> node 3 free: 125846 MB >> node distances: >> node 0 1 2 3 >> 0: 10 12 20 22 >> 1: 12 10 22 24 >> 2: 20 22 10 12 >> 3: 22 24 12 10 >> >> Kunpeng 920 is non-smt. There're 4 level domains and below is part of the /proc/schedstat: >> [...] >> cpu0 >> domain0 00000000,00000000,00000000,ffffffff >> domain1 00000000,00000000,ffffffff,ffffffff >> domain2 00000000,ffffffff,ffffffff,ffffffff >> domain3 ffffffff,ffffffff,ffffffff,ffffffff > > Because of the large difference between the number of cpus at 1st and > last level, an average duration of load_balance() is not really > meaningful and we can expect a factor of 4 between smallest and larger > one > yes, the larger domain may have larger cost. I only show the average value here while I got a histgram of the cost distribution as well. the min range means where the minimal values fall in while the max range means where the maximum values fall in. Counts means how many times load_balance() is measured. min range(counts) max range(counts) total counts idle 256-511(456) 16384-32767(16) 14047 50% load 256-511(4018) 16382-32767(140) 57908 100%load 1024-2047(64) 32768-65535(8) 2582 Load balance is more frequent on a half loaded system while it takes more time when it's well loaded. funclatency tools: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py >> [...] >> cpu32 >> domain0 00000000,00000000,ffffffff,00000000 >> domain1 00000000,00000000,ffffffff,ffffffff >> domain2 00000000,ffffffff,ffffffff,ffffffff >> domain3 ffffffff,ffffffff,ffffffff,ffffffff >> [...] >> cpu64 >> domain0 00000000,ffffffff,00000000,00000000 >> domain1 ffffffff,ffffffff,00000000,00000000 >> domain2 ffffffff,ffffffff,00000000,ffffffff >> domain3 ffffffff,ffffffff,ffffffff,ffffffff >> [...] >> cpu96 >> domain0 ffffffff,00000000,00000000,00000000 >> domain1 ffffffff,ffffffff,00000000,00000000 >> domain2 ffffffff,ffffffff,00000000,ffffffff >> domain3 ffffffff,ffffffff,ffffffff,ffffffff >> [...] >> >>> Are you using cgroup hierarchy ? >>> >> >> No cgroup hierarchy during the test. > > This can slow down a bit the load_balance so might be good to take > that into account > If I run the test in a cgroup, the load balance will only be performed on the cpuset rather than the whole system and the scan will be faster as the range narrowed. Is that the reason here? Thanks. >> >>>> >>>> The average cost of load balance seems quite smaller than the default sysctl_sched_migration_cost >>>> which is 500000(500ms). >>> >>> AFAICT, it is 500us not 500ms >>> >> >> yes it's 500us. sorry for the wrong unit. >> >>>> >>>> So I have some RFC questions: >>>> 1. how is the default 500000 (500ms) migration cost is measured or caculated? >>> >>> 500us not ms >>> >>> I would say that it's a heuristic value that works for most of system >>> but it should probably be tuned per platform. But also note that it's >>> quite difficult to get a correct value >>> >> >> thanks for the explanation. I agree that it should be tuned per platform, and maybe also >> per workload. Current default value seems to have be well tuned on x86 but not on the some >> arm64 platforms. > > Adjusting the value based on the platform seems reasonable although > i'm not sure which input should be used (arch type / interconnect > bandwidth / cache size / number of cpu per cache level ...) > >> >> Thanks. >> >>>> The value has never changed in the past decade. I dig into the git commits and find it was introduced >>>> in da84d9617672 ("sched: reintroduce cache-hot affinity"). But it didn't explain how did this value come. >>>> 2. The ABI now has been removed from sysctl and moved to debugfs. As tuning this can improve the performance >>>> of some workloads on some platforms, maybe it's better to make it a formal sysctl again with docs? >>>> >>>> I'll be appreciated for any comments and replies! >>>> >>>> Thanks, >>>> Yicong >>>> >>>> >>> >>> . >>> >> > > . >