Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3833898imu; Mon, 10 Dec 2018 08:31:43 -0800 (PST) X-Google-Smtp-Source: AFSGD/XA05ORHqxp18q2EHQyLNh7X32FEcdrZMtW2LoPYU/ACz2wPiIgkAf9hH5T6waOlqc2a6rw X-Received: by 2002:a62:5793:: with SMTP id i19mr13100468pfj.49.1544459503377; Mon, 10 Dec 2018 08:31:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544459503; cv=none; d=google.com; s=arc-20160816; b=vn6HNleH9RFN9i+Nz4k4zNpUf0CVhoyNInSjtOOLPIuQ/Y0x8XJ6hrsDeA9d1uxA7b kXNlcRI6ImBPsi6HRPXl8fUcfFsvL2Moi7VlHm2KJ9cm3ZCAKqk47I9n+iMAe8kR72LI 5LBC7sTrLSxRPYqBnGC7nps3kibI7rf72SYEv9qOlvNanaLelnTzbjntsubv5GYr00JP nS3SSvCzApXsd5UgXpfYYE5mj45REl1mcWp7i9x2m2CggCv/T0iUjF5jzC+q+IYLzlj0 QKb7NMH+O4SwpkT0gyHbZfeirUks91PbFxAmYT5eSkU48/b/gmf7Ck/WFBHNYH0Umj5n b/OQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=yPVDp9Oewc1/reIdoTFty2Lv1QZJp/zPgRZg2C+IW7g=; b=ZA4syVKMFAG2RUxGGZmzyBYQAFa70sAmhPZc+RZPZgeqTCzNxwqhCxZX4nyRM0qM3T Cw6CgRME3h4lw9LkI4FKivRSn0l6En+0Blgv3xumpHxZxQGboN0+pGPWl2/cytdd8zJH qKpuQahEcl0RAsW0mMdBFa6ithtO6tqD8XHM3TLVKuOyGZtXID2gkJhl+DiTHm48KqFu wVRIumDLD3SbQACCi211Hdih+RwWVeoaXxD4PB1RPvaZ6njaqXkz15xw2dnWpi5lmotq XB0Hn98SQcH04A41Sd9tzaSk4+41jFXQINTSi49hz6kFFz2H6buSiylIrbj+4QEfyw4m XSVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=gNCtlabm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h67si11259167pfb.146.2018.12.10.08.31.28; Mon, 10 Dec 2018 08:31:43 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=gNCtlabm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727414AbeLJQKw (ORCPT + 99 others); Mon, 10 Dec 2018 11:10:52 -0500 Received: from mail-io1-f65.google.com ([209.85.166.65]:40624 "EHLO mail-io1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727304AbeLJQKv (ORCPT ); Mon, 10 Dec 2018 11:10:51 -0500 Received: by mail-io1-f65.google.com with SMTP id n9so9149908ioh.7 for ; Mon, 10 Dec 2018 08:10:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=yPVDp9Oewc1/reIdoTFty2Lv1QZJp/zPgRZg2C+IW7g=; b=gNCtlabm09vtegjAc7EdoLSy07COef+VSfhrdTt0G/j+On8OGbvo/f2wuyIJq4aLk3 5+syKB+Ncwb08gdfheGxB5k/Tud6tjjlsnmuqHd3A5AYy1KOdKnHpZYsIAVZJ54oEGkU KCMlzBybt2VLrWcPtIA7OWYG0QbSGEfp7GB24= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=yPVDp9Oewc1/reIdoTFty2Lv1QZJp/zPgRZg2C+IW7g=; b=RgQ1dA3yHtyDW1kcSLj9laCGAF6Kx9N5Ybytgr+YswXOfUpI+DZ55uJ6UWOYu5kzSs Z+OuZ6Jm0KduCo3GWbo1U28ir8O/k5VYz2n/pjNd47sS1SF9S2VSExaJ/NX2wcQlTQv6 j/oO3tDU68gD/C9doBI4qGpFJXhS6nPz0/5XmvF62Rm0z+m31FmklthZkQ3U4V3fm0qy +MC9vXsnZEDNXKZAtm7H+WfTfZJtF4U71dGhaztfEqQsi7OktXPT/YKNzb627XkYNsHF L6Mu8DCoh77nIxZI0EEb5pafB7CHbehOkHN9zfy630pXXapHTuoqXRQeEjVAV78tyXTv lkRg== X-Gm-Message-State: AA+aEWZkPdG22EcLKWC47b3PkMgqOuM86mqOT1P9BUKpuZf/HKtP0myl uK9/467fnijq4vM6OlvGo7PR7d3x0NZWpjEEM00NJ2YNgKB8UQ== X-Received: by 2002:a6b:c8c9:: with SMTP id y192mr9624896iof.183.1544458249006; Mon, 10 Dec 2018 08:10:49 -0800 (PST) MIME-Version: 1.0 References: <1544131696-2888-1-git-send-email-steven.sistare@oracle.com> In-Reply-To: <1544131696-2888-1-git-send-email-steven.sistare@oracle.com> From: Vincent Guittot Date: Mon, 10 Dec 2018 17:10:37 +0100 Message-ID: Subject: Re: [PATCH v4 00/10] steal tasks to improve CPU utilization To: steven.sistare@oracle.com Cc: Ingo Molnar , Peter Zijlstra , subhra.mazumdar@oracle.com, Dhaval Giani , daniel.m.jordan@oracle.com, pavel.tatashin@microsoft.com, Matt Fleming , Mike Galbraith , Rik van Riel , Josef Bacik , Juri Lelli , Valentin Schneider , Quentin Perret , linux-kernel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Steven, On Thu, 6 Dec 2018 at 22:38, Steve Sistare wrote: > > When a CPU has no more CFS tasks to run, and idle_balance() fails to > find a task, then attempt to steal a task from an overloaded CPU in the > same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently > identify candidates. To minimize search time, steal the first migratable > task that is found when the bitmap is traversed. For fairness, search > for migratable tasks on an overloaded CPU in order of next to run. > > This simple stealing yields a higher CPU utilization than idle_balance() > alone, because the search is cheap, so it may be called every time the CPU > is about to go idle. idle_balance() does more work because it searches > widely for the busiest queue, so to limit its CPU consumption, it declines > to search if the system is too busy. Simple stealing does not offload the > globally busiest queue, but it is much better than running nothing at all. > > The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to > reduce cache contention vs the usual bitmap when many threads concurrently > set, clear, and visit elements. > > Patch 1 defines the sparsemask type and its operations. > > Patches 2, 3, and 4 implement the bitmap of overloaded CPUs. > > Patches 5 and 6 refactor existing code for a cleaner merge of later > patches. > > Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap. > > Patch 9 disables stealing on systems with more than 2 NUMA nodes for the > time being because of performance regressions that are not due to stealing > per-se. See the patch description for details. > > Patch 10 adds schedstats for comparing the new behavior to the old, and > provided as a convenience for developers only, not for integration. > > The patch series is based on kernel 4.20.0-rc1. It compiles, boots, and > runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG, > and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT + > CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES + > CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU > bandwidth control were tested. > > Stealing improves utilization with only a modest CPU overhead in scheduler > code. In the following experiment, hackbench is run with varying numbers > of groups (40 tasks per group), and the delta in /proc/schedstat is shown > for each run, averaged per CPU, augmented with these non-standard stats: > > %find - percent of time spent in old and new functions that search for > idle CPUs and tasks to steal and set the overloaded CPUs bitmap. > > steal - number of times a task is stolen from another CPU. > > X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz > hackbench process 100000 > sched_wakeup_granularity_ns=15000000 > > baseline > grps time %busy slice sched idle wake %find steal > 1 8.084 75.02 0.10 105476 46291 59183 0.31 0 > 2 13.892 85.33 0.10 190225 70958 119264 0.45 0 > 3 19.668 89.04 0.10 263896 87047 176850 0.49 0 > 4 25.279 91.28 0.10 322171 94691 227474 0.51 0 > 8 47.832 94.86 0.09 630636 144141 486322 0.56 0 > > new > grps time %busy slice sched idle wake %find steal %speedup > 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1 > 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9 > 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8 > 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3 > 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6 > > Elapsed time improves by 8 to 36%, and CPU busy utilization is up > by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks). > The cost is at most 0.4% more find time. I have run some hackbench tests on my hikey arm64 octo cores with your patchset. My original intent was to send a tested-by but I have some performances regressions. This hikey is the smp one and not the asymetric hikey960 that Valentin used for his tests The sched domain topology is domain-0: span=0-3 level=MC and domain-0: span=4-7 level=MC domain-1: span=0-7 level=DIE I have run 12 times hackbench -g $j -P -l 2000 with j equals to 1 2 3 4 8 grps time 1 1.396 2 2.699 3 3.617 4 4.498 8 7.721 Then after disabling STEAL in sched_feature with echo NO_STEAL > /sys/kernel/debug/sched_features , the results become: grps time 1 1.217 2 1.973 3 2.855 4 3.932 8 7.674 I haven't looked in details about some possible reasons of such difference yet and haven't collected the stats that you added with patch 10. Have you got a script to collect and post process them ? Regards, Vincent > > Additional performance results follow. A negative "speedup" is a > regression. Note: for all hackbench runs, sched_wakeup_granularity_ns > is set to 15 msec. Otherwise, preemptions increase at higher loads and > distort the comparison between baseline and new. > > ------------------ 1 Socket Results ------------------ > > X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz > Average of 10 runs of: hackbench process 100000 > > --- base -- --- new --- > groups time %stdev time %stdev %speedup > 1 8.008 0.1 5.905 0.2 35.6 > 2 13.814 0.2 11.438 0.1 20.7 > 3 19.488 0.2 16.919 0.1 15.1 > 4 25.059 0.1 22.409 0.1 11.8 > 8 47.478 0.1 44.221 0.1 7.3 > > X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs > Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > Average of 10 runs of: hackbench process 100000 > > --- base -- --- new --- > groups time %stdev time %stdev %speedup > 1 4.586 0.8 4.596 0.6 -0.3 > 2 7.693 0.2 5.775 1.3 33.2 > 3 10.442 0.3 8.288 0.3 25.9 > 4 13.087 0.2 11.057 0.1 18.3 > 8 24.145 0.2 22.076 0.3 9.3 > 16 43.779 0.1 41.741 0.2 4.8 > > KVM 4-cpu > Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz > tbench, average of 11 runs. > > clients %speedup > 1 16.2 > 2 11.7 > 4 9.9 > 8 12.8 > 16 13.7 > > KVM 2-cpu > Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz > > Benchmark %speedup > specjbb2015_critical_jops 5.7 > mysql_sysb1.0.14_mutex_2 40.6 > mysql_sysb1.0.14_oltp_2 3.9 > > ------------------ 2 Socket Results ------------------ > > X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz > Average of 10 runs of: hackbench process 100000 > > --- base -- --- new --- > groups time %stdev time %stdev %speedup > 1 7.945 0.2 7.219 8.7 10.0 > 2 8.444 0.4 6.689 1.5 26.2 > 3 12.100 1.1 9.962 2.0 21.4 > 4 15.001 0.4 13.109 1.1 14.4 > 8 27.960 0.2 26.127 0.3 7.0 > > X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs > Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > Average of 10 runs of: hackbench process 100000 > > --- base -- --- new --- > groups time %stdev time %stdev %speedup > 1 5.826 5.4 5.840 5.0 -0.3 > 2 5.041 5.3 6.171 23.4 -18.4 > 3 6.839 2.1 6.324 3.8 8.1 > 4 8.177 0.6 7.318 3.6 11.7 > 8 14.429 0.7 13.966 1.3 3.3 > 16 26.401 0.3 25.149 1.5 4.9 > > > X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs > Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > Oracle database OLTP, logging disabled, NVRAM storage > > Customers Users %speedup > 1200000 40 -1.2 > 2400000 80 2.7 > 3600000 120 8.9 > 4800000 160 4.4 > 6000000 200 3.0 > > X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs > Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz > Results from the Oracle "Performance PIT". > > Benchmark %speedup > > mysql_sysb1.0.14_fileio_56_rndrd 19.6 > mysql_sysb1.0.14_fileio_56_seqrd 12.1 > mysql_sysb1.0.14_fileio_56_rndwr 0.4 > mysql_sysb1.0.14_fileio_56_seqrewr -0.3 > > pgsql_sysb1.0.14_fileio_56_rndrd 19.5 > pgsql_sysb1.0.14_fileio_56_seqrd 8.6 > pgsql_sysb1.0.14_fileio_56_rndwr 1.0 > pgsql_sysb1.0.14_fileio_56_seqrewr 0.5 > > opatch_time_ASM_12.2.0.1.0_HP2M 7.5 > select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1 > select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4 > swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8 > > lm3_memlat_L2 4.8 > lm3_memlat_L1 0.0 > > ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1 > ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2 > ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0 > ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4 > > X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs > Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz > > NAS_OMP > bench class ncpu %improved(Mops) > dc B 72 1.3 > is C 72 0.9 > is D 72 0.7 > > sysbench mysql, average of 24 runs > --- base --- --- new --- > nthr events %stdev events %stdev %speedup > 1 331.0 0.25 331.0 0.24 -0.1 > 2 661.3 0.22 661.8 0.22 0.0 > 4 1297.0 0.88 1300.5 0.82 0.2 > 8 2420.8 0.04 2420.5 0.04 -0.1 > 16 4826.3 0.07 4825.4 0.05 -0.1 > 32 8815.3 0.27 8830.2 0.18 0.1 > 64 12823.0 0.24 12823.6 0.26 0.0 > > -------------------------------------------------------------- > > Changes from v1 to v2: > - Remove stray find_time hunk from patch 5 > - Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT > - Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2 > - Steal iff avg_idle exceeds the cost of stealing > > Changes from v2 to v3: > - Update series for kernel 4.20. Context changes only. > > Changes from v3 to v4: > - Avoid 64-bit division on 32-bit processors in compute_skid() > - Replace IF_SMP with inline functions to set idle_stamp > - Push ZALLOC_MASK body into calling function > - Set rq->cfs_overload_cpus in update_top_cache_domain instead of > cpu_attach_domain > - Rewrite sparsemask iterator for complete inlining > - Cull and clean up sparsemask functions and moved all into > sched/sparsemask.h > > Steve Sistare (10): > sched: Provide sparsemask, a reduced contention bitmap > sched/topology: Provide hooks to allocate data shared per LLC > sched/topology: Provide cfs_overload_cpus bitmap > sched/fair: Dynamically update cfs_overload_cpus > sched/fair: Hoist idle_stamp up from idle_balance > sched/fair: Generalize the detach_task interface > sched/fair: Provide can_migrate_task_llc > sched/fair: Steal work from an overloaded CPU when CPU goes idle > sched/fair: disable stealing if too many NUMA nodes > sched/fair: Provide idle search schedstats > > include/linux/sched/topology.h | 1 + > kernel/sched/core.c | 31 +++- > kernel/sched/fair.c | 354 +++++++++++++++++++++++++++++++++++++---- > kernel/sched/features.h | 6 + > kernel/sched/sched.h | 13 +- > kernel/sched/sparsemask.h | 210 ++++++++++++++++++++++++ > kernel/sched/stats.c | 11 +- > kernel/sched/stats.h | 13 ++ > kernel/sched/topology.c | 121 +++++++++++++- > 9 files changed, 726 insertions(+), 34 deletions(-) > create mode 100644 kernel/sched/sparsemask.h > > -- > 1.8.3.1 >