Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751674AbcLCVyr (ORCPT ); Sat, 3 Dec 2016 16:54:47 -0500 Received: from mail-wj0-f178.google.com ([209.85.210.178]:33664 "EHLO mail-wj0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751317AbcLCVyn (ORCPT ); Sat, 3 Dec 2016 16:54:43 -0500 Date: Sat, 3 Dec 2016 21:47:07 +0000 From: Matt Fleming To: Brendan Gregg Cc: Vincent Guittot , Peter Zijlstra , Ingo Molnar , LKML , Morten.Rasmussen@arm.com, dietmar.eggemann@arm.com, kernellwp@gmail.com, yuyang.du@intel.com, umgwanakikbuti@gmail.com, Mel Gorman Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group Message-ID: <20161203214707.GI20785@codeblueprint.co.uk> References: <1480088073-11642-1-git-send-email-vincent.guittot@linaro.org> <1480088073-11642-3-git-send-email-vincent.guittot@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24+41 (02bc14ed1569) (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5273 Lines: 99 On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote: > > For background, is this from the "A decade of wasted cores" paper's > patches? No, this patch fixes an issue I originally reported here, https://lkml.kernel.org/r/20160923115808.2330-1-matt@codeblueprint.co.uk Essentially, if you have an idle or partially-idle system and a workload that consists of fork()'ing a bunch of tasks, where each of those tasks immediately sleeps waiting for some wakeup, then those tasks aren't spread across all idle CPUs very well. We saw this issue when running hackbench with a small loop count, such that the actual benchmark setup (fork()'ing) is where the majority of the runtime is spent. In that scenario, there's a large potential/blocked load, but essentially no runnable load, and the balance on fork scheduler code only cares about runnable load without Vincent's patch applied. The closest thing I can find in the "A decade of wasted cores" paper is "The Overload-on-Wakeup bug", but I don't think that's the issue here since, a) We're balancing on fork, not wakeup b) The fork on balance code balances across nodes OK > What's the expected typical gain? Thanks, The results are still coming back from the SUSE performance test grid but they do show that this patch is mainly a win for multi-socket machines with more than 8 cores or thereabouts. [ Vincent, I'll follow up to your PATCH 1/2 with the results that are specifically for that patch ] Assuming a fork-intensive or fork-dominated workload, and a multi-socket machine, such as this 2 socket, NUMA, with 12 cores and HT enabled (48 cpus), we saw a very clear win between +10% and +15% for processes communicating via pipes, (1) tip-sched = tip/sched/core branch (2) fix-fig-for-fork = (1) + PATCH 1/2 (3) fix-sig = (1) + (2) + PATCH 2/2 hackbench-process-pipes 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 tip-sched fix-fig-for-fork fix-sig Amean 1 0.0717 ( 0.00%) 0.0696 ( 2.99%) 0.0730 ( -1.79%) Amean 4 0.1244 ( 0.00%) 0.1200 ( 3.56%) 0.1190 ( 4.36%) Amean 7 0.1891 ( 0.00%) 0.1937 ( -2.42%) 0.1831 ( 3.17%) Amean 12 0.2964 ( 0.00%) 0.3116 ( -5.11%) 0.2784 ( 6.07%) Amean 21 0.4011 ( 0.00%) 0.4090 ( -1.96%) 0.3574 ( 10.90%) Amean 30 0.4944 ( 0.00%) 0.4654 ( 5.87%) 0.4171 ( 15.63%) Amean 48 0.6113 ( 0.00%) 0.6309 ( -3.20%) 0.5331 ( 12.78%) Amean 79 0.8616 ( 0.00%) 0.8706 ( -1.04%) 0.7710 ( 10.51%) Amean 110 1.1304 ( 0.00%) 1.2211 ( -8.02%) 1.0163 ( 10.10%) Amean 141 1.3754 ( 0.00%) 1.4279 ( -3.81%) 1.2803 ( 6.92%) Amean 172 1.6217 ( 0.00%) 1.7367 ( -7.09%) 1.5363 ( 5.27%) Amean 192 1.7809 ( 0.00%) 2.0199 (-13.42%) 1.7129 ( 3.82%) Things look even better when using threads and pipes, with wins between 11% and 29% when looking at results outside of the noise, hackbench-thread-pipes 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 tip-sched fix-fig-for-fork fix-sig Amean 1 0.0736 ( 0.00%) 0.0794 ( -7.96%) 0.0779 ( -5.83%) Amean 4 0.1709 ( 0.00%) 0.1690 ( 1.09%) 0.1663 ( 2.68%) Amean 7 0.2836 ( 0.00%) 0.3080 ( -8.61%) 0.2640 ( 6.90%) Amean 12 0.4393 ( 0.00%) 0.4843 (-10.24%) 0.4090 ( 6.89%) Amean 21 0.5821 ( 0.00%) 0.6369 ( -9.40%) 0.5126 ( 11.95%) Amean 30 0.6557 ( 0.00%) 0.6459 ( 1.50%) 0.5711 ( 12.90%) Amean 48 0.7924 ( 0.00%) 0.7760 ( 2.07%) 0.6286 ( 20.68%) Amean 79 1.0534 ( 0.00%) 1.0551 ( -0.16%) 0.8481 ( 19.49%) Amean 110 1.5286 ( 0.00%) 1.4504 ( 5.11%) 1.1121 ( 27.24%) Amean 141 1.9507 ( 0.00%) 1.7790 ( 8.80%) 1.3804 ( 29.23%) Amean 172 2.2261 ( 0.00%) 2.3330 ( -4.80%) 1.6336 ( 26.62%) Amean 192 2.3753 ( 0.00%) 2.3307 ( 1.88%) 1.8246 ( 23.19%) Somewhat surprisingly, I can see improvements for UMA machines with fewer cores when the workload heavily saturates the machine and the workload isn't dominated by fork. Such heavy saturation isn't super realistic, but still interesting. I haven't dug into why these results occurred, but I am happy things didn't instead fall off a cliff. Here's a 4-cpu UMA box showing some improvement at the higher end, hackbench-process-pipes 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 tip-sched fix-fig-for-fork fix-sig Amean 1 3.5060 ( 0.00%) 3.5747 ( -1.96%) 3.5117 ( -0.16%) Amean 3 7.7113 ( 0.00%) 7.8160 ( -1.36%) 7.7747 ( -0.82%) Amean 5 11.4453 ( 0.00%) 11.5710 ( -1.10%) 11.3870 ( 0.51%) Amean 7 15.3147 ( 0.00%) 15.9420 ( -4.10%) 15.8450 ( -3.46%) Amean 12 25.5110 ( 0.00%) 24.3410 ( 4.59%) 22.6717 ( 11.13%) Amean 16 32.3010 ( 0.00%) 28.5897 ( 11.49%) 25.7473 ( 20.29%)