Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752042AbdF3NL3 (ORCPT ); Fri, 30 Jun 2017 09:11:29 -0400 Received: from mail-wr0-f179.google.com ([209.85.128.179]:35117 "EHLO mail-wr0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751668AbdF3NL2 (ORCPT ); Fri, 30 Jun 2017 09:11:28 -0400 Date: Fri, 30 Jun 2017 14:11:24 +0100 From: Matt Fleming To: Josef Bacik Cc: Joel Fernandes , Mike Galbraith , Peter Zijlstra , LKML , Juri Lelli , Dietmar Eggemann , Patrick Bellasi , Brendan Jackman , Chris Redpath , Vincent Guittot Subject: Re: wake_wide mechanism clarification Message-ID: <20170630131124.GB12077@codeblueprint.co.uk> References: <20170630004912.GA2457@destiny> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170630004912.GA2457@destiny> User-Agent: Mutt/1.5.24+42 (6e565710a064) (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2397 Lines: 57 On Thu, 29 Jun, at 08:49:13PM, Josef Bacik wrote: > > It may be worth to try with schedbench and trace it to see how this turns out in > practice, as that's the workload that generated all this discussion before. I > imagine generally speaking this works out properly. The small regression I > reported before was at low RPS, so we wouldn't be waking up as many tasks as > often, so we would be returning 0 from wake_wide() and we'd get screwed. This > is where I think possibly dropping the slave < factor part of the test would > address that, but I'd have to trace it to say for sure. Thanks, Just 2 weeks ago I was poking at wake_wide() because it's impacting hackbench times now we're better at balancing on fork() (see commit 6b94780e45c1 ("sched/core: Use load_avg for selecting idlest group")). What's happening is that occasionally the hackbench times will be pretty large because the hackbench tasks are being pulled back and forth across NUMA domains due to the wake_wide() logic. Reproducing this issue does require a NUMA box with more CPUs than hackbench tasks. I was using an 80-cpu 2 NUMA node box with 1 hackbench group (20 readers, 20 writers). I did the following very quick hack, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a1f5efa51dc7..c1bc1b0434bd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5055,7 +5055,7 @@ static int wake_wide(struct task_struct *p) if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if (master < slave * factor) return 0; return 1; } Which produces the following results for the 1 group (40 tasks) on one of SUSE's enterprise kernels: hackbench-process-pipes 4.4.71 4.4.71 patched+patched+-wake-wide-fix Min 1 0.7000 ( 0.00%) 0.8480 (-21.14%) Amean 1 1.0343 ( 0.00%) 0.9073 ( 12.28%) Stddev 1 0.2373 ( 0.00%) 0.0447 ( 81.15%) CoeffVar 1 22.9447 ( 0.00%) 4.9300 ( 78.51%) Max 1 1.2270 ( 0.00%) 0.9560 ( 22.09%) You'll see that the minimum value is worse with my change, but the maximum is much better. So the current wake_wide() code does help sometimes, but it also hurts sometimes too. I'm happy to gather performance data for any code suggestions.