Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752440AbbGBRpr (ORCPT ); Thu, 2 Jul 2015 13:45:47 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:32463 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751382AbbGBRpk (ORCPT ); Thu, 2 Jul 2015 13:45:40 -0400 Message-ID: <55957871.7080906@fb.com> Date: Thu, 2 Jul 2015 13:44:17 -0400 From: Josef Bacik User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Mike Galbraith CC: Peter Zijlstra , , , , , kernel-team Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE References: <1432761736-22093-1-git-send-email-jbacik@fb.com> <20150528102127.GD3644@twins.programming.kicks-ass.net> <20150528110514.GR18673@twins.programming.kicks-ass.net> <1434087305.3674.26.camel@gmail.com> <5581B70D.2000800@fb.com> <1434588939.3444.25.camel@gmail.com> <55823F33.7040005@fb.com> <1434600765.3393.9.camel@gmail.com> In-Reply-To: <1434600765.3393.9.camel@gmail.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.52.123] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.14.151,1.0.33,0.0.0000 definitions=2015-07-02_12:2015-07-02,2015-07-02,1970-01-01 signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5350 Lines: 133 On 06/18/2015 12:12 AM, Mike Galbraith wrote: > On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote: >> On 06/17/2015 05:55 PM, Mike Galbraith wrote: >>> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote: >>>> On 06/11/2015 10:35 PM, Mike Galbraith wrote: >>>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: >>> >>>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by >>>>> tsk_cpus_allowed() or wake_affine(). >>>>> >>>> >>>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something >>>> like this >>>> >>>> if (tmp >= 0) { >>>> new_cpu = tmp; >>>> goto unlock; >>>> } else if (!want_affine) { >>>> new_cpu = prev_cpu; >>>> } >>>> >>>> so we can make sure we're not being pushed onto a cpu that we aren't >>>> allowed on? Thanks, >>> >>> The buglet is a messenger methinks. You saying the patch helped without >>> SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say >>> that preferring cache is not harming your load after all. It now sounds >>> as though wake_wide() may be what you're squabbling with. >>> >>> Things aren't adding up all that well. >> >> Yeah I'm horribly confused. The other thing is I had to switch clusters >> (I know, I know, I'm changing the parameters of the test). So these new >> boxes are haswell boxes, but basically the same otherwise, 2 socket 12 >> core with HT, just newer/faster CPUs. I'll re-run everything again and >> give the numbers so we're all on the same page again, but as it stands >> now I think we have this >> >> 3.10 with wake_idle forward ported - good >> 4.0 stock - 20% perf drop >> 4.0 w/ Peter's patch - good >> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop >> >> I can do all these iterations again to verify, is there any other >> permutation you'd like to see? Thanks, > > Yeah, after re-baseline, please apply/poke these buttons individually in > 4.0-virgin. > > (cat /sys/kernel/debug/sched_features, prepend NO_, echo it back) > Sorry it took me a while to get these numbers to you, migrating the whole fleet to a new setup broke the performance test suite thing so I've only just been able to run tests again. I'll do my best to describe what is going on and hopefully that will make the results make sense. This is on our webservers, which is HHVM. A request comes in for a page and this goes onto one of the two hhvm.node.# threads, one thread per NUMA node. From there it is farmed off to one of the worker threads. If there are no idle workers the request gets put on what is called the "select_queue". Basically the select_queue should never be larger than 0 in a perfect world. If it's more than we've hit latency somewhere and that's not good. The other measurement we care about is how long a thread spends on a request before it sends a response (this would be the actual work being done). Our tester slowly increases load to a group of servers until the select queue is consistently >= 1. That means we've loaded the boxes so high that they can't process the requests as soon as they've come in. Then it backs down and then ramps up a second time. It takes all of these measurements and puts them into these pretty graphs. There are 2 graphs we care about, the duration of the requests vs the requests per second and the probability that our select queue is >= 1 vs requests per second. Now for 3.10 vs 4.0 our request duration time is the same if not slightly better on 4.0, so once the workers are doing their job everything is a-ok. The problem is the probability the select queue >= 1 is way different on 4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up to some RPS (requests per second) threshold and then shoots up to 100% after the threshold. I'll make a table of these graphs that hopefully makes sense, the numbers are different from run to run because of traffic and such, the test and control are both run at the same time. The header is the probability the select queue >=1 25% 50% 75% 4.0 plain: 371 388 402 control: 386 394 402 difference: 15 6 0 So with 4.0 its basically a straight line, at lower RPS we are getting a higher probability of a select queue >= 1. We are measuring the cpu delay avg ms thing from the scheduler netlink stuff which is how I noticed it was scheduler related, our cpu delay is way higher on 4.0 than it is on 3.10 or 4.0 with the wake idle patch. So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain 25% 50% 75% NO_PREFER_IDLE: 399 401 414 control: 385 408 416 difference: 14 7 2 The numbers don't really show it well, but the graphs are closer together, it's slightly more s shaped, but still not great. Next is NO_WAKE_WIDE, which is horrible 25% 50% 75% NO_WAKE_WIDE: 315 344 369 control: 373 380 388 difference: 58 36 19 This isn't even in the same ballpark, it's a way worse regression than plain. The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad 25% 50% 75% EVERYTHING: 327 360 383 control: 381 390 399 difference: 54 30 19 Hopefully that helps. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/