Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964910AbcDLN2i (ORCPT ); Tue, 12 Apr 2016 09:28:38 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:28442 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964784AbcDLN2g (ORCPT ); Tue, 12 Apr 2016 09:28:36 -0400 Date: Tue, 12 Apr 2016 09:27:58 -0400 From: Chris Mason To: Mike Galbraith CC: Peter Zijlstra , Ingo Molnar , Matt Fleming , Subject: Re: sched: tweak select_idle_sibling to look for idle threads Message-ID: <20160412132758.7apgqqwl2c2wksy6@floor.thefacebook.com> Mail-Followup-To: Chris Mason , Mike Galbraith , Peter Zijlstra , Ingo Molnar , Matt Fleming , linux-kernel@vger.kernel.org References: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com> <20160409190554.honue3gtian2p6vr@floor.thefacebook.com> <1460282661.4251.44.camel@suse.de> <20160410195543.fp2tpixaafsts5x3@floor.thefacebook.com> <1460350461.3870.36.camel@suse.de> <20160412003044.smr24xzuom3locvo@floor.thefacebook.com> <1460436248.3839.80.camel@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1460436248.3839.80.camel@suse.de> User-Agent: Mutt/1.5.23.1 (2014-03-12) X-Originating-IP: [192.168.52.123] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-04-12_05:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2541 Lines: 57 On Tue, Apr 12, 2016 at 06:44:08AM +0200, Mike Galbraith wrote: > On Mon, 2016-04-11 at 20:30 -0400, Chris Mason wrote: > > On Mon, Apr 11, 2016 at 06:54:21AM +0200, Mike Galbraith wrote: > > > > > Ok, I was able to reproduce this by stuffing tbench_srv and tbench onto > > > > just socket 0. Version 2 below fixes things for me, but I'm hoping > > > > someone can suggest a way to get task_hot() buddy checks without the rq > > > > lock. > > > > > > > > I haven't run this on production loads yet, but our 4.0 patch for this > > > > uses task_hot(), so I'd expect it to be on par. If this doesn't fix it > > > > for you, I'll dig up a similar machine on Monday. > > > > > > My box stopped caring. I personally would be reluctant to apply it > > > without a "you asked for it" button or a large pile of benchmark > > > results. Lock banging or not, full scan existing makes me nervous. > > > > > > We can use a bitmap at the socket level to keep track of which cpus are > > idle. I'm sure there are better places for the array and better ways to > > allocate, this is just a rough cut to make sure the idle tracking works. > > See e0a79f529d5b: > > pre 15.22 MB/sec 1 procs > post 252.01 MB/sec 1 procs > > You can make traverse cycles go away, but those cycles, while precious, > are not the most costly cycles. The above was 1 tbench pair in an > otherwise idle box.. ie it wasn't traverse cycles that demolished it. Agreed, this is why the decision not to scan is so important. But while I've been describing this patch in terms of latency, latency is really the symptom instead of the goal. Without these patches, workloads that do want to fully utilize the hardware are basically getting one fewer core of utilization. It's true that we define 'fully utilize' with an upper bound on application response time, but we're not talking high frequency trading here. It clearly shows up in our graphs. CPU idle is higher (the lost core), CPU user time is lower, average system load is higher (procs waiting on a fewer number of core). We measure this internally with scheduling latency because that's the easiest way to talk about it across a wide variety of hardware. > > -Mike > > (p.s. SCHED_IDLE is dinky bandwidth fair class) Ugh, not my best quick patch, but you get the idea I was going for. I can always add the tunable to flip things on/off but I'd prefer that we find a good set of defaults, mostly so the FB production runtime is the common config instead of the special snowflake. -chris