Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759880AbcDEVGL (ORCPT ); Tue, 5 Apr 2016 17:06:11 -0400 Received: from mail-qk0-f193.google.com ([209.85.220.193]:33141 "EHLO mail-qk0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751320AbcDEVGH (ORCPT ); Tue, 5 Apr 2016 17:06:07 -0400 Subject: Re: [PATCH RFC] select_idle_sibling experiments To: Matt Fleming , Chris Mason , Peter Zijlstra , Ingo Molnar , Mike Galbraith , linux-kernel@vger.kernel.org References: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com> <20160405200302.GL2701@codeblueprint.co.uk> Cc: Mel Gorman From: Bastien Philbert Message-ID: <570428B2.5020900@gmail.com> Date: Tue, 5 Apr 2016 17:05:54 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160405200302.GL2701@codeblueprint.co.uk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3465 Lines: 77 On 2016-04-05 04:03 PM, Matt Fleming wrote: > On Tue, 05 Apr, at 02:08:22PM, Chris Mason wrote: >> >> I started with a small-ish program to benchmark wakeup latencies. The >> basic idea is a bunch of worker threads who sit around and burn CPU. >> Every once and a while they send a message to a message thread. > > This reminds me of something I've been looking at recently; a similar > workload in Mel's mmtests based on pgbench with 1-client that also has > this problem of cpu_idle() being false at an inconvenient time in > select_idle_sibling(), so we move the task off the cpu and the cpu > then immediately goes idle. > > This leads to tasks bouncing around the socket as we search for idle > cpus. > >> It has knobs for cpu think time, and for how long the messenger thread >> waits before replying. Here's how I'm running it with my patch: > > [...] > > Cool, I'll go have a play with this. > >> Now, on to the patch. I pushed some code around and narrowed the >> problem down to select_idle_sibling() We have cores going into and out >> of idle fast enough that even this cut our latencies in half: >> >> static int select_idle_sibling(struct task_struct *p, int target) >> goto next; >> >> for_each_cpu(i, sched_group_cpus(sg)) { >> - if (i == target || !idle_cpu(i)) >> + if (!idle_cpu(i)) >> goto next; >> } >> >> IOW, by the time we get down to for_each_cpu(), the idle_cpu() check >> done at the top of the function is no longer valid. > > Yeah. The problem is that because we're racing with the cpu going in > and out of idle, and since you're exploiting that race condition, this > is highly tuned to your specific workload. > > Which is a roundabout way of saying, this is probably going to > negatively impact other workloads. > >> I tried a few variations on select_idle_sibling() that preserved the >> underlying goal of returning idle cores before idle SMT threads. They >> were all horrible in different ways, and none of them were fast. > > I toyed with ignoring cpu_idle() in select_idle_sibling() for my > workload. That actually was faster ;) > >> The patch below just makes select_idle_sibling pick the first idle >> thread it can find. When I ran it through production workloads here, it >> was faster than the patch we've been carrying around for the last few >> years. > > It would be really nice if we had a lightweight way to gauge the > "idleness" of a cpu, and whether we expect it to be idle again soon. > The best way to do this is either embed it in a already used structure to allow us to check it quickly. Otherwise I am curious if writing a marco may prove useful for this. Seems that idleness checking needs to accounted for when scheduling, in order to make this lightweight enough to avoid using it during a context switch, the challenge however is to make the reference counting lightweight enough to out weight it being done during current scheduling functions. > Failing that, could we just force the task onto 'target' when it makes > sense and skip the idle search (and the race) altogether? > Doesn't this possibly cause a context switch or even a extensive move to another CPU instruction(s) on certain architectures. Seems we need to add reference counting or tracking of idle CPUS somewhere. Bastien