Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759958AbcDEUDK (ORCPT ); Tue, 5 Apr 2016 16:03:10 -0400 Received: from mail-wm0-f52.google.com ([74.125.82.52]:34154 "EHLO mail-wm0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752757AbcDEUDG (ORCPT ); Tue, 5 Apr 2016 16:03:06 -0400 Date: Tue, 5 Apr 2016 21:03:02 +0100 From: Matt Fleming To: Chris Mason , Peter Zijlstra , Ingo Molnar , Mike Galbraith , linux-kernel@vger.kernel.org Cc: Mel Gorman Subject: Re: [PATCH RFC] select_idle_sibling experiments Message-ID: <20160405200302.GL2701@codeblueprint.co.uk> References: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com> User-Agent: Mutt/1.5.24+41 (02bc14ed1569) (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2645 Lines: 62 On Tue, 05 Apr, at 02:08:22PM, Chris Mason wrote: > > I started with a small-ish program to benchmark wakeup latencies. The > basic idea is a bunch of worker threads who sit around and burn CPU. > Every once and a while they send a message to a message thread. This reminds me of something I've been looking at recently; a similar workload in Mel's mmtests based on pgbench with 1-client that also has this problem of cpu_idle() being false at an inconvenient time in select_idle_sibling(), so we move the task off the cpu and the cpu then immediately goes idle. This leads to tasks bouncing around the socket as we search for idle cpus. > It has knobs for cpu think time, and for how long the messenger thread > waits before replying. Here's how I'm running it with my patch: [...] Cool, I'll go have a play with this. > Now, on to the patch. I pushed some code around and narrowed the > problem down to select_idle_sibling() We have cores going into and out > of idle fast enough that even this cut our latencies in half: > > static int select_idle_sibling(struct task_struct *p, int target) > goto next; > > for_each_cpu(i, sched_group_cpus(sg)) { > - if (i == target || !idle_cpu(i)) > + if (!idle_cpu(i)) > goto next; > } > > IOW, by the time we get down to for_each_cpu(), the idle_cpu() check > done at the top of the function is no longer valid. Yeah. The problem is that because we're racing with the cpu going in and out of idle, and since you're exploiting that race condition, this is highly tuned to your specific workload. Which is a roundabout way of saying, this is probably going to negatively impact other workloads. > I tried a few variations on select_idle_sibling() that preserved the > underlying goal of returning idle cores before idle SMT threads. They > were all horrible in different ways, and none of them were fast. I toyed with ignoring cpu_idle() in select_idle_sibling() for my workload. That actually was faster ;) > The patch below just makes select_idle_sibling pick the first idle > thread it can find. When I ran it through production workloads here, it > was faster than the patch we've been carrying around for the last few > years. It would be really nice if we had a lightweight way to gauge the "idleness" of a cpu, and whether we expect it to be idle again soon. Failing that, could we just force the task onto 'target' when it makes sense and skip the idle search (and the race) altogether?