Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030369AbdD0I3i (ORCPT ); Thu, 27 Apr 2017 04:29:38 -0400 Received: from mail-oi0-f41.google.com ([209.85.218.41]:33261 "EHLO mail-oi0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750991AbdD0I3b (ORCPT ); Thu, 27 Apr 2017 04:29:31 -0400 MIME-Version: 1.0 In-Reply-To: <20170426225202.GC11348@wtj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> <20170426225202.GC11348@wtj.duckdns.org> From: Vincent Guittot Date: Thu, 27 Apr 2017 10:29:10 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2397 Lines: 58 On 27 April 2017 at 00:52, Tejun Heo wrote: > Hello, > > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote: >> On 24 April 2017 at 22:14, Tejun Heo wrote: >> Can the problem be on the load balance side instead ? and more >> precisely in the wakeup path ? >> After looking at the trace, it seems that task placement happens at >> wake up path and if it fails to select the right idle cpu at wake up, >> you will have to wait for a load balance which is alreayd too late > > Oh, I was tracing most of scheduler activities and the ratios of > wakeups picking idle CPUs were about the same regardless of cgroup > membership. I can confidently say that the latency issue that I'm > seeing is from load balancer picking the wrong busiest CPU, which is > not to say that there can be other problems. ok. Is there any trace that you can share ? your behavior seems different of mine > >> > another queued wouldn't report the correspondingly higher >> >> It will as load_avg includes the runnable_load_avg so whatever load is >> in runnable_load_avg will be in load_avg too. But at the contrary, >> runnable_load_avg will not have the blocked that is going to wake up >> soon in the case of schbench > > Decaying contribution of blocked tasks don't affect the busiest CPU > selection. Without cgroup, runnable_load_avg is immediately increased > and decreased as tasks enter and leave the queue and otherwise we end > up with CPUs which are idle when there are threads queued on different > CPUs accumulating scheduling latencies. > > The patch doesn't change how the busiest CPU is picked. It already > uses runnable_load_avg. The change that cgroup causes is that it > blocks updates to runnable_load_avg from newly scheduled or sleeping > tasks. > > The issue isn't about whether runnable_load_avg or load_avg should be > used but the unexpected differences in the metrics that the load I think that's the root of the problem. I explain a bit more my view on the other thread > balancer uses depending on whether cgroup is used or not. > >> One last thing, the load_avg of an idle CPU can stay blocked for a >> while (until a load balance happens that will update blocked load) and >> can be seen has "busy" whereas it is not. Could it be a reason of your >> problem ? > > AFAICS, the load balancer doesn't use load_avg. > > Thanks. > > -- > tejun