Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030246AbdD0I2j (ORCPT ); Thu, 27 Apr 2017 04:28:39 -0400 Received: from mail-oi0-f50.google.com ([209.85.218.50]:33636 "EHLO mail-oi0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967759AbdD0I2W (ORCPT ); Thu, 27 Apr 2017 04:28:22 -0400 MIME-Version: 1.0 In-Reply-To: <20170427003020.GD11348@wtj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> <20170425184941.GB15593@wtj.duckdns.org> <20170425210810.GB20255@wtj.duckdns.org> <20170427003020.GD11348@wtj.duckdns.org> From: Vincent Guittot Date: Thu, 27 Apr 2017 10:28:01 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4825 Lines: 114 On 27 April 2017 at 02:30, Tejun Heo wrote: > Hello, Vincent. > > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote: >> > This is from the follow-up patch. I was confused. Because we don't >> > propagate decays, we still should decay the runnable_load_avg; >> > otherwise, we end up accumulating errors in the counter. I'll drop >> > the last patch. >> >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i >> see runnable_load_avg sometimes significantly higher than load_avg >> which is normally not possible as load_avg = runnable_load_avg + >> sleeping task's load_avg > > So, while load_avg would eventually converge on runnable_load_avg + > blocked load_avg given stable enough workload for long enough, > runnable_load_avg jumping above load avg temporarily is expected, No, it's not. Look at load_avg/runnable_avg at root domain when only task are involved, runnable_load_avg will never be higher than load_avg because load_avg = /sum load_avg of tasks attached to the cfs_rq runnable_load_avg = /sum load_avg of tasks attached and enqueued to the cfs_rq load_avg = runnable_load_avg + blocked tasks and as a result runnable_load_avg is always lower than load_avg. And with the propagate load/util_avg patchset, we can even reflect task migration directly at root domain whereas we had to wait for the util/load_avg and runnable_load_avg to converge to the new value before Just to confirm one of my assumption, the latency regression was already there in previous kernel versions and is not a result of propagate load/util_avg patchset, isn't it ? > AFAICS. That's the whole point of it, a sum closely tracking what's > currently on the cpu so that we can pick the cpu which has the most on > it now. It doesn't make sense to try to pick threads off of a cpu > which is generally loaded but doesn't have much going on right now, > after all. The only interest of runnable_load_avg is to be null when a cfs_rq is idle whereas load_avg is not but not to be higher than load_avg. The root cause is that load_balance only looks at "load" but not number of task currently running and that's probably the main problem: runnable_load_avg has been added because load_balance fails to filter idle group and idle rq. We should better add a new type in group_classify to tag group that are idle and the same in find_busiest_queue more. > >> Then, I just have the opposite behavior on my platform. I see a >> increase of latency at p99 with your patches. >> My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2 >> -t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which >> is similar to what you are doing on your platform >> >> With v4.11-rc8. I have run 10 times the test and get consistent results > ... >> *99.0000th: 539 > ... >> With your patches i see an increase of the latency for p99. I run 10 >> *99.0000th: 2034 > > I see. This is surprising given that at least the purpose of the > patch is restoring cgroup behavior to match !cgroup one. I could have > totally messed it up tho. Hmm... there are several ways forward I > guess. > > * Can you please double check that the higher latencies w/ the patch > is reliably reproducible? The test machines that I use have > variable management load. They never dominate the machine but are > enough to disturb the results so that to drawing out a reliable > pattern takes a lot of repeated runs. I'd really appreciate if you > could double check that the pattern is reliable with different run > patterns (ie. instead of 10 consecutive runs after another, > interleaved). I always let time between 2 consecutive run and the 10 consecutive runs take around 2min to execute Then I have run several time these 10 consecutive test and results stay the same > > * Is the board something easily obtainable? It'd be the eaisest for > me to set up the same environment and reproduce the problem. I > looked up hikey boards on amazon but couldn't easily find 2x4 core It is often named hikey octo cores but I use 2x4 cores just to point out that there are 2 clusters which is important for scheduler topology and behavior > ones. If there's something I can easily buy, please point me to it. > If there's something I can loan, that'd be great too. It looks like most of web site are currently out of stock > > * If not, I'll try to clean up the debug patches I have and send them > your way to get more visiblity but given these things tend to be > very iterative, it might take quite a few back and forth. Yes, that could be usefull. Even a trace of regression could be useful I can also push on my git tree the debug patch that i use for tracking load metrics if you want. It's ugly but it does the job Thanks > > Thanks! > > -- > tejun