Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756610AbdGCOoY (ORCPT ); Mon, 3 Jul 2017 10:44:24 -0400 Received: from mail-qt0-f171.google.com ([209.85.216.171]:34046 "EHLO mail-qt0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932345AbdGCNlJ (ORCPT ); Mon, 3 Jul 2017 09:41:09 -0400 Date: Mon, 3 Jul 2017 09:41:06 -0400 From: Josef Bacik To: Vincent Guittot Cc: josef@toxicpanda.com, "mingo@redhat.com" , Peter Zijlstra , linux-kernel , kernel-team@fb.com, Josef Bacik Subject: Re: [RFC][PATCH] sched: attach extra runtime to the right avg Message-ID: <20170703134105.GB27097@destiny> References: <1498787766-9593-1-git-send-email-jbacik@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2741 Lines: 53 On Mon, Jul 03, 2017 at 09:26:10AM +0200, Vincent Guittot wrote: > Hi Josef, > > On 30 June 2017 at 03:56, wrote: > > From: Josef Bacik > > > > We only track the load avg of a se in 1024 ns chunks, so in order to > > make up for the loss of the < 1024 ns part of a run/sleep delta we only > > add the time we processed to the se->avg.last_update_time. The problem > > is there is no way to know if this extra time was while we were asleep > > or while we were running. Instead keep track of the remainder and apply > > it in the appropriate place. If the remainder was while we were > > running, add it to the delta the next time we update the load avg while > > running, and the same for sleeping. This (coupled with other fixes) > > mostly fixes the regression to my workload introduced by Peter's > > experimental runnable load propagation patches. > > IIUC, your workload is sensible to the fact that the min granularity > of the load tracking is 1us ? > The contribution seems to be quite small to have a real impact on the load_avg. > May be rounding last_update_time to the closest value policy instead > of the bottom value would be enough ? we would have 512ns precision > > Have you got details about your use case that needs this sub > microsecond precision ? > Yup here's the artificial reproducer https://github.com/josefbacik/debug-scripts/tree/master/unbalanced-reproducer The problem is we put two sets of tasks in two different cgroups that have equal weight. One group is a cpu hog, it's never taken off of the runqueue as it never sleeps. The other is a process that does actual work, the reproducer has a rt-app config file that is a rough analog of the real workload. This one goes to sleep and wakes up and stuff. The task that goes to sleep and wakes up will end up with about 75% of the time the cpu hog ends up with. But this patch is only 1/3 of the solution. I'm on top of peterz's sched/experimental branch + some fixes to fix the regression those patches introduce to my workload. This patch is needed because the 'interactive' tasks will slowly lose load average, which means that every time they go onto the cpu they contribute less and less to the load of the cpu and thus screw up the load balancing. With this fix and all of my other fixes in place I get an even 50-50 split between the two groups. Note this is only for two groups with disparate levels of interactivity. If I put two of my sample workload in two different groups everything works out fine, same if I put two cpu hogs in the different groups, all is well. We only see this huge difference if one group is on the CPU more, thus losing less load average over time. Thanks, Josef