Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754069AbdGNNVN (ORCPT ); Fri, 14 Jul 2017 09:21:13 -0400 Received: from mail-yw0-f196.google.com ([209.85.161.196]:34048 "EHLO mail-yw0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753683AbdGNNVL (ORCPT ); Fri, 14 Jul 2017 09:21:11 -0400 From: Josef Bacik To: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org, umgwanakikbuti@gmail.com, tj@kernel.org, kernel-team@fb.com Subject: [PATCH 0/7][RESEND] Fix cpu imbalance with equal weighted groups Date: Fri, 14 Jul 2017 13:20:57 +0000 Message-Id: <1500038464-8742-1-git-send-email-josef@toxicpanda.com> X-Mailer: git-send-email 1.8.3.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3212 Lines: 63 (sorry to anybody who got this already, fb email is acting weird and ate the linux-kernel submission so I have no idea who got this and who didn't.) Hello, In testing stacked services we noticed that if you started a normal CPU heavy application in one cgroup, and a cpu stress test in another cgroup of equal weight, the cpu stress group would get significantly more cpu time, usually around 50% more. Peter fixed some load propagation issues for Tejun a few months ago, and they fixed the latency issues that Tejun was seeing, however they regressed this imbalance problem more, so the cpu stress group was now getting more like 70% more CPU time. The following patches are to fix the regression introduced by Peter's patches and then to fix the imbalance itself. Part of the imbalance fix is from Peter's propagation patches, we just needed the runnable weight to be calculated differently to fix the regression. Essentially what happens is the "stress" group has tasks that never leave the CPU, so the load average and runnable load average skews towards their load.weight. However the interactive tasks obviously go on and off the CPU, resulting in a lower load average. With Peter's changes to use the runnable load average more this exacerbated the problem. To solve this problem I've done a few things. First we use the max of the weight or average for our cfs_rq weight calculations. This allows tasks that have a lower load average but a higher weight to have an appropriate effect on the cfs_rq when enqueue'ing. The second part of the fix is to fix how we decide to do wake affinity. If I simply disabled wake affinity and had the other patches the imbalance disappeared as well. Fixing the wake affinity involves a few things. First we need to change effective_load() to re-calculate the historic weight in addition to the new weight with the new process. This is because simply using our old weight/load_avg would be inaccurate if the load_avg for the task_group had changed at all since we calculated our load. In practice this meant that effective_load would often (almost always for my testcase) return a negative delta for adding the process to the given CPU. This meant we always did wake affine, even though the load on the current CPU was too high. Those patches get us 95% there, the final patch is probably the more controversial one, but brings us to complete balance between the two groups. One thing that was observed was we would wake affine, and then promptly load balance things off of the CPU that we woke to. You'd see tasks bounce around CPU's constantly. So to avoid this thrashing record the last time we were load balanced, and wait HZ duration before allowing a affinity wake up to occur. This reduced the thrashing quite a bit, and brought our CPU usage to equality. I have a stripped down reproducer here https://github.com/josefbacik/debug-scripts/tree/master/unbalanced-reproducer unbalanced.sh uses the cgroup2 interface which requires Tejun's cgroup2 cpu controller patch, and unbalanced-v1.sh uses the old cgroupsv1 interface, and assumes you have cpuacct,cpu mounted at /sys/fs/cgroup/cpuacct. You also need rt-app installed. Thanks, Josef