Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751321AbdFBW1c (ORCPT ); Fri, 2 Jun 2017 18:27:32 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:59178 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750971AbdFBW1b (ORCPT ); Fri, 2 Jun 2017 18:27:31 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org E60A5600E6 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=jhugo@codeaurora.org From: Jeffrey Hugo To: Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org Cc: Dietmar Eggemann , Austin Christ , Tyler Baicar , Timur Tabi , Jeffrey Hugo Subject: [PATCH V4 0/2] load_balance() fixes for affinity Date: Fri, 2 Jun 2017 16:27:10 -0600 Message-Id: <1496442432-330-1-git-send-email-jhugo@codeaurora.org> X-Mailer: git-send-email 1.8.5.2 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4600 Lines: 93 We noticed in testing that affined workloads do not provide consistent performance under certain circumstances. To isolate the issue, we began testing with a representative CPU workload. In some cases the CPU workload results would vary by a large percentage from run to run. We used JTAG and scheduler trace to try and profile the issue and noticed that at least one core was left idle for the duration of the test in the low score runs. The specific test methodology used in our investigation involved using an executable that was compiled to fork a specific number of workers. The executable was then affined to a number of cores equal to the number of workers forked, using the userspace taskset utility. The expectation was that all cores in the mask of runnable CPUs would run a process until the work was complete. In practice, for several affinity masks, one to two cores would not receive work until late in the test or not at all. Our first observation was that it appeared initial cpu assignment of the forked threads appeared suboptimal. We observed that it was highly probable that many or all of the threads would be initially assigned to the same CPU. Our limited investigation into this indicated that the forking of the threads occurs fast enough that load estimates for CPUs have not adjusted to the work they have been assigned. The function select_task_rq_fair() does not take number of running process into consideration and thus can easily assign several tasks spawned close together to the same CPU within the allowed CPUs for a workload. While this seems suboptimal, the load_balance() algorithm run on other idle cores in the affinity mask should resolve imbalances that result from it. As our test case was leaving cores idle for many seconds at a time, we dug into load_balance() code and specifically the group_imbalance path. We added ftrace to the path in question and noticed that two branches in load_balance() conflict in their assessments of the ability of the algorithm to migrate load. The group_imbalance path correctly sets the flag to indicate the group can not be properly balanced due to affinity, but the redo condition right after this branch incorrectly assumes that there may be other cores with work to be pulled by considering cores outside of the scheduling domain in question. Consider the following circumstance where (T) represents tasks and (*) represents the allowed CPUs: A B C { 0 1 } { 2 3 } { 4 5 } T T T T T * * * * * In this case, core 0 in scheduling group 'A' should run load balance on its local scheduling domain and try to pull work from core 1. When it fails to move any work due to affinity, it will mark scheduling group 'A' as "group_imbalance". However, right after doing so, core 0 load_balance() evaluates the redo path with core 1 removed from consideration. This should leave no other cores to consider and result in a jump to "out_all_pinned", leaving the "group_imbalance" flag to be seen by future runs of the algorithm on core 3. As the code stands currently, core 0 evaluates all cores instead of those in the scheduling domain under consideration. It incorrectly sees other cores in the system and attempts a redo. With core 1 removed from the mask, the redo identifies no load imbalance and results in a jump to "out_balanced", clearing the "group_imbalance" flag. We propose correcting this logic with patch 1 of this series. Making this correction revealed an issue in the calculate_imbalance() path. Patch 2 removes a branch that does not make sense with the current load_balance() algorithm because it has no scenario where it benifits the "group_imbalance" case in calculate_imbalance() and which causes problems in systems with affined workloads and many idle or lightly loaded CPUs. Co-authored-by: Austin Christ Signed-off-by: Jeffrey Hugo [V4] -restricted active cpus mask to the domain span -fixed dst_cpu masking logic to work for the entire local group [V3] -fixed botched subject lines [V2] -Corrected use of Signed-off-by tags -Removed temp cpumask variable from stack Jeffrey Hugo (2): sched/fair: Fix load_balance() affinity redo path sched/fair: Remove group imbalance from calculate_imbalance() kernel/sched/fair.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) -- Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.