DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org E60A5600E6
From: Jeffrey Hugo <jhugo@codeaurora.org>
To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        linux-kernel@vger.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Austin Christ <austinwc@codeaurora.org>,
        Tyler Baicar <tbaicar@codeaurora.org>,
        Timur Tabi <timur@codeaurora.org>, Jeffrey Hugo <jhugo@codeaurora.org>
Subject: [PATCH V4 0/2] load_balance() fixes for affinity
Date: Fri,  2 Jun 2017 16:27:10 -0600
Message-Id: <1496442432-330-1-git-send-email-jhugo@codeaurora.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4600
Lines: 93

We noticed in testing that affined workloads do not provide consistent
performance under certain circumstances. To isolate the issue, we began
testing with a representative CPU workload. In some cases the CPU workload
results would vary by a large percentage from run to run. We used JTAG and
scheduler trace to try and profile the issue and noticed that at least one
core was left idle for the duration of the test in the low score runs.

The specific test methodology used in our investigation involved using an
executable that was compiled to fork a specific number of workers. The
executable was then affined to a number of cores equal to the number of
workers forked, using the userspace taskset utility.  The expectation was
that all cores in the mask of runnable CPUs would run a process until the
work was complete. In practice, for several affinity masks, one to two
cores would not receive work until late in the test or not at all.

Our first observation was that it appeared initial cpu assignment of the
forked threads appeared suboptimal.  We observed that it was highly
probable that many or all of the threads would be initially assigned to the
same CPU. Our limited investigation into this indicated that the forking of
the threads occurs fast enough that load estimates for CPUs have not
adjusted to the work they have been assigned. The function
select_task_rq_fair() does not take number of running process into
consideration and thus can easily assign several tasks spawned close
together to the same CPU within the allowed CPUs for a workload. While this
seems suboptimal, the load_balance() algorithm run on other idle cores in
the affinity mask should resolve imbalances that result from it.

As our test case was leaving cores idle for many seconds at a time, we dug
into load_balance() code and specifically the group_imbalance path. We
added ftrace to the path in question and noticed that two branches in
load_balance() conflict in their assessments of the ability of the
algorithm to migrate load. The group_imbalance path correctly sets the flag
to indicate the group can not be properly balanced due to affinity, but the
redo condition right after this branch incorrectly assumes that there may
be other cores with work to be pulled by considering cores outside of the
scheduling domain in question.

Consider the following circumstance where (T) represents tasks and (*)
represents the allowed CPUs:

   A       B       C
{ 0 1 } { 2 3 } { 4 5 }
    T     T       T T
    T
    *     * *     * *

In this case, core 0 in scheduling group 'A' should run load balance on its
local scheduling domain and try to pull work from core 1. When it fails to
move any work due to affinity, it will mark scheduling group 'A' as
"group_imbalance". However, right after doing so, core 0 load_balance()
evaluates the redo path with core 1 removed from consideration. This should
leave no other cores to consider and result in a jump to "out_all_pinned",
leaving the "group_imbalance" flag to be seen by future runs of the
algorithm on core 3. As the code stands currently, core 0 evaluates all
cores instead of those in the scheduling domain under consideration.
It incorrectly sees other cores in the system and attempts a redo.
With core 1 removed from the mask, the redo identifies no load imbalance
and results in a jump to "out_balanced", clearing the "group_imbalance"
flag.

We propose correcting this logic with patch 1 of this series.

Making this correction revealed an issue in the calculate_imbalance() path.
Patch 2 removes a branch that does not make sense with the current
load_balance() algorithm because it has no scenario where it benifits the
"group_imbalance" case in calculate_imbalance() and which causes problems
in systems with affined workloads and many idle or lightly loaded CPUs.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>

[V4]
-restricted active cpus mask to the domain span
-fixed dst_cpu masking logic to work for the entire local group

[V3]
-fixed botched subject lines

[V2]
-Corrected use of Signed-off-by tags
-Removed temp cpumask variable from stack

Jeffrey Hugo (2):
  sched/fair: Fix load_balance() affinity redo path
  sched/fair: Remove group imbalance from calculate_imbalance()

 kernel/sched/fair.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.