Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance
To:     Vincent Guittot <vincent.guittot@linaro.org>,
        linux-kernel@vger.kernel.org, mingo@redhat.com,
        peterz@infradead.org
Cc:     pauld@redhat.com, valentin.schneider@arm.com,
        srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
        dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
        hdanton@sina.com
References: <1568878421-12301-1-git-send-email-vincent.guittot@linaro.org>
From:   Parth Shah <parth@linux.ibm.com>
Date:   Wed, 16 Oct 2019 12:51:17 +0530
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.4.0
MIME-Version: 1.0
In-Reply-To: <1568878421-12301-1-git-send-email-vincent.guittot@linaro.org>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Message-Id: <66c8f739-aecd-1e1c-0571-047ce7bafcc7@linux.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk


On 9/19/19 1:03 PM, Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
> 
> Patch 1 has already been sent separately and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
> 
> Patch 2 renames the sum of h_nr_running in stats.
> 
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
> 
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
> 
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
> 
> Patch 6 replaces runnable_load by load now that the signal is only used
> when overloaded.
> 
> Patch 7 improves the spread of tasks at the 1st scheduling level.
> 
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
> 
> Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> 
> Patch 10 optimizes find_idlest_group() that was using both runnable_load
> and load. This has not been squashed with previous patch to ease the
> review.
> 
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe      54981 +/-0.36%        55459 +/-0.31%   (+0.97%)
> 
> hackbench
> 1 groups       0.906 +/-2.34%        0.906 +/-2.88%   (+0.06%)
> 
> - large arm64 2 nodes / 224 cores system
> 
>            tip/sched/core        w/ this patchset    improvement
> schedpipe     125323 +/-0.98%       125624 +/-0.71%   (+0.24%)
> 
> hackbench -l (256000/#grp) -g #grp
> 1 groups      15.360 +/-1.76%       14.206 +/-1.40%   (+8.69%)
> 4 groups       5.822 +/-1.02%        5.508 +/-6.45%   (+5.38%)
> 16 groups      3.103 +/-0.80%        3.244 +/-0.77%   (-4.52%)
> 32 groups      2.892 +/-1.23%        2.850 +/-1.81%   (+1.47%)
> 64 groups      2.825 +/-1.51%        2.725 +/-1.51%   (+3.54%)
> 128 groups     3.149 +/-8.46%        3.053 +/-13.15%  (+3.06%)
> 256 groups     3.511 +/-8.49%        3.019 +/-1.71%  (+14.03%)
> 
> dbench
> 1 groups     329.677 +/-0.46%      329.771 +/-0.11%   (+0.03%)
> 4 groups     931.499 +/-0.79%      947.118 +/-0.94%   (+1.68%)
> 16 groups   1924.210 +/-0.89%     1947.849 +/-0.76%   (+1.23%)
> 32 groups   2350.646 +/-5.75%     2351.549 +/-6.33%   (+0.04%)
> 64 groups   2201.524 +/-3.35%     2192.749 +/-5.84%   (-0.40%)
> 128 groups  2206.858 +/-2.50%     2376.265 +/-7.44%   (+7.68%)
> 256 groups  1263.520 +/-3.34%     1633.143 +/-13.02% (+29.25%)
> 
> tip/sched/core sha1:
>   0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
> [...]

I am quietly impressed with this patch series as it makes easy to
understand the behavior of the load balancer just by looking at the code.

I have tested v3 on IBM POWER9 system with following configuration:
- CPU(s):              176
- Thread(s) per core:  4
- Core(s) per socket:  22
- Socket(s):           2
- Model name:          POWER9, altivec supported
- NUMA node0 CPU(s):   0-87
- NUMA node8 CPU(s):   88-175

I see results in par with the baseline (tip/sched/core) with most of my
testings.

hackbench
=========
hackbench -l (256000/#grp) -g #grp (lower is better):
+--------+--------------------+-------------------+------------------+
| groups |     w/ patches     |     Baseline      | Performance gain |
+--------+--------------------+-------------------+------------------+
|      1 | 14.948 (+/- 0.10)  | 15.13 (+/- 0.47 ) | +1.20            |
|      4 | 5.938 (+/- 0.034)  | 6.085 (+/- 0.07)  | +2.4             |
|      8 | 6.594 (+/- 0.072)  | 6.223 (+/- 0.03)  | -5.9             |
|     16 | 5.916 (+/- 0.05)   | 5.559 (+/- 0.00)  | -6.4             |
|     32 | 5.288 (+/- 0.034)  | 5.23 (+/- 0.01)   | -1.1             |
|     64 | 5.147 (+/- 0.036)  | 5.193 (+/- 0.09)  | +0.8             |
|    128 | 5.368 (+/- 0.0245) | 5.446 (+/- 0.04)  | +1.4             |
|    256 | 5.637 (+/- 0.088)  | 5.596 (+/- 0.07)  | -0.7             |
|    512 | 5.78 (+/- 0.0637)  | 5.934 (+/- 0.06)  | +2.5             |
+--------+--------------------+-------------------+------------------+


dbench
========
dbench <grp> (Throughput: Higher is better):
+---------+---------------------+-----------------------+----------+
| groups  |     w/ patches      |        baseline       |    gain  |
+---------+---------------------+-----------------------+----------+
|       1 |    12.6419(+/-0.58) |    12.6511 (+/-0.277) |   -0.00  |
|       4 |    23.7712(+/-2.22) |    21.8526 (+/-0.844) |   +8.7   |
|       8 |    40.1333(+/-0.85) |    37.0623 (+/-3.283) |   +8.2   |
|      16 |   60.5529(+/-2.35)  |    60.0972 (+/-9.655) |   +0.7   |
|      32 |   98.2194(+/-1.69)  |    87.6701 (+/-10.72) |   +12.0  |
|      64 |   150.733(+/-9.91)  |    109.782 (+/-0.503) |   +37.3  |
|     128 |  173.443(+/-22.4)   |    130.006 (+/-21.84) |   +33.4  |
|     256 |  121.011(+/-15.2)   |    120.603 (+/-11.82) |   +0.3   |
|     512 |  10.9889(+/-0.39)   |    12.5518 (+/-1.030) |   -12    |
+---------+---------------------+-----------------------+----------+


I am happy with the results as it turns to be beneficial in most cases.
Still I will be doing testing for different scenarios and workloads.
BTW do you have any specific test case which might show different behavior
for SMT-4/8 systems with these patch set?


Thanks,
Parth