Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Fri, 13 Jan 2023 13:49:51 +0000
From:   Kajetan Puchalski <kajetan.puchalski@arm.com>
To:     Vincent Guittot <vincent.guittot@linaro.org>
Cc:     mingo@kernel.org, peterz@infradead.org, dietmar.eggemann@arm.com,
        qyousef@layalina.io, rafael@kernel.org, viresh.kumar@linaro.org,
        vschneid@redhat.com, linux-pm@vger.kernel.org,
        linux-kernel@vger.kernel.org, lukasz.luba@arm.com, wvw@google.com,
        xuewen.yan94@gmail.com, han.lin@mediatek.com,
        Jonathan.JMChen@mediatek.com, kajetan.puchalski@arm.com
Subject: Re: [PATCH v2] sched/fair: unlink misfit task from cpu overutilized
Message-ID: <Y8FhfyVyUDZ98hKD@e126311.manchester.arm.com>
References: <20221228165415.3436-1-vincent.guittot@linaro.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20221228165415.3436-1-vincent.guittot@linaro.org>
Precedence: bulk

Hi,

> By taking into account uclamp_min, the 1:1 relation between task misfit
> and cpu overutilized is no more true as a task with a small util_avg of
> may not may not fit a high capacity cpu because of uclamp_min constraint.
> 
> Add a new state in util_fits_cpu() to reflect the case that task would fit
> a CPU except for the uclamp_min hint which is a performance requirement.
> 
> Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> can use this new value to take additional action to select the best CPU
> that doesn't match uclamp_min hint.

I just wanted to flag some issues I noticed with this patch and the
entire topic.

I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with
all the relevant uclamp and CFS scheduling patches backported to it from
mainline. From what I can see, the 'uclamp fits capacity' patchset
introduced some alarming power usage & performance issues that this
patch makes even worse.

The patch stack for the following tables is as follows:

(ufc_patched) sched/fair: unlink misfit task from cpu overutilized
sched/uclamp: Fix a uninitialized variable warnings
(baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
sched/uclamp: Fix fits_capacity() check in feec()
sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
sched/uclamp: Fix relationship between uclamp and migration margin
(previous 'baseline' was here)

I omitted the 3 patches relating directly to capacity_inversion but in
the other tests I did with those there were similar issues. It's
probably easier to consider the uclamp parts and their effects in
isolation.

1. Geekbench 5 (performance regression)

+-----------------+----------------------------+--------+-----------+
|     metric      |           kernel           | value  | perc_diff |
+-----------------+----------------------------+--------+-----------+
| multicore_score |          baseline          | 2765.4 |   0.0%    |
| multicore_score |        baseline_ufc        | 2704.3 |  -2.21%   | <-- a noticeable score decrease already
| multicore_score |        ufc_patched         | 2443.2 |  -11.65%  | <-- a massive score decrease
+-----------------+----------------------------+--------+-----------+

+--------------+--------+----------------------------+--------+-----------+
|  chan_name   | metric |           kernel           | value  | perc_diff |
+--------------+--------+----------------------------+--------+-----------+
| total_power  | gmean  |          baseline          | 2664.0 |   0.0%    |
| total_power  | gmean  |        baseline_ufc        | 2621.5 |   -1.6%   | <-- worse performance per watt
| total_power  | gmean  |        ufc_patched         | 2601.2 |  -2.36%   | <-- much worse performance per watt
+--------------+--------+----------------------------+--------+-----------+

The most likely cause for the regression seen above is the decrease in the amount of
time spent while overutilized with these patches. Maximising
overutilization for GB5 is the desired outcome as the benchmark for
almost its entire duration keeps either 1 core or all the cores
completely saturated so EAS cannot be effective. These patches have the
opposite from the desired effect in this area.

+----------------------------+--------------------+--------------------+------------+
|          kernel            |        time        |     total_time     | percentage |
+----------------------------+--------------------+--------------------+------------+
|          baseline          |      121.979       |      181.065       |   67.46    |
|        baseline_ufc        |      120.355       |      184.255       |   65.32    |
|        ufc_patched         |       60.715       |      196.135       |   30.98    | <-- !!!
+----------------------------+--------------------+--------------------+------------+

2. Jankbench (power usage regression)

+--------+---------------+---------------------------------+-------+-----------+
| metric |   variable    |             kernel              | value | perc_diff |
+--------+---------------+---------------------------------+-------+-----------+
| gmean  | mean_duration |          baseline_60hz          | 14.6  |   0.0%    |
| gmean  | mean_duration |        baseline_ufc_60hz        | 15.2  |   3.83%   |
| gmean  | mean_duration |        ufc_patched_60hz         | 14.0  |  -4.12%   |
+--------+---------------+---------------------------------+-------+-----------+

+--------+-----------+---------------------------------+-------+-----------+
| metric | variable  |             kernel              | value | perc_diff |
+--------+-----------+---------------------------------+-------+-----------+
| gmean  | jank_perc |          baseline_60hz          |  1.9  |   0.0%    |
| gmean  | jank_perc |        baseline_ufc_60hz        |  2.2  |  15.39%   |
| gmean  | jank_perc |        ufc_patched_60hz         |  2.0  |   3.61%   |
+--------+-----------+---------------------------------+-------+-----------+

+--------------+--------+---------------------------------+-------+-----------+
|  chan_name   | metric |             kernel              | value | perc_diff |
+--------------+--------+---------------------------------+-------+-----------+
| total_power  | gmean  |          baseline_60hz          | 135.9 |   0.0%    |
| total_power  | gmean  |        baseline_ufc_60hz        | 155.7 |  14.61%   | <-- !!!
| total_power  | gmean  |        ufc_patched_60hz         | 157.1 |  15.63%   | <-- !!!
+--------------+--------+---------------------------------+-------+-----------+

With these patches while running Jankbench we use up ~15% more power
just to achieve roughly the same results. Here I'm not sure where this
issue is coming from exactly but all the results above are very consistent
across different runs.

> -- 
> 2.17.1
> 
>