2024-03-04 20:18:00

by Christian Loehle

[permalink] [raw]
Subject: [RFC PATCH 0/2] Introduce per-task io utilization boost

There is a feature inside of both schedutil and intel_pstate called
iowait boosting which tries to prevent selecting a low frequency
during IO workloads when it impacts throughput.
The feature is implemented by checking for task wakeups that have
the in_iowait flag set and boost the CPU of the rq accordingly
(implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).

The necessity of the feature is argued with the potentially low
utilization of a task being frequently in_iowait (i.e. most of the
time not enqueued on any rq and cannot build up utilization).

The RFC focuses on the schedutil implementation.
intel_pstate frequency selection isn't touched for now, suggestions are
very welcome.
Current schedutil iowait boosting has several issues:
1. Boosting happens even in scenarios where it doesn't improve
throughput. [1]
2. The boost is not accounted for in EAS: a) feec() will only consider
the actual utilization for task placement, but another CPU might be
more energy-efficient at that capacity than the boosted one.)
b) When placing a non-IO task while a CPU is boosted compute_energy()
will not consider the (potentially 'free') boosted capacity, but the
one it would have without the boost (since the boost is only applied
in sugov).
3. Actual IO heavy workloads are hardly distinguished from infrequent
in_iowait wakeups.
4. The boost isn't associated with a task, it therefore isn't considered
for task placement, potentially missing out on higher capacity CPUs on
heterogeneous CPU topologies.
5. The boost isn't associated with a task, it therefore lingers on the
rq even after the responsible task has migrated / stopped.
6. The boost isn't associated with a task, it therefore needs to ramp
up again when migrated.
7. Since schedutil doesn't know which task is getting woken up,
multiple unrelated in_iowait tasks might lead to boosting.

We attempt to mitigate all of the above by reworking the way the
iowait boosting (io boosting from here on) works in two major ways:
- Carry the boost in task_struct, so it is a per-task attribute and
behaves similar to utilization of the task in some ways.
- Employ a counting-based tracking strategy that only boosts as long
as it sees benefits and returns to no boosting dynamically.

Note that some the issues (1, 3) can be solved by using a
counting-based strategy on a per-rq basis, i.e. in sugov entirely.
Experiments with Android in particular showed that such a strategy
(which necessarily needs longer intervals to be reasonably stable)
is too prone to migrations to be useful generally.
We therefore consider the additional complexity of such a per-task
based approach like proposed to be worth it.

We require a minimum of 1000 iowait wakeups per second to start
boosting.
This isn't too far off from what sugov currently does, since it resets
the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
We justify this by the small possible improvement by boosting in the
first place with 'rare' few iowait wakeups.

When IO even leads to a task being in iowait isn't as straightforward
to explain.
Of course if the issued IO can be served by the page cache (e.g. on
reads because the pages are contained, on writes because they can be
marked dirty and the writeback takes care of it later) the actual
issuing task is usually not in iowait.
We consider this the good case, since whenever the scheduler and a
potential userspace / kernel switch is in the critical path for IO
there is possibly overhead impacting throughput.
We therefore focus on random read from here on, because (on synchronous
IO [3]) this will lead to the task being set in iowait for every IO.
This is where iowait boosting shows its biggest throughput improvement.
From here on IOPS (IO operations per second) and iowait wakeups may
therefore be used interchangeably.

Performance:
Throughput for random read tries to be on par with the sugov
implementation of iowait boosting for reasonably long-lived workloads.
See the following table for some results, values are in IOPS, the
tests are ran for 30s with pauses in-between, results are sorted.

nvme on rk3399
[3588, 3590, 3597, 3632, 3745] sugov mainline
[3581, 3751, 3770, 3771, 3885] per-task tracking
[2592, 2639, 2701, 2717, 2784] sugov no iowait boost
[3218, 3451, 3598, 3848, 3921] performance governor

emmc with cqe on rk3399
[4146, 4155, 4159, 4161, 4193] sugov mainline
[2848, 3217, 4375, 4380, 4454] per-task tracking
[2510, 2665, 3093, 3101, 3105] sugov no iowait boost
[4690, 4803, 4860, 4976, 5069] performance governor

sd card on rk3399
[1777, 1780, 1806, 1827, 1850] sugov mainline
[1470, 1476, 1507, 1534, 1586] per-task tracking
[1356, 1372, 1373, 1377, 1416] sugov no iowait boost
[1861, 1890, 1901, 1905, 1908] performance governor

Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
[6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
[7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
[2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
[7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor

Apple M1 apple-nvme
[27421, 28331, 28515, 28699, 29529] sugov mainline
[27274, 27344, 27345, 27384, 27930] per-task tracking
[14480, 14512, 14625, 14872, 14967] sugov no iowait boost
[31595, 32085, 32386, 32465, 32643] performance governor

Showcasing some different IO scenarios, again all random read,
median out of 5 runs, all on rk3399 with NVMe.
e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
obtained using:
fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8

+---------------+----------------+-------------------+----------------+-------------+-----------+
| | Sugov mainline | Per-task tracking | Sugov no boost | Performance | Powersave |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| psyncx1 | 4073 | 3793 | 2979 | 4190 | 2788 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| psyncx4 | 13921 | 13503 | 10635 | 13931 | 10225 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| psyncx6 | 18473 | 17866 | 15902 | 19080 | 15789 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| psyncx8 | 22498 | 21242 | 19867 | 22650 | 18837 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| psyncx10 | 24801 | 23552 | 23658 | 25096 | 21474 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| psyncx12 | 26743 | 25377 | 26372 | 26663 | 23613 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| libaio1x1 | 4054 | 3542 | 2776 | 4055 | 2780 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| libaio1x128 | 3959 | 3516 | 2758 | 3590 | 2560 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| libaio4x128 | 13451 | 12517 | 10313 | 13403 | 9994 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| libaio6x1 | 18394 | 17432 | 15340 | 18954 | 15251 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| libaio6x4 | 18329 | 17100 | 15238 | 18623 | 15270 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| libaio6x128 | 18066 | 16964 | 15139 | 18577 | 15192 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring1x1 | 4043 | 3548 | 2810 | 4039 | 2689 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring4x64 | 35790 | 32814 | 35983 | 34934 | 33254 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring1x128 | 32651 | 30427 | 32429 | 33232 | 9973 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring2x128 | 34928 | 32595 | 34922 | 33726 | 18790 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring4x128 | 34414 | 32173 | 34932 | 33332 | 33005 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring6x4 | 31578 | 29260 | 31714 | 31399 | 31784 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring6x128 | 34480 | 32634 | 34973 | 33390 | 36452 |
+---------------+----------------+-------------------+----------------+-------------+-----------+

Based on the above we can basically categorize these into the following
three:
a) boost is useful
b) boost irrelevant (util dominates)
c) boost is energy-inefficient (boost dominates)

The aim of the patch 1/2 is to boost as much as necessary for a) while
boosting little for c) (thus saving energy).

Energy-savings:
Regarding sugov iowait boosting problem 1 mentioned earlier,
some improvement can be seen:
Tested on rk3399 (LLLL)(bb) with an NVMe, 30s runtime
CPU0 perf domain spans 0-3 with 400MHz to 1400MHz
CPU4 perf domain spans 4-5 with 400MHz to 1800MHz

io_uring6x128:
Sugov iowait boost:
Average frequency for CPU0 : 1.180 GHz
Average frequency for CPU4 : 1.504 GHz
Per-task tracking:
Average frequency for CPU0 : 1.070 GHz
Average frequency for CPU4 : 1.211 GHz

io_uring12x128:
Sugov iowait boost:
Average frequency for CPU0 : 1.324 GHz
Average frequency for CPU4 : 1.444 GHz
Per-task tracking:
Average frequency for CPU0 : 1.260 GHz
Average frequency for CPU4 : 1.062 GHz
(In both cases actually 400MHz on both perf domains is optimal, more
fine-tuning could get us closer [2])

[1]
There are many scenarios when it doesn't, so let's start with
explaining when it does:
Boosting improves throughput if there is frequent IO to a device from
one or few origins, such that the device is likely idle when the task
is enqueued on the rq and reducing this time cuts down on the storage
device idle time.
This might not be true (and boosting doesn't help) if:
- The storage device uses the idle time to actually commit the IO to
persistent storage or do other management activity (this can be
observed with e.g. writes to flash-based storage, which will usually
write to cache and flush the cache when idle or necessary).
- The device is under thermal pressure and needs idle time to cool off
(not uncommon for e.g. nvme devices).
Furthermore the assumption (the device being idle while task is
enqueued) is false altogether if:
- Other tasks use the same storage device.
- The task uses asynchronous IO with iodepth > 1 like io_uring, the
in_iowait is then just to fill the queue on the host again.
- The task just sets in_iowait to signal it is waiting on io to not
appear as system idle, it might not send any io at all (cf with
the various occurrences of in_iowait, io_mutex_lock and io_schedule*).

[3]
Unfortunately even for asynchronous IO iowait may be set, in the case
of io_uring this is specifically for the iowait boost to trigger, see
commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
which is why the energy-savings are so significant here, as io_uring
load on the CPU is minimal.

Problems encountered:
- Higher cap is not always beneficial, we might place the task away
from the CPU where the interrupt handler is running, making it run
on an unboosted CPU which may have a bigger impact than the difference
between the CPU's capacity the task moved to. (Of course the boost will
then be reverted again, but a ping-pong every interval is possible).
- [2] tracking and scaling can be improved (io_uring12x128 still shows
boosting): Unfortunately tracking purely per-task shows some limits.
One task might show more iowaits per second when boosted, but overall
throughput doesn't increase => there is still some boost.
The task throughput improvement is somewhat limited though,
so by fine-tuning the thresholds there could be mitigations.

Christian Loehle (2):
sched/fair: Introduce per-task io util boost
cpufreq/schedutil: Remove iowait boost

include/linux/sched.h | 15 +++
kernel/sched/cpufreq_schedutil.c | 150 ++--------------------------
kernel/sched/fair.c | 165 +++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +-
4 files changed, 182 insertions(+), 152 deletions(-)

--
2.34.1



2024-03-04 20:18:04

by Christian Loehle

[permalink] [raw]
Subject: [RFC PATCH 1/2] sched/fair: Introduce per-task io util boost

Implement an io boost utilization enhancement that is tracked for each
task_struct. Tasks that wake up from in_iowait frequently will have
a io_boost associated with them, which counts iowait wakeups and only
boosts when it seems to improve the per-task throughput.

The patch is intended to replace the current iowait boosting strategy,
implemented in both schedutil and intel_pstate which boost the CPU for
iowait wakeups on the rq.
The primary benefits are:
1. EAS can take the io boost into account.
2. Boosting is limited when it doesn't seem to improve throughput.
3. io boost is being carried with the task when it migrates.

This is implemented by observing the iowait wakeups for an interval.
The boost is divided into 8 levels. If the task achieves the
required number of iowait wakeups per interval it's boost level is
increased.
To reflect that we can't expect an increase of iowait wakeups linear
to the applied boost (the time the task spends in iowait isn't
decreased by boosting) we scale the intervals.
Intervals for the lower boost levels are shorter, also allowing for
a faster ramp up.

If multiple tasks are io-boosted their boost will be max-aggregated
per rq. The energy calculations of EAS have been adapted to reflect
this.

Signed-off-by: Christian Loehle <[email protected]>
---
include/linux/sched.h | 15 +++
kernel/sched/cpufreq_schedutil.c | 6 ++
kernel/sched/fair.c | 165 +++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +-
4 files changed, 181 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..4e0dfa6fbd65 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1547,6 +1547,21 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif

+ /* IO boost tracking */
+ u64 io_boost_timeout;
+ u64 io_boost_interval_start;
+#define IO_BOOST_INTERVAL_MSEC 25
+/* Require 1000 iowait wakeups per second to start the boosting */
+#define IO_BOOST_IOWAITS_MIN (IO_BOOST_INTERVAL_MSEC)
+#define IO_BOOST_LEVELS 8
+/* The util boost given to the task per io boost level, account for headroom */
+#define IO_BOOST_UTIL_STEP ((unsigned long)((SCHED_CAPACITY_SCALE / 1.25) / IO_BOOST_LEVELS))
+#define IO_BOOST_IOWAITS_STEP 5
+ /* Minimum number of iowaits per interval to maintain current boost */
+ unsigned int io_boost_threshold_down;
+ unsigned int io_boost_level;
+ unsigned int io_boost_curr_ios;
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..cd0ca3cbd212 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -198,7 +198,13 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{
unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+ unsigned long io_boost = cpu_util_io_boost(sg_cpu->cpu);

+ /*
+ * XXX: This already includes io boost now, makes little sense with
+ * sugov iowait boost on top
+ */
+ util = max(util, io_boost);
util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
sg_cpu->bw_min = min;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..b983e4399c53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4959,6 +4959,11 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
trace_sched_util_est_se_tp(&p->se);
}

+static inline unsigned int io_boost_util(struct task_struct *p)
+{
+ return p->io_boost_level * IO_BOOST_UTIL_STEP;
+}
+
static inline int util_fits_cpu(unsigned long util,
unsigned long uclamp_min,
unsigned long uclamp_max,
@@ -6695,6 +6700,137 @@ static int sched_idle_cpu(int cpu)
}
#endif

+static unsigned long io_boost_rq(struct cfs_rq *cfs_rq)
+{
+ int i;
+
+ for (i = IO_BOOST_LEVELS; i > 0; i--)
+ if (atomic_read(&cfs_rq->io_boost_tasks[i - 1]))
+ return i * IO_BOOST_UTIL_STEP;
+ return 0;
+}
+
+static inline unsigned long io_boost_interval_nsec(unsigned int io_boost_level)
+{
+ /*
+ * We require 5 iowaits per interval increase to consider the boost
+ * worth having, that leads to:
+ * level 0->1: 25ms -> 200 iowaits per second increase
+ * level 1->2: 50ms -> 125 iowaits per second increase
+ * level 2->3: 75ms -> 66 iowaits per second increase
+ * level 3->4: 100ms -> 50 iowaits per second increase
+ * level 4->5: 125ms -> 40 iowaits per second increase
+ * level 5->6: 150ms -> 33 iowaits per second increase
+ * level 6->7: 175ms -> 28 iowaits per second increase
+ * level 7->8: 200ms -> 25 iowaits per second increase
+ * => level 8 can be maintained with >=1567 iowaits per second.
+ */
+ return (io_boost_level + 1) * IO_BOOST_INTERVAL_MSEC * NSEC_PER_MSEC;
+}
+
+static inline void io_boost_scale_interval(struct task_struct *p, bool inc)
+{
+ unsigned int level = p->io_boost_level + (inc ? 1 : -1);
+
+ p->io_boost_level = level;
+ /* We change interval length, scale iowaits per interval accordingly. */
+ if (inc)
+ p->io_boost_threshold_down = (p->io_boost_curr_ios *
+ (level + 1) / level) + IO_BOOST_IOWAITS_STEP;
+ else
+ p->io_boost_threshold_down = (p->io_boost_curr_ios *
+ level / (level + 1)) - IO_BOOST_IOWAITS_STEP;
+}
+
+static void enqueue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+ u64 now = sched_clock();
+
+ /* Only what's necessary here because this is the critical path */
+ if (now > p->io_boost_timeout) {
+ /* Last iowait took too long, reset boost */
+ p->io_boost_interval_start = 0;
+ p->io_boost_level = 0;
+ }
+ if (p->io_boost_level)
+ atomic_inc(&cfs_rq->io_boost_tasks[p->io_boost_level - 1]);
+}
+
+static inline void io_boost_start_interval(struct task_struct *p, u64 now)
+{
+ p->io_boost_interval_start = now;
+ p->io_boost_curr_ios = 1;
+}
+
+static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+ u64 now;
+
+ if (p->io_boost_level)
+ atomic_dec(&cfs_rq->io_boost_tasks[p->io_boost_level - 1]);
+
+ /*
+ * Doing all this at dequeue instead of at enqueue might seem wrong,
+ * but it really doesn't matter as the task won't be enqueued anywhere
+ * anyway. At enqueue we then only need to check if the in_iowait
+ * wasn't too long. We can then act as if the current in_iowait has
+ * already completed 'in time'.
+ * Doing all this at dequeue has a performance benefit as at this time
+ * the io is issued and we aren't in the io critical path.
+ */
+
+ if (!p->in_iowait) {
+ /* Even if no boost is active, we reset the interval */
+ p->io_boost_interval_start = 0;
+ p->io_boost_level = 0;
+ return;
+ }
+
+ /* The maximum in_iowait time we allow to continue boosting */
+ now = sched_clock();
+ p->io_boost_timeout = now + 10 * NSEC_PER_MSEC;
+
+ if (!p->io_boost_interval_start) {
+ io_boost_start_interval(p, now);
+ return;
+ }
+ p->io_boost_curr_ios++;
+
+ if (now < p->io_boost_interval_start +
+ io_boost_interval_nsec(p->io_boost_level))
+ return;
+
+ if (!p->io_boost_level) {
+ if (likely(p->io_boost_curr_ios < IO_BOOST_IOWAITS_MIN)) {
+ io_boost_start_interval(p, now);
+ return;
+ }
+ io_boost_scale_interval(p, true);
+ } else if (p->io_boost_curr_ios < IO_BOOST_IOWAITS_MIN) {
+ p->io_boost_level = 0;
+ } else if (p->io_boost_curr_ios > p->io_boost_threshold_down + IO_BOOST_IOWAITS_STEP) {
+ /* Increase boost */
+ if (p->io_boost_level < IO_BOOST_LEVELS)
+ io_boost_scale_interval(p, true);
+ else
+ p->io_boost_threshold_down =
+ p->io_boost_curr_ios - IO_BOOST_IOWAITS_STEP;
+ } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
+ /* Reduce boost */
+ if (p->io_boost_level > 1)
+ io_boost_scale_interval(p, true);
+ else
+ p->io_boost_level = 0;
+ } else if (p->io_boost_level == IO_BOOST_LEVELS) {
+ /* Allow for reducing boost on max when conditions changed. */
+ p->io_boost_threshold_down = max(p->io_boost_threshold_down,
+ p->io_boost_curr_ios - IO_BOOST_IOWAITS_STEP);
+ }
+ /* On maintaining boost we just start a new interval. */
+
+ io_boost_start_interval(p, now);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -6716,11 +6852,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
*/
util_est_enqueue(&rq->cfs, p);

- /*
- * If in_iowait is set, the code below may not trigger any cpufreq
- * utilization updates, so do it here explicitly with the IOWAIT flag
- * passed.
- */
+ if (p->in_iowait || p->io_boost_interval_start)
+ enqueue_io_boost(&rq->cfs, p);
+ /* Ensure new io boost can be applied. */
if (p->in_iowait)
cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

@@ -6804,6 +6938,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

util_est_dequeue(&rq->cfs, p);

+ dequeue_io_boost(&rq->cfs, p);
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -7429,11 +7565,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
int fits, best_fits = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;
+ unsigned long io_boost = io_boost_util(p);

cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

task_util = task_util_est(p);
+ task_util = max(task_util, io_boost);
util_min = uclamp_eff_value(p, UCLAMP_MIN);
util_max = uclamp_eff_value(p, UCLAMP_MAX);

@@ -7501,7 +7639,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (sched_asym_cpucap_active()) {
sync_entity_load_avg(&p->se);
- task_util = task_util_est(p);
+ task_util = max(task_util_est(p), io_boost_util(p));
util_min = uclamp_eff_value(p, UCLAMP_MIN);
util_max = uclamp_eff_value(p, UCLAMP_MAX);
}
@@ -7615,12 +7753,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return target;
}

+unsigned long cpu_util_io_boost(int cpu)
+{
+ return io_boost_rq(&cpu_rq(cpu)->cfs);
+}
+
/**
* cpu_util() - Estimates the amount of CPU capacity used by CFS tasks.
* @cpu: the CPU to get the utilization for
* @p: task for which the CPU utilization should be predicted or NULL
* @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
- * @boost: 1 to enable boosting, otherwise 0
+ * @boost: 1 to enable runnable boosting, otherwise 0
*
* The unit of the return value must be the same as the one of CPU capacity
* so that CPU utilization can be compared with CPU capacity.
@@ -7843,8 +7986,10 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
for_each_cpu(cpu, pd_cpus) {
struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+ unsigned long io_boost = max(io_boost_util(p), cpu_util_io_boost(cpu));
unsigned long eff_util, min, max;

+ util = max(util, io_boost);
/*
* Performance domain frequency: utilization clamping
* must be considered since it affects the selection
@@ -7970,7 +8115,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = prev_cpu;

sync_entity_load_avg(&p->se);
- if (!task_util_est(p) && p_util_min == 0)
+ if (!task_util_est(p) && p_util_min == 0 && io_boost_util(p) == 0)
goto unlock;

eenv_task_busy_time(&eenv, p, prev_cpu);
@@ -7983,6 +8128,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
unsigned long cur_delta, base_energy;
int max_spare_cap_cpu = -1;
int fits, max_fits = -1;
+ unsigned long p_io_boost = io_boost_util(p);

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);

@@ -7999,6 +8145,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

for_each_cpu(cpu, cpus) {
struct rq *rq = cpu_rq(cpu);
+ unsigned long io_boost;

eenv.pd_cap += cpu_thermal_cap;

@@ -8009,6 +8156,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
continue;

util = cpu_util(cpu, p, cpu, 0);
+ io_boost = max(p_io_boost, cpu_util_io_boost(cpu));
+ util = max(util, io_boost);
cpu_cap = capacity_of(cpu);

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..5f42b72b3cde 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -598,6 +598,8 @@ struct cfs_rq {
struct sched_entity *curr;
struct sched_entity *next;

+ atomic_t io_boost_tasks[IO_BOOST_LEVELS];
+
#ifdef CONFIG_SCHED_DEBUG
unsigned int nr_spread_over;
#endif
@@ -3039,7 +3041,7 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
return READ_ONCE(rq->avg_dl.util_avg);
}

-
+extern unsigned long cpu_util_io_boost(int cpu);
extern unsigned long cpu_util_cfs(int cpu);
extern unsigned long cpu_util_cfs_boost(int cpu);

--
2.34.1


2024-03-04 20:18:47

by Christian Loehle

[permalink] [raw]
Subject: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

The previous commit provides a new cpu_util_cfs_boost_io interface for
schedutil which uses the io boosted utilization of the per-task
tracking strategy. Schedutil iowait boosting is therefore no longer
necessary so remove it.

Signed-off-by: Christian Loehle <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 152 +------------------------------
1 file changed, 5 insertions(+), 147 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index cd0ca3cbd212..ed9fc88a74fc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -6,8 +6,6 @@
* Author: Rafael J. Wysocki <[email protected]>
*/

-#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
-
struct sugov_tunables {
struct gov_attr_set attr_set;
unsigned int rate_limit_us;
@@ -42,10 +40,6 @@ struct sugov_cpu {
struct sugov_policy *sg_policy;
unsigned int cpu;

- bool iowait_boost_pending;
- unsigned int iowait_boost;
- u64 last_update;
-
unsigned long util;
unsigned long bw_min;

@@ -195,141 +189,17 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
return max(min, max);
}

-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
unsigned long io_boost = cpu_util_io_boost(sg_cpu->cpu);

- /*
- * XXX: This already includes io boost now, makes little sense with
- * sugov iowait boost on top
- */
util = max(util, io_boost);
util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
- util = max(util, boost);
sg_cpu->bw_min = min;
sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
}

-/**
- * sugov_iowait_reset() - Reset the IO boost status of a CPU.
- * @sg_cpu: the sugov data for the CPU to boost
- * @time: the update time from the caller
- * @set_iowait_boost: true if an IO boost has been requested
- *
- * The IO wait boost of a task is disabled after a tick since the last update
- * of a CPU. If a new IO wait boost is requested after more then a tick, then
- * we enable the boost starting from IOWAIT_BOOST_MIN, which improves energy
- * efficiency by ignoring sporadic wakeups from IO.
- */
-static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time,
- bool set_iowait_boost)
-{
- s64 delta_ns = time - sg_cpu->last_update;
-
- /* Reset boost only if a tick has elapsed since last request */
- if (delta_ns <= TICK_NSEC)
- return false;
-
- sg_cpu->iowait_boost = set_iowait_boost ? IOWAIT_BOOST_MIN : 0;
- sg_cpu->iowait_boost_pending = set_iowait_boost;
-
- return true;
-}
-
-/**
- * sugov_iowait_boost() - Updates the IO boost status of a CPU.
- * @sg_cpu: the sugov data for the CPU to boost
- * @time: the update time from the caller
- * @flags: SCHED_CPUFREQ_IOWAIT if the task is waking up after an IO wait
- *
- * Each time a task wakes up after an IO operation, the CPU utilization can be
- * boosted to a certain utilization which doubles at each "frequent and
- * successive" wakeup from IO, ranging from IOWAIT_BOOST_MIN to the utilization
- * of the maximum OPP.
- *
- * To keep doubling, an IO boost has to be requested at least once per tick,
- * otherwise we restart from the utilization of the minimum OPP.
- */
-static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
- unsigned int flags)
-{
- bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
-
- /* Reset boost if the CPU appears to have been idle enough */
- if (sg_cpu->iowait_boost &&
- sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
- return;
-
- /* Boost only tasks waking up after IO */
- if (!set_iowait_boost)
- return;
-
- /* Ensure boost doubles only one time at each request */
- if (sg_cpu->iowait_boost_pending)
- return;
- sg_cpu->iowait_boost_pending = true;
-
- /* Double the boost at each request */
- if (sg_cpu->iowait_boost) {
- sg_cpu->iowait_boost =
- min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE);
- return;
- }
-
- /* First wakeup after IO: start with minimum boost */
- sg_cpu->iowait_boost = IOWAIT_BOOST_MIN;
-}
-
-/**
- * sugov_iowait_apply() - Apply the IO boost to a CPU.
- * @sg_cpu: the sugov data for the cpu to boost
- * @time: the update time from the caller
- * @max_cap: the max CPU capacity
- *
- * A CPU running a task which woken up after an IO operation can have its
- * utilization boosted to speed up the completion of those IO operations.
- * The IO boost value is increased each time a task wakes up from IO, in
- * sugov_iowait_apply(), and it's instead decreased by this function,
- * each time an increase has not been requested (!iowait_boost_pending).
- *
- * A CPU which also appears to have been idle for at least one tick has also
- * its IO boost utilization reset.
- *
- * This mechanism is designed to boost high frequently IO waiting tasks, while
- * being more conservative on tasks which does sporadic IO operations.
- */
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
- unsigned long max_cap)
-{
- /* No boost currently required */
- if (!sg_cpu->iowait_boost)
- return 0;
-
- /* Reset boost if the CPU appears to have been idle enough */
- if (sugov_iowait_reset(sg_cpu, time, false))
- return 0;
-
- if (!sg_cpu->iowait_boost_pending) {
- /*
- * No boost pending; reduce the boost value.
- */
- sg_cpu->iowait_boost >>= 1;
- if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
- sg_cpu->iowait_boost = 0;
- return 0;
- }
- }
-
- sg_cpu->iowait_boost_pending = false;
-
- /*
- * sg_cpu->util is already in capacity scale; convert iowait_boost
- * into the same scale so we can compare.
- */
- return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
-}
-
#ifdef CONFIG_NO_HZ_COMMON
static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
{
@@ -357,18 +227,12 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
u64 time, unsigned long max_cap,
unsigned int flags)
{
- unsigned long boost;
-
- sugov_iowait_boost(sg_cpu, time, flags);
- sg_cpu->last_update = time;
-
ignore_dl_rate_limit(sg_cpu);

if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
return false;

- boost = sugov_iowait_apply(sg_cpu, time, max_cap);
- sugov_get_util(sg_cpu, boost);
+ sugov_get_util(sg_cpu);

return true;
}
@@ -458,7 +322,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sg_cpu->sg_policy->last_freq_update_time = time;
}

-static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
+static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
{
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
@@ -469,11 +333,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)

for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
- unsigned long boost;
-
- boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
- sugov_get_util(j_sg_cpu, boost);

+ sugov_get_util(j_sg_cpu);
util = max(j_sg_cpu->util, util);
}

@@ -489,13 +350,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)

raw_spin_lock(&sg_policy->update_lock);

- sugov_iowait_boost(sg_cpu, time, flags);
- sg_cpu->last_update = time;
-
ignore_dl_rate_limit(sg_cpu);

if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq_shared(sg_cpu, time);
+ next_f = sugov_next_freq_shared(sg_cpu);

if (!sugov_update_next_freq(sg_policy, time, next_f))
goto unlock;
--
2.34.1


2024-03-05 00:20:22

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 3/4/24 12:16, Christian Loehle wrote:
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor

Variance of performance results for Pixel devices can be reduced greatly
by disabling devfreq scaling, e.g. as follows (this may cause thermal
issues if the system load is high enough):

for d in $(adb shell echo /sys/class/devfreq/*); do
adb shell "cat $d/available_frequencies |
tr ' ' '\n' |
sort -n |
case $devfreq in
min) head -n1;;
max) tail -n1;;
esac > $d/min_freq"
done

> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8

So buffered I/O was used during this test? Shouldn't direct I/O be used
for this kind of tests (--buffered=0)? Additionally, which I/O scheduler
was configured? I recommend --ioscheduler=none for this kind of tests.

> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).

In the above I see "the interrupt handler". Does this mean that the NVMe
controller in the test setup only supports one completion interrupt for
all completion queues instead of one completion interrupt per completion
queue? There are already Android phones and developer boards available
that support the latter, namely the boards equipped with a UFSHCI 4.0
controller.

Thanks,

Bart.

2024-03-05 09:13:33

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

Hi Bart,

On 05/03/2024 00:20, Bart Van Assche wrote:
> On 3/4/24 12:16, Christian Loehle wrote:
>> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
>> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
>> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
>> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
>> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
>
> Variance of performance results for Pixel devices can be reduced greatly
> by disabling devfreq scaling, e.g. as follows (this may cause thermal
> issues if the system load is high enough):
>
>      for d in $(adb shell echo /sys/class/devfreq/*); do
>     adb shell "cat $d/available_frequencies |
>         tr ' ' '\n' |
>         sort -n |
>         case $devfreq in
>             min) head -n1;;
>             max) tail -n1;;
>         esac > $d/min_freq"
>     done
>

Thanks for the hint!

>> Showcasing some different IO scenarios, again all random read,
>> median out of 5 runs, all on rk3399 with NVMe.
>> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
>> obtained using:
>> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
>
> So buffered I/O was used during this test? Shouldn't direct I/O be used
> for this kind of tests (--buffered=0)? Additionally, which I/O scheduler
> was configured? I recommend --ioscheduler=none for this kind of tests.

Yes I opted for buffered I/O, I guess it's the eternal question if you
should benchmark the device/stack (O_DIRECT) or be more realistic to actual
use cases (probably). I opted for the latter, but since it's 4K randread
on significantly large devices the results don't differ too much.


>> - Higher cap is not always beneficial, we might place the task away
>> from the CPU where the interrupt handler is running, making it run
>> on an unboosted CPU which may have a bigger impact than the difference
>> between the CPU's capacity the task moved to. (Of course the boost will
>> then be reverted again, but a ping-pong every interval is possible).
>
> In the above I see "the interrupt handler". Does this mean that the NVMe
> controller in the test setup only supports one completion interrupt for
> all completion queues instead of one completion interrupt per completion
> queue? There are already Android phones and developer boards available
> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.

No, both NVMe test setups have one completion interrupt per completion queue,
so this caveat doesn't affect them, higher capacity CPU is strictly better.
The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
interrupt (on CPU0 on my setup).
The difference between the CPU capacities on the Pixel6 is able to make up for this.
The big CPU is still the best to run these single-threaded fio benchmarks on in terms
of throughput.
FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
interrupt to a big CPU, too. Similarly for the mmc.
Unfortunately the infrastructure is far from being there for the scheduler to move the
interrupt to the same performance domain as the task, which is often optimal both in
terms of throughput and in terms of power.
I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
patch will of course be greatly increased.
Thanks!

Best Regards,
Christian

2024-03-05 18:36:16

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 3/5/24 01:13, Christian Loehle wrote:
> On 05/03/2024 00:20, Bart Van Assche wrote:
>> On 3/4/24 12:16, Christian Loehle wrote:
>>> - Higher cap is not always beneficial, we might place the task away
>>> from the CPU where the interrupt handler is running, making it run
>>> on an unboosted CPU which may have a bigger impact than the difference
>>> between the CPU's capacity the task moved to. (Of course the boost will
>>> then be reverted again, but a ping-pong every interval is possible).
>>
>> In the above I see "the interrupt handler". Does this mean that the NVMe
>> controller in the test setup only supports one completion interrupt for
>> all completion queues instead of one completion interrupt per completion
>> queue? There are already Android phones and developer boards available
>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>
> No, both NVMe test setups have one completion interrupt per completion queue,
> so this caveat doesn't affect them, higher capacity CPU is strictly better.
> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
> interrupt (on CPU0 on my setup).

I think that measurements should be provided in the cover letter for the
two types of storage controllers: one series of measurements for a
storage controller with a single completion interrupt and a second
series of measurements for storage controllers with one completion
interrupt per CPU.

> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
> interrupt to a big CPU, too. Similarly for the mmc.
> Unfortunately the infrastructure is far from being there for the scheduler to move the
> interrupt to the same performance domain as the task, which is often optimal both in
> terms of throughput and in terms of power.
> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
> patch will of course be greatly increased.

I'm not sure whether making the completion interrupt follow the workload
is a good solution. I'm concerned that this would increase energy
consumption by keeping the big cores active longer than necessary. I
like this solution better (improves storage performance on at least
devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
Handle HMP systems when completing IO"
(https://lore.kernel.org/linux-block/[email protected]/).

Thanks,

Bart.


2024-03-06 10:49:41

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

Hi Bart,

On 05/03/2024 18:36, Bart Van Assche wrote:
> On 3/5/24 01:13, Christian Loehle wrote:
>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>> - Higher cap is not always beneficial, we might place the task away
>>>> from the CPU where the interrupt handler is running, making it run
>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>> then be reverted again, but a ping-pong every interval is possible).
>>>
>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>> controller in the test setup only supports one completion interrupt for
>>> all completion queues instead of one completion interrupt per completion
>>> queue? There are already Android phones and developer boards available
>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>
>> No, both NVMe test setups have one completion interrupt per completion queue,
>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>> interrupt (on CPU0 on my setup).
>
> I think that measurements should be provided in the cover letter for the
> two types of storage controllers: one series of measurements for a
> storage controller with a single completion interrupt and a second
> series of measurements for storage controllers with one completion
> interrupt per CPU.

Of the same type of storage controller? Or what is missing for you in
the cover letter exactly (ufs/emmc: single completion interrupt,
nvme: one completion interrupt per CPU).

>
>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>> interrupt to a big CPU, too. Similarly for the mmc.
>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>> interrupt to the same performance domain as the task, which is often optimal both in
>> terms of throughput and in terms of power.
>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>> patch will of course be greatly increased.
>
> I'm not sure whether making the completion interrupt follow the workload
> is a good solution. I'm concerned that this would increase energy
> consumption by keeping the big cores active longer than necessary. I
> like this solution better (improves storage performance on at least
> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
> Handle HMP systems when completing IO"
> (https://lore.kernel.org/linux-block/[email protected]/).

That patch is good, don't get me wrong, but you still lose out by running everything
up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
while having a big CPU available at a high OPP anyway ("for free").
It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):

Pretty numbers (IOPS):
Base irq@CPU0 median: 6969
Base irq@CPU6 median: 8407 (+20.6%)
Applied irq@CPU0 median: 7144 (+2.5%)
Applied irq@CPU6 median: 8288 (18.9%)

This is with psyncx1 4K Random Read again, of course anything with queue depth
takes advantage of batch completions to significantly reduce irq pressure.

Not so pretty numbers and full list commands used:

w/o patch:
irq on CPU0 (default):
psyncx1: 7000 6969 7025 6954 6964
io_uring4x128: 28766 28280 28339 28310 28349
irq on CPU6:
psyncx1: 8342 8492 8355 8407 8532
io_uring4x128: 28641 28356 25908 25787 25853

with patch:
irq on CPU0:
psyncx1: 7672 7144 7301 6976 6889
io_uring4x128: 28266 26314 27648 24482 25301
irq on CPU6:
psyncx1: 8208 8401 8351 8221 8288
io_uring4x128: 25603 25438 25453 25514 25402


for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --minimal | awk -F ";" '{print $8}'; sleep 30; done

for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --ioengine=io_uring --iodepth=128 --numjobs=4 --group_reporting --minimal | awk -F ";" '{print $8}'; sleep 30; done

echo 6 > /proc/irq/296/smp_affinity_list


Kind Regards,
Christian

2024-03-18 14:09:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
<[email protected]> wrote:
>
> The previous commit provides a new cpu_util_cfs_boost_io interface for
> schedutil which uses the io boosted utilization of the per-task
> tracking strategy. Schedutil iowait boosting is therefore no longer
> necessary so remove it.

I'm wondering about the cases when schedutil is used without EAS.

Are they still going to be handled as before after this change?

2024-03-18 16:40:55

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 18/03/2024 14:07, Rafael J. Wysocki wrote:
> On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
> <[email protected]> wrote:
>>
>> The previous commit provides a new cpu_util_cfs_boost_io interface for
>> schedutil which uses the io boosted utilization of the per-task
>> tracking strategy. Schedutil iowait boosting is therefore no longer
>> necessary so remove it.
>
> I'm wondering about the cases when schedutil is used without EAS.
>
> Are they still going to be handled as before after this change?

Well they should still get boosted (under the new conditions) and according
to my tests that does work. Anything in particular you're worried about?

So in terms of throughput I see similar results with EAS and CAS+sugov.
I'm happy including numbers in the cover letter for future versions, too.
So far my intuition was that nobody would care enough to include them
(as long as it generally still works).

Kind Regards,
Christian

2024-03-18 17:21:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
<[email protected]> wrote:
>
> On 18/03/2024 14:07, Rafael J. Wysocki wrote:
> > On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
> > <[email protected]> wrote:
> >>
> >> The previous commit provides a new cpu_util_cfs_boost_io interface for
> >> schedutil which uses the io boosted utilization of the per-task
> >> tracking strategy. Schedutil iowait boosting is therefore no longer
> >> necessary so remove it.
> >
> > I'm wondering about the cases when schedutil is used without EAS.
> >
> > Are they still going to be handled as before after this change?
>
> Well they should still get boosted (under the new conditions) and according
> to my tests that does work.

OK

> Anything in particular you're worried about?

It is not particularly clear to me how exactly the boost is taken into
account without EAS.

> So in terms of throughput I see similar results with EAS and CAS+sugov.
> I'm happy including numbers in the cover letter for future versions, too.
> So far my intuition was that nobody would care enough to include them
> (as long as it generally still works).

Well, IMV clear understanding of the changes is more important.

2024-03-19 14:09:16

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 18/03/2024 17:08, Rafael J. Wysocki wrote:
> On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
> <[email protected]> wrote:
>>
>> On 18/03/2024 14:07, Rafael J. Wysocki wrote:
>>> On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
>>> <[email protected]> wrote:
>>>>
>>>> The previous commit provides a new cpu_util_cfs_boost_io interface for
>>>> schedutil which uses the io boosted utilization of the per-task
>>>> tracking strategy. Schedutil iowait boosting is therefore no longer
>>>> necessary so remove it.
>>>
>>> I'm wondering about the cases when schedutil is used without EAS.
>>>
>>> Are they still going to be handled as before after this change?
>>
>> Well they should still get boosted (under the new conditions) and according
>> to my tests that does work.
>
> OK
>
>> Anything in particular you're worried about?
>
> It is not particularly clear to me how exactly the boost is taken into
> account without EAS.

So a quick rundown for now, I'll try to include something along the lines in
future versions then, too.
Every task_struct carries an io_boost_level in the range of [0..8] with it.
The boost is in units of utilization (w.r.t SCHED_CAPACITY_SCALE, independent
of CPU the task might be currently enqueued on).
The boost is taken into account for:
1. sugov frequency selection with
io_boost = cpu_util_io_boost(sg_cpu->cpu);
util = max(util, io_boost);

The io boost of all tasks enqueued on the rq will be max-aggregated with the
util here. (See cfs_rq->io_boost_tasks).

2. Task placement, for EAS in feec();
Otherwise select_idle_sibling() / select_idle_capacity() to ensure the CPU
satisfies the requested io_boost of the task to be enqueued.

Determining the io_boost_level is a bit more involved than with sugov's
implementation and happens in dequeue_io_boost(), hopefully that part
is reasonably understandable from the code.

Hope that helps.

Kind Regards,
Christian


>
>> So in terms of throughput I see similar results with EAS and CAS+sugov.
>> I'm happy including numbers in the cover letter for future versions, too.
>> So far my intuition was that nobody would care enough to include them
>> (as long as it generally still works).
>
> Well, IMV clear understanding of the changes is more important.


2024-03-21 12:39:52

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

(Thanks for the CC Bart)

On 03/06/24 10:49, Christian Loehle wrote:
> Hi Bart,
>
> On 05/03/2024 18:36, Bart Van Assche wrote:
> > On 3/5/24 01:13, Christian Loehle wrote:
> >> On 05/03/2024 00:20, Bart Van Assche wrote:
> >>> On 3/4/24 12:16, Christian Loehle wrote:
> >>>> - Higher cap is not always beneficial, we might place the task away
> >>>> from the CPU where the interrupt handler is running, making it run
> >>>> on an unboosted CPU which may have a bigger impact than the difference
> >>>> between the CPU's capacity the task moved to. (Of course the boost will
> >>>> then be reverted again, but a ping-pong every interval is possible).
> >>>
> >>> In the above I see "the interrupt handler". Does this mean that the NVMe
> >>> controller in the test setup only supports one completion interrupt for
> >>> all completion queues instead of one completion interrupt per completion
> >>> queue? There are already Android phones and developer boards available
> >>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
> >>
> >> No, both NVMe test setups have one completion interrupt per completion queue,
> >> so this caveat doesn't affect them, higher capacity CPU is strictly better.
> >> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
> >> interrupt (on CPU0 on my setup).
> >
> > I think that measurements should be provided in the cover letter for the
> > two types of storage controllers: one series of measurements for a
> > storage controller with a single completion interrupt and a second
> > series of measurements for storage controllers with one completion
> > interrupt per CPU.
>
> Of the same type of storage controller? Or what is missing for you in
> the cover letter exactly (ufs/emmc: single completion interrupt,
> nvme: one completion interrupt per CPU).
>
> >
> >> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
> >> interrupt to a big CPU, too. Similarly for the mmc.
> >> Unfortunately the infrastructure is far from being there for the scheduler to move the
> >> interrupt to the same performance domain as the task, which is often optimal both in
> >> terms of throughput and in terms of power.
> >> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
> >> patch will of course be greatly increased.
> >
> > I'm not sure whether making the completion interrupt follow the workload
> > is a good solution. I'm concerned that this would increase energy
> > consumption by keeping the big cores active longer than necessary. I
> > like this solution better (improves storage performance on at least
> > devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
> > Handle HMP systems when completing IO"
> > (https://lore.kernel.org/linux-block/[email protected]/).
>
> That patch is good, don't get me wrong, but you still lose out by running everything
> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
> while having a big CPU available at a high OPP anyway ("for free").
> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):

So you want the hardirq to move to the big core? Unlike softirq, there will be
a single hardirq for the controller (to my limited knowledge), so if there are
multiple requests I'm not sure we can easily match which one relates to which
before it triggers. So we can end up waking up the wrong core.

Generally this should be a userspace policy. If there's a scenario where the
throughput is that important they can easily move the hardirq to the big core
unconditionally and move it back again once this high throughput scenario is no
longer important.

Or where you describing a different problem?

Glad to see your series by the way :-) I'll get a chance to review it over the
weekend hopefully.


Cheers

--
Qais Yousef

>
> Pretty numbers (IOPS):
> Base irq@CPU0 median: 6969
> Base irq@CPU6 median: 8407 (+20.6%)
> Applied irq@CPU0 median: 7144 (+2.5%)
> Applied irq@CPU6 median: 8288 (18.9%)
>
> This is with psyncx1 4K Random Read again, of course anything with queue depth
> takes advantage of batch completions to significantly reduce irq pressure.
>
> Not so pretty numbers and full list commands used:
>
> w/o patch:
> irq on CPU0 (default):
> psyncx1: 7000 6969 7025 6954 6964
> io_uring4x128: 28766 28280 28339 28310 28349
> irq on CPU6:
> psyncx1: 8342 8492 8355 8407 8532
> io_uring4x128: 28641 28356 25908 25787 25853
>
> with patch:
> irq on CPU0:
> psyncx1: 7672 7144 7301 6976 6889
> io_uring4x128: 28266 26314 27648 24482 25301
> irq on CPU6:
> psyncx1: 8208 8401 8351 8221 8288
> io_uring4x128: 25603 25438 25453 25514 25402
>
>
> for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --minimal | awk -F ";" '{print $8}'; sleep 30; done
>
> for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --ioengine=io_uring --iodepth=128 --numjobs=4 --group_reporting --minimal | awk -F ";" '{print $8}'; sleep 30; done
>
> echo 6 > /proc/irq/296/smp_affinity_list
>
>
> Kind Regards,
> Christian

2024-03-21 17:57:25

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 21/03/2024 12:39, Qais Yousef wrote:
[snip]
>> On 05/03/2024 18:36, Bart Van Assche wrote:
>>> On 3/5/24 01:13, Christian Loehle wrote:
>>>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>>>> - Higher cap is not always beneficial, we might place the task away
>>>>>> from the CPU where the interrupt handler is running, making it run
>>>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>>>> then be reverted again, but a ping-pong every interval is possible).
>>>>>
>>>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>>>> controller in the test setup only supports one completion interrupt for
>>>>> all completion queues instead of one completion interrupt per completion
>>>>> queue? There are already Android phones and developer boards available
>>>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>>>
>>>> No, both NVMe test setups have one completion interrupt per completion queue,
>>>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>>>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>>>> interrupt (on CPU0 on my setup).
>>>
>>> I think that measurements should be provided in the cover letter for the
>>> two types of storage controllers: one series of measurements for a
>>> storage controller with a single completion interrupt and a second
>>> series of measurements for storage controllers with one completion
>>> interrupt per CPU.
>>
>> Of the same type of storage controller? Or what is missing for you in
>> the cover letter exactly (ufs/emmc: single completion interrupt,
>> nvme: one completion interrupt per CPU).
>>
>>>
>>>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>>>> interrupt to a big CPU, too. Similarly for the mmc.
>>>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>>>> interrupt to the same performance domain as the task, which is often optimal both in
>>>> terms of throughput and in terms of power.
>>>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>>>> patch will of course be greatly increased.
>>>
>>> I'm not sure whether making the completion interrupt follow the workload
>>> is a good solution. I'm concerned that this would increase energy
>>> consumption by keeping the big cores active longer than necessary. I
>>> like this solution better (improves storage performance on at least
>>> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
>>> Handle HMP systems when completing IO"
>>> (https://lore.kernel.org/linux-block/[email protected]/).
>>
>> That patch is good, don't get me wrong, but you still lose out by running everything
>> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
>> while having a big CPU available at a high OPP anyway ("for free").
>> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
>> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
>> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):
>
> So you want the hardirq to move to the big core? Unlike softirq, there will be
> a single hardirq for the controller (to my limited knowledge), so if there are
> multiple requests I'm not sure we can easily match which one relates to which
> before it triggers. So we can end up waking up the wrong core.

It would be beneficial to move the hardirq to a big core if the IO task
is using it anyway.
I'm not sure I actually want to. There are quite a few pitfalls (like you
mentioned) that the scheduler really shouldn't be concerned about.
Moving the hardirq, if implemented in the kernel, would have to be done by the
host controller driver anyway, which would explode this series.
(host controller drivers are quite fragmented e.g. on mmc)

The fact that having a higher capacity CPU available ("running faster") for an
IO task doesn't (always) imply higher throughput because of the hardirq staying
on some LITTLE CPU is bothering (for this series), though.

>
> Generally this should be a userspace policy. If there's a scenario where the
> throughput is that important they can easily move the hardirq to the big core
> unconditionally and move it back again once this high throughput scenario is no
> longer important.

It also feels wrong to let this be a userspace policy, as the hardirq must be
migrated to the perf domain of the task, which userspace isn't aware of.
Unless you expect userspace to do
CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
but then you could just as well ask them to set performance governor for
big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
any iowait boosting.

Furthermore you can't generally expect userspace to know if their IO will lead
to any interrupt at all, much less which one. They ideally don't even know if
the file IO they are doing is backed by any physical storage in the first place.
(Or even further, that they are doing file IO at all, they might just be
e.g. page-faulting.)

>
> Or where you describing a different problem?

That is the problem I mentioned in the series and Bart and I were discussing.
It's a problem of the series as in "the numbers aren't that impressive".
Current iowait boosting on embedded/mobile systems will perform quite well by
chance, as the (low util) task will often be on the same perf domain the hardirq
will be run on. As can be seen in the cover letter the benefit of running the
task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
will be much greater.
I just wanted to point this out. We might just acknowledge the problem and say
"don't care" about the potential performance benefits of those scenarios that
would require hardirq moving.
In the long-term it looks like for UFS the problem will disappear as we are
expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
is already the case.

I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
is available (and it doesn't look like this is changing anytime soon).
I would also love to hear what Bart or other UFS folks think about it.
Furthermore if I forgot any storage subsystem with the same behavior in that
regards do tell me.

Lastly, you could consider the IO workload:
IO task being in iowait very frequently [1] with just a single IO inflight [2]
and only very little time being spent on the CPU in-between iowaits[3],
therefore the interrupt handler being on the critical path for IO throughput
to a non-negligible degree, to be niche, but it's precisely the use-case where
iowait boosting shows it's biggest benefit.

Sorry for the abomination of a sentence, see footnotes for the reasons.

[1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
won't apply any significant boost currently.
[2] If the storage devices has enough in-flight requests to serve, iowait
boosting is unnecessary/wasteful, see cover letter.
[3] If the task actually uses the CPU in-between iowaits, it will build up
utilization, iowait boosting benefit diminishes.

>
> Glad to see your series by the way :-) I'll get a chance to review it over the
> weekend hopefully.

Thank you!
Apologies for not CCing you in the first place, I am curious about your opinion
on the concept!

FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
only received a quick smoke test, so 1/2 needs the following:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aaf64023b03..2b6f521be658 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
} else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
/* Reduce boost */
if (p->io_boost_level > 1)
- io_boost_scale_interval(p, true);
+ io_boost_scale_interval(p, false);
else
p->io_boost_level = 0;
} else if (p->io_boost_level == IO_BOOST_LEVELS) {


I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
changes are mostly cosmetic and addressing Bart's comments about the benchmark
numbers in the cover letter.

Kind Regards,
Christian

2024-03-21 19:53:05

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 3/21/24 10:57, Christian Loehle wrote:
> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.

Why the focus on storage controllers with a single completion interrupt?
It probably won't take long (one year?) until all new high-end
smartphones may have support for multiple completion interrupts.

Thanks,

Bart.


2024-03-22 18:09:23

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

Hi Christian,

On Mon, 4 Mar 2024 at 21:17, Christian Loehle <[email protected]> wrote:
>
> There is a feature inside of both schedutil and intel_pstate called
> iowait boosting which tries to prevent selecting a low frequency
> during IO workloads when it impacts throughput.
> The feature is implemented by checking for task wakeups that have
> the in_iowait flag set and boost the CPU of the rq accordingly
> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>
> The necessity of the feature is argued with the potentially low
> utilization of a task being frequently in_iowait (i.e. most of the
> time not enqueued on any rq and cannot build up utilization).
>
> The RFC focuses on the schedutil implementation.
> intel_pstate frequency selection isn't touched for now, suggestions are
> very welcome.
> Current schedutil iowait boosting has several issues:
> 1. Boosting happens even in scenarios where it doesn't improve
> throughput. [1]
> 2. The boost is not accounted for in EAS: a) feec() will only consider
> the actual utilization for task placement, but another CPU might be
> more energy-efficient at that capacity than the boosted one.)
> b) When placing a non-IO task while a CPU is boosted compute_energy()
> will not consider the (potentially 'free') boosted capacity, but the
> one it would have without the boost (since the boost is only applied
> in sugov).
> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> in_iowait wakeups.
> 4. The boost isn't associated with a task, it therefore isn't considered
> for task placement, potentially missing out on higher capacity CPUs on
> heterogeneous CPU topologies.
> 5. The boost isn't associated with a task, it therefore lingers on the
> rq even after the responsible task has migrated / stopped.
> 6. The boost isn't associated with a task, it therefore needs to ramp
> up again when migrated.
> 7. Since schedutil doesn't know which task is getting woken up,
> multiple unrelated in_iowait tasks might lead to boosting.
>
> We attempt to mitigate all of the above by reworking the way the
> iowait boosting (io boosting from here on) works in two major ways:
> - Carry the boost in task_struct, so it is a per-task attribute and
> behaves similar to utilization of the task in some ways.
> - Employ a counting-based tracking strategy that only boosts as long
> as it sees benefits and returns to no boosting dynamically.

Thanks for working on improving IO boosting. I have started to read
your patchset and have few comments about your proposal:

The main one is that the io boosting decision should remain a cpufreq
governor decision and so the io boosting value should be applied by
the governor like in sugov_effective_cpu_perf() as an example instead
of everywhere in the scheduler code.

Then, the algorithm to track the right interval bucket and the mapping
of intervals into utilization really looks like a policy which has
been defined with heuristics and as a result further seems to be a
governor decision

Finally adding some atomic operation in the fast path is not really desirable

I will continue to review your patchset

>
> Note that some the issues (1, 3) can be solved by using a
> counting-based strategy on a per-rq basis, i.e. in sugov entirely.
> Experiments with Android in particular showed that such a strategy
> (which necessarily needs longer intervals to be reasonably stable)
> is too prone to migrations to be useful generally.
> We therefore consider the additional complexity of such a per-task
> based approach like proposed to be worth it.
>
> We require a minimum of 1000 iowait wakeups per second to start
> boosting.
> This isn't too far off from what sugov currently does, since it resets
> the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
> For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
> We justify this by the small possible improvement by boosting in the
> first place with 'rare' few iowait wakeups.
>
> When IO even leads to a task being in iowait isn't as straightforward
> to explain.
> Of course if the issued IO can be served by the page cache (e.g. on
> reads because the pages are contained, on writes because they can be
> marked dirty and the writeback takes care of it later) the actual
> issuing task is usually not in iowait.
> We consider this the good case, since whenever the scheduler and a
> potential userspace / kernel switch is in the critical path for IO
> there is possibly overhead impacting throughput.
> We therefore focus on random read from here on, because (on synchronous
> IO [3]) this will lead to the task being set in iowait for every IO.
> This is where iowait boosting shows its biggest throughput improvement.
> From here on IOPS (IO operations per second) and iowait wakeups may
> therefore be used interchangeably.
>
> Performance:
> Throughput for random read tries to be on par with the sugov
> implementation of iowait boosting for reasonably long-lived workloads.
> See the following table for some results, values are in IOPS, the
> tests are ran for 30s with pauses in-between, results are sorted.
>
> nvme on rk3399
> [3588, 3590, 3597, 3632, 3745] sugov mainline
> [3581, 3751, 3770, 3771, 3885] per-task tracking
> [2592, 2639, 2701, 2717, 2784] sugov no iowait boost
> [3218, 3451, 3598, 3848, 3921] performance governor
>
> emmc with cqe on rk3399
> [4146, 4155, 4159, 4161, 4193] sugov mainline
> [2848, 3217, 4375, 4380, 4454] per-task tracking
> [2510, 2665, 3093, 3101, 3105] sugov no iowait boost
> [4690, 4803, 4860, 4976, 5069] performance governor
>
> sd card on rk3399
> [1777, 1780, 1806, 1827, 1850] sugov mainline
> [1470, 1476, 1507, 1534, 1586] per-task tracking
> [1356, 1372, 1373, 1377, 1416] sugov no iowait boost
> [1861, 1890, 1901, 1905, 1908] performance governor
>
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
>
> Apple M1 apple-nvme
> [27421, 28331, 28515, 28699, 29529] sugov mainline
> [27274, 27344, 27345, 27384, 27930] per-task tracking
> [14480, 14512, 14625, 14872, 14967] sugov no iowait boost
> [31595, 32085, 32386, 32465, 32643] performance governor
>
> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
>
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | | Sugov mainline | Per-task tracking | Sugov no boost | Performance | Powersave |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx1 | 4073 | 3793 | 2979 | 4190 | 2788 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx4 | 13921 | 13503 | 10635 | 13931 | 10225 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx6 | 18473 | 17866 | 15902 | 19080 | 15789 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx8 | 22498 | 21242 | 19867 | 22650 | 18837 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx10 | 24801 | 23552 | 23658 | 25096 | 21474 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx12 | 26743 | 25377 | 26372 | 26663 | 23613 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio1x1 | 4054 | 3542 | 2776 | 4055 | 2780 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio1x128 | 3959 | 3516 | 2758 | 3590 | 2560 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio4x128 | 13451 | 12517 | 10313 | 13403 | 9994 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio6x1 | 18394 | 17432 | 15340 | 18954 | 15251 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio6x4 | 18329 | 17100 | 15238 | 18623 | 15270 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio6x128 | 18066 | 16964 | 15139 | 18577 | 15192 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring1x1 | 4043 | 3548 | 2810 | 4039 | 2689 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring4x64 | 35790 | 32814 | 35983 | 34934 | 33254 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring1x128 | 32651 | 30427 | 32429 | 33232 | 9973 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring2x128 | 34928 | 32595 | 34922 | 33726 | 18790 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring4x128 | 34414 | 32173 | 34932 | 33332 | 33005 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring6x4 | 31578 | 29260 | 31714 | 31399 | 31784 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring6x128 | 34480 | 32634 | 34973 | 33390 | 36452 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
>
> Based on the above we can basically categorize these into the following
> three:
> a) boost is useful
> b) boost irrelevant (util dominates)
> c) boost is energy-inefficient (boost dominates)
>
> The aim of the patch 1/2 is to boost as much as necessary for a) while
> boosting little for c) (thus saving energy).
>
> Energy-savings:
> Regarding sugov iowait boosting problem 1 mentioned earlier,
> some improvement can be seen:
> Tested on rk3399 (LLLL)(bb) with an NVMe, 30s runtime
> CPU0 perf domain spans 0-3 with 400MHz to 1400MHz
> CPU4 perf domain spans 4-5 with 400MHz to 1800MHz
>
> io_uring6x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.180 GHz
> Average frequency for CPU4 : 1.504 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.070 GHz
> Average frequency for CPU4 : 1.211 GHz
>
> io_uring12x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.324 GHz
> Average frequency for CPU4 : 1.444 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.260 GHz
> Average frequency for CPU4 : 1.062 GHz
> (In both cases actually 400MHz on both perf domains is optimal, more
> fine-tuning could get us closer [2])
>
> [1]
> There are many scenarios when it doesn't, so let's start with
> explaining when it does:
> Boosting improves throughput if there is frequent IO to a device from
> one or few origins, such that the device is likely idle when the task
> is enqueued on the rq and reducing this time cuts down on the storage
> device idle time.
> This might not be true (and boosting doesn't help) if:
> - The storage device uses the idle time to actually commit the IO to
> persistent storage or do other management activity (this can be
> observed with e.g. writes to flash-based storage, which will usually
> write to cache and flush the cache when idle or necessary).
> - The device is under thermal pressure and needs idle time to cool off
> (not uncommon for e.g. nvme devices).
> Furthermore the assumption (the device being idle while task is
> enqueued) is false altogether if:
> - Other tasks use the same storage device.
> - The task uses asynchronous IO with iodepth > 1 like io_uring, the
> in_iowait is then just to fill the queue on the host again.
> - The task just sets in_iowait to signal it is waiting on io to not
> appear as system idle, it might not send any io at all (cf with
> the various occurrences of in_iowait, io_mutex_lock and io_schedule*).
>
> [3]
> Unfortunately even for asynchronous IO iowait may be set, in the case
> of io_uring this is specifically for the iowait boost to trigger, see
> commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
> which is why the energy-savings are so significant here, as io_uring
> load on the CPU is minimal.
>
> Problems encountered:
> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).
> - [2] tracking and scaling can be improved (io_uring12x128 still shows
> boosting): Unfortunately tracking purely per-task shows some limits.
> One task might show more iowaits per second when boosted, but overall
> throughput doesn't increase => there is still some boost.
> The task throughput improvement is somewhat limited though,
> so by fine-tuning the thresholds there could be mitigations.
>
> Christian Loehle (2):
> sched/fair: Introduce per-task io util boost
> cpufreq/schedutil: Remove iowait boost
>
> include/linux/sched.h | 15 +++
> kernel/sched/cpufreq_schedutil.c | 150 ++--------------------------
> kernel/sched/fair.c | 165 +++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 4 +-
> 4 files changed, 182 insertions(+), 152 deletions(-)
>
> --
> 2.34.1
>

2024-03-25 08:49:46

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 03/18/24 18:08, Rafael J. Wysocki wrote:
> On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
> <[email protected]> wrote:
> >
> > On 18/03/2024 14:07, Rafael J. Wysocki wrote:
> > > On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
> > > <[email protected]> wrote:
> > >>
> > >> The previous commit provides a new cpu_util_cfs_boost_io interface for
> > >> schedutil which uses the io boosted utilization of the per-task
> > >> tracking strategy. Schedutil iowait boosting is therefore no longer
> > >> necessary so remove it.
> > >
> > > I'm wondering about the cases when schedutil is used without EAS.
> > >
> > > Are they still going to be handled as before after this change?
> >
> > Well they should still get boosted (under the new conditions) and according
> > to my tests that does work.
>
> OK
>
> > Anything in particular you're worried about?
>
> It is not particularly clear to me how exactly the boost is taken into
> account without EAS.
>
> > So in terms of throughput I see similar results with EAS and CAS+sugov.
> > I'm happy including numbers in the cover letter for future versions, too.
> > So far my intuition was that nobody would care enough to include them
> > (as long as it generally still works).
>
> Well, IMV clear understanding of the changes is more important.

I think the major thing we need to be careful about is the behavior when the
task is sleeping. I think the boosting will be removed when the task is
dequeued and I can bet there will be systems out there where the BLOCK softirq
being boosted when the task is sleeping will matter.

FWIW I do have an implementation for per-task iowait boost where I went a step
further and converted intel_pstate too and like Christian didn't notice
a regression. But I am not sure (rather don't think) I triggered this use case.
I can't tell when the systems truly have per-cpu cpufreq control or just appear
so and they are actually shared but not visible at linux level.

2024-03-25 08:52:41

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

(piggy backing on this reply)

On 03/22/24 19:08, Vincent Guittot wrote:
> Hi Christian,
>
> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <[email protected]> wrote:
> >
> > There is a feature inside of both schedutil and intel_pstate called
> > iowait boosting which tries to prevent selecting a low frequency
> > during IO workloads when it impacts throughput.
> > The feature is implemented by checking for task wakeups that have
> > the in_iowait flag set and boost the CPU of the rq accordingly
> > (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
> >
> > The necessity of the feature is argued with the potentially low
> > utilization of a task being frequently in_iowait (i.e. most of the
> > time not enqueued on any rq and cannot build up utilization).
> >
> > The RFC focuses on the schedutil implementation.
> > intel_pstate frequency selection isn't touched for now, suggestions are
> > very welcome.
> > Current schedutil iowait boosting has several issues:
> > 1. Boosting happens even in scenarios where it doesn't improve
> > throughput. [1]
> > 2. The boost is not accounted for in EAS: a) feec() will only consider
> > the actual utilization for task placement, but another CPU might be
> > more energy-efficient at that capacity than the boosted one.)
> > b) When placing a non-IO task while a CPU is boosted compute_energy()
> > will not consider the (potentially 'free') boosted capacity, but the
> > one it would have without the boost (since the boost is only applied
> > in sugov).
> > 3. Actual IO heavy workloads are hardly distinguished from infrequent
> > in_iowait wakeups.
> > 4. The boost isn't associated with a task, it therefore isn't considered
> > for task placement, potentially missing out on higher capacity CPUs on
> > heterogeneous CPU topologies.
> > 5. The boost isn't associated with a task, it therefore lingers on the
> > rq even after the responsible task has migrated / stopped.
> > 6. The boost isn't associated with a task, it therefore needs to ramp
> > up again when migrated.
> > 7. Since schedutil doesn't know which task is getting woken up,
> > multiple unrelated in_iowait tasks might lead to boosting.

You forgot an important problem which what was the main request from Android
when this first came up few years back. iowait boost is a power hungry
feature and not all tasks require iowait boost. By having it per task we want
to be able to prevent tasks from causing frequency spikes due to iowait boost
when it is not warranted.

> >
> > We attempt to mitigate all of the above by reworking the way the
> > iowait boosting (io boosting from here on) works in two major ways:
> > - Carry the boost in task_struct, so it is a per-task attribute and
> > behaves similar to utilization of the task in some ways.
> > - Employ a counting-based tracking strategy that only boosts as long
> > as it sees benefits and returns to no boosting dynamically.
>
> Thanks for working on improving IO boosting. I have started to read
> your patchset and have few comments about your proposal:
>
> The main one is that the io boosting decision should remain a cpufreq
> governor decision and so the io boosting value should be applied by
> the governor like in sugov_effective_cpu_perf() as an example instead
> of everywhere in the scheduler code.

I have similar thoughts.

I think we want the scheduler to treat iowait boost like uclamp_min, but
requested by block subsystem rather than by the user.

I think we should create a new task_min/max_perf() and replace all current
callers in scheduler to uclamp_eff_value() with task_min/max_perf() where
task_min/max_perf()

unsigned long task_min_perf(struct task_struct *p)
{
return max(uclamp_eff_value(p, UCLAMP_MIN), p->iowait_boost);
}

unsigned long task_max_perf(struct task_struct *p)
{
return uclamp_eff_value(p, UCLAMP_MAX);
}

then all users of uclamp_min in the scheduler will see the request for boost
from iowait and do the correct task placement decision. Including under thermal
pressure and ensuring that they don't accidentally escape uclamp_max which I am
not sure if your series caters for with the open coding it. You're missing the
load balancer paths from what I see.

It will also solve the problem I mention above. The tasks that should not use
iowait boost are likely restricted with uclamp_max already. If we treat iowait
boost as an additional source of min_perf request, then uclamp_max will prevent
it from going above a certain perf level and give us the desired impact without
any additional hint. I don't think it is important to disable it completely but
rather have a way to prevent tasks from consuming too much resources when not
needed, which we already have from uclamp_max.

I am not sure it makes sense to have a separate control where a task can run
fast due to util but can't have iowait boost or vice versa. I think existing
uclamp_max should be enough to restrict tasks from exceeding a performance
limit.

>
> Then, the algorithm to track the right interval bucket and the mapping
> of intervals into utilization really looks like a policy which has
> been defined with heuristics and as a result further seems to be a
> governor decision

Hmm do you think this should not be a per-task value then Vincent?

Or oh, I think I see what you mean. Make effective_cpu_util() set min parameter
correctly. I think that would work too, yes. iowait boost is just another min
perf request and as long as it is treated as such, it is good for me. We'll
just need to add a new parameter for the task like I did in remove uclamp max
aggregation serires.

Generally I think it's better to split the patches so that the conversion to
iowait boost with current algorithm to being per-task as a separate patch. And
then look at improving the algorithm logic on top. These are two different
problems IMHO.

One major problem and big difference in per-task iowait that I see Christian
alluded to is that the CPU will no longer be boosted when the task is sleeping.
I think there will be cases out there where some users relied on that for the
BLOCK softirq to run faster too. We need an additional way to ensure that the
softirq runs at a similar performance level to the task that initiated the
request. So we need a way to hold the cpufreq policy's min perf until the
softirq is serviced. Or just keep the CPU boosted until the task is migrated.
I'm not sure what is better yet.

>
> Finally adding some atomic operation in the fast path is not really desirable

Yes I was thinking if we can apply the value when we set the p->in_iowait flag
instead?

2024-03-25 08:52:41

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/fair: Introduce per-task io util boost

I didn't spend much time on it, but some higher level comments

On 03/04/24 20:16, Christian Loehle wrote:
> Implement an io boost utilization enhancement that is tracked for each
> task_struct. Tasks that wake up from in_iowait frequently will have
> a io_boost associated with them, which counts iowait wakeups and only
> boosts when it seems to improve the per-task throughput.
>
> The patch is intended to replace the current iowait boosting strategy,
> implemented in both schedutil and intel_pstate which boost the CPU for
> iowait wakeups on the rq.
> The primary benefits are:
> 1. EAS can take the io boost into account.
> 2. Boosting is limited when it doesn't seem to improve throughput.
> 3. io boost is being carried with the task when it migrates.

4. io boost can be wasteful not restricted to save power by tasks that don't
need it e.g: background tasks with no perf requirements

>
> This is implemented by observing the iowait wakeups for an interval.
> The boost is divided into 8 levels. If the task achieves the
> required number of iowait wakeups per interval it's boost level is
> increased.

So one of the other problems I am looking at is the response time of the
system. And I must stress *time*. Did you measure how long to go from 0 to max
with your new logic? (edit: I think I found the answer below)

As mentioned elsewhere, better split the logic to make iowait per-task from the
algorithm improvements.

> To reflect that we can't expect an increase of iowait wakeups linear
> to the applied boost (the time the task spends in iowait isn't
> decreased by boosting) we scale the intervals.
> Intervals for the lower boost levels are shorter, also allowing for
> a faster ramp up.
>
> If multiple tasks are io-boosted their boost will be max-aggregated
> per rq. The energy calculations of EAS have been adapted to reflect
> this.
>
> Signed-off-by: Christian Loehle <[email protected]>
> ---
> include/linux/sched.h | 15 +++
> kernel/sched/cpufreq_schedutil.c | 6 ++
> kernel/sched/fair.c | 165 +++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 4 +-
> 4 files changed, 181 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffe8f618ab86..4e0dfa6fbd65 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1547,6 +1547,21 @@ struct task_struct {
> struct user_event_mm *user_event_mm;
> #endif
>
> + /* IO boost tracking */
> + u64 io_boost_timeout;
> + u64 io_boost_interval_start;
> +#define IO_BOOST_INTERVAL_MSEC 25
> +/* Require 1000 iowait wakeups per second to start the boosting */
> +#define IO_BOOST_IOWAITS_MIN (IO_BOOST_INTERVAL_MSEC)
> +#define IO_BOOST_LEVELS 8
> +/* The util boost given to the task per io boost level, account for headroom */
> +#define IO_BOOST_UTIL_STEP ((unsigned long)((SCHED_CAPACITY_SCALE / 1.25) / IO_BOOST_LEVELS))
> +#define IO_BOOST_IOWAITS_STEP 5

This is too crammed to be readable :)

Better put the defines somewhere on top. And I think they probably belong to
schedutil as Vincent pointed out.

> + /* Minimum number of iowaits per interval to maintain current boost */
> + unsigned int io_boost_threshold_down;
> + unsigned int io_boost_level;
> + unsigned int io_boost_curr_ios;
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..cd0ca3cbd212 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -198,7 +198,13 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> {
> unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> + unsigned long io_boost = cpu_util_io_boost(sg_cpu->cpu);
>
> + /*
> + * XXX: This already includes io boost now, makes little sense with
> + * sugov iowait boost on top
> + */
> + util = max(util, io_boost);
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);

I am not keen on this open coding and another max aggregation.

I think this new boost is better treated like a new min_perf request and either
make effective_cpu_util() fill up &min here correctly (you'd need to extend it
like I did in [1]) or generalize the concept of perf_min like I suggested.
Seems Vincent has a preference for the former.

[1] https://lore.kernel.org/lkml/[email protected]/

> util = max(util, boost);
> sg_cpu->bw_min = min;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..b983e4399c53 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4959,6 +4959,11 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> trace_sched_util_est_se_tp(&p->se);
> }
>
> +static inline unsigned int io_boost_util(struct task_struct *p)
> +{
> + return p->io_boost_level * IO_BOOST_UTIL_STEP;
> +}
> +
> static inline int util_fits_cpu(unsigned long util,
> unsigned long uclamp_min,
> unsigned long uclamp_max,
> @@ -6695,6 +6700,137 @@ static int sched_idle_cpu(int cpu)
> }
> #endif
>
> +static unsigned long io_boost_rq(struct cfs_rq *cfs_rq)
> +{
> + int i;
> +
> + for (i = IO_BOOST_LEVELS; i > 0; i--)
> + if (atomic_read(&cfs_rq->io_boost_tasks[i - 1]))

Why cfs_rq not rq?

So uclamp has taught it something about keeping stuff at the rq level. And that
things can get complex and not work as well - beside cause undesirable cache
bouncing effects. Though for uclamp_min it is not that bad, but we want to
remove uclamp accounting at rq level and I am not keen on adding similar type
of accounting here.

One way we can address this until my work (hopefully) is done; is to send
a special cpufreq update to the governor for it to actually do the tracking.

For iowait boost we do have an additional complexity where the boost must be
applied for when the task is sleeping.

I think this needs to be something taken care of by schedutil/cpufreq governor
where it remembers that it needs to keep iowait boost sticky for a period of
time.

Let's try to keep the clutter out of the scheduler and make cpufreq governors
do the work they ought to in terms of keeping track of info when needed. Like
stickiness of the boost and the current level.

I hope my work to make cpufreq updates done at context switch will help
simplify these decisions. Generally hardware has gotten a lot better in DVFS
and we want to ensure cpufreq governor can take best advantage of hardware
without any policy enforced by the scheduler. So deferring to the governor is
better in general.

> + return i * IO_BOOST_UTIL_STEP;
> + return 0;
> +}
> +
> +static inline unsigned long io_boost_interval_nsec(unsigned int io_boost_level)
> +{
> + /*
> + * We require 5 iowaits per interval increase to consider the boost
> + * worth having, that leads to:
> + * level 0->1: 25ms -> 200 iowaits per second increase
> + * level 1->2: 50ms -> 125 iowaits per second increase
> + * level 2->3: 75ms -> 66 iowaits per second increase
> + * level 3->4: 100ms -> 50 iowaits per second increase
> + * level 4->5: 125ms -> 40 iowaits per second increase
> + * level 5->6: 150ms -> 33 iowaits per second increase
> + * level 6->7: 175ms -> 28 iowaits per second increase
> + * level 7->8: 200ms -> 25 iowaits per second increase

Ah seems this is the answer to my question above. 200ms is the time to reach
max perf point of the system? It almost matches utilization ramp up time. Good.
Though not 100% accurate (it actually depends on the capacity of the 2nd
highest freq) but can't complain for now.

> + * => level 8 can be maintained with >=1567 iowaits per second.
> + */
> + return (io_boost_level + 1) * IO_BOOST_INTERVAL_MSEC * NSEC_PER_MSEC;
> +}
> +
> +static inline void io_boost_scale_interval(struct task_struct *p, bool inc)
> +{
> + unsigned int level = p->io_boost_level + (inc ? 1 : -1);
> +
> + p->io_boost_level = level;
> + /* We change interval length, scale iowaits per interval accordingly. */
> + if (inc)
> + p->io_boost_threshold_down = (p->io_boost_curr_ios *
> + (level + 1) / level) + IO_BOOST_IOWAITS_STEP;
> + else
> + p->io_boost_threshold_down = (p->io_boost_curr_ios *
> + level / (level + 1)) - IO_BOOST_IOWAITS_STEP;
> +}
> +
> +static void enqueue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
> +{
> + u64 now = sched_clock();
> +
> + /* Only what's necessary here because this is the critical path */
> + if (now > p->io_boost_timeout) {
> + /* Last iowait took too long, reset boost */
> + p->io_boost_interval_start = 0;
> + p->io_boost_level = 0;
> + }
> + if (p->io_boost_level)
> + atomic_inc(&cfs_rq->io_boost_tasks[p->io_boost_level - 1]);
> +}
> +
> +static inline void io_boost_start_interval(struct task_struct *p, u64 now)
> +{
> + p->io_boost_interval_start = now;
> + p->io_boost_curr_ios = 1;
> +}
> +
> +static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
> +{
> + u64 now;
> +
> + if (p->io_boost_level)
> + atomic_dec(&cfs_rq->io_boost_tasks[p->io_boost_level - 1]);
> +
> + /*
> + * Doing all this at dequeue instead of at enqueue might seem wrong,
> + * but it really doesn't matter as the task won't be enqueued anywhere
> + * anyway. At enqueue we then only need to check if the in_iowait
> + * wasn't too long. We can then act as if the current in_iowait has
> + * already completed 'in time'.
> + * Doing all this at dequeue has a performance benefit as at this time
> + * the io is issued and we aren't in the io critical path.
> + */
> +
> + if (!p->in_iowait) {
> + /* Even if no boost is active, we reset the interval */
> + p->io_boost_interval_start = 0;
> + p->io_boost_level = 0;
> + return;
> + }
> +
> + /* The maximum in_iowait time we allow to continue boosting */
> + now = sched_clock();
> + p->io_boost_timeout = now + 10 * NSEC_PER_MSEC;
> +
> + if (!p->io_boost_interval_start) {
> + io_boost_start_interval(p, now);
> + return;
> + }
> + p->io_boost_curr_ios++;
> +
> + if (now < p->io_boost_interval_start +
> + io_boost_interval_nsec(p->io_boost_level))
> + return;
> +
> + if (!p->io_boost_level) {
> + if (likely(p->io_boost_curr_ios < IO_BOOST_IOWAITS_MIN)) {
> + io_boost_start_interval(p, now);
> + return;
> + }
> + io_boost_scale_interval(p, true);
> + } else if (p->io_boost_curr_ios < IO_BOOST_IOWAITS_MIN) {
> + p->io_boost_level = 0;
> + } else if (p->io_boost_curr_ios > p->io_boost_threshold_down + IO_BOOST_IOWAITS_STEP) {
> + /* Increase boost */
> + if (p->io_boost_level < IO_BOOST_LEVELS)
> + io_boost_scale_interval(p, true);
> + else
> + p->io_boost_threshold_down =
> + p->io_boost_curr_ios - IO_BOOST_IOWAITS_STEP;
> + } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
> + /* Reduce boost */
> + if (p->io_boost_level > 1)
> + io_boost_scale_interval(p, true);
> + else
> + p->io_boost_level = 0;
> + } else if (p->io_boost_level == IO_BOOST_LEVELS) {
> + /* Allow for reducing boost on max when conditions changed. */
> + p->io_boost_threshold_down = max(p->io_boost_threshold_down,
> + p->io_boost_curr_ios - IO_BOOST_IOWAITS_STEP);
> + }
> + /* On maintaining boost we just start a new interval. */
> +
> + io_boost_start_interval(p, now);
> +}

Can we avoid doing updates at enqueue/dequeue and do them instead when we
set/unset p->in_iowait instead?


Cheers

--
Qais Yousef

> +
> /*
> * The enqueue_task method is called before nr_running is
> * increased. Here we update the fair scheduling stats and
> @@ -6716,11 +6852,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> */
> util_est_enqueue(&rq->cfs, p);
>
> - /*
> - * If in_iowait is set, the code below may not trigger any cpufreq
> - * utilization updates, so do it here explicitly with the IOWAIT flag
> - * passed.
> - */
> + if (p->in_iowait || p->io_boost_interval_start)
> + enqueue_io_boost(&rq->cfs, p);
> + /* Ensure new io boost can be applied. */
> if (p->in_iowait)
> cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
>
> @@ -6804,6 +6938,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> util_est_dequeue(&rq->cfs, p);
>
> + dequeue_io_boost(&rq->cfs, p);
> +
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
> @@ -7429,11 +7565,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> int fits, best_fits = 0;
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
> + unsigned long io_boost = io_boost_util(p);
>
> cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> task_util = task_util_est(p);
> + task_util = max(task_util, io_boost);
> util_min = uclamp_eff_value(p, UCLAMP_MIN);
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> @@ -7501,7 +7639,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> */
> if (sched_asym_cpucap_active()) {
> sync_entity_load_avg(&p->se);
> - task_util = task_util_est(p);
> + task_util = max(task_util_est(p), io_boost_util(p));
> util_min = uclamp_eff_value(p, UCLAMP_MIN);
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
> }
> @@ -7615,12 +7753,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> return target;
> }
>
> +unsigned long cpu_util_io_boost(int cpu)
> +{
> + return io_boost_rq(&cpu_rq(cpu)->cfs);
> +}
> +
> /**
> * cpu_util() - Estimates the amount of CPU capacity used by CFS tasks.
> * @cpu: the CPU to get the utilization for
> * @p: task for which the CPU utilization should be predicted or NULL
> * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
> - * @boost: 1 to enable boosting, otherwise 0
> + * @boost: 1 to enable runnable boosting, otherwise 0
> *
> * The unit of the return value must be the same as the one of CPU capacity
> * so that CPU utilization can be compared with CPU capacity.
> @@ -7843,8 +7986,10 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> for_each_cpu(cpu, pd_cpus) {
> struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> + unsigned long io_boost = max(io_boost_util(p), cpu_util_io_boost(cpu));
> unsigned long eff_util, min, max;
>
> + util = max(util, io_boost);
> /*
> * Performance domain frequency: utilization clamping
> * must be considered since it affects the selection
> @@ -7970,7 +8115,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> target = prev_cpu;
>
> sync_entity_load_avg(&p->se);
> - if (!task_util_est(p) && p_util_min == 0)
> + if (!task_util_est(p) && p_util_min == 0 && io_boost_util(p) == 0)
> goto unlock;
>
> eenv_task_busy_time(&eenv, p, prev_cpu);
> @@ -7983,6 +8128,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> unsigned long cur_delta, base_energy;
> int max_spare_cap_cpu = -1;
> int fits, max_fits = -1;
> + unsigned long p_io_boost = io_boost_util(p);
>
> cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>
> @@ -7999,6 +8145,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> for_each_cpu(cpu, cpus) {
> struct rq *rq = cpu_rq(cpu);
> + unsigned long io_boost;
>
> eenv.pd_cap += cpu_thermal_cap;
>
> @@ -8009,6 +8156,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> continue;
>
> util = cpu_util(cpu, p, cpu, 0);
> + io_boost = max(p_io_boost, cpu_util_io_boost(cpu));
> + util = max(util, io_boost);
> cpu_cap = capacity_of(cpu);
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 001fe047bd5d..5f42b72b3cde 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -598,6 +598,8 @@ struct cfs_rq {
> struct sched_entity *curr;
> struct sched_entity *next;
>
> + atomic_t io_boost_tasks[IO_BOOST_LEVELS];
> +
> #ifdef CONFIG_SCHED_DEBUG
> unsigned int nr_spread_over;
> #endif
> @@ -3039,7 +3041,7 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> return READ_ONCE(rq->avg_dl.util_avg);
> }
>
> -
> +extern unsigned long cpu_util_io_boost(int cpu);
> extern unsigned long cpu_util_cfs(int cpu);
> extern unsigned long cpu_util_cfs_boost(int cpu);
>
> --
> 2.34.1
>

2024-03-25 15:19:04

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 22/03/2024 18:08, Vincent Guittot wrote:
> Hi Christian,
Hi Vincent,
thanks for taking a look.

>
> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <[email protected]> wrote:
>>
>> There is a feature inside of both schedutil and intel_pstate called
>> iowait boosting which tries to prevent selecting a low frequency
>> during IO workloads when it impacts throughput.
>> The feature is implemented by checking for task wakeups that have
>> the in_iowait flag set and boost the CPU of the rq accordingly
>> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>>
>> The necessity of the feature is argued with the potentially low
>> utilization of a task being frequently in_iowait (i.e. most of the
>> time not enqueued on any rq and cannot build up utilization).
>>
>> The RFC focuses on the schedutil implementation.
>> intel_pstate frequency selection isn't touched for now, suggestions are
>> very welcome.
>> Current schedutil iowait boosting has several issues:
>> 1. Boosting happens even in scenarios where it doesn't improve
>> throughput. [1]
>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>> the actual utilization for task placement, but another CPU might be
>> more energy-efficient at that capacity than the boosted one.)
>> b) When placing a non-IO task while a CPU is boosted compute_energy()
>> will not consider the (potentially 'free') boosted capacity, but the
>> one it would have without the boost (since the boost is only applied
>> in sugov).
>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>> in_iowait wakeups.
>> 4. The boost isn't associated with a task, it therefore isn't considered
>> for task placement, potentially missing out on higher capacity CPUs on
>> heterogeneous CPU topologies.
>> 5. The boost isn't associated with a task, it therefore lingers on the
>> rq even after the responsible task has migrated / stopped.
>> 6. The boost isn't associated with a task, it therefore needs to ramp
>> up again when migrated.
>> 7. Since schedutil doesn't know which task is getting woken up,
>> multiple unrelated in_iowait tasks might lead to boosting.
>>
>> We attempt to mitigate all of the above by reworking the way the
>> iowait boosting (io boosting from here on) works in two major ways:
>> - Carry the boost in task_struct, so it is a per-task attribute and
>> behaves similar to utilization of the task in some ways.
>> - Employ a counting-based tracking strategy that only boosts as long
>> as it sees benefits and returns to no boosting dynamically.
>
> Thanks for working on improving IO boosting. I have started to read
> your patchset and have few comments about your proposal:
>
> The main one is that the io boosting decision should remain a cpufreq
> governor decision and so the io boosting value should be applied by
> the governor like in sugov_effective_cpu_perf() as an example instead
> of everywhere in the scheduler code
Having it move into the scheduler is to enable it for EAS (e.g. boosting
a LITTLE to it's highest OPP often being much less energy-efficient than
having a higher cap CPU at a lower OPP) and to enable higher capacities
reachable on other CPUs, too.
I guess for you the first one is the more interesting one.

>
> Then, the algorithm to track the right interval bucket and the mapping
> of intervals into utilization really looks like a policy which has
> been defined with heuristics and as a result further seems to be a
> governor decision

I did have a comparable thing as a governor decision, but the entire
"Test if util boost increases iowaits seen per interval and only boost
accordingly" really only works if the interval is long enough, my proposed
starting length of 25ms really being the lower limit for the storage devices
we want to cover (IO latency not being constant and therefore iowaits per
interval being somewhat noisy).
Given that the IO tasks will be enqueued/dequeued very frequently it just
isn't credible to expect them to land on the same CPU for many intervals,
unless your system is very bare-bones and idle, but even on an idle Android
I see any interval above 50ms to be unusable and not provide any throughput
improvement.
The idea of tracking the iowaits I do find the best option in this vague and
noisy environment of "iowait wakeups" and definitely worth having, so that's
why I opted for it being in the scheduler code, but I'd love to hear your
thoughts/alternatives.
I'd also like an improvement on the definition of iowait or some more separate
flag for boostable IO, the entire "boost on any iowait wakeup" is groping in
the dark which I'm trying to combat, but it's somewhat out of scope here.

>
> Finally adding some atomic operation in the fast path is not really desirable
Agreed, I'll look into it, for now I wanted as much feedback on the two major
changes:
- iowait boost now per-task
- boosting based upon iowaits seen per interval

>
> I will continue to review your patchset

Thank you, looking forward to seeing your review.

>>[snip]
Kind Regards,
Christian

2024-03-25 15:19:38

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 21/03/2024 19:52, Bart Van Assche wrote:
> On 3/21/24 10:57, Christian Loehle wrote:
>> In the long-term it looks like for UFS the problem will disappear as we are
>> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
>> is already the case.
>
> Why the focus on storage controllers with a single completion interrupt?
> It probably won't take long (one year?) until all new high-end
> smartphones may have support for multiple completion interrupts.
>
> Thanks,
>
> Bart.
>

Apart from going to "This patch shows significant performance improvements on
hardware that runs mainline today" to "This patch will have significant
performance improvements on devices running mainline in a couple years"
nothing in particular.
I'm fine with leaving it with having acknowledged the problem.
Maybe I would just gate the task placement on the task having been in
UFS (with multiple completion interrupts) or NVMe submission recently to
avoid regressions to current behavior in future versions. I did have that
already at some point, although it was a bit hacky.
Anyway, thank you for your input on that, it is what I wanted to hear!

Kind Regards,
Christian

2024-03-25 15:58:11

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 03/21/24 17:57, Christian Loehle wrote:

> > So you want the hardirq to move to the big core? Unlike softirq, there will be
> > a single hardirq for the controller (to my limited knowledge), so if there are
> > multiple requests I'm not sure we can easily match which one relates to which
> > before it triggers. So we can end up waking up the wrong core.
>
> It would be beneficial to move the hardirq to a big core if the IO task
> is using it anyway.
> I'm not sure I actually want to. There are quite a few pitfalls (like you

I'm actually against it. I think it's too much complexity for not necessasrily
a big gain. FWIW, one of the design request to get per task iowait boost so
that we can *disable* it. It wastes power when only a handful of tasks actually
care about perf.

Caring where the hardirq run for perf is unlikely a problem in practice.
Softirq should follow the requester already when it matters.

> mentioned) that the scheduler really shouldn't be concerned about.
> Moving the hardirq, if implemented in the kernel, would have to be done by the
> host controller driver anyway, which would explode this series.
> (host controller drivers are quite fragmented e.g. on mmc)
>
> The fact that having a higher capacity CPU available ("running faster") for an
> IO task doesn't (always) imply higher throughput because of the hardirq staying
> on some LITTLE CPU is bothering (for this series), though.
>
> >
> > Generally this should be a userspace policy. If there's a scenario where the
> > throughput is that important they can easily move the hardirq to the big core
> > unconditionally and move it back again once this high throughput scenario is no
> > longer important.
>
> It also feels wrong to let this be a userspace policy, as the hardirq must be
> migrated to the perf domain of the task, which userspace isn't aware of.
> Unless you expect userspace to do

irq balancer is a userspace policy. For kernel to make an automatic decision
there are a lot of ifs must be present. Again, I don't see on such system
maximizing throughput is a concern. And userspace can fix the problem simply
- they know after all when the throughput really matters to the point where the
hardirq runs is a bottleneck. In practice, I don't think it is a bottleneck.
But this is my handwavy judgement. The experts know better. And note, I mean
use cases that are not benchmarks ;-)

> CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
> but then you could just as well ask them to set performance governor for
> big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
> any iowait boosting.
>
> Furthermore you can't generally expect userspace to know if their IO will lead
> to any interrupt at all, much less which one. They ideally don't even know if
> the file IO they are doing is backed by any physical storage in the first place.
> (Or even further, that they are doing file IO at all, they might just be
> e.g. page-faulting.)

The way I see it, it's like gigabit networking. The hardirq will matter once
you reach such high throughput scenarios. Which are corner cases and not the
norm?

>
> >
> > Or where you describing a different problem?
>
> That is the problem I mentioned in the series and Bart and I were discussing.
> It's a problem of the series as in "the numbers aren't that impressive".
> Current iowait boosting on embedded/mobile systems will perform quite well by
> chance, as the (low util) task will often be on the same perf domain the hardirq
> will be run on. As can be seen in the cover letter the benefit of running the
> task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
> for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
> will be much greater.
> I just wanted to point this out. We might just acknowledge the problem and say
> "don't care" about the potential performance benefits of those scenarios that
> would require hardirq moving.

I thought the softirq does the bulk of the work. hardirq being such
a bottleneck is (naively maybe) a red flag for me that it's doing too much than
a simple interrupt servicing.

You don't boost when the task is sleeping, right? I think this is likely
a cause of the problem where softirq is not running as fast - where before the
series the CPU will be iowait boosted regardless the task is blocked or not.

> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.
>
> I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
> 'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
> is available (and it doesn't look like this is changing anytime soon).
> I would also love to hear what Bart or other UFS folks think about it.
> Furthermore if I forgot any storage subsystem with the same behavior in that
> regards do tell me.
>
> Lastly, you could consider the IO workload:
> IO task being in iowait very frequently [1] with just a single IO inflight [2]
> and only very little time being spent on the CPU in-between iowaits[3],
> therefore the interrupt handler being on the critical path for IO throughput
> to a non-negligible degree, to be niche, but it's precisely the use-case where
> iowait boosting shows it's biggest benefit.
>
> Sorry for the abomination of a sentence, see footnotes for the reasons.
>
> [1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
> won't apply any significant boost currently.

I CCed you to a patch where I fix this. I've been sleeping on it for too long.
Maybe I should have split this fix out of the consolidation patch.

> [2] If the storage devices has enough in-flight requests to serve, iowait
> boosting is unnecessary/wasteful, see cover letter.
> [3] If the task actually uses the CPU in-between iowaits, it will build up
> utilization, iowait boosting benefit diminishes.

The current mechanism is very aggressive. It needs to evolve for sure.

>
> >
> > Glad to see your series by the way :-) I'll get a chance to review it over the
> > weekend hopefully.
>
> Thank you!
> Apologies for not CCing you in the first place, I am curious about your opinion
> on the concept!

I actually had a patch that implements iowait boost per-task (on top of my
remove uclamp max aggregation series) where I did actually take the extra step
to remove iowait from intel_pstate. Can share the patches if you think you'll
find them useful.

Just want to note that this mechanism can end up waste power and this is an
important direction to consider. It's not about perf only (which matters too).

>
> FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
> only received a quick smoke test, so 1/2 needs the following:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4aaf64023b03..2b6f521be658 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
> } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
> /* Reduce boost */
> if (p->io_boost_level > 1)
> - io_boost_scale_interval(p, true);
> + io_boost_scale_interval(p, false);
> else
> p->io_boost_level = 0;
> } else if (p->io_boost_level == IO_BOOST_LEVELS) {
>
>
> I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
> changes are mostly cosmetic and addressing Bart's comments about the benchmark
> numbers in the cover letter.

I didn't spend a lot of time on the series, but I can see a number of problems.
Let us discuss them first and plan a future direction. No need to v2 if it's
just for this fix IMO.

2024-03-25 17:59:01

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 25/03/2024 02:20, Qais Yousef wrote:
> (piggy backing on this reply)
>
> On 03/22/24 19:08, Vincent Guittot wrote:
>> Hi Christian,
>>
>> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <[email protected]> wrote:
>>>
>>> There is a feature inside of both schedutil and intel_pstate called
>>> iowait boosting which tries to prevent selecting a low frequency
>>> during IO workloads when it impacts throughput.
>>> The feature is implemented by checking for task wakeups that have
>>> the in_iowait flag set and boost the CPU of the rq accordingly
>>> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>>>
>>> The necessity of the feature is argued with the potentially low
>>> utilization of a task being frequently in_iowait (i.e. most of the
>>> time not enqueued on any rq and cannot build up utilization).
>>>
>>> The RFC focuses on the schedutil implementation.
>>> intel_pstate frequency selection isn't touched for now, suggestions are
>>> very welcome.
>>> Current schedutil iowait boosting has several issues:
>>> 1. Boosting happens even in scenarios where it doesn't improve
>>> throughput. [1]
>>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>> the actual utilization for task placement, but another CPU might be
>>> more energy-efficient at that capacity than the boosted one.)
>>> b) When placing a non-IO task while a CPU is boosted compute_energy()
>>> will not consider the (potentially 'free') boosted capacity, but the
>>> one it would have without the boost (since the boost is only applied
>>> in sugov).
>>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>>> in_iowait wakeups.
>>> 4. The boost isn't associated with a task, it therefore isn't considered
>>> for task placement, potentially missing out on higher capacity CPUs on
>>> heterogeneous CPU topologies.
>>> 5. The boost isn't associated with a task, it therefore lingers on the
>>> rq even after the responsible task has migrated / stopped.
>>> 6. The boost isn't associated with a task, it therefore needs to ramp
>>> up again when migrated.
>>> 7. Since schedutil doesn't know which task is getting woken up,
>>> multiple unrelated in_iowait tasks might lead to boosting.
>
> You forgot an important problem which what was the main request from Android
> when this first came up few years back. iowait boost is a power hungry
> feature and not all tasks require iowait boost. By having it per task we want
> to be able to prevent tasks from causing frequency spikes due to iowait boost
> when it is not warranted.

It is and most of the time I see it triggering (in day-to-day workloads) it
doesn't help in any measurable way.
Being able to toggle this per-task is the logical next step, although I would
expect very little over-boosting overall compared to the current sugov
implementation. If you observe otherwise please do tell me for which workloads!

>>>
>>> We attempt to mitigate all of the above by reworking the way the
>>> iowait boosting (io boosting from here on) works in two major ways:
>>> - Carry the boost in task_struct, so it is a per-task attribute and
>>> behaves similar to utilization of the task in some ways.
>>> - Employ a counting-based tracking strategy that only boosts as long
>>> as it sees benefits and returns to no boosting dynamically.
>>
>> Thanks for working on improving IO boosting. I have started to read
>> your patchset and have few comments about your proposal:
>>
>> The main one is that the io boosting decision should remain a cpufreq
>> governor decision and so the io boosting value should be applied by
>> the governor like in sugov_effective_cpu_perf() as an example instead
>> of everywhere in the scheduler code.
>
> I have similar thoughts.
>
> I think we want the scheduler to treat iowait boost like uclamp_min, but
> requested by block subsystem rather than by the user.
>
> I think we should create a new task_min/max_perf() and replace all current
> callers in scheduler to uclamp_eff_value() with task_min/max_perf() where
> task_min/max_perf()
>
> unsigned long task_min_perf(struct task_struct *p)
> {
> return max(uclamp_eff_value(p, UCLAMP_MIN), p->iowait_boost);
> }
>
> unsigned long task_max_perf(struct task_struct *p)
> {
> return uclamp_eff_value(p, UCLAMP_MAX);
> }
>
> then all users of uclamp_min in the scheduler will see the request for boost
> from iowait and do the correct task placement decision. Including under thermal
> pressure and ensuring that they don't accidentally escape uclamp_max which I am
> not sure if your series caters for with the open coding it. You're missing the
> load balancer paths from what I see.

io_boost doesn't have to be clamped at the load balancer path because it isn't
included there (unless I messed up).
Essentially io_boost should never trigger a load balance, we are talking about
tasks that get constantly enqueued and only spend very little time on the CPU
until sleeping again, so any load balancing should be overkill.
For the rest I'm open to anything, it's all a 'minor' implementation detail for
me :)

>
> It will also solve the problem I mention above. The tasks that should not use
> iowait boost are likely restricted with uclamp_max already. If we treat iowait
> boost as an additional source of min_perf request, then uclamp_max will prevent
> it from going above a certain perf level and give us the desired impact without
> any additional hint. I don't think it is important to disable it completely but
> rather have a way to prevent tasks from consuming too much resources when not
> needed, which we already have from uclamp_max.
>
> I am not sure it makes sense to have a separate control where a task can run
> fast due to util but can't have iowait boost or vice versa. I think existing
> uclamp_max should be enough to restrict tasks from exceeding a performance
> limit.
>
>>
>> Then, the algorithm to track the right interval bucket and the mapping
>> of intervals into utilization really looks like a policy which has
>> been defined with heuristics and as a result further seems to be a
>> governor decision
>
> Hmm do you think this should not be a per-task value then Vincent?

That's how I understood Vincent anyway.
See my other reply.

>
> Or oh, I think I see what you mean. Make effective_cpu_util() set min parameter
> correctly. I think that would work too, yes. iowait boost is just another min
> perf request and as long as it is treated as such, it is good for me. We'll
> just need to add a new parameter for the task like I did in remove uclamp max
> aggregation serires.

I did have that at some point, too, although before Vincent's rework.
Should be fine from what I can see now.

>
> Generally I think it's better to split the patches so that the conversion to
> iowait boost with current algorithm to being per-task as a separate patch. And
> then look at improving the algorithm logic on top. These are two different
> problems IMHO.

That's possible, although the current iowait boosting is based on consecutiveness
of the iowait wakeups on the rq (oversimplifying away all that rate_limit_us
stuff), which doesn't really translate well into a per-task property, but
I can come up with something that works just well enough here.
As I said in my other reply this entire piggybacking ontop of iowait wakeups
is such an unfortunate beast, see all the different occurrences of io_schedule*()
and mutex_lock_io(). The entire interval-based tracking strategy attempts to
mitigate that somewhat without going to the entire tree.

> One major problem and big difference in per-task iowait that I see Christian
> alluded to is that the CPU will no longer be boosted when the task is sleeping.
> I think there will be cases out there where some users relied on that for the
> BLOCK softirq to run faster too. We need an additional way to ensure that the
> softirq runs at a similar performance level to the task that initiated the
> request. So we need a way to hold the cpufreq policy's min perf until the
> softirq is serviced. Or just keep the CPU boosted until the task is migrated.
> I'm not sure what is better yet.

Yes, right now rate_limit_us (which is usually at least TICK_NSEC currently)
'protects' this. Almost all of the cpufreq updates will come from the iowait
task(s) enqueue anyway (in cases we apply some io boost).
Having the per-task boost 'linger' around at the runqueue more explicitly is
a bit awkward though, as you would have to remove if the scheduler picks a
different CPU once the task is being re-enqueued.
Not impossible to do but lots of awkwardness there.

>>
>> Finally adding some atomic operation in the fast path is not really desirable
>
> Yes I was thinking if we can apply the value when we set the p->in_iowait flag
> instead?

Yeah thought about it, too, again the awkwardness is that you don't know on which
rq the task will be enqueued on after the wake up.
(Boost current CPU and then remove if we switched CPUs can be done, but then we
also need to arm a timer for tasks that go into iowait for a long time (and thus
don't deserve boosting anymore)).
Might be worse than the current atomic.
But I'll come up with something, should be the least critical part of this series ;)

Thanks for taking a look, I'll gather some additional numbers for the other replies
and get back to you.

Kind Regards,
Christian

2024-03-25 18:03:09

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On 3/25/24 05:06, Christian Loehle wrote:
> On 21/03/2024 19:52, Bart Van Assche wrote:
>> On 3/21/24 10:57, Christian Loehle wrote:
>>> In the long-term it looks like for UFS the problem will disappear as we are
>>> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
>>> is already the case.
>>
>> Why the focus on storage controllers with a single completion interrupt?
>> It probably won't take long (one year?) until all new high-end
>> smartphones may have support for multiple completion interrupts.
>
> Apart from going to "This patch shows significant performance improvements on
> hardware that runs mainline today" to "This patch will have significant
> performance improvements on devices running mainline in a couple years"
> nothing in particular.

That doesn't make sense to me. Smartphones with UFSHCI 4.0 controllers
are available from multiple vendors. See also
https://en.wikipedia.org/wiki/Universal_Flash_Storage. See also
https://www.gsmarena.com/samsung_galaxy_s24-12773.php.

Bart.

2024-03-28 10:10:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

On Mon, 25 Mar 2024 at 13:24, Christian Loehle <[email protected]> wrote:
>
> On 22/03/2024 18:08, Vincent Guittot wrote:
> > Hi Christian,
> Hi Vincent,
> thanks for taking a look.
>
> >
> > On Mon, 4 Mar 2024 at 21:17, Christian Loehle <[email protected]> wrote:
> >>
> >> There is a feature inside of both schedutil and intel_pstate called
> >> iowait boosting which tries to prevent selecting a low frequency
> >> during IO workloads when it impacts throughput.
> >> The feature is implemented by checking for task wakeups that have
> >> the in_iowait flag set and boost the CPU of the rq accordingly
> >> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
> >>
> >> The necessity of the feature is argued with the potentially low
> >> utilization of a task being frequently in_iowait (i.e. most of the
> >> time not enqueued on any rq and cannot build up utilization).
> >>
> >> The RFC focuses on the schedutil implementation.
> >> intel_pstate frequency selection isn't touched for now, suggestions are
> >> very welcome.
> >> Current schedutil iowait boosting has several issues:
> >> 1. Boosting happens even in scenarios where it doesn't improve
> >> throughput. [1]
> >> 2. The boost is not accounted for in EAS: a) feec() will only consider
> >> the actual utilization for task placement, but another CPU might be
> >> more energy-efficient at that capacity than the boosted one.)
> >> b) When placing a non-IO task while a CPU is boosted compute_energy()
> >> will not consider the (potentially 'free') boosted capacity, but the
> >> one it would have without the boost (since the boost is only applied
> >> in sugov).
> >> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> >> in_iowait wakeups.
> >> 4. The boost isn't associated with a task, it therefore isn't considered
> >> for task placement, potentially missing out on higher capacity CPUs on
> >> heterogeneous CPU topologies.
> >> 5. The boost isn't associated with a task, it therefore lingers on the
> >> rq even after the responsible task has migrated / stopped.
> >> 6. The boost isn't associated with a task, it therefore needs to ramp
> >> up again when migrated.
> >> 7. Since schedutil doesn't know which task is getting woken up,
> >> multiple unrelated in_iowait tasks might lead to boosting.
> >>
> >> We attempt to mitigate all of the above by reworking the way the
> >> iowait boosting (io boosting from here on) works in two major ways:
> >> - Carry the boost in task_struct, so it is a per-task attribute and
> >> behaves similar to utilization of the task in some ways.
> >> - Employ a counting-based tracking strategy that only boosts as long
> >> as it sees benefits and returns to no boosting dynamically.
> >
> > Thanks for working on improving IO boosting. I have started to read
> > your patchset and have few comments about your proposal:
> >
> > The main one is that the io boosting decision should remain a cpufreq
> > governor decision and so the io boosting value should be applied by
> > the governor like in sugov_effective_cpu_perf() as an example instead
> > of everywhere in the scheduler code
> Having it move into the scheduler is to enable it for EAS (e.g. boosting
> a LITTLE to it's highest OPP often being much less energy-efficient than
> having a higher cap CPU at a lower OPP) and to enable higher capacities
> reachable on other CPUs, too.

sugov_effective_cpu_perf() is used by EAS when finding the final OPP
and computing the energy so I don't see a problem of moving the policy
(converting some iowait boost information into some performance level)
into the cpufreq governor. EAS should be able to select the more
efficient CPU for the waking task.
Furthermore, you add it into the utilization whereas iowait boost is
not a capacity that will be used by the task but like a minimum
bandwidth requirement to speedup its execution; This could be seen
like uclamp_min especially if you also want to use the iowait boosting
to migrate tasks. But I don't think that this is exactly the same.
Uclamp_min helps when a task has always the same amount of small work
to do periodically. Whatever the frequency, its utilization remains
(almost) the same and is not really expected to impact its period. In
the case of iowait boost, when you increase the frequency, the task
will do more work and its utilization will decrease (because the
overall periods will decrease). This increase of the utilization
should be the trigger for migrating the task on another CPU.

> I guess for you the first one is the more interesting one.
>
> >
> > Then, the algorithm to track the right interval bucket and the mapping
> > of intervals into utilization really looks like a policy which has
> > been defined with heuristics and as a result further seems to be a
> > governor decision
>
> I did have a comparable thing as a governor decision, but the entire
> "Test if util boost increases iowaits seen per interval and only boost
> accordingly" really only works if the interval is long enough, my proposed
> starting length of 25ms really being the lower limit for the storage devices
> we want to cover (IO latency not being constant and therefore iowaits per
> interval being somewhat noisy).

Your explanation above confirms that it's a policy for your storage devices.

> Given that the IO tasks will be enqueued/dequeued very frequently it just
> isn't credible to expect them to land on the same CPU for many intervals,
> unless your system is very bare-bones and idle, but even on an idle Android
> I see any interval above 50ms to be unusable and not provide any throughput
> improvement.
> The idea of tracking the iowaits I do find the best option in this vague and
> noisy environment of "iowait wakeups" and definitely worth having, so that's
> why I opted for it being in the scheduler code, but I'd love to hear your
> thoughts/alternatives.

There is 3 parts in your proposal
1- tracking per task iowait statistics
2- translate that into a capacity more than an utilization
3- use this value in EAS

Having 1- in the scheduler seems ok but 2- and 3- should not be
injected directly into the scheduler



> I'd also like an improvement on the definition of iowait or some more separate
> flag for boostable IO, the entire "boost on any iowait wakeup" is groping in
> the dark which I'm trying to combat, but it's somewhat out of scope here.
>
> >
> > Finally adding some atomic operation in the fast path is not really desirable
> Agreed, I'll look into it, for now I wanted as much feedback on the two major
> changes:
> - iowait boost now per-task
> - boosting based upon iowaits seen per interval
>
> >
> > I will continue to review your patchset
>
> Thank you, looking forward to seeing your review.
>
> >>[snip]
> Kind Regards,
> Christian

2024-04-19 13:56:57

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 25/03/2024 02:37, Qais Yousef wrote:
> On 03/18/24 18:08, Rafael J. Wysocki wrote:
>> On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
>> <[email protected]> wrote:
>>>
>>> On 18/03/2024 14:07, Rafael J. Wysocki wrote:
>>>> On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
>>>> <[email protected]> wrote:
>>>>>
>>>>> The previous commit provides a new cpu_util_cfs_boost_io interface for
>>>>> schedutil which uses the io boosted utilization of the per-task
>>>>> tracking strategy. Schedutil iowait boosting is therefore no longer
>>>>> necessary so remove it.
>>>>
>>>> I'm wondering about the cases when schedutil is used without EAS.
>>>>
>>>> Are they still going to be handled as before after this change?
>>>
>>> Well they should still get boosted (under the new conditions) and according
>>> to my tests that does work.
>>
>> OK
>>
>>> Anything in particular you're worried about?
>>
>> It is not particularly clear to me how exactly the boost is taken into
>> account without EAS.
>>
>>> So in terms of throughput I see similar results with EAS and CAS+sugov.
>>> I'm happy including numbers in the cover letter for future versions, too.
>>> So far my intuition was that nobody would care enough to include them
>>> (as long as it generally still works).
>>
>> Well, IMV clear understanding of the changes is more important.
>
> I think the major thing we need to be careful about is the behavior when the
> task is sleeping. I think the boosting will be removed when the task is
> dequeued and I can bet there will be systems out there where the BLOCK softirq
> being boosted when the task is sleeping will matter.

Currently I see this mainly protected by the sugov rate_limit_us.
With the enqueue's being the dominating cpufreq updates it's not really an
issue, the boost is expected to survive the sleep duration, during which it
wouldn't be active.
I did experiment with some sort of 'stickiness' of the boost to the rq, but
it is somewhat of a pain to deal with if we want to remove it once enqueued
on a different rq. A sugov 1ms timer is much simpler of course.
Currently it's not necessary IMO, but for the sake of being future-proof in
terms of more frequent freq updates I might include it in v2.

>
> FWIW I do have an implementation for per-task iowait boost where I went a step
> further and converted intel_pstate too and like Christian didn't notice
> a regression. But I am not sure (rather don't think) I triggered this use case.
> I can't tell when the systems truly have per-cpu cpufreq control or just appear
> so and they are actually shared but not visible at linux level.

Please do share your intel_pstate proposal!

Kind Regards,
Christian

2024-04-29 11:32:16

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 04/19/24 14:42, Christian Loehle wrote:

> > I think the major thing we need to be careful about is the behavior when the
> > task is sleeping. I think the boosting will be removed when the task is
> > dequeued and I can bet there will be systems out there where the BLOCK softirq
> > being boosted when the task is sleeping will matter.
>
> Currently I see this mainly protected by the sugov rate_limit_us.
> With the enqueue's being the dominating cpufreq updates it's not really an
> issue, the boost is expected to survive the sleep duration, during which it
> wouldn't be active.
> I did experiment with some sort of 'stickiness' of the boost to the rq, but
> it is somewhat of a pain to deal with if we want to remove it once enqueued
> on a different rq. A sugov 1ms timer is much simpler of course.
> Currently it's not necessary IMO, but for the sake of being future-proof in
> terms of more frequent freq updates I might include it in v2.

Making sure things work with purpose would be really great. This implicit
dependency is not great IMHO and make both testing and reasoning about why
things are good or bad harder when analysing real workloads. Especially by non
kernel developers.

>
> >
> > FWIW I do have an implementation for per-task iowait boost where I went a step
> > further and converted intel_pstate too and like Christian didn't notice
> > a regression. But I am not sure (rather don't think) I triggered this use case.
> > I can't tell when the systems truly have per-cpu cpufreq control or just appear
> > so and they are actually shared but not visible at linux level.
>
> Please do share your intel_pstate proposal!

This is what I had. I haven't been working on this for the past few months, but
I remember tried several tests on different machines then without a problem.
I tried to re-order patches at some point though and I hope I didn't break
something accidentally and forgot the state.

https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation

2024-05-07 15:50:13

by Christian Loehle

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 29/04/2024 12:18, Qais Yousef wrote:
> On 04/19/24 14:42, Christian Loehle wrote:
>
>>> I think the major thing we need to be careful about is the behavior when the
>>> task is sleeping. I think the boosting will be removed when the task is
>>> dequeued and I can bet there will be systems out there where the BLOCK softirq
>>> being boosted when the task is sleeping will matter.
>>
>> Currently I see this mainly protected by the sugov rate_limit_us.
>> With the enqueue's being the dominating cpufreq updates it's not really an
>> issue, the boost is expected to survive the sleep duration, during which it
>> wouldn't be active.
>> I did experiment with some sort of 'stickiness' of the boost to the rq, but
>> it is somewhat of a pain to deal with if we want to remove it once enqueued
>> on a different rq. A sugov 1ms timer is much simpler of course.
>> Currently it's not necessary IMO, but for the sake of being future-proof in
>> terms of more frequent freq updates I might include it in v2.
>
> Making sure things work with purpose would be really great. This implicit
> dependency is not great IMHO and make both testing and reasoning about why
> things are good or bad harder when analysing real workloads. Especially by non
> kernel developers.

Agreed.
Even without your proposed changes [1] relying on sugov rate_limit_us is
unfortunate.
There is a problem with an arbitrarily low rate_limit_us more generally, not
just because we kind of rely on the CPU being boosted right before the task is
actually enqueued (for the interrupt/softirq part of it), but also because of
the latency from requested frequency improvement to actually running on that
frequency. If the task is 90% done by the time it sees the improvement and
the frequency will be updated (back to a lower one) before the next enqueue,
then that's hardly worth the effort.
Currently this is covered by rate_limit_us probabillistically and that seems
to be good enough in practice, but it's not very pleasing (and also EAS can't
take it into consideration).
That's not just exclusive for iowait wakeup tasks of course, but in theory any
that is off the rq frequently (and still requests a higher frequency than it can
realistically build up through util_avg like through uclamp_min).

>>>
>>> FWIW I do have an implementation for per-task iowait boost where I went a step
>>> further and converted intel_pstate too and like Christian didn't notice
>>> a regression. But I am not sure (rather don't think) I triggered this use case.
>>> I can't tell when the systems truly have per-cpu cpufreq control or just appear
>>> so and they are actually shared but not visible at linux level.
>>
>> Please do share your intel_pstate proposal!
>
> This is what I had. I haven't been working on this for the past few months, but
> I remember tried several tests on different machines then without a problem.
> I tried to re-order patches at some point though and I hope I didn't break
> something accidentally and forgot the state.
>
> https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation
>

Thanks for sharing, that looks reasonable with consolidating it into uclamp_min.
Couple of thoughts on yours, I'm sure you're aware, but consider it me thinking out
loud:
- iowait boost is taken into consideration for task placement, but with just the
4 steps that made it more aggressive on HMP. (Potentially 2-3 consecutive iowait
wakeups to land on the big instead of running at max OPP of a LITTLE).
- If the current iowait boost decay is sensible is questionable, but there should
probably be some decay. Taken to the extreme this would mean something
like blk_wait_io() demands 1024 utilization, if it waits for a very long time.
Repeating myself here, but iowait wakeups itself is tricky to work with (and I
try to work around that).
- The intel_pstate solution will increase boost even if
previous_wakeup->iowait_boost > current->iowait_boost
right? But using current->iowait_boost is a clever idea.

[1]
https://lore.kernel.org/lkml/[email protected]/T/

Kind Regards,
Christian

2024-05-12 15:30:00

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 05/07/24 16:19, Christian Loehle wrote:
> On 29/04/2024 12:18, Qais Yousef wrote:
> > On 04/19/24 14:42, Christian Loehle wrote:
> >
> >>> I think the major thing we need to be careful about is the behavior when the
> >>> task is sleeping. I think the boosting will be removed when the task is
> >>> dequeued and I can bet there will be systems out there where the BLOCK softirq
> >>> being boosted when the task is sleeping will matter.
> >>
> >> Currently I see this mainly protected by the sugov rate_limit_us.
> >> With the enqueue's being the dominating cpufreq updates it's not really an
> >> issue, the boost is expected to survive the sleep duration, during which it
> >> wouldn't be active.
> >> I did experiment with some sort of 'stickiness' of the boost to the rq, but
> >> it is somewhat of a pain to deal with if we want to remove it once enqueued
> >> on a different rq. A sugov 1ms timer is much simpler of course.
> >> Currently it's not necessary IMO, but for the sake of being future-proof in
> >> terms of more frequent freq updates I might include it in v2.
> >
> > Making sure things work with purpose would be really great. This implicit
> > dependency is not great IMHO and make both testing and reasoning about why
> > things are good or bad harder when analysing real workloads. Especially by non
> > kernel developers.
>
> Agreed.
> Even without your proposed changes [1] relying on sugov rate_limit_us is
> unfortunate.
> There is a problem with an arbitrarily low rate_limit_us more generally, not
> just because we kind of rely on the CPU being boosted right before the task is
> actually enqueued (for the interrupt/softirq part of it), but also because of
> the latency from requested frequency improvement to actually running on that
> frequency. If the task is 90% done by the time it sees the improvement and
> the frequency will be updated (back to a lower one) before the next enqueue,
> then that's hardly worth the effort.

I think that's why iowait boost is done the way it is today. You need to
sustain the boost as the thing that need it run for a very short amount of
time..

Have you looked at how long the iowait boosted tasks run for in your tests?

> Currently this is covered by rate_limit_us probabillistically and that seems

Side note. While looking more at the history of rate_limit_us and why it is set
soo much higher than reported transition_latency by the driver, I am slowly
reaching the conclusion that what is happening here is similar to introduction
of down_rate_limit_us in Android. Our ability to predict system requirement is
not great and we end up with prematurely lowering frequencies, and this rate
limit just happens to more likely keep it higher so on average you see better
perf in various workloads. I hope we can improve on this so we don't rely on
this magic and enable better usage of the hardware ability to transition
between frequencies fast and be more exact. I am trying to work on this as part
of my magic margins series. But I think the story is bigger than that..

> to be good enough in practice, but it's not very pleasing (and also EAS can't
> take it into consideration).
> That's not just exclusive for iowait wakeup tasks of course, but in theory any
> that is off the rq frequently (and still requests a higher frequency than it can
> realistically build up through util_avg like through uclamp_min).

For uclamp_min, if the task RUNNING duration is shorter than the hardware
ability to apply the boost, I think there's little we can do. The user can opt
to boost the system in general. Note that this is likely a problem on systems
with multi-ms transition_delay_us. If the task is running for few 100s us and
it really wants to be boosted for this short time then a static system wide
boost is all we can do to guarantee what it wants. The hardware is poor fit for
the use case in this scenario. And personally I'd push back against introducing
complexity to deal with such poor fit scenarios. We can already set min freq
for the policy via sysfs and uclamp_min can still help with task placement for
HMP systems.

Now the problem we could have which is similar to iowait boost scenario is when
there's a chaining effect that requires the boost for the duration of the
chain.

I think we can do something about it if we:

1. Guarantee the chain will run on the same CPU.
2. Introduce a 'sticky' flag for the boost to stay while the chain is
running.
3. Introduce a start/finish indication for the chain.

I think we can do something like that with sched-qos [1] to tag the chain via
a cookie and request the boost to apply to them collectively.

Generally, userspace would be better to collapse this chain into a single task
that runs in one go.

I don't know how often this scenario exists in practice and what limitations
exist that make the simple collapse solution infeasible.. So I'd leave this
out until more info is available.

>
> >>>
> >>> FWIW I do have an implementation for per-task iowait boost where I went a step
> >>> further and converted intel_pstate too and like Christian didn't notice
> >>> a regression. But I am not sure (rather don't think) I triggered this use case.
> >>> I can't tell when the systems truly have per-cpu cpufreq control or just appear
> >>> so and they are actually shared but not visible at linux level.
> >>
> >> Please do share your intel_pstate proposal!
> >
> > This is what I had. I haven't been working on this for the past few months, but
> > I remember tried several tests on different machines then without a problem.
> > I tried to re-order patches at some point though and I hope I didn't break
> > something accidentally and forgot the state.
> >
> > https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation
> >
>
> Thanks for sharing, that looks reasonable with consolidating it into uclamp_min.
> Couple of thoughts on yours, I'm sure you're aware, but consider it me thinking out
> loud:
> - iowait boost is taken into consideration for task placement, but with just the
> 4 steps that made it more aggressive on HMP. (Potentially 2-3 consecutive iowait
> wakeups to land on the big instead of running at max OPP of a LITTLE).

Yeah I opted to keep the logic the same. I think there are gains to be had even
without being smarter about the algorithm. But we do need to improve it, yes.
The current logic is too aggressive and the perf/power trade-off will be tricky
in practice.

> - If the current iowait boost decay is sensible is questionable, but there should
> probably be some decay. Taken to the extreme this would mean something
> like blk_wait_io() demands 1024 utilization, if it waits for a very long time.
> Repeating myself here, but iowait wakeups itself is tricky to work with (and I
> try to work around that).

I didn't get you here. But generally the story can go few levels deep yes. My
approach was to make incremental progress without breaking existing stuff, but
help move things in the right direction over time. Fixing everything in one go
will be hard and not productive.

> - The intel_pstate solution will increase boost even if
> previous_wakeup->iowait_boost > current->iowait_boost
> right? But using current->iowait_boost is a clever idea.

I forgot the details now. But I seem to remember intel_pstate had its own
accounting when it sees iowait boost in intel_pstate_hwp_boost_up/down().

Note that the major worry I had is about the softirq not being boosted.
Although in my testing this didn't seem to show up as things seemed fine. But
I haven't dug down to see how accidental this was. I could easily see my
patches making some use case out there not happy as the softirq might not get
a chance to see the boost. I got distracted with other stuff and didn't get
back to the topic since then. I am more than happy to support your efforts
though :)

[1] https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/