Hello,
This patchset improves iocost in three areas to make iocost internal
operations more accurate and immediate with the goal of improving work
conservation and distribution fairness, and removing dependence on vrate
adjustments for masking work conservation issues. This improves overall
control quality and allows regulating vrate more tightly for more consistent
behavior as vrate now only needs to respond to device behavior changes.
1. Donation
iocost implements work-conservation by making under-utliized cgroups to
donate unused budgets to saturated cgroups. This approach has the
significant advantage that calculation or synchronization inaccuracies never
lead to over utilization of the device while allowing all hot path
operations to be local to each cgroup - it's inherently safe w/o needing
system-wide synchronization in hot paths.
However, this approach requires dynamically adjusting weights according to
the current usage of each cgroup. Given that a cgroup with weight X is using
only a portion of its hierarchical absoulte share, it needs to scale down X
so that the share matches the observed usage. With nesting and multiple
nodes needing adjustments at once, the math is non-trivial. The current
implementation works around the issue by trying to converge by repeatedly
under-adjusting the weight of each cgroup.
The innate inaccuracies can lead to significant errors impacting work
conservation and fairness, and the workarounds around them weigh down the
rest of the control logic.
Andy Newell devised a method to calculate the exact weight updates given the
target hierarchical shares which is described in the following pdfs.
https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
This patchset implements Andy's method for precise donation weight
adjustments on each period timer.
Donation amount is also adjusted during a period if the donor is running out
of budget. This mechanism used to be very coarse as donation calculations
weren't accurate to begin with. Now that donation calculations are exact,
this patchset improves in-period adjustments too.
2. Debt
Some IOs which are attributed to a low priority cgroup can cause severe
priority inversions when blocked - e.g. swap outs and filesystem metadata
IOs. These IOs are issued right away even when the cgroup doesn't have
enough budget. When this happens, the cgroup incurs debt, which the cgroup
has to pay off before issuing more IOs.
There were several issues around debt handling around how weight is adjusted
while under debt, how payment is calculated, and how anonymous memory delay
duration is determined. This patchset fixes and improves debt handling and
adds debt forgiveness mechanism which avoids extended pathological stalling
on very slow devices.
3. Excess handling
During a period, each cgroup mostly runs on its own without constantly
synchronizing with other cgroups. This often leads to excess budget which
needs to be thrown away at the end of the period, which can have negative
impact on work conservation. This is somewhat offset by vrate adjustments
but vrate compensation is delayed and can sometimes be erratic and it
prevents us from confining vrate for more consistent behavior.
This patchset implements excess vrate compensation where the effective vrate
is transparently boosted to compensate for excesses without affecting the
regular latency based vrate adjustment mechanism. This compensates for
excesses immediately and accurately and allows the regular vrate adjustment
mechanism to worry only about device behavior changes.
This patchset is on top of for-5.10/block (2b64038972e4) and availalbe in
the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git iocost-andys
It contains the following 27 patches.
0001-blk-iocost-ioc_pd_free-shouldn-t-assume-irq-disabled.patch
0002-blk-stat-make-q-stats-lock-irqsafe.patch
0003-blk-iocost-use-local-64-_t-for-percpu-stat.patch
0004-blk-iocost-rename-propagate_active_weights-to-propag.patch
0005-blk-iocost-clamp-inuse-and-skip-noops-in-__propagate.patch
0006-blk-iocost-move-iocg_kick_delay-above-iocg_kick_wait.patch
0007-blk-iocost-make-iocg_kick_waitq-call-iocg_kick_delay.patch
0008-blk-iocost-s-HWEIGHT_WHOLE-WEIGHT_ONE-g.patch
0009-blk-iocost-use-WEIGHT_ONE-based-fixed-point-number-f.patch
0010-blk-iocost-make-ioc_now-now-and-ioc-period_at-64bit.patch
0011-blk-iocost-streamline-vtime-margin-and-timer-slack-h.patch
0012-blk-iocost-grab-ioc-lock-for-debt-handling.patch
0013-blk-iocost-add-absolute-usage-stat.patch
0014-blk-iocost-calculate-iocg-usages-from-iocg-local_sta.patch
0015-blk-iocost-replace-iocg-has_surplus-with-surplus_lis.patch
0016-blk-iocost-decouple-vrate-adjustment-from-surplus-tr.patch
0017-blk-iocost-restructure-surplus-donation-logic.patch
0018-blk-iocost-implement-Andy-s-method-for-donation-weig.patch
0019-blk-iocost-revamp-donation-amount-determination.patch
0020-blk-iocost-revamp-in-period-donation-snapbacks.patch
0021-blk-iocost-revamp-debt-handling.patch
0022-blk-iocost-implement-delay-adjustment-hysteresis.patch
0023-blk-iocost-halve-debts-if-device-stays-idle.patch
0024-blk-iocost-implement-vtime-loss-compensation.patch
0025-blk-iocost-restore-inuse-update-tracepoints.patch
0026-blk-iocost-add-three-debug-stat-cost.wait-indebt-and.patch
0027-blk-iocost-update-iocost_monitor.py.patch
0001-0002 are fixes w/ stable cc'd.
0003-0012 are prep patches - increasing calculation precision for weights,
switching some fields to 64bit, code reorganization, locking changes and so
on.
0013-0014 implement per-cgroup absolute usage tracking so that control
decisions aren't affected by weight distribution changes.
0015-0017 restructure donation logic to prepare for Andy's weight adjustment
method.
0018-0020 implement Andy's weight adjustment method, improve donation logic
both on period and in period.
0021-0023 improve debt and delay handling.
0024 implements budget excess compensation.
0025-0027 update tracepoints, monitoring script, debug stat.
diffstat follows. Thanks.
block/blk-cgroup.c | 23
block/blk-iocost.c | 1540 +++++++++++++++++++++++++++++++----------
block/blk-stat.c | 17
include/trace/events/iocost.h | 26
tools/cgroup/iocost_monitor.py | 54 -
5 files changed, 1227 insertions(+), 433 deletions(-)
--
tejun
blk-iocost has been reading percpu stat counters from remote cpus which on
some archs can lead to torn reads in really rare occassions. Use local[64]_t
for those counters.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 37 +++++++++++++++++++++++++++----------
1 file changed, 27 insertions(+), 10 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index d37b55db2409..e2266e7692b4 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -179,6 +179,8 @@
#include <linux/parser.h>
#include <linux/sched/signal.h>
#include <linux/blk-cgroup.h>
+#include <asm/local.h>
+#include <asm/local64.h>
#include "blk-rq-qos.h"
#include "blk-stat.h"
#include "blk-wbt.h"
@@ -373,8 +375,8 @@ struct ioc_params {
};
struct ioc_missed {
- u32 nr_met;
- u32 nr_missed;
+ local_t nr_met;
+ local_t nr_missed;
u32 last_met;
u32 last_missed;
};
@@ -382,7 +384,7 @@ struct ioc_missed {
struct ioc_pcpu_stat {
struct ioc_missed missed[2];
- u64 rq_wait_ns;
+ local64_t rq_wait_ns;
u64 last_rq_wait_ns;
};
@@ -1278,8 +1280,8 @@ static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p
u64 this_rq_wait_ns;
for (rw = READ; rw <= WRITE; rw++) {
- u32 this_met = READ_ONCE(stat->missed[rw].nr_met);
- u32 this_missed = READ_ONCE(stat->missed[rw].nr_missed);
+ u32 this_met = local_read(&stat->missed[rw].nr_met);
+ u32 this_missed = local_read(&stat->missed[rw].nr_missed);
nr_met[rw] += this_met - stat->missed[rw].last_met;
nr_missed[rw] += this_missed - stat->missed[rw].last_missed;
@@ -1287,7 +1289,7 @@ static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p
stat->missed[rw].last_missed = this_missed;
}
- this_rq_wait_ns = READ_ONCE(stat->rq_wait_ns);
+ this_rq_wait_ns = local64_read(&stat->rq_wait_ns);
rq_wait_ns += this_rq_wait_ns - stat->last_rq_wait_ns;
stat->last_rq_wait_ns = this_rq_wait_ns;
}
@@ -1908,6 +1910,7 @@ static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio)
static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
{
struct ioc *ioc = rqos_to_ioc(rqos);
+ struct ioc_pcpu_stat *ccs;
u64 on_q_ns, rq_wait_ns, size_nsec;
int pidx, rw;
@@ -1931,13 +1934,17 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
+ ccs = get_cpu_ptr(ioc->pcpu_stat);
+
if (on_q_ns <= size_nsec ||
on_q_ns - size_nsec <= ioc->params.qos[pidx] * NSEC_PER_USEC)
- this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_met);
+ local_inc(&ccs->missed[rw].nr_met);
else
- this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_missed);
+ local_inc(&ccs->missed[rw].nr_missed);
+
+ local64_add(rq_wait_ns, &ccs->rq_wait_ns);
- this_cpu_add(ioc->pcpu_stat->rq_wait_ns, rq_wait_ns);
+ put_cpu_ptr(ccs);
}
static void ioc_rqos_queue_depth_changed(struct rq_qos *rqos)
@@ -1977,7 +1984,7 @@ static int blk_iocost_init(struct request_queue *q)
{
struct ioc *ioc;
struct rq_qos *rqos;
- int ret;
+ int i, cpu, ret;
ioc = kzalloc(sizeof(*ioc), GFP_KERNEL);
if (!ioc)
@@ -1989,6 +1996,16 @@ static int blk_iocost_init(struct request_queue *q)
return -ENOMEM;
}
+ for_each_possible_cpu(cpu) {
+ struct ioc_pcpu_stat *ccs = per_cpu_ptr(ioc->pcpu_stat, cpu);
+
+ for (i = 0; i < ARRAY_SIZE(ccs->missed); i++) {
+ local_set(&ccs->missed[i].nr_met, 0);
+ local_set(&ccs->missed[i].nr_missed, 0);
+ }
+ local64_set(&ccs->rq_wait_ns, 0);
+ }
+
rqos = &ioc->rqos;
rqos->id = RQ_QOS_COST;
rqos->ops = &ioc_rqos_ops;
--
2.26.2
We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
WEIGHT_ONE.
Pure rename.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index b2b8dfbeee5a..5e6d56eec1c9 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -68,7 +68,7 @@
* gets 300/(100+300) or 75% share, and A0 and A1 equally splits the rest,
* 12.5% each. The distribution mechanism only cares about these flattened
* shares. They're called hweights (hierarchical weights) and always add
- * upto 1 (HWEIGHT_WHOLE).
+ * upto 1 (WEIGHT_ONE).
*
* A given cgroup's vtime runs slower in inverse proportion to its hweight.
* For example, with 12.5% weight, A0's time runs 8 times slower (100/12.5)
@@ -246,7 +246,7 @@ enum {
MIN_VALID_USAGES = 2,
/* 1/64k is granular enough and can easily be handled w/ u32 */
- HWEIGHT_WHOLE = 1 << 16,
+ WEIGHT_ONE = 1 << 16,
/*
* As vtime is used to calculate the cost of each IO, it needs to
@@ -285,8 +285,8 @@ enum {
* donate the surplus.
*/
SURPLUS_SCALE_PCT = 125, /* * 125% */
- SURPLUS_SCALE_ABS = HWEIGHT_WHOLE / 50, /* + 2% */
- SURPLUS_MIN_ADJ_DELTA = HWEIGHT_WHOLE / 33, /* 3% */
+ SURPLUS_SCALE_ABS = WEIGHT_ONE / 50, /* + 2% */
+ SURPLUS_MIN_ADJ_DELTA = WEIGHT_ONE / 33, /* 3% */
/* switch iff the conditions are met for longer than this */
AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC,
@@ -491,7 +491,7 @@ struct ioc_gq {
struct hrtimer waitq_timer;
struct hrtimer delay_timer;
- /* usage is recorded as fractions of HWEIGHT_WHOLE */
+ /* usage is recorded as fractions of WEIGHT_ONE */
int usage_idx;
u32 usages[NR_USAGE_SLOTS];
@@ -658,7 +658,7 @@ static struct ioc_cgrp *blkcg_to_iocc(struct blkcg *blkcg)
*/
static u64 abs_cost_to_cost(u64 abs_cost, u32 hw_inuse)
{
- return DIV64_U64_ROUND_UP(abs_cost * HWEIGHT_WHOLE, hw_inuse);
+ return DIV64_U64_ROUND_UP(abs_cost * WEIGHT_ONE, hw_inuse);
}
/*
@@ -666,7 +666,7 @@ static u64 abs_cost_to_cost(u64 abs_cost, u32 hw_inuse)
*/
static u64 cost_to_abs_cost(u64 cost, u32 hw_inuse)
{
- return DIV64_U64_ROUND_UP(cost * hw_inuse, HWEIGHT_WHOLE);
+ return DIV64_U64_ROUND_UP(cost * hw_inuse, WEIGHT_ONE);
}
static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost)
@@ -980,7 +980,7 @@ static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep
*/
smp_rmb();
- hwa = hwi = HWEIGHT_WHOLE;
+ hwa = hwi = WEIGHT_ONE;
for (lvl = 0; lvl <= iocg->level - 1; lvl++) {
struct ioc_gq *parent = iocg->ancestors[lvl];
struct ioc_gq *child = iocg->ancestors[lvl + 1];
@@ -2088,8 +2088,8 @@ static void ioc_pd_init(struct blkg_policy_data *pd)
atomic64_set(&iocg->done_vtime, now.vnow);
atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period));
INIT_LIST_HEAD(&iocg->active_list);
- iocg->hweight_active = HWEIGHT_WHOLE;
- iocg->hweight_inuse = HWEIGHT_WHOLE;
+ iocg->hweight_active = WEIGHT_ONE;
+ iocg->hweight_inuse = WEIGHT_ONE;
init_waitqueue_head(&iocg->waitq);
hrtimer_init(&iocg->waitq_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
--
2.26.2
Budget donations are inaccurate and could take multiple periods to converge.
To prevent triggering vrate adjustments while surplus transfers were
catching up, vrate adjustment was suppressed if donations were increasing,
which was indicated by non-zero nr_surpluses.
This entangling won't be necessary with the scheduled rewrite of donation
mechanism which will make it precise and immediate. Let's decouple the two
in preparation.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 19 +++++++------------
include/trace/events/iocost.h | 13 ++++---------
2 files changed, 11 insertions(+), 21 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index c1cd66cfa2a8..a3889a8b0a33 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1508,7 +1508,7 @@ static void ioc_timer_fn(struct timer_list *timer)
struct ioc_gq *iocg, *tiocg;
struct ioc_now now;
LIST_HEAD(surpluses);
- int nr_surpluses = 0, nr_shortages = 0, nr_lagging = 0;
+ int nr_shortages = 0, nr_lagging = 0;
u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
u32 missed_ppm[2], rq_wait_pct;
@@ -1640,10 +1640,8 @@ static void ioc_timer_fn(struct timer_list *timer)
atomic64_add(delta, &iocg->vtime);
atomic64_add(delta, &iocg->done_vtime);
/* if usage is sufficiently low, maybe it can donate */
- if (surplus_adjusted_hweight_inuse(usage, hw_inuse)) {
+ if (surplus_adjusted_hweight_inuse(usage, hw_inuse))
list_add(&iocg->surplus_list, &surpluses);
- nr_surpluses++;
- }
} else if (hw_inuse < hw_active) {
u32 new_hwi, new_inuse;
@@ -1673,7 +1671,7 @@ static void ioc_timer_fn(struct timer_list *timer)
}
}
- if (!nr_shortages || !nr_surpluses)
+ if (!nr_shortages || list_empty(&surpluses))
goto skip_surplus_transfers;
/* there are both shortages and surpluses, transfer surpluses */
@@ -1738,11 +1736,9 @@ static void ioc_timer_fn(struct timer_list *timer)
/*
* If there are IOs spanning multiple periods, wait
- * them out before pushing the device harder. If
- * there are surpluses, let redistribution work it
- * out first.
+ * them out before pushing the device harder.
*/
- if (!nr_lagging && !nr_surpluses)
+ if (!nr_lagging)
ioc->busy_level--;
} else {
/*
@@ -1796,15 +1792,14 @@ static void ioc_timer_fn(struct timer_list *timer)
}
trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
- nr_lagging, nr_shortages,
- nr_surpluses);
+ nr_lagging, nr_shortages);
atomic64_set(&ioc->vtime_rate, vrate);
ioc_refresh_margins(ioc);
} else if (ioc->busy_level != prev_busy_level || nr_lagging) {
trace_iocost_ioc_vrate_adj(ioc, atomic64_read(&ioc->vtime_rate),
missed_ppm, rq_wait_pct, nr_lagging,
- nr_shortages, nr_surpluses);
+ nr_shortages);
}
ioc_refresh_params(ioc, false);
diff --git a/include/trace/events/iocost.h b/include/trace/events/iocost.h
index a905ecc0342f..ee024fe8fef6 100644
--- a/include/trace/events/iocost.h
+++ b/include/trace/events/iocost.h
@@ -128,11 +128,9 @@ DEFINE_EVENT(iocg_inuse_update, iocost_inuse_reset,
TRACE_EVENT(iocost_ioc_vrate_adj,
TP_PROTO(struct ioc *ioc, u64 new_vrate, u32 *missed_ppm,
- u32 rq_wait_pct, int nr_lagging, int nr_shortages,
- int nr_surpluses),
+ u32 rq_wait_pct, int nr_lagging, int nr_shortages),
- TP_ARGS(ioc, new_vrate, missed_ppm, rq_wait_pct, nr_lagging, nr_shortages,
- nr_surpluses),
+ TP_ARGS(ioc, new_vrate, missed_ppm, rq_wait_pct, nr_lagging, nr_shortages),
TP_STRUCT__entry (
__string(devname, ioc_name(ioc))
@@ -144,7 +142,6 @@ TRACE_EVENT(iocost_ioc_vrate_adj,
__field(u32, rq_wait_pct)
__field(int, nr_lagging)
__field(int, nr_shortages)
- __field(int, nr_surpluses)
),
TP_fast_assign(
@@ -157,15 +154,13 @@ TRACE_EVENT(iocost_ioc_vrate_adj,
__entry->rq_wait_pct = rq_wait_pct;
__entry->nr_lagging = nr_lagging;
__entry->nr_shortages = nr_shortages;
- __entry->nr_surpluses = nr_surpluses;
),
- TP_printk("[%s] vrate=%llu->%llu busy=%d missed_ppm=%u:%u rq_wait_pct=%u lagging=%d shortages=%d surpluses=%d",
+ TP_printk("[%s] vrate=%llu->%llu busy=%d missed_ppm=%u:%u rq_wait_pct=%u lagging=%d shortages=%d",
__get_str(devname), __entry->old_vrate, __entry->new_vrate,
__entry->busy_level,
__entry->read_missed_ppm, __entry->write_missed_ppm,
- __entry->rq_wait_pct, __entry->nr_lagging, __entry->nr_shortages,
- __entry->nr_surpluses
+ __entry->rq_wait_pct, __entry->nr_lagging, __entry->nr_shortages
)
);
--
2.26.2
Instead of marking iocgs with surplus with a flag and filtering for them
while walking all active iocgs, build a surpluses list. This doesn't make
much difference now but will help implementing improved donation logic which
will iterate iocgs with surplus multiple times.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 2496674bbbf4..c1cd66cfa2a8 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -494,9 +494,9 @@ struct ioc_gq {
int hweight_gen;
u32 hweight_active;
u32 hweight_inuse;
- bool has_surplus;
struct list_head walk_list;
+ struct list_head surplus_list;
struct wait_queue_head waitq;
struct hrtimer waitq_timer;
@@ -1507,6 +1507,7 @@ static void ioc_timer_fn(struct timer_list *timer)
struct ioc *ioc = container_of(timer, struct ioc, timer);
struct ioc_gq *iocg, *tiocg;
struct ioc_now now;
+ LIST_HEAD(surpluses);
int nr_surpluses = 0, nr_shortages = 0, nr_lagging = 0;
u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
@@ -1630,8 +1631,7 @@ static void ioc_timer_fn(struct timer_list *timer)
/* see whether there's surplus vtime */
vmin = now.vnow - ioc->margins.max;
- iocg->has_surplus = false;
-
+ WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
if (!waitqueue_active(&iocg->waitq) &&
time_before64(vtime, vmin)) {
u64 delta = vmin - vtime;
@@ -1641,7 +1641,7 @@ static void ioc_timer_fn(struct timer_list *timer)
atomic64_add(delta, &iocg->done_vtime);
/* if usage is sufficiently low, maybe it can donate */
if (surplus_adjusted_hweight_inuse(usage, hw_inuse)) {
- iocg->has_surplus = true;
+ list_add(&iocg->surplus_list, &surpluses);
nr_surpluses++;
}
} else if (hw_inuse < hw_active) {
@@ -1677,13 +1677,10 @@ static void ioc_timer_fn(struct timer_list *timer)
goto skip_surplus_transfers;
/* there are both shortages and surpluses, transfer surpluses */
- list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
+ list_for_each_entry(iocg, &surpluses, surplus_list) {
u32 usage, hw_active, hw_inuse, new_hwi, new_inuse;
int nr_valid = 0;
- if (!iocg->has_surplus)
- continue;
-
/* base the decision on max historical usage */
for (i = 0, usage = 0; i < NR_USAGE_SLOTS; i++) {
if (iocg->usages[i]) {
@@ -1711,6 +1708,10 @@ static void ioc_timer_fn(struct timer_list *timer)
skip_surplus_transfers:
commit_weights(ioc);
+ /* surplus list should be dissolved after use */
+ list_for_each_entry_safe(iocg, tiocg, &surpluses, surplus_list)
+ list_del_init(&iocg->surplus_list);
+
/*
* If q is getting clogged or we're missing too much, we're issuing
* too much IO and should lower vtime rate. If we're not missing
@@ -2284,6 +2285,7 @@ static void ioc_pd_init(struct blkg_policy_data *pd)
atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period));
INIT_LIST_HEAD(&iocg->active_list);
INIT_LIST_HEAD(&iocg->walk_list);
+ INIT_LIST_HEAD(&iocg->surplus_list);
iocg->hweight_active = WEIGHT_ONE;
iocg->hweight_inuse = WEIGHT_ONE;
@@ -2320,6 +2322,7 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
}
WARN_ON_ONCE(!list_empty(&iocg->walk_list));
+ WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
spin_unlock_irqrestore(&ioc->lock, flags);
--
2.26.2
Currently, iocost doesn't collect or expose any statistics punting off all
monitoring duties to drgn based iocost_monitor.py. While it works for some
scenarios, there are some usability and data availability challenges. For
example, accurate per-cgroup usage information can't be tracked by vtime
progression at all and the number available in iocg->usages[] are really
short-term snapshots used for control heuristics with possibly significant
errors.
This patch implements per-cgroup absolute usage stat counter and exposes it
through io.stat along with the current vrate. Usage stat collection and
flushing employ the same method as cgroup rstat on the active iocg's and the
only hot path overhead is preemption toggling and adding to a percpu
counter.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 155 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 149 insertions(+), 6 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 23b173e34591..f30f9b37fcf0 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -431,6 +431,14 @@ struct ioc {
bool user_cost_model:1;
};
+struct iocg_pcpu_stat {
+ local64_t abs_vusage;
+};
+
+struct iocg_stat {
+ u64 usage_us;
+};
+
/* per device-cgroup pair */
struct ioc_gq {
struct blkg_policy_data pd;
@@ -492,10 +500,19 @@ struct ioc_gq {
u32 hweight_inuse;
bool has_surplus;
+ struct list_head walk_list;
+
struct wait_queue_head waitq;
struct hrtimer waitq_timer;
struct hrtimer delay_timer;
+ /* statistics */
+ struct iocg_pcpu_stat __percpu *pcpu_stat;
+ struct iocg_stat local_stat;
+ struct iocg_stat desc_stat;
+ struct iocg_stat last_stat;
+ u64 last_stat_abs_vusage;
+
/* usage is recorded as fractions of WEIGHT_ONE */
int usage_idx;
u32 usages[NR_USAGE_SLOTS];
@@ -674,10 +691,17 @@ static u64 cost_to_abs_cost(u64 cost, u32 hw_inuse)
return DIV64_U64_ROUND_UP(cost * hw_inuse, WEIGHT_ONE);
}
-static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost)
+static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio,
+ u64 abs_cost, u64 cost)
{
+ struct iocg_pcpu_stat *gcs;
+
bio->bi_iocost_cost = cost;
atomic64_add(cost, &iocg->vtime);
+
+ gcs = get_cpu_ptr(iocg->pcpu_stat);
+ local64_add(abs_cost, &gcs->abs_vusage);
+ put_cpu_ptr(gcs);
}
static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
@@ -1221,7 +1245,7 @@ static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
if (ctx->vbudget < 0)
return -1;
- iocg_commit_bio(ctx->iocg, wait->bio, cost);
+ iocg_commit_bio(ctx->iocg, wait->bio, wait->abs_cost, cost);
/*
* autoremove_wake_function() removes the wait entry only when it
@@ -1382,6 +1406,87 @@ static bool iocg_is_idle(struct ioc_gq *iocg)
return true;
}
+/*
+ * Call this function on the target leaf @iocg's to build pre-order traversal
+ * list of all the ancestors in @inner_walk. The inner nodes are linked through
+ * ->walk_list and the caller is responsible for dissolving the list after use.
+ */
+static void iocg_build_inner_walk(struct ioc_gq *iocg,
+ struct list_head *inner_walk)
+{
+ int lvl;
+
+ WARN_ON_ONCE(!list_empty(&iocg->walk_list));
+
+ /* find the first ancestor which hasn't been visited yet */
+ for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
+ if (!list_empty(&iocg->ancestors[lvl]->walk_list))
+ break;
+ }
+
+ /* walk down and visit the inner nodes to get pre-order traversal */
+ while (++lvl <= iocg->level - 1) {
+ struct ioc_gq *inner = iocg->ancestors[lvl];
+
+ /* record traversal order */
+ list_add_tail(&inner->walk_list, inner_walk);
+ }
+}
+
+/* collect per-cpu counters and propagate the deltas to the parent */
+static void iocg_flush_stat_one(struct ioc_gq *iocg, struct ioc_now *now)
+{
+ struct iocg_stat new_stat;
+ u64 abs_vusage = 0;
+ u64 vusage_delta;
+ int cpu;
+
+ lockdep_assert_held(&iocg->ioc->lock);
+
+ /* collect per-cpu counters */
+ for_each_possible_cpu(cpu) {
+ abs_vusage += local64_read(
+ per_cpu_ptr(&iocg->pcpu_stat->abs_vusage, cpu));
+ }
+ vusage_delta = abs_vusage - iocg->last_stat_abs_vusage;
+ iocg->last_stat_abs_vusage = abs_vusage;
+
+ iocg->local_stat.usage_us += div64_u64(vusage_delta, now->vrate);
+
+ new_stat.usage_us =
+ iocg->local_stat.usage_us + iocg->desc_stat.usage_us;
+
+ /* propagate the deltas to the parent */
+ if (iocg->level > 0) {
+ struct iocg_stat *parent_stat =
+ &iocg->ancestors[iocg->level - 1]->desc_stat;
+
+ parent_stat->usage_us +=
+ new_stat.usage_us - iocg->last_stat.usage_us;
+ }
+
+ iocg->last_stat = new_stat;
+}
+
+/* get stat counters ready for reading on all active iocgs */
+static void iocg_flush_stat(struct list_head *target_iocgs, struct ioc_now *now)
+{
+ LIST_HEAD(inner_walk);
+ struct ioc_gq *iocg, *tiocg;
+
+ /* flush leaves and build inner node walk list */
+ list_for_each_entry(iocg, target_iocgs, active_list) {
+ iocg_flush_stat_one(iocg, now);
+ iocg_build_inner_walk(iocg, &inner_walk);
+ }
+
+ /* keep flushing upwards by walking the inner list backwards */
+ list_for_each_entry_safe_reverse(iocg, tiocg, &inner_walk, walk_list) {
+ iocg_flush_stat_one(iocg, now);
+ list_del_init(&iocg->walk_list);
+ }
+}
+
/* returns usage with margin added if surplus is large enough */
static u32 surplus_adjusted_hweight_inuse(u32 usage, u32 hw_inuse)
{
@@ -1422,6 +1527,8 @@ static void ioc_timer_fn(struct timer_list *timer)
return;
}
+ iocg_flush_stat(&ioc->active_iocgs, &now);
+
/*
* Waiters determine the sleep durations based on the vrate they
* saw at the time of sleep. If vrate has increased, some waiters
@@ -1824,7 +1931,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
*/
if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt &&
time_before_eq64(vtime + cost, now.vnow)) {
- iocg_commit_bio(iocg, bio, cost);
+ iocg_commit_bio(iocg, bio, abs_cost, cost);
return;
}
@@ -1849,7 +1956,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
*/
if (unlikely(list_empty(&iocg->active_list))) {
iocg_unlock(iocg, ioc_locked, &flags);
- iocg_commit_bio(iocg, bio, cost);
+ iocg_commit_bio(iocg, bio, abs_cost, cost);
return;
}
@@ -1948,7 +2055,7 @@ static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
*/
if (rq->bio && rq->bio->bi_iocost_cost &&
time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) {
- iocg_commit_bio(iocg, bio, cost);
+ iocg_commit_bio(iocg, bio, abs_cost, cost);
return;
}
@@ -1962,7 +2069,7 @@ static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
iocg->abs_vdebt += abs_cost;
iocg_kick_delay(iocg, &now);
} else {
- iocg_commit_bio(iocg, bio, cost);
+ iocg_commit_bio(iocg, bio, abs_cost, cost);
}
spin_unlock_irqrestore(&iocg->waitq.lock, flags);
}
@@ -2133,6 +2240,12 @@ static struct blkg_policy_data *ioc_pd_alloc(gfp_t gfp, struct request_queue *q,
if (!iocg)
return NULL;
+ iocg->pcpu_stat = alloc_percpu_gfp(struct iocg_pcpu_stat, gfp);
+ if (!iocg->pcpu_stat) {
+ kfree(iocg);
+ return NULL;
+ }
+
return &iocg->pd;
}
@@ -2152,6 +2265,7 @@ static void ioc_pd_init(struct blkg_policy_data *pd)
atomic64_set(&iocg->done_vtime, now.vnow);
atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period));
INIT_LIST_HEAD(&iocg->active_list);
+ INIT_LIST_HEAD(&iocg->walk_list);
iocg->hweight_active = WEIGHT_ONE;
iocg->hweight_inuse = WEIGHT_ONE;
@@ -2181,18 +2295,46 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
if (ioc) {
spin_lock_irqsave(&ioc->lock, flags);
+
if (!list_empty(&iocg->active_list)) {
propagate_weights(iocg, 0, 0);
list_del_init(&iocg->active_list);
}
+
+ WARN_ON_ONCE(!list_empty(&iocg->walk_list));
+
spin_unlock_irqrestore(&ioc->lock, flags);
hrtimer_cancel(&iocg->waitq_timer);
hrtimer_cancel(&iocg->delay_timer);
}
+ free_percpu(iocg->pcpu_stat);
kfree(iocg);
}
+static size_t ioc_pd_stat(struct blkg_policy_data *pd, char *buf, size_t size)
+{
+ struct ioc_gq *iocg = pd_to_iocg(pd);
+ struct ioc *ioc = iocg->ioc;
+ size_t pos = 0;
+
+ if (!ioc->enabled)
+ return 0;
+
+ if (iocg->level == 0) {
+ unsigned vp10k = DIV64_U64_ROUND_CLOSEST(
+ atomic64_read(&ioc->vtime_rate) * 10000,
+ VTIME_PER_USEC);
+ pos += scnprintf(buf + pos, size - pos, " cost.vrate=%u.%02u",
+ vp10k / 100, vp10k % 100);
+ }
+
+ pos += scnprintf(buf + pos, size - pos, " cost.usage=%llu",
+ iocg->last_stat.usage_us);
+
+ return pos;
+}
+
static u64 ioc_weight_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
@@ -2606,6 +2748,7 @@ static struct blkcg_policy blkcg_policy_iocost = {
.pd_alloc_fn = ioc_pd_alloc,
.pd_init_fn = ioc_pd_init,
.pd_free_fn = ioc_pd_free,
+ .pd_stat_fn = ioc_pd_stat,
};
static int __init ioc_init(void)
--
2.26.2
iocost implements work conservation by reducing iocg->inuse and propagating
the adjustment upwards proportionally. However, while I knew the target
absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
out how to determine the iocg->inuse adjustment to achieve that and
approximated the adjustment by scaling iocg->inuse using the proportion of
the needed hweight_inuse changes.
When nested, these scalings aren't accurate even when adjusting a single
node as the donating node also receives the benefit of the donated portion.
When multiple nodes are donating as they often do, they can be wildly wrong.
iocost employed various safety nets to combat the inaccuracies. There are
ample buffers in determining how much to donate, the adjustments are
conservative and gradual. While it can achieve a reasonable level of work
conservation in simple scenarios, the inaccuracies can easily add up leading
to significant loss of total work. This in turn makes it difficult to
closely cap vrate as vrate adjustment is needed to compensate for the loss
of work. The combination of inaccurate donation calculations and vrate
adjustments can lead to wide fluctuations and clunky overall behaviors.
Andy Newell devised a method to calculate the needed ->inuse updates to
achieve the target hweight_inuse's. The method is compatible with the
proportional inuse adjustment propagation which allows all hot path
operations to be local to each iocg.
To roughly summarize, Andy's method divides the tree into donating and
non-donating parts, calculates global donation rate which is used to
determine the target hweight_inuse for each node, and then derives per-level
proportions. There's non-trivial amount of math involved. Please refer to
the following pdfs for detailed descriptions.
https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
This patch implements Andy's method in transfer_surpluses(). This makes the
donation calculations accurate per cycle and enables further improvements in
other parts of the donation logic.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Andy Newell <[email protected]>
---
block/blk-iocost.c | 252 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 244 insertions(+), 8 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 61b008d0801f..ecc23b827e5d 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -491,9 +491,11 @@ struct ioc_gq {
/* see __propagate_weights() and current_hweight() for details */
u64 child_active_sum;
u64 child_inuse_sum;
+ u64 child_adjusted_sum;
int hweight_gen;
u32 hweight_active;
u32 hweight_inuse;
+ u32 hweight_donating;
u32 hweight_after_donation;
struct list_head walk_list;
@@ -1551,20 +1553,252 @@ static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
return usage;
}
+/*
+ * For work-conservation, an iocg which isn't using all of its share should
+ * donate the leftover to other iocgs. There are two ways to achieve this - 1.
+ * bumping up vrate accordingly 2. lowering the donating iocg's inuse weight.
+ *
+ * #1 is mathematically simpler but has the drawback of requiring synchronous
+ * global hweight_inuse updates when idle iocg's get activated or inuse weights
+ * change due to donation snapbacks as it has the possibility of grossly
+ * overshooting what's allowed by the model and vrate.
+ *
+ * #2 is inherently safe with local operations. The donating iocg can easily
+ * snap back to higher weights when needed without worrying about impacts on
+ * other nodes as the impacts will be inherently correct. This also makes idle
+ * iocg activations safe. The only effect activations have is decreasing
+ * hweight_inuse of others, the right solution to which is for those iocgs to
+ * snap back to higher weights.
+ *
+ * So, we go with #2. The challenge is calculating how each donating iocg's
+ * inuse should be adjusted to achieve the target donation amounts. This is done
+ * using Andy's method described in the following pdf.
+ *
+ * https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
+ *
+ * Given the weights and target after-donation hweight_inuse values, Andy's
+ * method determines how the proportional distribution should look like at each
+ * sibling level to maintain the relative relationship between all non-donating
+ * pairs. To roughly summarize, it divides the tree into donating and
+ * non-donating parts, calculates global donation rate which is used to
+ * determine the target hweight_inuse for each node, and then derives per-level
+ * proportions.
+ *
+ * The following pdf shows that global distribution calculated this way can be
+ * achieved by scaling inuse weights of donating leaves and propagating the
+ * adjustments upwards proportionally.
+ *
+ * https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
+ *
+ * Combining the above two, we can determine how each leaf iocg's inuse should
+ * be adjusted to achieve the target donation.
+ *
+ * https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
+ *
+ * The inline comments use symbols from the last pdf.
+ *
+ * b is the sum of the absolute budgets in the subtree. 1 for the root node.
+ * f is the sum of the absolute budgets of non-donating nodes in the subtree.
+ * t is the sum of the absolute budgets of donating nodes in the subtree.
+ * w is the weight of the node. w = w_f + w_t
+ * w_f is the non-donating portion of w. w_f = w * f / b
+ * w_b is the donating portion of w. w_t = w * t / b
+ * s is the sum of all sibling weights. s = Sum(w) for siblings
+ * s_f and s_t are the non-donating and donating portions of s.
+ *
+ * Subscript p denotes the parent's counterpart and ' the adjusted value - e.g.
+ * w_pt is the donating portion of the parent's weight and w'_pt the same value
+ * after adjustments. Subscript r denotes the root node's values.
+ */
static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
{
- struct ioc_gq *iocg;
+ LIST_HEAD(over_hwa);
+ LIST_HEAD(inner_walk);
+ struct ioc_gq *iocg, *tiocg, *root_iocg;
+ u32 after_sum, over_sum, over_target, gamma;
+
+ /*
+ * It's pretty unlikely but possible for the total sum of
+ * hweight_after_donation's to be higher than WEIGHT_ONE, which will
+ * confuse the following calculations. If such condition is detected,
+ * scale down everyone over its full share equally to keep the sum below
+ * WEIGHT_ONE.
+ */
+ after_sum = 0;
+ over_sum = 0;
+ list_for_each_entry(iocg, surpluses, surplus_list) {
+ u32 hwa;
+
+ current_hweight(iocg, &hwa, NULL);
+ after_sum += iocg->hweight_after_donation;
+
+ if (iocg->hweight_after_donation > hwa) {
+ over_sum += iocg->hweight_after_donation;
+ list_add(&iocg->walk_list, &over_hwa);
+ }
+ }
+
+ if (after_sum >= WEIGHT_ONE) {
+ /*
+ * The delta should be deducted from the over_sum, calculate
+ * target over_sum value.
+ */
+ u32 over_delta = after_sum - (WEIGHT_ONE - 1);
+ WARN_ON_ONCE(over_sum <= over_delta);
+ over_target = over_sum - over_delta;
+ } else {
+ over_target = 0;
+ }
+
+ list_for_each_entry_safe(iocg, tiocg, &over_hwa, walk_list) {
+ if (over_target)
+ iocg->hweight_after_donation =
+ div_u64((u64)iocg->hweight_after_donation *
+ over_target, over_sum);
+ list_del_init(&iocg->walk_list);
+ }
+
+ /*
+ * Build pre-order inner node walk list and prepare for donation
+ * adjustment calculations.
+ */
+ list_for_each_entry(iocg, surpluses, surplus_list) {
+ iocg_build_inner_walk(iocg, &inner_walk);
+ }
+
+ root_iocg = list_first_entry(&inner_walk, struct ioc_gq, walk_list);
+ WARN_ON_ONCE(root_iocg->level > 0);
+ list_for_each_entry(iocg, &inner_walk, walk_list) {
+ iocg->child_adjusted_sum = 0;
+ iocg->hweight_donating = 0;
+ iocg->hweight_after_donation = 0;
+ }
+
+ /*
+ * Propagate the donating budget (b_t) and after donation budget (b'_t)
+ * up the hierarchy.
+ */
list_for_each_entry(iocg, surpluses, surplus_list) {
- u32 old_hwi, new_hwi, new_inuse;
+ struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
+
+ parent->hweight_donating += iocg->hweight_donating;
+ parent->hweight_after_donation += iocg->hweight_after_donation;
+ }
- current_hweight(iocg, NULL, &old_hwi);
- new_hwi = iocg->hweight_after_donation;
+ list_for_each_entry_reverse(iocg, &inner_walk, walk_list) {
+ if (iocg->level > 0) {
+ struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
- new_inuse = DIV64_U64_ROUND_UP((u64)iocg->inuse * new_hwi,
- old_hwi);
- __propagate_weights(iocg, iocg->weight, new_inuse);
+ parent->hweight_donating += iocg->hweight_donating;
+ parent->hweight_after_donation += iocg->hweight_after_donation;
+ }
+ }
+
+ /*
+ * Calculate inner hwa's (b) and make sure the donation values are
+ * within the accepted ranges as we're doing low res calculations with
+ * roundups.
+ */
+ list_for_each_entry(iocg, &inner_walk, walk_list) {
+ if (iocg->level) {
+ struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
+
+ iocg->hweight_active = DIV64_U64_ROUND_UP(
+ (u64)parent->hweight_active * iocg->active,
+ parent->child_active_sum);
+
+ }
+
+ iocg->hweight_donating = min(iocg->hweight_donating,
+ iocg->hweight_active);
+ iocg->hweight_after_donation = min(iocg->hweight_after_donation,
+ iocg->hweight_donating - 1);
+ if (WARN_ON_ONCE(iocg->hweight_active <= 1 ||
+ iocg->hweight_donating <= 1 ||
+ iocg->hweight_after_donation == 0)) {
+ pr_warn("iocg: invalid donation weights in ");
+ pr_cont_cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup);
+ pr_cont(": active=%u donating=%u after=%u\n",
+ iocg->hweight_active, iocg->hweight_donating,
+ iocg->hweight_after_donation);
+ }
}
+
+ /*
+ * Calculate the global donation rate (gamma) - the rate to adjust
+ * non-donating budgets by. No need to use 64bit multiplication here as
+ * the first operand is guaranteed to be smaller than WEIGHT_ONE
+ * (1<<16).
+ *
+ * gamma = (1 - t_r') / (1 - t_r)
+ */
+ gamma = DIV_ROUND_UP(
+ (WEIGHT_ONE - root_iocg->hweight_after_donation) * WEIGHT_ONE,
+ WEIGHT_ONE - root_iocg->hweight_donating);
+
+ /*
+ * Calculate adjusted hwi, child_adjusted_sum and inuse for the inner
+ * nodes.
+ */
+ list_for_each_entry(iocg, &inner_walk, walk_list) {
+ struct ioc_gq *parent;
+ u32 inuse, wpt, wptp;
+ u64 st, sf;
+
+ if (iocg->level == 0) {
+ /* adjusted weight sum for 1st level: s' = s * b_pf / b'_pf */
+ iocg->child_adjusted_sum = DIV64_U64_ROUND_UP(
+ iocg->child_active_sum * (WEIGHT_ONE - iocg->hweight_donating),
+ WEIGHT_ONE - iocg->hweight_after_donation);
+ continue;
+ }
+
+ parent = iocg->ancestors[iocg->level - 1];
+
+ /* b' = gamma * b_f + b_t' */
+ iocg->hweight_inuse = DIV64_U64_ROUND_UP(
+ (u64)gamma * (iocg->hweight_active - iocg->hweight_donating),
+ WEIGHT_ONE) + iocg->hweight_after_donation;
+
+ /* w' = s' * b' / b'_p */
+ inuse = DIV64_U64_ROUND_UP(
+ (u64)parent->child_adjusted_sum * iocg->hweight_inuse,
+ parent->hweight_inuse);
+
+ /* adjusted weight sum for children: s' = s_f + s_t * w'_pt / w_pt */
+ st = DIV64_U64_ROUND_UP(
+ iocg->child_active_sum * iocg->hweight_donating,
+ iocg->hweight_active);
+ sf = iocg->child_active_sum - st;
+ wpt = DIV64_U64_ROUND_UP(
+ (u64)iocg->active * iocg->hweight_donating,
+ iocg->hweight_active);
+ wptp = DIV64_U64_ROUND_UP(
+ (u64)inuse * iocg->hweight_after_donation,
+ iocg->hweight_inuse);
+
+ iocg->child_adjusted_sum = sf + DIV64_U64_ROUND_UP(st * wptp, wpt);
+ }
+
+ /*
+ * All inner nodes now have ->hweight_inuse and ->child_adjusted_sum and
+ * we can finally determine leaf adjustments.
+ */
+ list_for_each_entry(iocg, surpluses, surplus_list) {
+ struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
+ u32 inuse;
+
+ /* w' = s' * b' / b'_p, note that b' == b'_t for donating leaves */
+ inuse = DIV64_U64_ROUND_UP(
+ parent->child_adjusted_sum * iocg->hweight_after_donation,
+ parent->hweight_inuse);
+ __propagate_weights(iocg, iocg->active, inuse);
+ }
+
+ /* walk list should be dissolved after use */
+ list_for_each_entry_safe(iocg, tiocg, &inner_walk, walk_list)
+ list_del_init(&iocg->walk_list);
}
static void ioc_timer_fn(struct timer_list *timer)
@@ -1705,16 +1939,18 @@ static void ioc_timer_fn(struct timer_list *timer)
if (hw_inuse < hw_active ||
(!waitqueue_active(&iocg->waitq) &&
time_before64(vtime, now.vnow - ioc->margins.max))) {
- u32 hwm, new_hwi;
+ u32 hwa, hwm, new_hwi;
/*
* Already donating or accumulated enough to start.
* Determine the donation amount.
*/
+ current_hweight(iocg, &hwa, NULL);
hwm = current_hweight_max(iocg);
new_hwi = hweight_after_donation(iocg, hwm, usage,
&now);
if (new_hwi < hwm) {
+ iocg->hweight_donating = hwa;
iocg->hweight_after_donation = new_hwi;
list_add(&iocg->surplus_list, &surpluses);
} else {
--
2.26.2
iocost went through significant internal changes. Update iocost_monitor.py
accordingly.
Signed-off-by: Tejun Heo <[email protected]>
---
tools/cgroup/iocost_monitor.py | 54 ++++++++++++----------------------
1 file changed, 19 insertions(+), 35 deletions(-)
diff --git a/tools/cgroup/iocost_monitor.py b/tools/cgroup/iocost_monitor.py
index f4699f9b46ba..c4ff907c078b 100644
--- a/tools/cgroup/iocost_monitor.py
+++ b/tools/cgroup/iocost_monitor.py
@@ -45,8 +45,7 @@ args = parser.parse_args()
err('The kernel does not have iocost enabled')
IOC_RUNNING = prog['IOC_RUNNING'].value_()
-NR_USAGE_SLOTS = prog['NR_USAGE_SLOTS'].value_()
-HWEIGHT_WHOLE = prog['HWEIGHT_WHOLE'].value_()
+WEIGHT_ONE = prog['WEIGHT_ONE'].value_()
VTIME_PER_SEC = prog['VTIME_PER_SEC'].value_()
VTIME_PER_USEC = prog['VTIME_PER_USEC'].value_()
AUTOP_SSD_FAST = prog['AUTOP_SSD_FAST'].value_()
@@ -100,7 +99,7 @@ autop_names = {
self.period_ms = ioc.period_us.value_() / 1_000
self.period_at = ioc.period_at.value_() / 1_000_000
self.vperiod_at = ioc.period_at_vtime.value_() / VTIME_PER_SEC
- self.vrate_pct = ioc.vtime_rate.counter.value_() * 100 / VTIME_PER_USEC
+ self.vrate_pct = ioc.vtime_base_rate.value_() * 100 / VTIME_PER_USEC
self.busy_level = ioc.busy_level.value_()
self.autop_idx = ioc.autop_idx.value_()
self.user_cost_model = ioc.user_cost_model.value_()
@@ -136,7 +135,7 @@ autop_names = {
def table_header_str(self):
return f'{"":25} active {"weight":>9} {"hweight%":>13} {"inflt%":>6} ' \
- f'{"dbt":>3} {"delay":>6} {"usages%"}'
+ f'{"debt":>7} {"delay":>7} {"usage%"}'
class IocgStat:
def __init__(self, iocg):
@@ -144,11 +143,11 @@ autop_names = {
blkg = iocg.pd.blkg
self.is_active = not list_empty(iocg.active_list.address_of_())
- self.weight = iocg.weight.value_()
- self.active = iocg.active.value_()
- self.inuse = iocg.inuse.value_()
- self.hwa_pct = iocg.hweight_active.value_() * 100 / HWEIGHT_WHOLE
- self.hwi_pct = iocg.hweight_inuse.value_() * 100 / HWEIGHT_WHOLE
+ self.weight = iocg.weight.value_() / WEIGHT_ONE
+ self.active = iocg.active.value_() / WEIGHT_ONE
+ self.inuse = iocg.inuse.value_() / WEIGHT_ONE
+ self.hwa_pct = iocg.hweight_active.value_() * 100 / WEIGHT_ONE
+ self.hwi_pct = iocg.hweight_inuse.value_() * 100 / WEIGHT_ONE
self.address = iocg.value_()
vdone = iocg.done_vtime.counter.value_()
@@ -160,23 +159,13 @@ autop_names = {
else:
self.inflight_pct = 0
- # vdebt used to be an atomic64_t and is now u64, support both
- try:
- self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000
- except:
- self.debt_ms = iocg.abs_vdebt.value_() / VTIME_PER_USEC / 1000
-
- self.use_delay = blkg.use_delay.counter.value_()
- self.delay_ms = blkg.delay_nsec.counter.value_() / 1_000_000
-
- usage_idx = iocg.usage_idx.value_()
- self.usages = []
- self.usage = 0
- for i in range(NR_USAGE_SLOTS):
- usage = iocg.usages[(usage_idx + 1 + i) % NR_USAGE_SLOTS].value_()
- upct = usage * 100 / HWEIGHT_WHOLE
- self.usages.append(upct)
- self.usage = max(self.usage, upct)
+ self.usage = (100 * iocg.usage_delta_us.value_() /
+ ioc.period_us.value_()) if self.active else 0
+ self.debt_ms = iocg.abs_vdebt.value_() / VTIME_PER_USEC / 1000
+ if blkg.use_delay.counter.value_() != 0:
+ self.delay_ms = blkg.delay_nsec.counter.value_() / 1_000_000
+ else:
+ self.delay_ms = 0
def dict(self, now, path):
out = { 'cgroup' : path,
@@ -189,25 +178,20 @@ autop_names = {
'hweight_inuse_pct' : self.hwi_pct,
'inflight_pct' : self.inflight_pct,
'debt_ms' : self.debt_ms,
- 'use_delay' : self.use_delay,
'delay_ms' : self.delay_ms,
'usage_pct' : self.usage,
'address' : self.address }
- for i in range(len(self.usages)):
- out[f'usage_pct_{i}'] = str(self.usages[i])
return out
def table_row_str(self, path):
out = f'{path[-28:]:28} ' \
f'{"*" if self.is_active else " "} ' \
- f'{self.inuse:5}/{self.active:5} ' \
+ f'{round(self.inuse):5}/{round(self.active):5} ' \
f'{self.hwi_pct:6.2f}/{self.hwa_pct:6.2f} ' \
f'{self.inflight_pct:6.2f} ' \
- f'{min(math.ceil(self.debt_ms), 999):3} ' \
- f'{min(self.use_delay, 99):2}*'\
- f'{min(math.ceil(self.delay_ms), 999):03} '
- for u in self.usages:
- out += f'{min(round(u), 999):03d}:'
+ f'{self.debt_ms:7.2f} ' \
+ f'{self.delay_ms:7.2f} '\
+ f'{min(self.usage, 999):6.2f}'
out = out.rstrip(':')
return out
--
2.26.2
Debt handling had several issues.
* How much inuse a debtor carries wasn't clearly defined. inuse would be
driven down over time from not issuing IOs but it'd be better to clamp it
to minimum immediately once in debt.
* How much can be paid off was determined by hweight_inuse. As inuse was
driven down, the payment amount would fall together regardless of the
debtor's active weight. This means that the debtors were punished harshly.
* ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
iocg_kick_delay().
This patch revamps debt handling so that
* Debt handling owns inuse for iocgs in debt and keeps them at zero.
* Payment amount is determined by hweight_active. This is more deterministic
and safer than hweight_inuse but still far from ideal in that it doesn't
factor in possible donations from other iocgs for debt payments. This
likely needs further improvements in the future.
* iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Andy Newell <[email protected]>
---
block/blk-iocost.c | 117 +++++++++++++++++++++++++++++++++++----------
1 file changed, 93 insertions(+), 24 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index d09b4011449c..d2b69d87f3e7 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1206,13 +1206,13 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
u64 vtime = atomic64_read(&iocg->vtime);
u64 delta_ns, expires, oexpires;
- u32 hw_inuse;
+ u32 hwa;
lockdep_assert_held(&iocg->waitq.lock);
/* debt-adjust vtime */
- current_hweight(iocg, NULL, &hw_inuse);
- vtime += abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
+ current_hweight(iocg, &hwa, NULL);
+ vtime += abs_cost_to_cost(iocg->abs_vdebt, hwa);
/*
* Clear or maintain depending on the overage. Non-zero vdebt is what
@@ -1258,6 +1258,47 @@ static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+static void iocg_incur_debt(struct ioc_gq *iocg, u64 abs_cost,
+ struct ioc_now *now)
+{
+ struct iocg_pcpu_stat *gcs;
+
+ lockdep_assert_held(&iocg->ioc->lock);
+ lockdep_assert_held(&iocg->waitq.lock);
+ WARN_ON_ONCE(list_empty(&iocg->active_list));
+
+ /*
+ * Once in debt, debt handling owns inuse. @iocg stays at the minimum
+ * inuse donating all of it share to others until its debt is paid off.
+ */
+ if (!iocg->abs_vdebt && abs_cost)
+ propagate_weights(iocg, iocg->active, 0, false, now);
+
+ iocg->abs_vdebt += abs_cost;
+
+ gcs = get_cpu_ptr(iocg->pcpu_stat);
+ local64_add(abs_cost, &gcs->abs_vusage);
+ put_cpu_ptr(gcs);
+}
+
+static void iocg_pay_debt(struct ioc_gq *iocg, u64 abs_vpay,
+ struct ioc_now *now)
+{
+ lockdep_assert_held(&iocg->ioc->lock);
+ lockdep_assert_held(&iocg->waitq.lock);
+
+ /* make sure that nobody messed with @iocg */
+ WARN_ON_ONCE(list_empty(&iocg->active_list));
+ WARN_ON_ONCE(iocg->inuse > 1);
+
+ iocg->abs_vdebt -= min(abs_vpay, iocg->abs_vdebt);
+
+ /* if debt is paid in full, restore inuse */
+ if (!iocg->abs_vdebt)
+ propagate_weights(iocg, iocg->active, iocg->last_inuse,
+ false, now);
+}
+
static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
int flags, void *key)
{
@@ -1296,26 +1337,25 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
struct iocg_wake_ctx ctx = { .iocg = iocg };
u64 vshortage, expires, oexpires;
s64 vbudget;
- u32 hw_inuse;
+ u32 hwa;
lockdep_assert_held(&iocg->waitq.lock);
- current_hweight(iocg, NULL, &hw_inuse);
+ current_hweight(iocg, &hwa, NULL);
vbudget = now->vnow - atomic64_read(&iocg->vtime);
/* pay off debt */
if (pay_debt && iocg->abs_vdebt && vbudget > 0) {
- u64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
- u64 delta = min_t(u64, vbudget, vdebt);
- u64 abs_delta = min(cost_to_abs_cost(delta, hw_inuse),
- iocg->abs_vdebt);
+ u64 abs_vbudget = cost_to_abs_cost(vbudget, hwa);
+ u64 abs_vpay = min_t(u64, abs_vbudget, iocg->abs_vdebt);
+ u64 vpay = abs_cost_to_cost(abs_vpay, hwa);
lockdep_assert_held(&ioc->lock);
- atomic64_add(delta, &iocg->vtime);
- atomic64_add(delta, &iocg->done_vtime);
- iocg->abs_vdebt -= abs_delta;
- vbudget -= vdebt;
+ atomic64_add(vpay, &iocg->vtime);
+ atomic64_add(vpay, &iocg->done_vtime);
+ iocg_pay_debt(iocg, abs_vpay, now);
+ vbudget -= vpay;
iocg_kick_delay(iocg, now);
}
@@ -1327,17 +1367,20 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
* not positive.
*/
if (iocg->abs_vdebt) {
- s64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
+ s64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hwa);
vbudget = min_t(s64, 0, vbudget - vdebt);
}
/*
- * Wake up the ones which are due and see how much vtime we'll need
- * for the next one.
+ * Wake up the ones which are due and see how much vtime we'll need for
+ * the next one. As paying off debt restores hw_inuse, it must be read
+ * after the above debt payment.
*/
- ctx.hw_inuse = hw_inuse;
ctx.vbudget = vbudget;
+ current_hweight(iocg, NULL, &ctx.hw_inuse);
+
__wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx);
+
if (!waitqueue_active(&iocg->waitq))
return;
if (WARN_ON_ONCE(ctx.vbudget >= 0))
@@ -1525,6 +1568,10 @@ static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
u64 vtime = atomic64_read(&iocg->vtime);
s64 excess, delta, target, new_hwi;
+ /* debt handling owns inuse for debtors */
+ if (iocg->abs_vdebt)
+ return 1;
+
/* see whether minimum margin requirement is met */
if (waitqueue_active(&iocg->waitq) ||
time_after64(vtime, now->vnow - ioc->margins.min))
@@ -1798,6 +1845,18 @@ static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
u32 inuse;
+ /*
+ * In-debt iocgs participated in the donation calculation with
+ * the minimum target hweight_inuse. Configuring inuse
+ * accordingly would work fine but debt handling expects
+ * @iocg->inuse stay at the minimum and we don't wanna
+ * interfere.
+ */
+ if (iocg->abs_vdebt) {
+ WARN_ON_ONCE(iocg->inuse > 1);
+ continue;
+ }
+
/* w' = s' * b' / b'_p, note that b' == b'_t for donating leaves */
inuse = DIV64_U64_ROUND_UP(
parent->child_adjusted_sum * iocg->hweight_after_donation,
@@ -2081,6 +2140,10 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
cost = abs_cost_to_cost(abs_cost, hwi);
margin = now->vnow - vtime - cost;
+ /* debt handling owns inuse for debtors */
+ if (iocg->abs_vdebt)
+ return cost;
+
/*
* We only increase inuse during period and do so iff the margin has
* deteriorated since the previous adjustment.
@@ -2092,7 +2155,7 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
spin_lock_irq(&ioc->lock);
/* we own inuse only when @iocg is in the normal active state */
- if (list_empty(&iocg->active_list)) {
+ if (iocg->abs_vdebt || list_empty(&iocg->active_list)) {
spin_unlock_irq(&ioc->lock);
return cost;
}
@@ -2266,7 +2329,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
* penalizing the cgroup and its descendants.
*/
if (use_debt) {
- iocg->abs_vdebt += abs_cost;
+ iocg_incur_debt(iocg, abs_cost, &now);
if (iocg_kick_delay(iocg, &now))
blkcg_schedule_throttle(rqos->q,
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
@@ -2275,7 +2338,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
}
/* guarantee that iocgs w/ waiters have maximum inuse */
- if (iocg->inuse != iocg->active) {
+ if (!iocg->abs_vdebt && iocg->inuse != iocg->active) {
if (!ioc_locked) {
iocg_unlock(iocg, false, &flags);
ioc_locked = true;
@@ -2363,14 +2426,20 @@ static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
* be for the vast majority of cases. See debt handling in
* ioc_rqos_throttle() for details.
*/
- spin_lock_irqsave(&iocg->waitq.lock, flags);
+ spin_lock_irqsave(&ioc->lock, flags);
+ spin_lock(&iocg->waitq.lock);
+
if (likely(!list_empty(&iocg->active_list))) {
- iocg->abs_vdebt += abs_cost;
- iocg_kick_delay(iocg, &now);
+ iocg_incur_debt(iocg, abs_cost, &now);
+ if (iocg_kick_delay(iocg, &now))
+ blkcg_schedule_throttle(rqos->q,
+ (bio->bi_opf & REQ_SWAP) == REQ_SWAP);
} else {
iocg_commit_bio(iocg, bio, abs_cost, cost);
}
- spin_unlock_irqrestore(&iocg->waitq.lock, flags);
+
+ spin_unlock(&iocg->waitq.lock);
+ spin_unlock_irqrestore(&ioc->lock, flags);
}
static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio)
--
2.26.2
Curently, iocost syncs the delay duration to the outstanding debt amount,
which seemed enough to protect the system from anon memory hogs. However,
that was mostly because the delay calcuation was using hweight_inuse which
quickly converges towards zero under debt for delay duration calculation,
often pusnishing debtors overly harshly for longer than deserved.
The previous patch fixed the delay calcuation and now the protection against
anonymous memory hogs isn't enough because the effect of delay is indirect
and non-linear and a huge amount of future debt can accumulate abruptly
while unthrottled.
This patch implements delay hysteresis so that delay is decayed
exponentially over time instead of getting cleared immediately as debt is
paid off. While the overall behavior is similar to the blk-cgroup
implementation used by blk-iolatency, a lot of the details are different and
due to the empirical nature of the mechanism, it's challenging to adapt the
mechanism for one controller without negatively impacting the other.
As the delay is gradually decayed now, there's no point in running it from
its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
the dedicated hrtimer is removed.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Josef Bacik <[email protected]>
---
block/blk-cgroup.c | 23 ++++++---
block/blk-iocost.c | 119 ++++++++++++++++++++++++++-------------------
2 files changed, 86 insertions(+), 56 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c195365c9817..d33dd6be1d9c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1613,16 +1613,24 @@ static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 now)
static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
{
unsigned long pflags;
+ bool clamp;
u64 now = ktime_to_ns(ktime_get());
u64 exp;
u64 delay_nsec = 0;
int tok;
while (blkg->parent) {
- if (atomic_read(&blkg->use_delay)) {
+ int use_delay = atomic_read(&blkg->use_delay);
+
+ if (use_delay) {
+ u64 this_delay;
+
blkcg_scale_delay(blkg, now);
- delay_nsec = max_t(u64, delay_nsec,
- atomic64_read(&blkg->delay_nsec));
+ this_delay = atomic64_read(&blkg->delay_nsec);
+ if (this_delay > delay_nsec) {
+ delay_nsec = this_delay;
+ clamp = use_delay > 0;
+ }
}
blkg = blkg->parent;
}
@@ -1634,10 +1642,13 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
* Let's not sleep for all eternity if we've amassed a huge delay.
* Swapping or metadata IO can accumulate 10's of seconds worth of
* delay, and we want userspace to be able to do _something_ so cap the
- * delays at 1 second. If there's 10's of seconds worth of delay then
- * the tasks will be delayed for 1 second for every syscall.
+ * delays at 0.25s. If there's 10's of seconds worth of delay then the
+ * tasks will be delayed for 0.25 second for every syscall. If
+ * blkcg_set_delay() was used as indicated by negative use_delay, the
+ * caller is responsible for regulating the range.
*/
- delay_nsec = min_t(u64, delay_nsec, 250 * NSEC_PER_MSEC);
+ if (clamp)
+ delay_nsec = min_t(u64, delay_nsec, 250 * NSEC_PER_MSEC);
if (use_memdelay)
psi_memstall_enter(&pflags);
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index d2b69d87f3e7..9cb8f29f01f5 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -270,6 +270,31 @@ enum {
/* unbusy hysterisis */
UNBUSY_THR_PCT = 75,
+ /*
+ * The effect of delay is indirect and non-linear and a huge amount of
+ * future debt can accumulate abruptly while unthrottled. Linearly scale
+ * up delay as debt is going up and then let it decay exponentially.
+ * This gives us quick ramp ups while delay is accumulating and long
+ * tails which can help reducing the frequency of debt explosions on
+ * unthrottle. The parameters are experimentally determined.
+ *
+ * The delay mechanism provides adequate protection and behavior in many
+ * cases. However, this is far from ideal and falls shorts on both
+ * fronts. The debtors are often throttled too harshly costing a
+ * significant level of fairness and possibly total work while the
+ * protection against their impacts on the system can be choppy and
+ * unreliable.
+ *
+ * The shortcoming primarily stems from the fact that, unlike for page
+ * cache, the kernel doesn't have well-defined back-pressure propagation
+ * mechanism and policies for anonymous memory. Fully addressing this
+ * issue will likely require substantial improvements in the area.
+ */
+ MIN_DELAY_THR_PCT = 500,
+ MAX_DELAY_THR_PCT = 25000,
+ MIN_DELAY = 250,
+ MAX_DELAY = 250 * USEC_PER_MSEC,
+
/* don't let cmds which take a very long time pin lagging for too long */
MAX_LAGGING_PERIODS = 10,
@@ -473,6 +498,10 @@ struct ioc_gq {
atomic64_t done_vtime;
u64 abs_vdebt;
+ /* current delay in effect and when it started */
+ u64 delay;
+ u64 delay_at;
+
/*
* The period this iocg was last active in. Used for deactivation
* and invalidating `vtime`.
@@ -495,7 +524,6 @@ struct ioc_gq {
struct wait_queue_head waitq;
struct hrtimer waitq_timer;
- struct hrtimer delay_timer;
/* timestamp at the latest activation */
u64 activated_at;
@@ -1204,58 +1232,50 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
- u64 vtime = atomic64_read(&iocg->vtime);
- u64 delta_ns, expires, oexpires;
+ u64 tdelta, delay, new_delay;
+ s64 vover, vover_pct;
u32 hwa;
lockdep_assert_held(&iocg->waitq.lock);
- /* debt-adjust vtime */
+ /* calculate the current delay in effect - 1/2 every second */
+ tdelta = now->now - iocg->delay_at;
+ if (iocg->delay)
+ delay = iocg->delay >> div64_u64(tdelta, USEC_PER_SEC);
+ else
+ delay = 0;
+
+ /* calculate the new delay from the debt amount */
current_hweight(iocg, &hwa, NULL);
- vtime += abs_cost_to_cost(iocg->abs_vdebt, hwa);
+ vover = atomic64_read(&iocg->vtime) +
+ abs_cost_to_cost(iocg->abs_vdebt, hwa) - now->vnow;
+ vover_pct = div64_s64(100 * vover, ioc->period_us * now->vrate);
+
+ if (vover_pct <= MIN_DELAY_THR_PCT)
+ new_delay = 0;
+ else if (vover_pct >= MAX_DELAY_THR_PCT)
+ new_delay = MAX_DELAY;
+ else
+ new_delay = MIN_DELAY +
+ div_u64((MAX_DELAY - MIN_DELAY) *
+ (vover_pct - MIN_DELAY_THR_PCT),
+ MAX_DELAY_THR_PCT - MIN_DELAY_THR_PCT);
- /*
- * Clear or maintain depending on the overage. Non-zero vdebt is what
- * guarantees that @iocg is online and future iocg_kick_delay() will
- * clear use_delay. Don't leave it on when there's no vdebt.
- */
- if (!iocg->abs_vdebt || time_before_eq64(vtime, now->vnow)) {
- blkcg_clear_delay(blkg);
- return false;
+ /* pick the higher one and apply */
+ if (new_delay > delay) {
+ iocg->delay = new_delay;
+ iocg->delay_at = now->now;
+ delay = new_delay;
}
- if (!atomic_read(&blkg->use_delay) &&
- time_before_eq64(vtime, now->vnow + ioc->margins.target))
- return false;
-
- /* use delay */
- delta_ns = DIV64_U64_ROUND_UP(vtime - now->vnow,
- now->vrate) * NSEC_PER_USEC;
- blkcg_set_delay(blkg, delta_ns);
- expires = now->now_ns + delta_ns;
- /* if already active and close enough, don't bother */
- oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer));
- if (hrtimer_is_queued(&iocg->delay_timer) &&
- abs(oexpires - expires) <= ioc->timer_slack_ns)
+ if (delay >= MIN_DELAY) {
+ blkcg_set_delay(blkg, delay * NSEC_PER_USEC);
return true;
-
- hrtimer_start_range_ns(&iocg->delay_timer, ns_to_ktime(expires),
- ioc->timer_slack_ns, HRTIMER_MODE_ABS);
- return true;
-}
-
-static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer)
-{
- struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer);
- struct ioc_now now;
- unsigned long flags;
-
- spin_lock_irqsave(&iocg->waitq.lock, flags);
- ioc_now(iocg->ioc, &now);
- iocg_kick_delay(iocg, &now);
- spin_unlock_irqrestore(&iocg->waitq.lock, flags);
-
- return HRTIMER_NORESTART;
+ } else {
+ iocg->delay = 0;
+ blkcg_clear_delay(blkg);
+ return false;
+ }
}
static void iocg_incur_debt(struct ioc_gq *iocg, u64 abs_cost,
@@ -1356,9 +1376,10 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
atomic64_add(vpay, &iocg->done_vtime);
iocg_pay_debt(iocg, abs_vpay, now);
vbudget -= vpay;
+ }
+ if (iocg->abs_vdebt || iocg->delay)
iocg_kick_delay(iocg, now);
- }
/*
* Debt can still be outstanding if we haven't paid all yet or the
@@ -1906,12 +1927,13 @@ static void ioc_timer_fn(struct timer_list *timer)
*/
list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) {
if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt &&
- !iocg_is_idle(iocg))
+ !iocg->delay && !iocg_is_idle(iocg))
continue;
spin_lock(&iocg->waitq.lock);
- if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt) {
+ if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt ||
+ iocg->delay) {
/* might be oversleeping vtime / hweight changes, kick */
iocg_kick_waitq(iocg, true, &now);
} else if (iocg_is_idle(iocg)) {
@@ -2641,8 +2663,6 @@ static void ioc_pd_init(struct blkg_policy_data *pd)
init_waitqueue_head(&iocg->waitq);
hrtimer_init(&iocg->waitq_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
iocg->waitq_timer.function = iocg_waitq_timer_fn;
- hrtimer_init(&iocg->delay_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
- iocg->delay_timer.function = iocg_delay_timer_fn;
iocg->level = blkg->blkcg->css.cgroup->level;
@@ -2679,7 +2699,6 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
spin_unlock_irqrestore(&ioc->lock, flags);
hrtimer_cancel(&iocg->waitq_timer);
- hrtimer_cancel(&iocg->delay_timer);
}
free_percpu(iocg->pcpu_stat);
kfree(iocg);
--
2.26.2
It already propagates two weights - active and inuse - and there will be
another soon. Let's drop the confusing misnomers. Rename
[__]propagate_active_weights() to [__]propagate_weights() and
commit_active_weights() to commit_weights().
This is pure rename.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 40 ++++++++++++++++++++--------------------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index e2266e7692b4..78e6919153d8 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -479,7 +479,7 @@ struct ioc_gq {
atomic64_t active_period;
struct list_head active_list;
- /* see __propagate_active_weight() and current_hweight() for details */
+ /* see __propagate_weights() and current_hweight() for details */
u64 child_active_sum;
u64 child_inuse_sum;
int hweight_gen;
@@ -890,7 +890,7 @@ static void ioc_start_period(struct ioc *ioc, struct ioc_now *now)
* Update @iocg's `active` and `inuse` to @active and @inuse, update level
* weight sums and propagate upwards accordingly.
*/
-static void __propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse)
+static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse)
{
struct ioc *ioc = iocg->ioc;
int lvl;
@@ -935,7 +935,7 @@ static void __propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse
ioc->weights_updated = true;
}
-static void commit_active_weights(struct ioc *ioc)
+static void commit_weights(struct ioc *ioc)
{
lockdep_assert_held(&ioc->lock);
@@ -947,10 +947,10 @@ static void commit_active_weights(struct ioc *ioc)
}
}
-static void propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse)
+static void propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse)
{
- __propagate_active_weight(iocg, active, inuse);
- commit_active_weights(iocg->ioc);
+ __propagate_weights(iocg, active, inuse);
+ commit_weights(iocg->ioc);
}
static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep)
@@ -966,9 +966,9 @@ static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep
goto out;
/*
- * Paired with wmb in commit_active_weights(). If we saw the
- * updated hweight_gen, all the weight updates from
- * __propagate_active_weight() are visible too.
+ * Paired with wmb in commit_weights(). If we saw the updated
+ * hweight_gen, all the weight updates from __propagate_weights() are
+ * visible too.
*
* We can race with weight updates during calculation and get it
* wrong. However, hweight_gen would have changed and a future
@@ -1018,7 +1018,7 @@ static void weight_updated(struct ioc_gq *iocg)
weight = iocg->cfg_weight ?: iocc->dfl_weight;
if (weight != iocg->weight && iocg->active)
- propagate_active_weight(iocg, weight,
+ propagate_weights(iocg, weight,
DIV64_U64_ROUND_UP(iocg->inuse * weight, iocg->weight));
iocg->weight = weight;
}
@@ -1090,8 +1090,8 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
*/
iocg->hweight_gen = atomic_read(&ioc->hweight_gen) - 1;
list_add(&iocg->active_list, &ioc->active_iocgs);
- propagate_active_weight(iocg, iocg->weight,
- iocg->last_inuse ?: iocg->weight);
+ propagate_weights(iocg, iocg->weight,
+ iocg->last_inuse ?: iocg->weight);
TRACE_IOCG_PATH(iocg_activate, iocg, now,
last_period, cur_period, vtime);
@@ -1384,13 +1384,13 @@ static void ioc_timer_fn(struct timer_list *timer)
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
iocg->last_inuse = iocg->inuse;
- __propagate_active_weight(iocg, 0, 0);
+ __propagate_weights(iocg, 0, 0);
list_del_init(&iocg->active_list);
}
spin_unlock(&iocg->waitq.lock);
}
- commit_active_weights(ioc);
+ commit_weights(ioc);
/* calc usages and see whether some weights need to be moved around */
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
@@ -1483,8 +1483,8 @@ static void ioc_timer_fn(struct timer_list *timer)
TRACE_IOCG_PATH(inuse_takeback, iocg, &now,
iocg->inuse, new_inuse,
hw_inuse, new_hwi);
- __propagate_active_weight(iocg, iocg->weight,
- new_inuse);
+ __propagate_weights(iocg, iocg->weight,
+ new_inuse);
}
} else {
/* genuninely out of vtime */
@@ -1524,11 +1524,11 @@ static void ioc_timer_fn(struct timer_list *timer)
TRACE_IOCG_PATH(inuse_giveaway, iocg, &now,
iocg->inuse, new_inuse,
hw_inuse, new_hwi);
- __propagate_active_weight(iocg, iocg->weight, new_inuse);
+ __propagate_weights(iocg, iocg->weight, new_inuse);
}
}
skip_surplus_transfers:
- commit_active_weights(ioc);
+ commit_weights(ioc);
/*
* If q is getting clogged or we're missing too much, we're issuing
@@ -1753,7 +1753,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
TRACE_IOCG_PATH(inuse_reset, iocg, &now,
iocg->inuse, iocg->weight, hw_inuse, hw_active);
spin_lock_irq(&ioc->lock);
- propagate_active_weight(iocg, iocg->weight, iocg->weight);
+ propagate_weights(iocg, iocg->weight, iocg->weight);
spin_unlock_irq(&ioc->lock);
current_hweight(iocg, &hw_active, &hw_inuse);
}
@@ -2114,7 +2114,7 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
if (ioc) {
spin_lock_irqsave(&ioc->lock, flags);
if (!list_empty(&iocg->active_list)) {
- propagate_active_weight(iocg, 0, 0);
+ propagate_weights(iocg, 0, 0);
list_del_init(&iocg->active_list);
}
spin_unlock_irqrestore(&ioc->lock, flags);
--
2.26.2
iocost has various safety nets to combat inuse adjustment calculation
inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
adjustment calculations are now accurate and we can make donation amount
determinations accurate too.
* Stop keeping track of past usage history and using the maximum. Act on the
immediate usage information.
* Remove donation constraints defined by SURPLUS_* constants. Donate
whatever isn't used.
* Determine the donation amount so that the iocg will end up with
MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
usage as the previous period. TARGET is set at 50% of period, which is the
previous maximum. This provides smooth convergence for most repetitive IO
patterns.
* Apply donation logic early at 20% budget. There's no risk in doing so as
the calculation is based on the delta between the current budget and the
target budget at the end of the coming period.
* Remove preemptive iocg activation for zero cost IOs. As donation can reach
near zero now, the mere activation doesn't provide any protection anymore.
In the unlikely case that this becomes a problem, the right solution is
assigning appropriate costs for such IOs.
This significantly improves the donation determination logic while also
simplifying it. Now all donations are immediate, exact and smooth.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Andy Newell <[email protected]>
---
block/blk-iocost.c | 133 +++++++++++++++++----------------------------
1 file changed, 51 insertions(+), 82 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ecc23b827e5d..694f1487208a 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -217,12 +217,14 @@ enum {
MAX_PERIOD = USEC_PER_SEC,
/*
- * A cgroup's vtime can run 50% behind the device vtime, which
+ * iocg->vtime is targeted at 50% behind the device vtime, which
* serves as its IO credit buffer. Surplus weight adjustment is
* immediately canceled if the vtime margin runs below 10%.
*/
MARGIN_MIN_PCT = 10,
- MARGIN_MAX_PCT = 50,
+ MARGIN_LOW_PCT = 20,
+ MARGIN_TARGET_PCT = 50,
+ MARGIN_MAX_PCT = 100,
/* Have some play in timer operations */
TIMER_SLACK_PCT = 1,
@@ -234,17 +236,6 @@ enum {
*/
VTIME_VALID_DUR = 300 * USEC_PER_SEC,
- /*
- * Remember the past three non-zero usages and use the max for
- * surplus calculation. Three slots guarantee that we remember one
- * full period usage from the last active stretch even after
- * partial deactivation and re-activation periods. Don't start
- * giving away weight before collecting two data points to prevent
- * hweight adjustments based on one partial activation period.
- */
- NR_USAGE_SLOTS = 3,
- MIN_VALID_USAGES = 2,
-
/* 1/64k is granular enough and can easily be handled w/ u32 */
WEIGHT_ONE = 1 << 16,
@@ -280,14 +271,6 @@ enum {
/* don't let cmds which take a very long time pin lagging for too long */
MAX_LAGGING_PERIODS = 10,
- /*
- * If usage% * 1.25 + 2% is lower than hweight% by more than 3%,
- * donate the surplus.
- */
- SURPLUS_SCALE_PCT = 125, /* * 125% */
- SURPLUS_SCALE_ABS = WEIGHT_ONE / 50, /* + 2% */
- SURPLUS_MIN_ADJ_DELTA = WEIGHT_ONE / 33, /* 3% */
-
/* switch iff the conditions are met for longer than this */
AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC,
@@ -376,6 +359,8 @@ struct ioc_params {
struct ioc_margins {
s64 min;
+ s64 low;
+ s64 target;
s64 max;
};
@@ -514,11 +499,7 @@ struct ioc_gq {
struct iocg_stat desc_stat;
struct iocg_stat last_stat;
u64 last_stat_abs_vusage;
-
- /* usage is recorded as fractions of WEIGHT_ONE */
- u32 usage_delta_us;
- int usage_idx;
- u32 usages[NR_USAGE_SLOTS];
+ u64 usage_delta_us;
/* this iocg's depth in the hierarchy and ancestors including self */
int level;
@@ -737,6 +718,8 @@ static void ioc_refresh_margins(struct ioc *ioc)
u64 vrate = atomic64_read(&ioc->vtime_rate);
margins->min = (period_us * MARGIN_MIN_PCT / 100) * vrate;
+ margins->low = (period_us * MARGIN_LOW_PCT / 100) * vrate;
+ margins->target = (period_us * MARGIN_TARGET_PCT / 100) * vrate;
margins->max = (period_us * MARGIN_MAX_PCT / 100) * vrate;
}
@@ -1228,7 +1211,7 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
return false;
}
if (!atomic_read(&blkg->use_delay) &&
- time_before_eq64(vtime, now->vnow + ioc->margins.max))
+ time_before_eq64(vtime, now->vnow + ioc->margins.target))
return false;
/* use delay */
@@ -1527,7 +1510,7 @@ static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
{
struct ioc *ioc = iocg->ioc;
u64 vtime = atomic64_read(&iocg->vtime);
- s64 excess;
+ s64 excess, delta, target, new_hwi;
/* see whether minimum margin requirement is met */
if (waitqueue_active(&iocg->waitq) ||
@@ -1542,15 +1525,28 @@ static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
vtime += excess;
}
- /* add margin */
- usage = DIV_ROUND_UP(usage * SURPLUS_SCALE_PCT, 100);
- usage += SURPLUS_SCALE_ABS;
-
- /* don't bother if the surplus is too small */
- if (usage + SURPLUS_MIN_ADJ_DELTA > hwm)
- return hwm;
+ /*
+ * Let's say the distance between iocg's and device's vtimes as a
+ * fraction of period duration is delta. Assuming that the iocg will
+ * consume the usage determined above, we want to determine new_hwi so
+ * that delta equals MARGIN_TARGET at the end of the next period.
+ *
+ * We need to execute usage worth of IOs while spending the sum of the
+ * new budget (1 - MARGIN_TARGET) and the leftover from the last period
+ * (delta):
+ *
+ * usage = (1 - MARGIN_TARGET + delta) * new_hwi
+ *
+ * Therefore, the new_hwi is:
+ *
+ * new_hwi = usage / (1 - MARGIN_TARGET + delta)
+ */
+ delta = div64_s64(WEIGHT_ONE * (now->vnow - vtime),
+ now->vnow - ioc->period_at_vtime);
+ target = WEIGHT_ONE * MARGIN_TARGET_PCT / 100;
+ new_hwi = div64_s64(WEIGHT_ONE * usage, WEIGHT_ONE - target + delta);
- return usage;
+ return clamp_t(s64, new_hwi, 1, hwm);
}
/*
@@ -1812,7 +1808,7 @@ static void ioc_timer_fn(struct timer_list *timer)
u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
u32 missed_ppm[2], rq_wait_pct;
u64 period_vtime;
- int prev_busy_level, i;
+ int prev_busy_level;
/* how were the latencies during the period? */
ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct);
@@ -1857,11 +1853,10 @@ static void ioc_timer_fn(struct timer_list *timer)
}
commit_weights(ioc);
- /* calc usages and see whether some weights need to be moved around */
+ /* calc usage and see whether some weights need to be moved around */
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
- u64 vdone, vtime, usage_us;
- u32 hw_active, hw_inuse, usage;
- int uidx, nr_valid;
+ u64 vdone, vtime, usage_us, usage_dur;
+ u32 usage, hw_active, hw_inuse;
/*
* Collect unused and wind vtime closer to vnow to prevent
@@ -1886,15 +1881,11 @@ static void ioc_timer_fn(struct timer_list *timer)
nr_lagging++;
/*
- * Determine absolute usage factoring in pending and in-flight
- * IOs to avoid stalls and high-latency completions appearing as
- * idle.
+ * Determine absolute usage factoring in in-flight IOs to avoid
+ * high-latency completions appearing as idle.
*/
usage_us = iocg->usage_delta_us;
- if (waitqueue_active(&iocg->waitq) && time_before64(vtime, now.vnow))
- usage_us += DIV64_U64_ROUND_UP(
- cost_to_abs_cost(now.vnow - vtime, hw_inuse),
- now.vrate);
+
if (vdone != vtime) {
u64 inflight_us = DIV64_U64_ROUND_UP(
cost_to_abs_cost(vtime - vdone, hw_inuse),
@@ -1902,43 +1893,22 @@ static void ioc_timer_fn(struct timer_list *timer)
usage_us = max(usage_us, inflight_us);
}
- /* convert to hweight based usage ratio and record */
- uidx = (iocg->usage_idx + 1) % NR_USAGE_SLOTS;
-
- if (time_after64(vtime, now.vnow - ioc->margins.min)) {
- iocg->usage_idx = uidx;
- iocg->usages[uidx] = WEIGHT_ONE;
- } else if (usage_us) {
- u64 started_at, dur;
-
- if (time_after64(iocg->activated_at, ioc->period_at))
- started_at = iocg->activated_at;
- else
- started_at = ioc->period_at;
-
- dur = max_t(u64, now.now - started_at, 1);
+ /* convert to hweight based usage ratio */
+ if (time_after64(iocg->activated_at, ioc->period_at))
+ usage_dur = max_t(u64, now.now - iocg->activated_at, 1);
+ else
+ usage_dur = max_t(u64, now.now - ioc->period_at, 1);
- iocg->usage_idx = uidx;
- iocg->usages[uidx] = clamp_t(u32,
- DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, dur),
+ usage = clamp_t(u32,
+ DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
+ usage_dur),
1, WEIGHT_ONE);
- }
-
- /* base the decision on max historical usage */
- for (i = 0, usage = 0, nr_valid = 0; i < NR_USAGE_SLOTS; i++) {
- if (iocg->usages[i]) {
- usage = max(usage, iocg->usages[i]);
- nr_valid++;
- }
- }
- if (nr_valid < MIN_VALID_USAGES)
- usage = WEIGHT_ONE;
/* see whether there's surplus vtime */
WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
if (hw_inuse < hw_active ||
(!waitqueue_active(&iocg->waitq) &&
- time_before64(vtime, now.vnow - ioc->margins.max))) {
+ time_before64(vtime, now.vnow - ioc->margins.low))) {
u32 hwa, hwm, new_hwi;
/*
@@ -2175,15 +2145,14 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
if (!ioc->enabled || !iocg->level)
return;
- /* always activate so that even 0 cost IOs get protected to some level */
- if (!iocg_activate(iocg, &now))
- return;
-
/* calculate the absolute vtime cost */
abs_cost = calc_vtime_cost(bio, iocg, false);
if (!abs_cost)
return;
+ if (!iocg_activate(iocg, &now))
+ return;
+
iocg->cursor = bio_end_sector(bio);
vtime = atomic64_read(&iocg->vtime);
--
2.26.2
The way the surplus donation logic is structured isn't great. There are two
separate paths for starting/increasing donations and decreasing them making
the logic harder to follow and is prone to unnecessary behavior differences.
In preparation for improved donation handling, this patch restructures the
code so that
* All donors - new, increasing and decreasing - are funneled through the
same code path.
* The target donation calculation is factored into hweight_after_donation()
which is called once from the same spot for all possible donors.
* Actual inuse adjustment is factored into trasnfer_surpluses().
This change introduces a few behavior differences - e.g. donation amount
reduction now uses the max usage of the recent three periods just like new
and increasing donations, and inuse now gets adjusted upwards the same way
it gets downwards. These differences are unlikely to have severely negative
implications and the whole logic will be revamped soon.
This patch also removes two tracepoints. The existing TPs don't quite fit
the new implementation. A later patch will update and reinstate them.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 179 ++++++++++++++++++++++++++-------------------
1 file changed, 103 insertions(+), 76 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index a3889a8b0a33..61b008d0801f 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -494,6 +494,7 @@ struct ioc_gq {
int hweight_gen;
u32 hweight_active;
u32 hweight_inuse;
+ u32 hweight_after_donation;
struct list_head walk_list;
struct list_head surplus_list;
@@ -1070,6 +1071,32 @@ static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep
*hw_inusep = iocg->hweight_inuse;
}
+/*
+ * Calculate the hweight_inuse @iocg would get with max @inuse assuming all the
+ * other weights stay unchanged.
+ */
+static u32 current_hweight_max(struct ioc_gq *iocg)
+{
+ u32 hwm = WEIGHT_ONE;
+ u32 inuse = iocg->active;
+ u64 child_inuse_sum;
+ int lvl;
+
+ lockdep_assert_held(&iocg->ioc->lock);
+
+ for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
+ struct ioc_gq *parent = iocg->ancestors[lvl];
+ struct ioc_gq *child = iocg->ancestors[lvl + 1];
+
+ child_inuse_sum = parent->child_inuse_sum + inuse - child->inuse;
+ hwm = div64_u64((u64)hwm * inuse, child_inuse_sum);
+ inuse = DIV64_U64_ROUND_UP(parent->active * child_inuse_sum,
+ parent->child_active_sum);
+ }
+
+ return max_t(u32, hwm, 1);
+}
+
static void weight_updated(struct ioc_gq *iocg)
{
struct ioc *ioc = iocg->ioc;
@@ -1488,20 +1515,58 @@ static void iocg_flush_stat(struct list_head *target_iocgs, struct ioc_now *now)
}
}
-/* returns usage with margin added if surplus is large enough */
-static u32 surplus_adjusted_hweight_inuse(u32 usage, u32 hw_inuse)
+/*
+ * Determine what @iocg's hweight_inuse should be after donating unused
+ * capacity. @hwm is the upper bound and used to signal no donation. This
+ * function also throws away @iocg's excess budget.
+ */
+static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
+ struct ioc_now *now)
{
+ struct ioc *ioc = iocg->ioc;
+ u64 vtime = atomic64_read(&iocg->vtime);
+ s64 excess;
+
+ /* see whether minimum margin requirement is met */
+ if (waitqueue_active(&iocg->waitq) ||
+ time_after64(vtime, now->vnow - ioc->margins.min))
+ return hwm;
+
+ /* throw away excess above max */
+ excess = now->vnow - vtime - ioc->margins.max;
+ if (excess > 0) {
+ atomic64_add(excess, &iocg->vtime);
+ atomic64_add(excess, &iocg->done_vtime);
+ vtime += excess;
+ }
+
/* add margin */
usage = DIV_ROUND_UP(usage * SURPLUS_SCALE_PCT, 100);
usage += SURPLUS_SCALE_ABS;
/* don't bother if the surplus is too small */
- if (usage + SURPLUS_MIN_ADJ_DELTA > hw_inuse)
- return 0;
+ if (usage + SURPLUS_MIN_ADJ_DELTA > hwm)
+ return hwm;
return usage;
}
+static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
+{
+ struct ioc_gq *iocg;
+
+ list_for_each_entry(iocg, surpluses, surplus_list) {
+ u32 old_hwi, new_hwi, new_inuse;
+
+ current_hweight(iocg, NULL, &old_hwi);
+ new_hwi = iocg->hweight_after_donation;
+
+ new_inuse = DIV64_U64_ROUND_UP((u64)iocg->inuse * new_hwi,
+ old_hwi);
+ __propagate_weights(iocg, iocg->weight, new_inuse);
+ }
+}
+
static void ioc_timer_fn(struct timer_list *timer)
{
struct ioc *ioc = container_of(timer, struct ioc, timer);
@@ -1560,9 +1625,9 @@ static void ioc_timer_fn(struct timer_list *timer)
/* calc usages and see whether some weights need to be moved around */
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
- u64 vdone, vtime, usage_us, vmin;
+ u64 vdone, vtime, usage_us;
u32 hw_active, hw_inuse, usage;
- int uidx;
+ int uidx, nr_valid;
/*
* Collect unused and wind vtime closer to vnow to prevent
@@ -1618,92 +1683,54 @@ static void ioc_timer_fn(struct timer_list *timer)
started_at = ioc->period_at;
dur = max_t(u64, now.now - started_at, 1);
- usage = clamp_t(u32,
+
+ iocg->usage_idx = uidx;
+ iocg->usages[uidx] = clamp_t(u32,
DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, dur),
1, WEIGHT_ONE);
+ }
- iocg->usage_idx = uidx;
- iocg->usages[uidx] = usage;
- } else {
- usage = 0;
+ /* base the decision on max historical usage */
+ for (i = 0, usage = 0, nr_valid = 0; i < NR_USAGE_SLOTS; i++) {
+ if (iocg->usages[i]) {
+ usage = max(usage, iocg->usages[i]);
+ nr_valid++;
+ }
}
+ if (nr_valid < MIN_VALID_USAGES)
+ usage = WEIGHT_ONE;
/* see whether there's surplus vtime */
- vmin = now.vnow - ioc->margins.max;
-
WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
- if (!waitqueue_active(&iocg->waitq) &&
- time_before64(vtime, vmin)) {
- u64 delta = vmin - vtime;
-
- /* throw away surplus vtime */
- atomic64_add(delta, &iocg->vtime);
- atomic64_add(delta, &iocg->done_vtime);
- /* if usage is sufficiently low, maybe it can donate */
- if (surplus_adjusted_hweight_inuse(usage, hw_inuse))
- list_add(&iocg->surplus_list, &surpluses);
- } else if (hw_inuse < hw_active) {
- u32 new_hwi, new_inuse;
+ if (hw_inuse < hw_active ||
+ (!waitqueue_active(&iocg->waitq) &&
+ time_before64(vtime, now.vnow - ioc->margins.max))) {
+ u32 hwm, new_hwi;
- /* was donating but might need to take back some */
- if (waitqueue_active(&iocg->waitq)) {
- new_hwi = hw_active;
+ /*
+ * Already donating or accumulated enough to start.
+ * Determine the donation amount.
+ */
+ hwm = current_hweight_max(iocg);
+ new_hwi = hweight_after_donation(iocg, hwm, usage,
+ &now);
+ if (new_hwi < hwm) {
+ iocg->hweight_after_donation = new_hwi;
+ list_add(&iocg->surplus_list, &surpluses);
} else {
- new_hwi = max(hw_inuse,
- usage * SURPLUS_SCALE_PCT / 100 +
- SURPLUS_SCALE_ABS);
- }
-
- new_inuse = div64_u64((u64)iocg->inuse * new_hwi,
- hw_inuse);
- new_inuse = clamp_t(u32, new_inuse, 1, iocg->active);
-
- if (new_inuse > iocg->inuse) {
- TRACE_IOCG_PATH(inuse_takeback, iocg, &now,
- iocg->inuse, new_inuse,
- hw_inuse, new_hwi);
- __propagate_weights(iocg, iocg->weight,
- new_inuse);
+ __propagate_weights(iocg, iocg->active,
+ iocg->active);
+ nr_shortages++;
}
} else {
- /* genuninely out of vtime */
+ /* genuinely short on vtime */
nr_shortages++;
}
}
- if (!nr_shortages || list_empty(&surpluses))
- goto skip_surplus_transfers;
+ if (!list_empty(&surpluses) && nr_shortages)
+ transfer_surpluses(&surpluses, &now);
- /* there are both shortages and surpluses, transfer surpluses */
- list_for_each_entry(iocg, &surpluses, surplus_list) {
- u32 usage, hw_active, hw_inuse, new_hwi, new_inuse;
- int nr_valid = 0;
-
- /* base the decision on max historical usage */
- for (i = 0, usage = 0; i < NR_USAGE_SLOTS; i++) {
- if (iocg->usages[i]) {
- usage = max(usage, iocg->usages[i]);
- nr_valid++;
- }
- }
- if (nr_valid < MIN_VALID_USAGES)
- continue;
-
- current_hweight(iocg, &hw_active, &hw_inuse);
- new_hwi = surplus_adjusted_hweight_inuse(usage, hw_inuse);
- if (!new_hwi)
- continue;
-
- new_inuse = DIV64_U64_ROUND_UP((u64)iocg->inuse * new_hwi,
- hw_inuse);
- if (new_inuse < iocg->inuse) {
- TRACE_IOCG_PATH(inuse_giveaway, iocg, &now,
- iocg->inuse, new_inuse,
- hw_inuse, new_hwi);
- __propagate_weights(iocg, iocg->weight, new_inuse);
- }
- }
-skip_surplus_transfers:
commit_weights(ioc);
/* surplus list should be dissolved after use */
--
2.26.2
To improve weight donations, we want to able to scale inuse with a greater
accuracy and down below 1. Let's make non-hierarchical weights to use
WEIGHT_ONE based fixed point numbers too like hierarchical ones.
This doesn't cause any behavior changes yet.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 5e6d56eec1c9..00c5a3ad2b5b 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -984,8 +984,8 @@ static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep
for (lvl = 0; lvl <= iocg->level - 1; lvl++) {
struct ioc_gq *parent = iocg->ancestors[lvl];
struct ioc_gq *child = iocg->ancestors[lvl + 1];
- u32 active_sum = READ_ONCE(parent->child_active_sum);
- u32 inuse_sum = READ_ONCE(parent->child_inuse_sum);
+ u64 active_sum = READ_ONCE(parent->child_active_sum);
+ u64 inuse_sum = READ_ONCE(parent->child_inuse_sum);
u32 active = READ_ONCE(child->active);
u32 inuse = READ_ONCE(child->inuse);
@@ -993,11 +993,11 @@ static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep
if (!active_sum || !inuse_sum)
continue;
- active_sum = max(active, active_sum);
- hwa = hwa * active / active_sum; /* max 16bits * 10000 */
+ active_sum = max_t(u64, active, active_sum);
+ hwa = div64_u64((u64)hwa * active, active_sum);
- inuse_sum = max(inuse, inuse_sum);
- hwi = hwi * inuse / inuse_sum; /* max 16bits * 10000 */
+ inuse_sum = max_t(u64, inuse, inuse_sum);
+ hwi = div64_u64((u64)hwi * inuse, inuse_sum);
}
iocg->hweight_active = max_t(u32, hwa, 1);
@@ -1022,7 +1022,8 @@ static void weight_updated(struct ioc_gq *iocg)
weight = iocg->cfg_weight ?: iocc->dfl_weight;
if (weight != iocg->weight && iocg->active)
propagate_weights(iocg, weight,
- DIV64_U64_ROUND_UP(iocg->inuse * weight, iocg->weight));
+ DIV64_U64_ROUND_UP((u64)iocg->inuse * weight,
+ iocg->weight));
iocg->weight = weight;
}
@@ -2050,7 +2051,7 @@ static struct blkcg_policy_data *ioc_cpd_alloc(gfp_t gfp)
if (!iocc)
return NULL;
- iocc->dfl_weight = CGROUP_WEIGHT_DFL;
+ iocc->dfl_weight = CGROUP_WEIGHT_DFL * WEIGHT_ONE;
return &iocc->cpd;
}
@@ -2136,7 +2137,7 @@ static u64 ioc_weight_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
struct ioc_gq *iocg = pd_to_iocg(pd);
if (dname && iocg->cfg_weight)
- seq_printf(sf, "%s %u\n", dname, iocg->cfg_weight);
+ seq_printf(sf, "%s %u\n", dname, iocg->cfg_weight / WEIGHT_ONE);
return 0;
}
@@ -2146,7 +2147,7 @@ static int ioc_weight_show(struct seq_file *sf, void *v)
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg);
- seq_printf(sf, "default %u\n", iocc->dfl_weight);
+ seq_printf(sf, "default %u\n", iocc->dfl_weight / WEIGHT_ONE);
blkcg_print_blkgs(sf, blkcg, ioc_weight_prfill,
&blkcg_policy_iocost, seq_cft(sf)->private, false);
return 0;
@@ -2172,7 +2173,7 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
return -EINVAL;
spin_lock(&blkcg->lock);
- iocc->dfl_weight = v;
+ iocc->dfl_weight = v * WEIGHT_ONE;
hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
struct ioc_gq *iocg = blkg_to_iocg(blkg);
@@ -2203,7 +2204,7 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
}
spin_lock(&iocg->ioc->lock);
- iocg->cfg_weight = v;
+ iocg->cfg_weight = v * WEIGHT_ONE;
weight_updated(iocg);
spin_unlock(&iocg->ioc->lock);
--
2.26.2
__propagate_weights() currently expects the callers to clamp inuse within
[1, active], which is needlessly fragile. The inuse adjustment logic is
going to be revamped, in preparation, let's make __propagate_weights() clamp
inuse on entry.
Also, make it avoid weight updates altogether if neither active or inuse is
changed.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 78e6919153d8..8dfe73dde2a8 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -897,7 +897,10 @@ static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse)
lockdep_assert_held(&ioc->lock);
- inuse = min(active, inuse);
+ inuse = clamp_t(u32, inuse, 1, active);
+
+ if (active == iocg->active && inuse == iocg->inuse)
+ return;
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
struct ioc_gq *parent = iocg->ancestors[lvl];
--
2.26.2
Currently, debt handling requires only iocg->waitq.lock. In the future, we
want to adjust and propagate inuse changes depending on debt status. Let's
grab ioc->lock in debt handling paths in preparation.
* Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
ioc->lock needs to be made before entering the critical sections.
* Add and use iocg_[un]lock() which handles the conditional double locking.
* Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
the caller grabbed both locks.
This patch is prepatory and the comments contain references to future
changes.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 92 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 73 insertions(+), 19 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index f36988657594..23b173e34591 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -680,6 +680,26 @@ static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost)
atomic64_add(cost, &iocg->vtime);
}
+static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
+{
+ if (lock_ioc) {
+ spin_lock_irqsave(&iocg->ioc->lock, *flags);
+ spin_lock(&iocg->waitq.lock);
+ } else {
+ spin_lock_irqsave(&iocg->waitq.lock, *flags);
+ }
+}
+
+static void iocg_unlock(struct ioc_gq *iocg, bool unlock_ioc, unsigned long *flags)
+{
+ if (unlock_ioc) {
+ spin_unlock(&iocg->waitq.lock);
+ spin_unlock_irqrestore(&iocg->ioc->lock, *flags);
+ } else {
+ spin_unlock_irqrestore(&iocg->waitq.lock, *flags);
+ }
+}
+
#define CREATE_TRACE_POINTS
#include <trace/events/iocost.h>
@@ -1215,11 +1235,17 @@ static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
return 0;
}
-static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now)
+/*
+ * Calculate the accumulated budget, pay debt if @pay_debt and wake up waiters
+ * accordingly. When @pay_debt is %true, the caller must be holding ioc->lock in
+ * addition to iocg->waitq.lock.
+ */
+static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
+ struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
struct iocg_wake_ctx ctx = { .iocg = iocg };
- u64 vdebt, vshortage, expires, oexpires;
+ u64 vshortage, expires, oexpires;
s64 vbudget;
u32 hw_inuse;
@@ -1229,25 +1255,39 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now)
vbudget = now->vnow - atomic64_read(&iocg->vtime);
/* pay off debt */
- vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
- if (vdebt && vbudget > 0) {
+ if (pay_debt && iocg->abs_vdebt && vbudget > 0) {
+ u64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
u64 delta = min_t(u64, vbudget, vdebt);
u64 abs_delta = min(cost_to_abs_cost(delta, hw_inuse),
iocg->abs_vdebt);
+ lockdep_assert_held(&ioc->lock);
+
atomic64_add(delta, &iocg->vtime);
atomic64_add(delta, &iocg->done_vtime);
iocg->abs_vdebt -= abs_delta;
+ vbudget -= vdebt;
iocg_kick_delay(iocg, now);
}
+ /*
+ * Debt can still be outstanding if we haven't paid all yet or the
+ * caller raced and called without @pay_debt. Shouldn't wake up waiters
+ * under debt. Make sure @vbudget reflects the outstanding amount and is
+ * not positive.
+ */
+ if (iocg->abs_vdebt) {
+ s64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
+ vbudget = min_t(s64, 0, vbudget - vdebt);
+ }
+
/*
* Wake up the ones which are due and see how much vtime we'll need
* for the next one.
*/
ctx.hw_inuse = hw_inuse;
- ctx.vbudget = vbudget - vdebt;
+ ctx.vbudget = vbudget;
__wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx);
if (!waitqueue_active(&iocg->waitq))
return;
@@ -1273,14 +1313,15 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now)
static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer)
{
struct ioc_gq *iocg = container_of(timer, struct ioc_gq, waitq_timer);
+ bool pay_debt = READ_ONCE(iocg->abs_vdebt);
struct ioc_now now;
unsigned long flags;
ioc_now(iocg->ioc, &now);
- spin_lock_irqsave(&iocg->waitq.lock, flags);
- iocg_kick_waitq(iocg, &now);
- spin_unlock_irqrestore(&iocg->waitq.lock, flags);
+ iocg_lock(iocg, pay_debt, &flags);
+ iocg_kick_waitq(iocg, pay_debt, &now);
+ iocg_unlock(iocg, pay_debt, &flags);
return HRTIMER_NORESTART;
}
@@ -1396,7 +1437,7 @@ static void ioc_timer_fn(struct timer_list *timer)
if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt) {
/* might be oversleeping vtime / hweight changes, kick */
- iocg_kick_waitq(iocg, &now);
+ iocg_kick_waitq(iocg, true, &now);
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
iocg->last_inuse = iocg->inuse;
@@ -1743,6 +1784,8 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
struct iocg_wait wait;
u32 hw_active, hw_inuse;
u64 abs_cost, cost, vtime;
+ bool use_debt, ioc_locked;
+ unsigned long flags;
/* bypass IOs if disabled or for root cgroup */
if (!ioc->enabled || !iocg->level)
@@ -1786,15 +1829,26 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
}
/*
- * We activated above but w/o any synchronization. Deactivation is
- * synchronized with waitq.lock and we won't get deactivated as long
- * as we're waiting or has debt, so we're good if we're activated
- * here. In the unlikely case that we aren't, just issue the IO.
+ * We're over budget. This can be handled in two ways. IOs which may
+ * cause priority inversions are punted to @ioc->aux_iocg and charged as
+ * debt. Otherwise, the issuer is blocked on @iocg->waitq. Debt handling
+ * requires @ioc->lock, waitq handling @iocg->waitq.lock. Determine
+ * whether debt handling is needed and acquire locks accordingly.
*/
- spin_lock_irq(&iocg->waitq.lock);
+ use_debt = bio_issue_as_root_blkg(bio) || fatal_signal_pending(current);
+ ioc_locked = use_debt || READ_ONCE(iocg->abs_vdebt);
+ iocg_lock(iocg, ioc_locked, &flags);
+
+ /*
+ * @iocg must stay activated for debt and waitq handling. Deactivation
+ * is synchronized against both ioc->lock and waitq.lock and we won't
+ * get deactivated as long as we're waiting or has debt, so we're good
+ * if we're activated here. In the unlikely cases that we aren't, just
+ * issue the IO.
+ */
if (unlikely(list_empty(&iocg->active_list))) {
- spin_unlock_irq(&iocg->waitq.lock);
+ iocg_unlock(iocg, ioc_locked, &flags);
iocg_commit_bio(iocg, bio, cost);
return;
}
@@ -1816,12 +1870,12 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
* clear them and leave @iocg inactive w/ dangling use_delay heavily
* penalizing the cgroup and its descendants.
*/
- if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) {
+ if (use_debt) {
iocg->abs_vdebt += abs_cost;
if (iocg_kick_delay(iocg, &now))
blkcg_schedule_throttle(rqos->q,
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
- spin_unlock_irq(&iocg->waitq.lock);
+ iocg_unlock(iocg, ioc_locked, &flags);
return;
}
@@ -1845,9 +1899,9 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
wait.committed = false; /* will be set true by waker */
__add_wait_queue_entry_tail(&iocg->waitq, &wait.wait);
- iocg_kick_waitq(iocg, &now);
+ iocg_kick_waitq(iocg, ioc_locked, &now);
- spin_unlock_irq(&iocg->waitq.lock);
+ iocg_unlock(iocg, ioc_locked, &flags);
while (true) {
set_current_state(TASK_UNINTERRUPTIBLE);
--
2.26.2
They are in microseconds and wrap in around 1.2 hours with u32. While
unlikely, confusions from wraparounds are still possible. We aren't saving
anything meaningful by keeping these u32. Let's make them u64.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 00c5a3ad2b5b..dc72cd965837 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -409,7 +409,7 @@ struct ioc {
atomic64_t vtime_rate;
seqcount_spinlock_t period_seqcount;
- u32 period_at; /* wallclock starttime */
+ u64 period_at; /* wallclock starttime */
u64 period_at_vtime; /* vtime starttime */
atomic64_t cur_period; /* inc'd each period */
@@ -508,7 +508,7 @@ struct ioc_cgrp {
struct ioc_now {
u64 now_ns;
- u32 now;
+ u64 now;
u64 vnow;
u64 vrate;
};
--
2.26.2
These are really cheap to collect and can be useful in debugging iocost
behavior. Add them as debug stats for now.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 77 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 72 insertions(+), 5 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 9366527d8c12..fc897bb142bc 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -452,6 +452,9 @@ struct iocg_pcpu_stat {
struct iocg_stat {
u64 usage_us;
+ u64 wait_us;
+ u64 indebt_us;
+ u64 indelay_us;
};
/* per device-cgroup pair */
@@ -538,6 +541,9 @@ struct ioc_gq {
struct iocg_stat last_stat;
u64 last_stat_abs_vusage;
u64 usage_delta_us;
+ u64 wait_since;
+ u64 indebt_since;
+ u64 indelay_since;
/* this iocg's depth in the hierarchy and ancestors including self */
int level;
@@ -1303,9 +1309,15 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
}
if (delay >= MIN_DELAY) {
+ if (!iocg->indelay_since)
+ iocg->indelay_since = now->now;
blkcg_set_delay(blkg, delay * NSEC_PER_USEC);
return true;
} else {
+ if (iocg->indelay_since) {
+ iocg->local_stat.indelay_us += now->now - iocg->indelay_since;
+ iocg->indelay_since = 0;
+ }
iocg->delay = 0;
blkcg_clear_delay(blkg);
return false;
@@ -1325,8 +1337,10 @@ static void iocg_incur_debt(struct ioc_gq *iocg, u64 abs_cost,
* Once in debt, debt handling owns inuse. @iocg stays at the minimum
* inuse donating all of it share to others until its debt is paid off.
*/
- if (!iocg->abs_vdebt && abs_cost)
+ if (!iocg->abs_vdebt && abs_cost) {
+ iocg->indebt_since = now->now;
propagate_weights(iocg, iocg->active, 0, false, now);
+ }
iocg->abs_vdebt += abs_cost;
@@ -1348,9 +1362,13 @@ static void iocg_pay_debt(struct ioc_gq *iocg, u64 abs_vpay,
iocg->abs_vdebt -= min(abs_vpay, iocg->abs_vdebt);
/* if debt is paid in full, restore inuse */
- if (!iocg->abs_vdebt)
+ if (!iocg->abs_vdebt) {
+ iocg->local_stat.indebt_us += now->now - iocg->indebt_since;
+ iocg->indebt_since = 0;
+
propagate_weights(iocg, iocg->active, iocg->last_inuse,
false, now);
+ }
}
static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
@@ -1436,8 +1454,17 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
__wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx);
- if (!waitqueue_active(&iocg->waitq))
+ if (!waitqueue_active(&iocg->waitq)) {
+ if (iocg->wait_since) {
+ iocg->local_stat.wait_us += now->now - iocg->wait_since;
+ iocg->wait_since = 0;
+ }
return;
+ }
+
+ if (!iocg->wait_since)
+ iocg->wait_since = now->now;
+
if (WARN_ON_ONCE(ctx.vbudget >= 0))
return;
@@ -1579,8 +1606,15 @@ static void iocg_flush_stat_one(struct ioc_gq *iocg, struct ioc_now *now)
iocg->usage_delta_us = div64_u64(vusage_delta, ioc->vtime_base_rate);
iocg->local_stat.usage_us += iocg->usage_delta_us;
+ /* propagate upwards */
new_stat.usage_us =
iocg->local_stat.usage_us + iocg->desc_stat.usage_us;
+ new_stat.wait_us =
+ iocg->local_stat.wait_us + iocg->desc_stat.wait_us;
+ new_stat.indebt_us =
+ iocg->local_stat.indebt_us + iocg->desc_stat.indebt_us;
+ new_stat.indelay_us =
+ iocg->local_stat.indelay_us + iocg->desc_stat.indelay_us;
/* propagate the deltas to the parent */
if (iocg->level > 0) {
@@ -1589,6 +1623,12 @@ static void iocg_flush_stat_one(struct ioc_gq *iocg, struct ioc_now *now)
parent_stat->usage_us +=
new_stat.usage_us - iocg->last_stat.usage_us;
+ parent_stat->wait_us +=
+ new_stat.wait_us - iocg->last_stat.wait_us;
+ parent_stat->indebt_us +=
+ new_stat.indebt_us - iocg->last_stat.indebt_us;
+ parent_stat->indelay_us +=
+ new_stat.indelay_us - iocg->last_stat.indelay_us;
}
iocg->last_stat = new_stat;
@@ -1961,8 +2001,6 @@ static void ioc_timer_fn(struct timer_list *timer)
return;
}
- iocg_flush_stat(&ioc->active_iocgs, &now);
-
/*
* Waiters determine the sleep durations based on the vrate they
* saw at the time of sleep. If vrate has increased, some waiters
@@ -1976,6 +2014,22 @@ static void ioc_timer_fn(struct timer_list *timer)
spin_lock(&iocg->waitq.lock);
+ /* flush wait and indebt stat deltas */
+ if (iocg->wait_since) {
+ iocg->local_stat.wait_us += now.now - iocg->wait_since;
+ iocg->wait_since = now.now;
+ }
+ if (iocg->indebt_since) {
+ iocg->local_stat.indebt_us +=
+ now.now - iocg->indebt_since;
+ iocg->indebt_since = now.now;
+ }
+ if (iocg->indelay_since) {
+ iocg->local_stat.indelay_us +=
+ now.now - iocg->indelay_since;
+ iocg->indelay_since = now.now;
+ }
+
if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt ||
iocg->delay) {
/* might be oversleeping vtime / hweight changes, kick */
@@ -2010,6 +2064,12 @@ static void ioc_timer_fn(struct timer_list *timer)
}
commit_weights(ioc);
+ /*
+ * Wait and indebt stat are flushed above and the donation calculation
+ * below needs updated usage stat. Let's bring stat up-to-date.
+ */
+ iocg_flush_stat(&ioc->active_iocgs, &now);
+
/* calc usage and see whether some weights need to be moved around */
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
u64 vdone, vtime, usage_us, usage_dur;
@@ -2835,6 +2895,13 @@ static size_t ioc_pd_stat(struct blkg_policy_data *pd, char *buf, size_t size)
pos += scnprintf(buf + pos, size - pos, " cost.usage=%llu",
iocg->last_stat.usage_us);
+ if (blkcg_debug_stats)
+ pos += scnprintf(buf + pos, size - pos,
+ " cost.wait=%llu cost.indebt=%llu cost.indelay=%llu",
+ iocg->last_stat.wait_us,
+ iocg->last_stat.indebt_us,
+ iocg->last_stat.indelay_us);
+
return pos;
}
--
2.26.2
Currently, iocg->usages[] which are used to guide inuse adjustments are
calculated from vtime deltas. This, however, assumes that the hierarchical
inuse weight at the time of calculation held for the entire period, which
often isn't true and can lead to significant errors.
Now that we have absolute usage information collected, we can derive
iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
decisions are made based on actual absolute usage. The calculated usage is
clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
saturation regardless of the current hierarchical inuse weight.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 72 ++++++++++++++++++++++-------------
include/trace/events/iocost.h | 7 +---
2 files changed, 47 insertions(+), 32 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index f30f9b37fcf0..2496674bbbf4 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -476,14 +476,10 @@ struct ioc_gq {
* `vtime_done` is the same but progressed on completion rather
* than issue. The delta behind `vtime` represents the cost of
* currently in-flight IOs.
- *
- * `last_vtime` is used to remember `vtime` at the end of the last
- * period to calculate utilization.
*/
atomic64_t vtime;
atomic64_t done_vtime;
u64 abs_vdebt;
- u64 last_vtime;
/*
* The period this iocg was last active in. Used for deactivation
@@ -506,6 +502,9 @@ struct ioc_gq {
struct hrtimer waitq_timer;
struct hrtimer delay_timer;
+ /* timestamp at the latest activation */
+ u64 activated_at;
+
/* statistics */
struct iocg_pcpu_stat __percpu *pcpu_stat;
struct iocg_stat local_stat;
@@ -514,6 +513,7 @@ struct ioc_gq {
u64 last_stat_abs_vusage;
/* usage is recorded as fractions of WEIGHT_ONE */
+ u32 usage_delta_us;
int usage_idx;
u32 usages[NR_USAGE_SLOTS];
@@ -1159,7 +1159,7 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
TRACE_IOCG_PATH(iocg_activate, iocg, now,
last_period, cur_period, vtime);
- iocg->last_vtime = vtime;
+ iocg->activated_at = now->now;
if (ioc->running == IOC_IDLE) {
ioc->running = IOC_RUNNING;
@@ -1451,7 +1451,8 @@ static void iocg_flush_stat_one(struct ioc_gq *iocg, struct ioc_now *now)
vusage_delta = abs_vusage - iocg->last_stat_abs_vusage;
iocg->last_stat_abs_vusage = abs_vusage;
- iocg->local_stat.usage_us += div64_u64(vusage_delta, now->vrate);
+ iocg->usage_delta_us = div64_u64(vusage_delta, now->vrate);
+ iocg->local_stat.usage_us += iocg->usage_delta_us;
new_stat.usage_us =
iocg->local_stat.usage_us + iocg->desc_stat.usage_us;
@@ -1558,8 +1559,9 @@ static void ioc_timer_fn(struct timer_list *timer)
/* calc usages and see whether some weights need to be moved around */
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
- u64 vdone, vtime, vusage, vmin;
+ u64 vdone, vtime, usage_us, vmin;
u32 hw_active, hw_inuse, usage;
+ int uidx;
/*
* Collect unused and wind vtime closer to vnow to prevent
@@ -1583,27 +1585,44 @@ static void ioc_timer_fn(struct timer_list *timer)
time_before64(vdone, now.vnow - period_vtime))
nr_lagging++;
- if (waitqueue_active(&iocg->waitq))
- vusage = now.vnow - iocg->last_vtime;
- else if (time_before64(iocg->last_vtime, vtime))
- vusage = vtime - iocg->last_vtime;
- else
- vusage = 0;
-
- iocg->last_vtime += vusage;
/*
- * Factor in in-flight vtime into vusage to avoid
- * high-latency completions appearing as idle. This should
- * be done after the above ->last_time adjustment.
+ * Determine absolute usage factoring in pending and in-flight
+ * IOs to avoid stalls and high-latency completions appearing as
+ * idle.
*/
- vusage = max(vusage, vtime - vdone);
-
- /* calculate hweight based usage ratio and record */
- if (vusage) {
- usage = DIV64_U64_ROUND_UP(vusage * hw_inuse,
- period_vtime);
- iocg->usage_idx = (iocg->usage_idx + 1) % NR_USAGE_SLOTS;
- iocg->usages[iocg->usage_idx] = usage;
+ usage_us = iocg->usage_delta_us;
+ if (waitqueue_active(&iocg->waitq) && time_before64(vtime, now.vnow))
+ usage_us += DIV64_U64_ROUND_UP(
+ cost_to_abs_cost(now.vnow - vtime, hw_inuse),
+ now.vrate);
+ if (vdone != vtime) {
+ u64 inflight_us = DIV64_U64_ROUND_UP(
+ cost_to_abs_cost(vtime - vdone, hw_inuse),
+ now.vrate);
+ usage_us = max(usage_us, inflight_us);
+ }
+
+ /* convert to hweight based usage ratio and record */
+ uidx = (iocg->usage_idx + 1) % NR_USAGE_SLOTS;
+
+ if (time_after64(vtime, now.vnow - ioc->margins.min)) {
+ iocg->usage_idx = uidx;
+ iocg->usages[uidx] = WEIGHT_ONE;
+ } else if (usage_us) {
+ u64 started_at, dur;
+
+ if (time_after64(iocg->activated_at, ioc->period_at))
+ started_at = iocg->activated_at;
+ else
+ started_at = ioc->period_at;
+
+ dur = max_t(u64, now.now - started_at, 1);
+ usage = clamp_t(u32,
+ DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, dur),
+ 1, WEIGHT_ONE);
+
+ iocg->usage_idx = uidx;
+ iocg->usages[uidx] = usage;
} else {
usage = 0;
}
@@ -1620,7 +1639,6 @@ static void ioc_timer_fn(struct timer_list *timer)
/* throw away surplus vtime */
atomic64_add(delta, &iocg->vtime);
atomic64_add(delta, &iocg->done_vtime);
- iocg->last_vtime += delta;
/* if usage is sufficiently low, maybe it can donate */
if (surplus_adjusted_hweight_inuse(usage, hw_inuse)) {
iocg->has_surplus = true;
diff --git a/include/trace/events/iocost.h b/include/trace/events/iocost.h
index c2f580fd371b..a905ecc0342f 100644
--- a/include/trace/events/iocost.h
+++ b/include/trace/events/iocost.h
@@ -26,7 +26,6 @@ TRACE_EVENT(iocost_iocg_activate,
__field(u64, vrate)
__field(u64, last_period)
__field(u64, cur_period)
- __field(u64, last_vtime)
__field(u64, vtime)
__field(u32, weight)
__field(u32, inuse)
@@ -42,7 +41,6 @@ TRACE_EVENT(iocost_iocg_activate,
__entry->vrate = now->vrate;
__entry->last_period = last_period;
__entry->cur_period = cur_period;
- __entry->last_vtime = iocg->last_vtime;
__entry->vtime = vtime;
__entry->weight = iocg->weight;
__entry->inuse = iocg->inuse;
@@ -51,13 +49,12 @@ TRACE_EVENT(iocost_iocg_activate,
),
TP_printk("[%s:%s] now=%llu:%llu vrate=%llu "
- "period=%llu->%llu vtime=%llu->%llu "
+ "period=%llu->%llu vtime=%llu "
"weight=%u/%u hweight=%llu/%llu",
__get_str(devname), __get_str(cgroup),
__entry->now, __entry->vnow, __entry->vrate,
__entry->last_period, __entry->cur_period,
- __entry->last_vtime, __entry->vtime,
- __entry->inuse, __entry->weight,
+ __entry->vtime, __entry->inuse, __entry->weight,
__entry->hweight_inuse, __entry->hweight_active
)
);
--
2.26.2
A low weight iocg can amass a large amount of debt, for example, when
anonymous memory gets reclaimed aggressively. If the system has a lot of
memory paired with a slow IO device, the debt can span multiple seconds or
more. If there are no other subsequent IO issuers, the in-debt iocg may end
up blocked paying its debt while the IO device is idle.
This patch implements a mechanism to protect against such pathological
cases. If the device has been sufficiently idle for a substantial amount of
time, the debts are halved. The criteria are on the conservative side as we
want to resolve the rare extreme cases without impacting regular operation
by forgiving debts too readily.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 49 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 9cb8f29f01f5..2a95a081cf44 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -295,6 +295,13 @@ enum {
MIN_DELAY = 250,
MAX_DELAY = 250 * USEC_PER_MSEC,
+ /*
+ * Halve debts if total usage keeps staying under 25% w/o any shortages
+ * for over 100ms.
+ */
+ DEBT_BUSY_USAGE_PCT = 25,
+ DEBT_REDUCTION_IDLE_DUR = 100 * USEC_PER_MSEC,
+
/* don't let cmds which take a very long time pin lagging for too long */
MAX_LAGGING_PERIODS = 10,
@@ -436,6 +443,9 @@ struct ioc {
bool weights_updated;
atomic_t hweight_gen; /* for lazy hweights */
+ /* the last time debt cancel condition wasn't met */
+ u64 debt_busy_at;
+
u64 autop_too_fast_at;
u64 autop_too_slow_at;
int autop_idx;
@@ -1216,6 +1226,7 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
if (ioc->running == IOC_IDLE) {
ioc->running = IOC_RUNNING;
+ ioc->debt_busy_at = now->now;
ioc_start_period(ioc, now);
}
@@ -1896,7 +1907,8 @@ static void ioc_timer_fn(struct timer_list *timer)
struct ioc_gq *iocg, *tiocg;
struct ioc_now now;
LIST_HEAD(surpluses);
- int nr_shortages = 0, nr_lagging = 0;
+ int nr_debtors = 0, nr_shortages = 0, nr_lagging = 0;
+ u64 usage_us_sum = 0;
u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
u32 missed_ppm[2], rq_wait_pct;
@@ -1936,6 +1948,8 @@ static void ioc_timer_fn(struct timer_list *timer)
iocg->delay) {
/* might be oversleeping vtime / hweight changes, kick */
iocg_kick_waitq(iocg, true, &now);
+ if (iocg->abs_vdebt)
+ nr_debtors++;
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
__propagate_weights(iocg, 0, 0, false, &now);
@@ -1978,6 +1992,7 @@ static void ioc_timer_fn(struct timer_list *timer)
* high-latency completions appearing as idle.
*/
usage_us = iocg->usage_delta_us;
+ usage_us_sum += usage_us;
if (vdone != vtime) {
u64 inflight_us = DIV64_U64_ROUND_UP(
@@ -2036,6 +2051,38 @@ static void ioc_timer_fn(struct timer_list *timer)
list_for_each_entry_safe(iocg, tiocg, &surpluses, surplus_list)
list_del_init(&iocg->surplus_list);
+ /*
+ * A low weight iocg can amass a large amount of debt, for example, when
+ * anonymous memory gets reclaimed aggressively. If the system has a lot
+ * of memory paired with a slow IO device, the debt can span multiple
+ * seconds or more. If there are no other subsequent IO issuers, the
+ * in-debt iocg may end up blocked paying its debt while the IO device
+ * is idle.
+ *
+ * The following protects against such pathological cases. If the device
+ * has been sufficiently idle for a substantial amount of time, the
+ * debts are halved. The criteria are on the conservative side as we
+ * want to resolve the rare extreme cases without impacting regular
+ * operation by forgiving debts too readily.
+ */
+ if (nr_shortages ||
+ div64_u64(100 * usage_us_sum, now.now - ioc->period_at) >=
+ DEBT_BUSY_USAGE_PCT)
+ ioc->debt_busy_at = now.now;
+
+ if (nr_debtors &&
+ now.now - ioc->debt_busy_at >= DEBT_REDUCTION_IDLE_DUR) {
+ list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
+ if (iocg->abs_vdebt) {
+ spin_lock(&iocg->waitq.lock);
+ iocg->abs_vdebt /= 2;
+ iocg_kick_waitq(iocg, true, &now);
+ spin_unlock(&iocg->waitq.lock);
+ }
+ }
+ ioc->debt_busy_at = now.now;
+ }
+
/*
* If q is getting clogged or we're missing too much, we're issuing
* too much IO and should lower vtime rate. If we're not missing
--
2.26.2
iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
updates the actual delay status accordingly. If iocg_kick_delay() is not
called after iocg_kick_delay() updated debt, unnecessarily large delays can
be applied temporarily.
Let's make sure such conditions don't occur by making iocg_kick_waitq()
always call iocg_kick_delay() after paying debt.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ac22d761a350..b2b8dfbeee5a 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1226,6 +1226,8 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now)
atomic64_add(delta, &iocg->vtime);
atomic64_add(delta, &iocg->done_vtime);
iocg->abs_vdebt -= abs_delta;
+
+ iocg_kick_delay(iocg, now);
}
/*
@@ -1383,7 +1385,6 @@ static void ioc_timer_fn(struct timer_list *timer)
if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt) {
/* might be oversleeping vtime / hweight changes, kick */
iocg_kick_waitq(iocg, &now);
- iocg_kick_delay(iocg, &now);
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
iocg->last_inuse = iocg->inuse;
--
2.26.2
ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.
Signed-off-by: Tejun Heo <[email protected]>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: [email protected] # v5.4+
---
block/blk-iocost.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 413e0b5c8e6b..d37b55db2409 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2092,14 +2092,15 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
{
struct ioc_gq *iocg = pd_to_iocg(pd);
struct ioc *ioc = iocg->ioc;
+ unsigned long flags;
if (ioc) {
- spin_lock(&ioc->lock);
+ spin_lock_irqsave(&ioc->lock, flags);
if (!list_empty(&iocg->active_list)) {
propagate_active_weight(iocg, 0, 0);
list_del_init(&iocg->active_list);
}
- spin_unlock(&ioc->lock);
+ spin_unlock_irqrestore(&ioc->lock, flags);
hrtimer_cancel(&iocg->waitq_timer);
hrtimer_cancel(&iocg->delay_timer);
--
2.26.2
Update and restore the inuse update tracepoints.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 16 ++++++++++++++++
include/trace/events/iocost.h | 6 +++---
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 0270a504e6b5..9366527d8c12 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1919,6 +1919,12 @@ static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
inuse = DIV64_U64_ROUND_UP(
parent->child_adjusted_sum * iocg->hweight_after_donation,
parent->hweight_inuse);
+
+ TRACE_IOCG_PATH(inuse_transfer, iocg, now,
+ iocg->inuse, inuse,
+ iocg->hweight_inuse,
+ iocg->hweight_after_donation);
+
__propagate_weights(iocg, iocg->active, inuse, true, now);
}
@@ -2076,6 +2082,10 @@ static void ioc_timer_fn(struct timer_list *timer)
iocg->hweight_after_donation = new_hwi;
list_add(&iocg->surplus_list, &surpluses);
} else {
+ TRACE_IOCG_PATH(inuse_shortage, iocg, &now,
+ iocg->inuse, iocg->active,
+ iocg->hweight_inuse, new_hwi);
+
__propagate_weights(iocg, iocg->active,
iocg->active, true, &now);
nr_shortages++;
@@ -2248,11 +2258,13 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
struct ioc *ioc = iocg->ioc;
struct ioc_margins *margins = &ioc->margins;
u32 adj_step = DIV_ROUND_UP(iocg->active * INUSE_ADJ_STEP_PCT, 100);
+ u32 __maybe_unused old_inuse = iocg->inuse, __maybe_unused old_hwi;
u32 hwi;
s64 margin;
u64 cost, new_inuse;
current_hweight(iocg, NULL, &hwi);
+ old_hwi = hwi;
cost = abs_cost_to_cost(abs_cost, hwi);
margin = now->vnow - vtime - cost;
@@ -2287,6 +2299,10 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
iocg->inuse != iocg->active);
spin_unlock_irq(&ioc->lock);
+
+ TRACE_IOCG_PATH(inuse_adjust, iocg, now,
+ old_inuse, iocg->inuse, old_hwi, hwi);
+
return cost;
}
diff --git a/include/trace/events/iocost.h b/include/trace/events/iocost.h
index ee024fe8fef6..b350860d2e71 100644
--- a/include/trace/events/iocost.h
+++ b/include/trace/events/iocost.h
@@ -95,7 +95,7 @@ DECLARE_EVENT_CLASS(iocg_inuse_update,
)
);
-DEFINE_EVENT(iocg_inuse_update, iocost_inuse_takeback,
+DEFINE_EVENT(iocg_inuse_update, iocost_inuse_shortage,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
@@ -105,7 +105,7 @@ DEFINE_EVENT(iocg_inuse_update, iocost_inuse_takeback,
old_hw_inuse, new_hw_inuse)
);
-DEFINE_EVENT(iocg_inuse_update, iocost_inuse_giveaway,
+DEFINE_EVENT(iocg_inuse_update, iocost_inuse_transfer,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
@@ -115,7 +115,7 @@ DEFINE_EVENT(iocg_inuse_update, iocost_inuse_giveaway,
old_hw_inuse, new_hw_inuse)
);
-DEFINE_EVENT(iocg_inuse_update, iocost_inuse_reset,
+DEFINE_EVENT(iocg_inuse_update, iocost_inuse_adjust,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
--
2.26.2
When an iocg accumulates too much vtime or gets deactivated, we throw away
some vtime, which lowers the overall device utilization. As the exact amount
which is being thrown away is known, we can compensate by accelerating the
vrate accordingly so that the extra vtime generated in the current period
matches what got lost.
This significantly improves work conservation when involving high weight
cgroups with intermittent and bursty IO patterns.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 132 ++++++++++++++++++++++++++++++---------------
1 file changed, 90 insertions(+), 42 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 2a95a081cf44..0270a504e6b5 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -224,20 +224,12 @@ enum {
MARGIN_MIN_PCT = 10,
MARGIN_LOW_PCT = 20,
MARGIN_TARGET_PCT = 50,
- MARGIN_MAX_PCT = 100,
INUSE_ADJ_STEP_PCT = 25,
/* Have some play in timer operations */
TIMER_SLACK_PCT = 1,
- /*
- * vtime can wrap well within a reasonable uptime when vrate is
- * consistently raised. Don't trust recorded cgroup vtime if the
- * period counter indicates that it's older than 5mins.
- */
- VTIME_VALID_DUR = 300 * USEC_PER_SEC,
-
/* 1/64k is granular enough and can easily be handled w/ u32 */
WEIGHT_ONE = 1 << 16,
@@ -395,7 +387,6 @@ struct ioc_margins {
s64 min;
s64 low;
s64 target;
- s64 max;
};
struct ioc_missed {
@@ -432,6 +423,8 @@ struct ioc {
enum ioc_running running;
atomic64_t vtime_rate;
+ u64 vtime_base_rate;
+ s64 vtime_err;
seqcount_spinlock_t period_seqcount;
u64 period_at; /* wallclock starttime */
@@ -760,12 +753,11 @@ static void ioc_refresh_margins(struct ioc *ioc)
{
struct ioc_margins *margins = &ioc->margins;
u32 period_us = ioc->period_us;
- u64 vrate = atomic64_read(&ioc->vtime_rate);
+ u64 vrate = ioc->vtime_base_rate;
margins->min = (period_us * MARGIN_MIN_PCT / 100) * vrate;
margins->low = (period_us * MARGIN_LOW_PCT / 100) * vrate;
margins->target = (period_us * MARGIN_TARGET_PCT / 100) * vrate;
- margins->max = (period_us * MARGIN_MAX_PCT / 100) * vrate;
}
/* latency Qos params changed, update period_us and all the dependent params */
@@ -831,8 +823,7 @@ static int ioc_autop_idx(struct ioc *ioc)
return idx;
/* step up/down based on the vrate */
- vrate_pct = div64_u64(atomic64_read(&ioc->vtime_rate) * 100,
- VTIME_PER_USEC);
+ vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
now_ns = ktime_get_ns();
if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
@@ -940,6 +931,43 @@ static bool ioc_refresh_params(struct ioc *ioc, bool force)
return true;
}
+/*
+ * When an iocg accumulates too much vtime or gets deactivated, we throw away
+ * some vtime, which lowers the overall device utilization. As the exact amount
+ * which is being thrown away is known, we can compensate by accelerating the
+ * vrate accordingly so that the extra vtime generated in the current period
+ * matches what got lost.
+ */
+static void ioc_refresh_vrate(struct ioc *ioc, struct ioc_now *now)
+{
+ s64 pleft = ioc->period_at + ioc->period_us - now->now;
+ s64 vperiod = ioc->period_us * ioc->vtime_base_rate;
+ s64 vcomp, vcomp_min, vcomp_max;
+
+ lockdep_assert_held(&ioc->lock);
+
+ /* we need some time left in this period */
+ if (pleft <= 0)
+ goto done;
+
+ /*
+ * Calculate how much vrate should be adjusted to offset the error.
+ * Limit the amount of adjustment and deduct the adjusted amount from
+ * the error.
+ */
+ vcomp = -div64_s64(ioc->vtime_err, pleft);
+ vcomp_min = -(ioc->vtime_base_rate >> 1);
+ vcomp_max = ioc->vtime_base_rate;
+ vcomp = clamp(vcomp, vcomp_min, vcomp_max);
+
+ ioc->vtime_err += vcomp * pleft;
+
+ atomic64_set(&ioc->vtime_rate, ioc->vtime_base_rate + vcomp);
+done:
+ /* bound how much error can accumulate */
+ ioc->vtime_err = clamp(ioc->vtime_err, -vperiod, vperiod);
+}
+
/* take a snapshot of the current [v]time and vrate */
static void ioc_now(struct ioc *ioc, struct ioc_now *now)
{
@@ -1152,8 +1180,8 @@ static void weight_updated(struct ioc_gq *iocg, struct ioc_now *now)
static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
- u64 last_period, cur_period, max_period_delta;
- u64 vtime, vmin;
+ u64 last_period, cur_period;
+ u64 vtime, vtarget;
int i;
/*
@@ -1192,21 +1220,15 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
goto fail_unlock;
/*
- * vtime may wrap when vrate is raised substantially due to
- * underestimated IO costs. Look at the period and ignore its
- * vtime if the iocg has been idle for too long. Also, cap the
- * budget it can start with to the margin.
+ * Always start with the target budget. On deactivation, we throw away
+ * anything above it.
*/
- max_period_delta = DIV64_U64_ROUND_UP(VTIME_VALID_DUR, ioc->period_us);
+ vtarget = now->vnow - ioc->margins.target;
vtime = atomic64_read(&iocg->vtime);
- vmin = now->vnow - ioc->margins.max;
- if (last_period + max_period_delta < cur_period ||
- time_before64(vtime, vmin)) {
- atomic64_add(vmin - vtime, &iocg->vtime);
- atomic64_add(vmin - vtime, &iocg->done_vtime);
- vtime = vmin;
- }
+ atomic64_add(vtarget - vtime, &iocg->vtime);
+ atomic64_add(vtarget - vtime, &iocg->done_vtime);
+ vtime = vtarget;
/*
* Activate, propagate weight and start period timer if not
@@ -1260,7 +1282,8 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
current_hweight(iocg, &hwa, NULL);
vover = atomic64_read(&iocg->vtime) +
abs_cost_to_cost(iocg->abs_vdebt, hwa) - now->vnow;
- vover_pct = div64_s64(100 * vover, ioc->period_us * now->vrate);
+ vover_pct = div64_s64(100 * vover,
+ ioc->period_us * ioc->vtime_base_rate);
if (vover_pct <= MIN_DELAY_THR_PCT)
new_delay = 0;
@@ -1421,7 +1444,8 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
/* determine next wakeup, add a timer margin to guarantee chunking */
vshortage = -ctx.vbudget;
expires = now->now_ns +
- DIV64_U64_ROUND_UP(vshortage, now->vrate) * NSEC_PER_USEC;
+ DIV64_U64_ROUND_UP(vshortage, ioc->vtime_base_rate) *
+ NSEC_PER_USEC;
expires += ioc->timer_slack_ns;
/* if already active and close enough, don't bother */
@@ -1536,6 +1560,7 @@ static void iocg_build_inner_walk(struct ioc_gq *iocg,
/* collect per-cpu counters and propagate the deltas to the parent */
static void iocg_flush_stat_one(struct ioc_gq *iocg, struct ioc_now *now)
{
+ struct ioc *ioc = iocg->ioc;
struct iocg_stat new_stat;
u64 abs_vusage = 0;
u64 vusage_delta;
@@ -1551,7 +1576,7 @@ static void iocg_flush_stat_one(struct ioc_gq *iocg, struct ioc_now *now)
vusage_delta = abs_vusage - iocg->last_stat_abs_vusage;
iocg->last_stat_abs_vusage = abs_vusage;
- iocg->usage_delta_us = div64_u64(vusage_delta, now->vrate);
+ iocg->usage_delta_us = div64_u64(vusage_delta, ioc->vtime_base_rate);
iocg->local_stat.usage_us += iocg->usage_delta_us;
new_stat.usage_us =
@@ -1593,8 +1618,8 @@ static void iocg_flush_stat(struct list_head *target_iocgs, struct ioc_now *now)
* capacity. @hwm is the upper bound and used to signal no donation. This
* function also throws away @iocg's excess budget.
*/
-static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
- struct ioc_now *now)
+static u32 hweight_after_donation(struct ioc_gq *iocg, u32 old_hwi, u32 hwm,
+ u32 usage, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
u64 vtime = atomic64_read(&iocg->vtime);
@@ -1609,12 +1634,13 @@ static u32 hweight_after_donation(struct ioc_gq *iocg, u32 hwm, u32 usage,
time_after64(vtime, now->vnow - ioc->margins.min))
return hwm;
- /* throw away excess above max */
- excess = now->vnow - vtime - ioc->margins.max;
+ /* throw away excess above target */
+ excess = now->vnow - vtime - ioc->margins.target;
if (excess > 0) {
atomic64_add(excess, &iocg->vtime);
atomic64_add(excess, &iocg->done_vtime);
vtime += excess;
+ ioc->vtime_err -= div64_u64(excess * old_hwi, WEIGHT_ONE);
}
/*
@@ -1952,6 +1978,24 @@ static void ioc_timer_fn(struct timer_list *timer)
nr_debtors++;
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
+ u64 vtime = atomic64_read(&iocg->vtime);
+ s64 excess;
+
+ /*
+ * @iocg has been inactive for a full duration and will
+ * have a high budget. Account anything above target as
+ * error and throw away. On reactivation, it'll start
+ * with the target budget.
+ */
+ excess = now.vnow - vtime - ioc->margins.target;
+ if (excess > 0) {
+ u32 old_hwi;
+
+ current_hweight(iocg, NULL, &old_hwi);
+ ioc->vtime_err -= div64_u64(excess * old_hwi,
+ WEIGHT_ONE);
+ }
+
__propagate_weights(iocg, 0, 0, false, &now);
list_del_init(&iocg->active_list);
}
@@ -1997,7 +2041,7 @@ static void ioc_timer_fn(struct timer_list *timer)
if (vdone != vtime) {
u64 inflight_us = DIV64_U64_ROUND_UP(
cost_to_abs_cost(vtime - vdone, hw_inuse),
- now.vrate);
+ ioc->vtime_base_rate);
usage_us = max(usage_us, inflight_us);
}
@@ -2017,16 +2061,16 @@ static void ioc_timer_fn(struct timer_list *timer)
if (hw_inuse < hw_active ||
(!waitqueue_active(&iocg->waitq) &&
time_before64(vtime, now.vnow - ioc->margins.low))) {
- u32 hwa, hwm, new_hwi;
+ u32 hwa, old_hwi, hwm, new_hwi;
/*
* Already donating or accumulated enough to start.
* Determine the donation amount.
*/
- current_hweight(iocg, &hwa, NULL);
+ current_hweight(iocg, &hwa, &old_hwi);
hwm = current_hweight_max(iocg);
- new_hwi = hweight_after_donation(iocg, hwm, usage,
- &now);
+ new_hwi = hweight_after_donation(iocg, old_hwi, hwm,
+ usage, &now);
if (new_hwi < hwm) {
iocg->hweight_donating = hwa;
iocg->hweight_after_donation = new_hwi;
@@ -2130,7 +2174,7 @@ static void ioc_timer_fn(struct timer_list *timer)
ioc->busy_level = clamp(ioc->busy_level, -1000, 1000);
if (ioc->busy_level > 0 || (ioc->busy_level < 0 && !nr_lagging)) {
- u64 vrate = atomic64_read(&ioc->vtime_rate);
+ u64 vrate = ioc->vtime_base_rate;
u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
/* rq_wait signal is always reliable, ignore user vrate_min */
@@ -2167,7 +2211,7 @@ static void ioc_timer_fn(struct timer_list *timer)
trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
nr_lagging, nr_shortages);
- atomic64_set(&ioc->vtime_rate, vrate);
+ ioc->vtime_base_rate = vrate;
ioc_refresh_margins(ioc);
} else if (ioc->busy_level != prev_busy_level || nr_lagging) {
trace_iocost_ioc_vrate_adj(ioc, atomic64_read(&ioc->vtime_rate),
@@ -2188,8 +2232,11 @@ static void ioc_timer_fn(struct timer_list *timer)
ioc_start_period(ioc, &now);
} else {
ioc->busy_level = 0;
+ ioc->vtime_err = 0;
ioc->running = IOC_IDLE;
}
+
+ ioc_refresh_vrate(ioc, &now);
}
spin_unlock_irq(&ioc->lock);
@@ -2628,6 +2675,7 @@ static int blk_iocost_init(struct request_queue *q)
INIT_LIST_HEAD(&ioc->active_iocgs);
ioc->running = IOC_IDLE;
+ ioc->vtime_base_rate = VTIME_PER_USEC;
atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC);
seqcount_spinlock_init(&ioc->period_seqcount, &ioc->lock);
ioc->period_at = ktime_to_us(ktime_get());
@@ -2762,7 +2810,7 @@ static size_t ioc_pd_stat(struct blkg_policy_data *pd, char *buf, size_t size)
if (iocg->level == 0) {
unsigned vp10k = DIV64_U64_ROUND_CLOSEST(
- atomic64_read(&ioc->vtime_rate) * 10000,
+ ioc->vtime_base_rate * 10000,
VTIME_PER_USEC);
pos += scnprintf(buf + pos, size - pos, " cost.vrate=%u.%02u",
vp10k / 100, vp10k % 100);
--
2.26.2
When the margin drops below the minimum on a donating iocg, donation is
immediately canceled in full. There are a couple shortcomings with the
current behavior.
* It's abrupt. A small temporary budget deficit can lead to a wide swing in
weight allocation and a large surplus.
* It's open coded in the issue path but not implemented for the merge path.
A series of merges at a low inuse can make the iocg incur debts and stall
incorrectly.
This patch reimplements in-period donation snapbacks so that
* inuse adjustment and cost calculations are factored into
adjust_inuse_and_calc_cost() which is called from both the issue and merge
paths.
* Snapbacks are more gradual. It occurs in quarter steps.
* A snapback triggers if the margin goes below the low threshold and is
lower than the budget at the time of the last adjustment.
* For the above, __propagate_weights() stores the margin in
iocg->saved_margin. Move iocg->last_inuse storing together into
__propagate_weights() for consistency.
* Full snapback is guaranteed when there are waiters.
* With precise donation and gradual snapbacks, inuse adjustments are now a
lot more effective and the value of scaling inuse on weight changes isn't
clear. Removed inuse scaling from weight_update().
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 133 ++++++++++++++++++++++++++++++++-------------
1 file changed, 96 insertions(+), 37 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 694f1487208a..d09b4011449c 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -226,6 +226,8 @@ enum {
MARGIN_TARGET_PCT = 50,
MARGIN_MAX_PCT = 100,
+ INUSE_ADJ_STEP_PCT = 25,
+
/* Have some play in timer operations */
TIMER_SLACK_PCT = 1,
@@ -443,12 +445,17 @@ struct ioc_gq {
*
* `last_inuse` remembers `inuse` while an iocg is idle to persist
* surplus adjustments.
+ *
+ * `inuse` may be adjusted dynamically during period. `saved_*` are used
+ * to determine and track adjustments.
*/
u32 cfg_weight;
u32 weight;
u32 active;
u32 inuse;
+
u32 last_inuse;
+ s64 saved_margin;
sector_t cursor; /* to detect randio */
@@ -934,9 +941,11 @@ static void ioc_start_period(struct ioc *ioc, struct ioc_now *now)
/*
* Update @iocg's `active` and `inuse` to @active and @inuse, update level
- * weight sums and propagate upwards accordingly.
+ * weight sums and propagate upwards accordingly. If @save, the current margin
+ * is saved to be used as reference for later inuse in-period adjustments.
*/
-static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse)
+static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
+ bool save, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
int lvl;
@@ -945,6 +954,10 @@ static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse)
inuse = clamp_t(u32, inuse, 1, active);
+ iocg->last_inuse = iocg->inuse;
+ if (save)
+ iocg->saved_margin = now->vnow - atomic64_read(&iocg->vtime);
+
if (active == iocg->active && inuse == iocg->inuse)
return;
@@ -996,9 +1009,10 @@ static void commit_weights(struct ioc *ioc)
}
}
-static void propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse)
+static void propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
+ bool save, struct ioc_now *now)
{
- __propagate_weights(iocg, active, inuse);
+ __propagate_weights(iocg, active, inuse, save, now);
commit_weights(iocg->ioc);
}
@@ -1082,7 +1096,7 @@ static u32 current_hweight_max(struct ioc_gq *iocg)
return max_t(u32, hwm, 1);
}
-static void weight_updated(struct ioc_gq *iocg)
+static void weight_updated(struct ioc_gq *iocg, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
@@ -1093,9 +1107,7 @@ static void weight_updated(struct ioc_gq *iocg)
weight = iocg->cfg_weight ?: iocc->dfl_weight;
if (weight != iocg->weight && iocg->active)
- propagate_weights(iocg, weight,
- DIV64_U64_ROUND_UP((u64)iocg->inuse * weight,
- iocg->weight));
+ propagate_weights(iocg, weight, iocg->inuse, true, now);
iocg->weight = weight;
}
@@ -1165,8 +1177,9 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
*/
iocg->hweight_gen = atomic_read(&ioc->hweight_gen) - 1;
list_add(&iocg->active_list, &ioc->active_iocgs);
+
propagate_weights(iocg, iocg->weight,
- iocg->last_inuse ?: iocg->weight);
+ iocg->last_inuse ?: iocg->weight, true, now);
TRACE_IOCG_PATH(iocg_activate, iocg, now,
last_period, cur_period, vtime);
@@ -1789,7 +1802,7 @@ static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
inuse = DIV64_U64_ROUND_UP(
parent->child_adjusted_sum * iocg->hweight_after_donation,
parent->hweight_inuse);
- __propagate_weights(iocg, iocg->active, inuse);
+ __propagate_weights(iocg, iocg->active, inuse, true, now);
}
/* walk list should be dissolved after use */
@@ -1844,8 +1857,7 @@ static void ioc_timer_fn(struct timer_list *timer)
iocg_kick_waitq(iocg, true, &now);
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
- iocg->last_inuse = iocg->inuse;
- __propagate_weights(iocg, 0, 0);
+ __propagate_weights(iocg, 0, 0, false, &now);
list_del_init(&iocg->active_list);
}
@@ -1925,7 +1937,7 @@ static void ioc_timer_fn(struct timer_list *timer)
list_add(&iocg->surplus_list, &surpluses);
} else {
__propagate_weights(iocg, iocg->active,
- iocg->active);
+ iocg->active, true, &now);
nr_shortages++;
}
} else {
@@ -2055,6 +2067,50 @@ static void ioc_timer_fn(struct timer_list *timer)
spin_unlock_irq(&ioc->lock);
}
+static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
+ u64 abs_cost, struct ioc_now *now)
+{
+ struct ioc *ioc = iocg->ioc;
+ struct ioc_margins *margins = &ioc->margins;
+ u32 adj_step = DIV_ROUND_UP(iocg->active * INUSE_ADJ_STEP_PCT, 100);
+ u32 hwi;
+ s64 margin;
+ u64 cost, new_inuse;
+
+ current_hweight(iocg, NULL, &hwi);
+ cost = abs_cost_to_cost(abs_cost, hwi);
+ margin = now->vnow - vtime - cost;
+
+ /*
+ * We only increase inuse during period and do so iff the margin has
+ * deteriorated since the previous adjustment.
+ */
+ if (margin >= iocg->saved_margin || margin >= margins->low ||
+ iocg->inuse == iocg->active)
+ return cost;
+
+ spin_lock_irq(&ioc->lock);
+
+ /* we own inuse only when @iocg is in the normal active state */
+ if (list_empty(&iocg->active_list)) {
+ spin_unlock_irq(&ioc->lock);
+ return cost;
+ }
+
+ /* bump up inuse till @abs_cost fits in the existing budget */
+ new_inuse = iocg->inuse;
+ do {
+ new_inuse = new_inuse + adj_step;
+ propagate_weights(iocg, iocg->active, new_inuse, true, now);
+ current_hweight(iocg, NULL, &hwi);
+ cost = abs_cost_to_cost(abs_cost, hwi);
+ } while (time_after64(vtime + cost, now->vnow) &&
+ iocg->inuse != iocg->active);
+
+ spin_unlock_irq(&ioc->lock);
+ return cost;
+}
+
static void calc_vtime_cost_builtin(struct bio *bio, struct ioc_gq *iocg,
bool is_merge, u64 *costp)
{
@@ -2136,7 +2192,6 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
struct ioc_gq *iocg = blkg_to_iocg(blkg);
struct ioc_now now;
struct iocg_wait wait;
- u32 hw_active, hw_inuse;
u64 abs_cost, cost, vtime;
bool use_debt, ioc_locked;
unsigned long flags;
@@ -2154,21 +2209,8 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
return;
iocg->cursor = bio_end_sector(bio);
-
vtime = atomic64_read(&iocg->vtime);
- current_hweight(iocg, &hw_active, &hw_inuse);
-
- if (hw_inuse < hw_active &&
- time_after_eq64(vtime + ioc->margins.min, now.vnow)) {
- TRACE_IOCG_PATH(inuse_reset, iocg, &now,
- iocg->inuse, iocg->weight, hw_inuse, hw_active);
- spin_lock_irq(&ioc->lock);
- propagate_weights(iocg, iocg->weight, iocg->weight);
- spin_unlock_irq(&ioc->lock);
- current_hweight(iocg, &hw_active, &hw_inuse);
- }
-
- cost = abs_cost_to_cost(abs_cost, hw_inuse);
+ cost = adjust_inuse_and_calc_cost(iocg, vtime, abs_cost, &now);
/*
* If no one's waiting and within budget, issue right away. The
@@ -2190,7 +2232,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
*/
use_debt = bio_issue_as_root_blkg(bio) || fatal_signal_pending(current);
ioc_locked = use_debt || READ_ONCE(iocg->abs_vdebt);
-
+retry_lock:
iocg_lock(iocg, ioc_locked, &flags);
/*
@@ -2232,6 +2274,17 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
return;
}
+ /* guarantee that iocgs w/ waiters have maximum inuse */
+ if (iocg->inuse != iocg->active) {
+ if (!ioc_locked) {
+ iocg_unlock(iocg, false, &flags);
+ ioc_locked = true;
+ goto retry_lock;
+ }
+ propagate_weights(iocg, iocg->active, iocg->active, true,
+ &now);
+ }
+
/*
* Append self to the waitq and schedule the wakeup timer if we're
* the first waiter. The timer duration is calculated based on the
@@ -2274,8 +2327,7 @@ static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
struct ioc *ioc = iocg->ioc;
sector_t bio_end = bio_end_sector(bio);
struct ioc_now now;
- u32 hw_inuse;
- u64 abs_cost, cost;
+ u64 vtime, abs_cost, cost;
unsigned long flags;
/* bypass if disabled or for root cgroup */
@@ -2287,8 +2339,9 @@ static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
return;
ioc_now(ioc, &now);
- current_hweight(iocg, NULL, &hw_inuse);
- cost = abs_cost_to_cost(abs_cost, hw_inuse);
+
+ vtime = atomic64_read(&iocg->vtime);
+ cost = adjust_inuse_and_calc_cost(iocg, vtime, abs_cost, &now);
/* update cursor if backmerging into the request at the cursor */
if (blk_rq_pos(rq) < bio_end &&
@@ -2530,7 +2583,7 @@ static void ioc_pd_init(struct blkg_policy_data *pd)
}
spin_lock_irqsave(&ioc->lock, flags);
- weight_updated(iocg);
+ weight_updated(iocg, &now);
spin_unlock_irqrestore(&ioc->lock, flags);
}
@@ -2544,7 +2597,10 @@ static void ioc_pd_free(struct blkg_policy_data *pd)
spin_lock_irqsave(&ioc->lock, flags);
if (!list_empty(&iocg->active_list)) {
- propagate_weights(iocg, 0, 0);
+ struct ioc_now now;
+
+ ioc_now(ioc, &now);
+ propagate_weights(iocg, 0, 0, false, &now);
list_del_init(&iocg->active_list);
}
@@ -2612,6 +2668,7 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
struct blkcg *blkcg = css_to_blkcg(of_css(of));
struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg);
struct blkg_conf_ctx ctx;
+ struct ioc_now now;
struct ioc_gq *iocg;
u32 v;
int ret;
@@ -2632,7 +2689,8 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
if (iocg) {
spin_lock_irq(&iocg->ioc->lock);
- weight_updated(iocg);
+ ioc_now(iocg->ioc, &now);
+ weight_updated(iocg, &now);
spin_unlock_irq(&iocg->ioc->lock);
}
}
@@ -2658,7 +2716,8 @@ static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
spin_lock(&iocg->ioc->lock);
iocg->cfg_weight = v * WEIGHT_ONE;
- weight_updated(iocg);
+ ioc_now(iocg->ioc, &now);
+ weight_updated(iocg, &now);
spin_unlock(&iocg->ioc->lock);
blkg_conf_finish(&ctx);
--
2.26.2
We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
preparation. This is pure code reorganization.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 120 ++++++++++++++++++++++-----------------------
1 file changed, 60 insertions(+), 60 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 8dfe73dde2a8..ac22d761a350 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1115,6 +1115,66 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
return false;
}
+static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
+{
+ struct ioc *ioc = iocg->ioc;
+ struct blkcg_gq *blkg = iocg_to_blkg(iocg);
+ u64 vtime = atomic64_read(&iocg->vtime);
+ u64 vmargin = ioc->margin_us * now->vrate;
+ u64 margin_ns = ioc->margin_us * NSEC_PER_USEC;
+ u64 delta_ns, expires, oexpires;
+ u32 hw_inuse;
+
+ lockdep_assert_held(&iocg->waitq.lock);
+
+ /* debt-adjust vtime */
+ current_hweight(iocg, NULL, &hw_inuse);
+ vtime += abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
+
+ /*
+ * Clear or maintain depending on the overage. Non-zero vdebt is what
+ * guarantees that @iocg is online and future iocg_kick_delay() will
+ * clear use_delay. Don't leave it on when there's no vdebt.
+ */
+ if (!iocg->abs_vdebt || time_before_eq64(vtime, now->vnow)) {
+ blkcg_clear_delay(blkg);
+ return false;
+ }
+ if (!atomic_read(&blkg->use_delay) &&
+ time_before_eq64(vtime, now->vnow + vmargin))
+ return false;
+
+ /* use delay */
+ delta_ns = DIV64_U64_ROUND_UP(vtime - now->vnow,
+ now->vrate) * NSEC_PER_USEC;
+ blkcg_set_delay(blkg, delta_ns);
+ expires = now->now_ns + delta_ns;
+
+ /* if already active and close enough, don't bother */
+ oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer));
+ if (hrtimer_is_queued(&iocg->delay_timer) &&
+ abs(oexpires - expires) <= margin_ns / 4)
+ return true;
+
+ hrtimer_start_range_ns(&iocg->delay_timer, ns_to_ktime(expires),
+ margin_ns / 4, HRTIMER_MODE_ABS);
+ return true;
+}
+
+static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer)
+{
+ struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer);
+ struct ioc_now now;
+ unsigned long flags;
+
+ spin_lock_irqsave(&iocg->waitq.lock, flags);
+ ioc_now(iocg->ioc, &now);
+ iocg_kick_delay(iocg, &now);
+ spin_unlock_irqrestore(&iocg->waitq.lock, flags);
+
+ return HRTIMER_NORESTART;
+}
+
static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
int flags, void *key)
{
@@ -1211,66 +1271,6 @@ static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
-static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
-{
- struct ioc *ioc = iocg->ioc;
- struct blkcg_gq *blkg = iocg_to_blkg(iocg);
- u64 vtime = atomic64_read(&iocg->vtime);
- u64 vmargin = ioc->margin_us * now->vrate;
- u64 margin_ns = ioc->margin_us * NSEC_PER_USEC;
- u64 delta_ns, expires, oexpires;
- u32 hw_inuse;
-
- lockdep_assert_held(&iocg->waitq.lock);
-
- /* debt-adjust vtime */
- current_hweight(iocg, NULL, &hw_inuse);
- vtime += abs_cost_to_cost(iocg->abs_vdebt, hw_inuse);
-
- /*
- * Clear or maintain depending on the overage. Non-zero vdebt is what
- * guarantees that @iocg is online and future iocg_kick_delay() will
- * clear use_delay. Don't leave it on when there's no vdebt.
- */
- if (!iocg->abs_vdebt || time_before_eq64(vtime, now->vnow)) {
- blkcg_clear_delay(blkg);
- return false;
- }
- if (!atomic_read(&blkg->use_delay) &&
- time_before_eq64(vtime, now->vnow + vmargin))
- return false;
-
- /* use delay */
- delta_ns = DIV64_U64_ROUND_UP(vtime - now->vnow,
- now->vrate) * NSEC_PER_USEC;
- blkcg_set_delay(blkg, delta_ns);
- expires = now->now_ns + delta_ns;
-
- /* if already active and close enough, don't bother */
- oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer));
- if (hrtimer_is_queued(&iocg->delay_timer) &&
- abs(oexpires - expires) <= margin_ns / 4)
- return true;
-
- hrtimer_start_range_ns(&iocg->delay_timer, ns_to_ktime(expires),
- margin_ns / 4, HRTIMER_MODE_ABS);
- return true;
-}
-
-static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer)
-{
- struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer);
- struct ioc_now now;
- unsigned long flags;
-
- spin_lock_irqsave(&iocg->waitq.lock, flags);
- ioc_now(iocg->ioc, &now);
- iocg_kick_delay(iocg, &now);
- spin_unlock_irqrestore(&iocg->waitq.lock, flags);
-
- return HRTIMER_NORESTART;
-}
-
static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p)
{
u32 nr_met[2] = { };
--
2.26.2
The margin handling was pretty inconsistent.
* ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
thresholds. However, the two are in different units with the former
requiring conversion to vtime on use.
* iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
different values for the two.
This patch cleans up margin and timer slack handling:
* vtime margins are now recorded in ioc->margins.{min, max} on period
duration changes and used consistently.
* Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
used consistently for iocg_kick_waitq() and iocg_kick_delay().
The only functional change is shortening of timer slack. No meaningful
visible change is expected.
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-iocost.c | 67 ++++++++++++++++++++++++++--------------------
1 file changed, 38 insertions(+), 29 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index dc72cd965837..f36988657594 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -221,11 +221,11 @@ enum {
* serves as its IO credit buffer. Surplus weight adjustment is
* immediately canceled if the vtime margin runs below 10%.
*/
- MARGIN_PCT = 50,
- INUSE_MARGIN_PCT = 10,
+ MARGIN_MIN_PCT = 10,
+ MARGIN_MAX_PCT = 50,
- /* Have some play in waitq timer operations */
- WAITQ_TIMER_MARGIN_PCT = 5,
+ /* Have some play in timer operations */
+ TIMER_SLACK_PCT = 1,
/*
* vtime can wrap well within a reasonable uptime when vrate is
@@ -374,6 +374,11 @@ struct ioc_params {
u32 too_slow_vrate_pct;
};
+struct ioc_margins {
+ s64 min;
+ s64 max;
+};
+
struct ioc_missed {
local_t nr_met;
local_t nr_missed;
@@ -395,8 +400,9 @@ struct ioc {
bool enabled;
struct ioc_params params;
+ struct ioc_margins margins;
u32 period_us;
- u32 margin_us;
+ u32 timer_slack_ns;
u64 vrate_min;
u64 vrate_max;
@@ -415,7 +421,6 @@ struct ioc {
atomic64_t cur_period; /* inc'd each period */
int busy_level; /* saturation history */
- u64 inuse_margin_vtime;
bool weights_updated;
atomic_t hweight_gen; /* for lazy hweights */
@@ -678,6 +683,16 @@ static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost)
#define CREATE_TRACE_POINTS
#include <trace/events/iocost.h>
+static void ioc_refresh_margins(struct ioc *ioc)
+{
+ struct ioc_margins *margins = &ioc->margins;
+ u32 period_us = ioc->period_us;
+ u64 vrate = atomic64_read(&ioc->vtime_rate);
+
+ margins->min = (period_us * MARGIN_MIN_PCT / 100) * vrate;
+ margins->max = (period_us * MARGIN_MAX_PCT / 100) * vrate;
+}
+
/* latency Qos params changed, update period_us and all the dependent params */
static void ioc_refresh_period_us(struct ioc *ioc)
{
@@ -711,9 +726,10 @@ static void ioc_refresh_period_us(struct ioc *ioc)
/* calculate dependent params */
ioc->period_us = period_us;
- ioc->margin_us = period_us * MARGIN_PCT / 100;
- ioc->inuse_margin_vtime = DIV64_U64_ROUND_UP(
- period_us * VTIME_PER_USEC * INUSE_MARGIN_PCT, 100);
+ ioc->timer_slack_ns = div64_u64(
+ (u64)period_us * NSEC_PER_USEC * TIMER_SLACK_PCT,
+ 100);
+ ioc_refresh_margins(ioc);
}
static int ioc_autop_idx(struct ioc *ioc)
@@ -1031,7 +1047,7 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
u64 last_period, cur_period, max_period_delta;
- u64 vtime, vmargin, vmin;
+ u64 vtime, vmin;
int i;
/*
@@ -1077,8 +1093,7 @@ static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
*/
max_period_delta = DIV64_U64_ROUND_UP(VTIME_VALID_DUR, ioc->period_us);
vtime = atomic64_read(&iocg->vtime);
- vmargin = ioc->margin_us * now->vrate;
- vmin = now->vnow - vmargin;
+ vmin = now->vnow - ioc->margins.max;
if (last_period + max_period_delta < cur_period ||
time_before64(vtime, vmin)) {
@@ -1121,8 +1136,6 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
struct ioc *ioc = iocg->ioc;
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
u64 vtime = atomic64_read(&iocg->vtime);
- u64 vmargin = ioc->margin_us * now->vrate;
- u64 margin_ns = ioc->margin_us * NSEC_PER_USEC;
u64 delta_ns, expires, oexpires;
u32 hw_inuse;
@@ -1142,7 +1155,7 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
return false;
}
if (!atomic_read(&blkg->use_delay) &&
- time_before_eq64(vtime, now->vnow + vmargin))
+ time_before_eq64(vtime, now->vnow + ioc->margins.max))
return false;
/* use delay */
@@ -1154,11 +1167,11 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
/* if already active and close enough, don't bother */
oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer));
if (hrtimer_is_queued(&iocg->delay_timer) &&
- abs(oexpires - expires) <= margin_ns / 4)
+ abs(oexpires - expires) <= ioc->timer_slack_ns)
return true;
hrtimer_start_range_ns(&iocg->delay_timer, ns_to_ktime(expires),
- margin_ns / 4, HRTIMER_MODE_ABS);
+ ioc->timer_slack_ns, HRTIMER_MODE_ABS);
return true;
}
@@ -1206,8 +1219,6 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
struct iocg_wake_ctx ctx = { .iocg = iocg };
- u64 margin_ns = (u64)(ioc->period_us *
- WAITQ_TIMER_MARGIN_PCT / 100) * NSEC_PER_USEC;
u64 vdebt, vshortage, expires, oexpires;
s64 vbudget;
u32 hw_inuse;
@@ -1243,20 +1254,20 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now)
if (WARN_ON_ONCE(ctx.vbudget >= 0))
return;
- /* determine next wakeup, add a quarter margin to guarantee chunking */
+ /* determine next wakeup, add a timer margin to guarantee chunking */
vshortage = -ctx.vbudget;
expires = now->now_ns +
DIV64_U64_ROUND_UP(vshortage, now->vrate) * NSEC_PER_USEC;
- expires += margin_ns / 4;
+ expires += ioc->timer_slack_ns;
/* if already active and close enough, don't bother */
oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->waitq_timer));
if (hrtimer_is_queued(&iocg->waitq_timer) &&
- abs(oexpires - expires) <= margin_ns / 4)
+ abs(oexpires - expires) <= ioc->timer_slack_ns)
return;
hrtimer_start_range_ns(&iocg->waitq_timer, ns_to_ktime(expires),
- margin_ns / 4, HRTIMER_MODE_ABS);
+ ioc->timer_slack_ns, HRTIMER_MODE_ABS);
}
static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer)
@@ -1399,7 +1410,7 @@ static void ioc_timer_fn(struct timer_list *timer)
/* calc usages and see whether some weights need to be moved around */
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
- u64 vdone, vtime, vusage, vmargin, vmin;
+ u64 vdone, vtime, vusage, vmin;
u32 hw_active, hw_inuse, usage;
/*
@@ -1450,8 +1461,7 @@ static void ioc_timer_fn(struct timer_list *timer)
}
/* see whether there's surplus vtime */
- vmargin = ioc->margin_us * now.vrate;
- vmin = now.vnow - vmargin;
+ vmin = now.vnow - ioc->margins.max;
iocg->has_surplus = false;
@@ -1623,8 +1633,7 @@ static void ioc_timer_fn(struct timer_list *timer)
nr_surpluses);
atomic64_set(&ioc->vtime_rate, vrate);
- ioc->inuse_margin_vtime = DIV64_U64_ROUND_UP(
- ioc->period_us * vrate * INUSE_MARGIN_PCT, 100);
+ ioc_refresh_margins(ioc);
} else if (ioc->busy_level != prev_busy_level || nr_lagging) {
trace_iocost_ioc_vrate_adj(ioc, atomic64_read(&ioc->vtime_rate),
missed_ppm, rq_wait_pct, nr_lagging,
@@ -1754,7 +1763,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
current_hweight(iocg, &hw_active, &hw_inuse);
if (hw_inuse < hw_active &&
- time_after_eq64(vtime + ioc->inuse_margin_vtime, now.vnow)) {
+ time_after_eq64(vtime + ioc->margins.min, now.vnow)) {
TRACE_IOCG_PATH(inuse_reset, iocg, &now,
iocg->inuse, iocg->weight, hw_inuse, hw_active);
spin_lock_irq(&ioc->lock);
--
2.26.2
blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.
Signed-off-by: Tejun Heo <[email protected]>
Fixes: cd006509b0a9 ("blk-iocost: account for IO size when testing latencies")
Cc: [email protected] # v5.8+
---
block/blk-stat.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 7da302ff88d0..ae3dd1fb8e61 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -137,6 +137,7 @@ void blk_stat_add_callback(struct request_queue *q,
struct blk_stat_callback *cb)
{
unsigned int bucket;
+ unsigned long flags;
int cpu;
for_each_possible_cpu(cpu) {
@@ -147,20 +148,22 @@ void blk_stat_add_callback(struct request_queue *q,
blk_rq_stat_init(&cpu_stat[bucket]);
}
- spin_lock(&q->stats->lock);
+ spin_lock_irqsave(&q->stats->lock, flags);
list_add_tail_rcu(&cb->list, &q->stats->callbacks);
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
- spin_unlock(&q->stats->lock);
+ spin_unlock_irqrestore(&q->stats->lock, flags);
}
void blk_stat_remove_callback(struct request_queue *q,
struct blk_stat_callback *cb)
{
- spin_lock(&q->stats->lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->stats->lock, flags);
list_del_rcu(&cb->list);
if (list_empty(&q->stats->callbacks) && !q->stats->enable_accounting)
blk_queue_flag_clear(QUEUE_FLAG_STATS, q);
- spin_unlock(&q->stats->lock);
+ spin_unlock_irqrestore(&q->stats->lock, flags);
del_timer_sync(&cb->timer);
}
@@ -183,10 +186,12 @@ void blk_stat_free_callback(struct blk_stat_callback *cb)
void blk_stat_enable_accounting(struct request_queue *q)
{
- spin_lock(&q->stats->lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->stats->lock, flags);
q->stats->enable_accounting = true;
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
- spin_unlock(&q->stats->lock);
+ spin_unlock_irqrestore(&q->stats->lock, flags);
}
EXPORT_SYMBOL_GPL(blk_stat_enable_accounting);
--
2.26.2
1-2 applied for 5.9, and the rest for 5.10. Thanks Tejun.
--
Jens Axboe
On Tue, Sep 01, 2020 at 02:52:33PM -0400, Tejun Heo wrote:
> blk-iocost has been reading percpu stat counters from remote cpus which on
> some archs can lead to torn reads in really rare occassions. Use local[64]_t
> for those counters.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> block/blk-iocost.c | 37 +++++++++++++++++++++++++++----------
> 1 file changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> index d37b55db2409..e2266e7692b4 100644
> --- a/block/blk-iocost.c
> +++ b/block/blk-iocost.c
> @@ -179,6 +179,8 @@
> #include <linux/parser.h>
> #include <linux/sched/signal.h>
> #include <linux/blk-cgroup.h>
> +#include <asm/local.h>
> +#include <asm/local64.h>
Hi Tejun,
FYI, I am just noticing this but this breaks my allyesconfig build
on OpenRISC; as 32-bit arch/openrisc doesn't define local64.h
In general local64 is slow on 32-bit architectures, would that
be a problem with the usage here? Are the calls to local64_*
below on critical paths?
Either way I will submit a patch in include generic local64.h
on OpenRISC, I confirmed it fixes the build. I do not know of anyone
using cgroups on OpenRISC systems.
-Stafford
> #include "blk-rq-qos.h"
> #include "blk-stat.h"
> #include "blk-wbt.h"
> @@ -373,8 +375,8 @@ struct ioc_params {
> };
>
> struct ioc_missed {
> - u32 nr_met;
> - u32 nr_missed;
> + local_t nr_met;
> + local_t nr_missed;
> u32 last_met;
> u32 last_missed;
> };
> @@ -382,7 +384,7 @@ struct ioc_missed {
> struct ioc_pcpu_stat {
> struct ioc_missed missed[2];
>
> - u64 rq_wait_ns;
> + local64_t rq_wait_ns;
> u64 last_rq_wait_ns;
> };
>
> @@ -1278,8 +1280,8 @@ static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p
> u64 this_rq_wait_ns;
>
> for (rw = READ; rw <= WRITE; rw++) {
> - u32 this_met = READ_ONCE(stat->missed[rw].nr_met);
> - u32 this_missed = READ_ONCE(stat->missed[rw].nr_missed);
> + u32 this_met = local_read(&stat->missed[rw].nr_met);
> + u32 this_missed = local_read(&stat->missed[rw].nr_missed);
>
> nr_met[rw] += this_met - stat->missed[rw].last_met;
> nr_missed[rw] += this_missed - stat->missed[rw].last_missed;
> @@ -1287,7 +1289,7 @@ static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p
> stat->missed[rw].last_missed = this_missed;
> }
>
> - this_rq_wait_ns = READ_ONCE(stat->rq_wait_ns);
> + this_rq_wait_ns = local64_read(&stat->rq_wait_ns);
> rq_wait_ns += this_rq_wait_ns - stat->last_rq_wait_ns;
> stat->last_rq_wait_ns = this_rq_wait_ns;
> }
> @@ -1908,6 +1910,7 @@ static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio)
> static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
> {
> struct ioc *ioc = rqos_to_ioc(rqos);
> + struct ioc_pcpu_stat *ccs;
> u64 on_q_ns, rq_wait_ns, size_nsec;
> int pidx, rw;
>
> @@ -1931,13 +1934,17 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
> rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
> size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
>
> + ccs = get_cpu_ptr(ioc->pcpu_stat);
> +
> if (on_q_ns <= size_nsec ||
> on_q_ns - size_nsec <= ioc->params.qos[pidx] * NSEC_PER_USEC)
> - this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_met);
> + local_inc(&ccs->missed[rw].nr_met);
> else
> - this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_missed);
> + local_inc(&ccs->missed[rw].nr_missed);
> +
> + local64_add(rq_wait_ns, &ccs->rq_wait_ns);
>
> - this_cpu_add(ioc->pcpu_stat->rq_wait_ns, rq_wait_ns);
> + put_cpu_ptr(ccs);
> }
>
> static void ioc_rqos_queue_depth_changed(struct rq_qos *rqos)
> @@ -1977,7 +1984,7 @@ static int blk_iocost_init(struct request_queue *q)
> {
> struct ioc *ioc;
> struct rq_qos *rqos;
> - int ret;
> + int i, cpu, ret;
>
> ioc = kzalloc(sizeof(*ioc), GFP_KERNEL);
> if (!ioc)
> @@ -1989,6 +1996,16 @@ static int blk_iocost_init(struct request_queue *q)
> return -ENOMEM;
> }
>
> + for_each_possible_cpu(cpu) {
> + struct ioc_pcpu_stat *ccs = per_cpu_ptr(ioc->pcpu_stat, cpu);
> +
> + for (i = 0; i < ARRAY_SIZE(ccs->missed); i++) {
> + local_set(&ccs->missed[i].nr_met, 0);
> + local_set(&ccs->missed[i].nr_missed, 0);
> + }
> + local64_set(&ccs->rq_wait_ns, 0);
> + }
> +
> rqos = &ioc->rqos;
> rqos->id = RQ_QOS_COST;
> rqos->ops = &ioc_rqos_ops;
> --
> 2.26.2
>
Hello,
On Sat, Nov 21, 2020 at 06:51:47AM +0900, Stafford Horne wrote:
> FYI, I am just noticing this but this breaks my allyesconfig build
> on OpenRISC; as 32-bit arch/openrisc doesn't define local64.h
>
> In general local64 is slow on 32-bit architectures, would that
> be a problem with the usage here? Are the calls to local64_*
> below on critical paths?
It gets hot when running on really high iops devices but that hopefully
isn't a problem for 32bit openrisc.
> Either way I will submit a patch in include generic local64.h
> on OpenRISC, I confirmed it fixes the build. I do not know of anyone
> using cgroups on OpenRISC systems.
Yeah, sounds like the right fix.
Thanks.
--
tejun
On 01/09/2020 19:52, Tejun Heo wrote:
Background:
I have been trying to solve some block/ sparse issues, and it has led me
to digging up this old mail.
> Currently, debt handling requires only iocg->waitq.lock. In the future, we
> want to adjust and propagate inuse changes depending on debt status. Let's
> grab ioc->lock in debt handling paths in preparation.
>
> * Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
> ioc->lock needs to be made before entering the critical sections.
>
> * Add and use iocg_[un]lock() which handles the conditional double locking.
>
> * Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
> the caller grabbed both locks.
>
> This patch is prepatory and the comments contain references to future
> changes.
>
> Signed-off-by: Tejun Heo<[email protected]>
> ---
> block/blk-iocost.c | 92 ++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 73 insertions(+), 19 deletions(-)
>
> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> index f36988657594..23b173e34591 100644
> --- a/block/blk-iocost.c
> +++ b/block/blk-iocost.c
> @@ -680,6 +680,26 @@ static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost)
> atomic64_add(cost, &iocg->vtime);
> }
>
> +static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
> +{
> + if (lock_ioc) {
> + spin_lock_irqsave(&iocg->ioc->lock, *flags);
> + spin_lock(&iocg->waitq.lock);
> + } else {
> + spin_lock_irqsave(&iocg->waitq.lock, *flags);
> + }
> +}
This generates the following sparse warnings on mainline today:
CHECK block/blk-iocost.c
block/blk-iocost.c:685:9: warning: context imbalance in 'iocg_lock' -
wrong count at exit
block/blk-iocost.c:696:28: warning: context imbalance in 'iocg_unlock'
- unexpected unlock
If we try to break iocg_lock() into one version for lock_ioc set and
another for lock_ioc unset, we can solve the sparse issues for those
functions, but then we get another sparse issue from the callsites for
those functions:
block/blk-iocost.c:2679:17: warning: context imbalance in
'ioc_rqos_throttle' - different lock contexts for basic block
I tried to solve with a total ioc_rqos_throttle() re-org and much code
duplication by calling the different lock and unlock versions from
effectively 2x separate copies of ioc_rqos_throttle(), as sparse seems
confused with how we call these functions. It's a total no-go.
Any simpler idea to solve these? Or just something to live with?
Thanks,
John
> +
> +static void iocg_unlock(struct ioc_gq *iocg, bool unlock_ioc, unsigned long *flags)
> +{
> + if (unlock_ioc) {
> + spin_unlock(&iocg->waitq.lock);
> + spin_unlock_irqrestore(&iocg->ioc->lock, *flags);
> + } else {
> + spin_unlock_irqrestore(&iocg->waitq.lock, *flags);
> + }
> +}
Hello,
On Wed, Jun 12, 2024 at 12:33:19PM +0100, John Garry wrote:
...
> This generates the following sparse warnings on mainline today:
>
> CHECK block/blk-iocost.c
> block/blk-iocost.c:685:9: warning: context imbalance in 'iocg_lock' -
> wrong count at exit
> block/blk-iocost.c:696:28: warning: context imbalance in 'iocg_unlock'
> - unexpected unlock
>
> If we try to break iocg_lock() into one version for lock_ioc set and another
> for lock_ioc unset, we can solve the sparse issues for those functions, but
> then we get another sparse issue from the callsites for those functions:
>
> block/blk-iocost.c:2679:17: warning: context imbalance in
> 'ioc_rqos_throttle' - different lock contexts for basic block
>
> I tried to solve with a total ioc_rqos_throttle() re-org and much code
> duplication by calling the different lock and unlock versions from
> effectively 2x separate copies of ioc_rqos_throttle(), as sparse seems
> confused with how we call these functions. It's a total no-go.
>
> Any simpler idea to solve these? Or just something to live with?
I kinda gave up on sparse lock checking as there's not much it can do that
lockdep can't and the annotations are too awkward and inconsistent
throughout the code base. So, my tendency is just to ignore it.
Thanks.
--
tejun