2020-04-14 23:19:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET v2 block/for-5.8] iocost: improve use_delay and latency target handling

Changes from v1[1]

* Dropped 0002-block-add-request-io_data_len.patch and updated to use
rq->stats_sectors instead as suggested by Pavel Begunkov.

This patchset improves the following two iocost control behaviors.

* iocost was failing to punish heavy shared IO generators (file metadata, memory
reclaim) through use_delay mechanism - use_delay automatically decays which
works well for iolatency but doesn't match how iocost behaves. This led to
e.g. memory bombs which generate a lot of swap IOs to use over their allotted
amount. This is fixed by adding non-decaying use_delay mechanism.

* The same latency targets were being applied regardless of the IO sizes. While
this works fine for loose targets, it gets in the way when trying to tigthen
them - a latency target adequate for a 4k IO is too short for a 1 meg IO.
iocost now discounts the size portion of cost when testing whether a given IO
met or missed its latency target.

While at it, it also makes minor changse to iocost_monitor.py.

This patchset contains the following five patches.

0001-blk-iocost-switch-to-fixed-non-auto-decaying-use_del.patch
0002-blk-iocost-account-for-IO-size-when-testing-latencie.patch
0003-iocost_monitor-exit-successfully-if-interval-is-zero.patch
0004-iocost_monitor-drop-string-wrap-around-numbers-when-.patch

and is also available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git iocost-delay-latency-v2

diffstat follows. Thanks.

block/Kconfig | 1
block/blk-cgroup.c | 6 ++++
block/blk-iocost.c | 56 +++++++++++++++++++++++++++++------------
include/linux/blk-cgroup.h | 43 ++++++++++++++++++++++++-------
tools/cgroup/iocost_monitor.py | 48 +++++++++++++++++++----------------
5 files changed, 106 insertions(+), 48 deletions(-)

--
tejun

[1] http://lkml.kernel.org/r/[email protected]


2020-04-14 23:19:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/4] blk-iocost: switch to fixed non-auto-decaying use_delay

The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.

While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.

As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.

With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Josef Bacik <[email protected]>
---
block/blk-cgroup.c | 6 ++++++
block/blk-iocost.c | 23 ++++++++------------
include/linux/blk-cgroup.h | 43 +++++++++++++++++++++++++++++---------
3 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c5dc833212e1..0a63c6cbbcb1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1530,6 +1530,10 @@ static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 now)
{
u64 old = atomic64_read(&blkg->delay_start);

+ /* negative use_delay means no scaling, see blkcg_set_delay() */
+ if (atomic_read(&blkg->use_delay) < 0)
+ return;
+
/*
* We only want to scale down every second. The idea here is that we
* want to delay people for min(delay_nsec, NSEC_PER_SEC) in a certain
@@ -1717,6 +1721,8 @@ void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
*/
void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
{
+ if (WARN_ON_ONCE(atomic_read(&blkg->use_delay) < 0))
+ return;
blkcg_scale_delay(blkg, now);
atomic64_add(delta, &blkg->delay_nsec);
}
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index db35ee682294..a8e99ef76a08 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1209,14 +1209,14 @@ static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

-static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now, u64 cost)
+static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
{
struct ioc *ioc = iocg->ioc;
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
u64 vtime = atomic64_read(&iocg->vtime);
u64 vmargin = ioc->margin_us * now->vrate;
u64 margin_ns = ioc->margin_us * NSEC_PER_USEC;
- u64 expires, oexpires;
+ u64 delta_ns, expires, oexpires;
u32 hw_inuse;

/* debt-adjust vtime */
@@ -1233,15 +1233,10 @@ static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now, u64 cost)
return false;

/* use delay */
- if (cost) {
- u64 cost_ns = DIV64_U64_ROUND_UP(cost * NSEC_PER_USEC,
- now->vrate);
- blkcg_add_delay(blkg, now->now_ns, cost_ns);
- }
- blkcg_use_delay(blkg);
-
- expires = now->now_ns + DIV64_U64_ROUND_UP(vtime - now->vnow,
- now->vrate) * NSEC_PER_USEC;
+ delta_ns = DIV64_U64_ROUND_UP(vtime - now->vnow,
+ now->vrate) * NSEC_PER_USEC;
+ blkcg_set_delay(blkg, delta_ns);
+ expires = now->now_ns + delta_ns;

/* if already active and close enough, don't bother */
oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer));
@@ -1260,7 +1255,7 @@ static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer)
struct ioc_now now;

ioc_now(iocg->ioc, &now);
- iocg_kick_delay(iocg, &now, 0);
+ iocg_kick_delay(iocg, &now);

return HRTIMER_NORESTART;
}
@@ -1378,7 +1373,7 @@ static void ioc_timer_fn(struct timer_list *timer)
atomic64_read(&iocg->abs_vdebt)) {
/* might be oversleeping vtime / hweight changes, kick */
iocg_kick_waitq(iocg, &now);
- iocg_kick_delay(iocg, &now, 0);
+ iocg_kick_delay(iocg, &now);
} else if (iocg_is_idle(iocg)) {
/* no waiter and idle, deactivate */
iocg->last_inuse = iocg->inuse;
@@ -1737,7 +1732,7 @@ static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
*/
if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) {
atomic64_add(abs_cost, &iocg->abs_vdebt);
- if (iocg_kick_delay(iocg, &now, cost))
+ if (iocg_kick_delay(iocg, &now))
blkcg_schedule_throttle(rqos->q,
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
return;
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 35f8ffe92b70..3f0f51c4a571 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -629,6 +629,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,

static inline void blkcg_use_delay(struct blkcg_gq *blkg)
{
+ if (WARN_ON_ONCE(atomic_read(&blkg->use_delay) < 0))
+ return;
if (atomic_add_return(1, &blkg->use_delay) == 1)
atomic_inc(&blkg->blkcg->css.cgroup->congestion_count);
}
@@ -637,6 +639,8 @@ static inline int blkcg_unuse_delay(struct blkcg_gq *blkg)
{
int old = atomic_read(&blkg->use_delay);

+ if (WARN_ON_ONCE(old < 0))
+ return 0;
if (old == 0)
return 0;

@@ -661,20 +665,39 @@ static inline int blkcg_unuse_delay(struct blkcg_gq *blkg)
return 1;
}

+/**
+ * blkcg_set_delay - Enable allocator delay mechanism with the specified delay amount
+ * @blkg: target blkg
+ * @delay: delay duration in nsecs
+ *
+ * When enabled with this function, the delay is not decayed and must be
+ * explicitly cleared with blkcg_clear_delay(). Must not be mixed with
+ * blkcg_[un]use_delay() and blkcg_add_delay() usages.
+ */
+static inline void blkcg_set_delay(struct blkcg_gq *blkg, u64 delay)
+{
+ int old = atomic_read(&blkg->use_delay);
+
+ /* We only want 1 person setting the congestion count for this blkg. */
+ if (!old && atomic_cmpxchg(&blkg->use_delay, old, -1) == old)
+ atomic_inc(&blkg->blkcg->css.cgroup->congestion_count);
+
+ atomic64_set(&blkg->delay_nsec, delay);
+}
+
+/**
+ * blkcg_clear_delay - Disable allocator delay mechanism
+ * @blkg: target blkg
+ *
+ * Disable use_delay mechanism. See blkcg_set_delay().
+ */
static inline void blkcg_clear_delay(struct blkcg_gq *blkg)
{
int old = atomic_read(&blkg->use_delay);
- if (!old)
- return;
+
/* We only want 1 person clearing the congestion count for this blkg. */
- while (old) {
- int cur = atomic_cmpxchg(&blkg->use_delay, old, 0);
- if (cur == old) {
- atomic_dec(&blkg->blkcg->css.cgroup->congestion_count);
- break;
- }
- old = cur;
- }
+ if (old && atomic_cmpxchg(&blkg->use_delay, old, 0) == old)
+ atomic_dec(&blkg->blkcg->css.cgroup->congestion_count);
}

void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta);
--
2.25.2

2020-04-14 23:29:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/4] iocost_monitor: exit successfully if interval is zero

This is to help external tools to decide whether iocost_monitor has all its
requirements met or not based on the exit status of an -i0 run.

Signed-off-by: Tejun Heo <[email protected]>
---
tools/cgroup/iocost_monitor.py | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/cgroup/iocost_monitor.py b/tools/cgroup/iocost_monitor.py
index 7427a5ee761b..eb2363b868c5 100644
--- a/tools/cgroup/iocost_monitor.py
+++ b/tools/cgroup/iocost_monitor.py
@@ -28,7 +28,8 @@ parser.add_argument('devname', metavar='DEV',
parser.add_argument('--cgroup', action='append', metavar='REGEX',
help='Regex for target cgroups, ')
parser.add_argument('--interval', '-i', metavar='SECONDS', type=float, default=1,
- help='Monitoring interval in seconds')
+ help='Monitoring interval in seconds (0 exits immediately '
+ 'after checking requirements)')
parser.add_argument('--json', action='store_true',
help='Output in json')
args = parser.parse_args()
@@ -243,6 +244,9 @@ ioc = None
if ioc is None:
err(f'Could not find ioc for {devname}');

+if interval == 0:
+ sys.exit(0)
+
# Keep printing
while True:
now = time.time()
--
2.25.2

2020-04-14 23:29:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/4] blk-iocost: account for IO size when testing latencies

On each IO completion, iocost decides whether the IO met or missed its latency
target. Currently, the targets are fixed numbers per IO type. While this can be
good enough for loose latency targets way higher than typical completion
latencies, the effect of IO size makes it difficult to tighten the latency
target - a target adequate for 4k IOs might be too tight for 512k IOs and
vice-versa.

iocost already has all the necessary information to account for different IO
sizes when testing whether the latency target is met as iocost can calculate the
size vtime cost of a given IO. This patch updates the completion path to
calculate the size vtime cost of the IO, deduct the nsec equivalent from the
observed latency and use the adjusted value to decide whether the target is met.

This makes latency targets independent from IO size and enables determining
adequate latency targets with fixed size fio runs.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Andy Newell <[email protected]>
---
block/Kconfig | 1 +
block/blk-iocost.c | 33 +++++++++++++++++++++++++++++++--
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 3bc76bb113a0..41cb34b0fcd1 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -146,6 +146,7 @@ config BLK_CGROUP_IOLATENCY
config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP=y
+ select BLK_RQ_IO_DATA_LEN
select BLK_RQ_ALLOC_TIME
---help---
Enabling this option enables the .weight interface for cost
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index a8e99ef76a08..9a667dd75eef 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -260,6 +260,7 @@ enum {
VTIME_PER_SEC_SHIFT = 37,
VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT,
VTIME_PER_USEC = VTIME_PER_SEC / USEC_PER_SEC,
+ VTIME_PER_NSEC = VTIME_PER_SEC / NSEC_PER_SEC,

/* bound vrate adjustments within two orders of magnitude */
VRATE_MIN_PPM = 10000, /* 1% */
@@ -1668,6 +1669,31 @@ static u64 calc_vtime_cost(struct bio *bio, struct ioc_gq *iocg, bool is_merge)
return cost;
}

+static void calc_size_vtime_cost_builtin(struct request *rq, struct ioc *ioc,
+ u64 *costp)
+{
+ unsigned int pages = blk_rq_stats_sectors(rq) >> IOC_SECT_TO_PAGE_SHIFT;
+
+ switch (req_op(rq)) {
+ case REQ_OP_READ:
+ *costp = pages * ioc->params.lcoefs[LCOEF_RPAGE];
+ break;
+ case REQ_OP_WRITE:
+ *costp = pages * ioc->params.lcoefs[LCOEF_WPAGE];
+ break;
+ default:
+ *costp = 0;
+ }
+}
+
+static u64 calc_size_vtime_cost(struct request *rq, struct ioc *ioc)
+{
+ u64 cost;
+
+ calc_size_vtime_cost_builtin(rq, ioc, &cost);
+ return cost;
+}
+
static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
{
struct blkcg_gq *blkg = bio->bi_blkg;
@@ -1837,7 +1863,7 @@ static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio)
static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
{
struct ioc *ioc = rqos_to_ioc(rqos);
- u64 on_q_ns, rq_wait_ns;
+ u64 on_q_ns, rq_wait_ns, size_nsec;
int pidx, rw;

if (!ioc->enabled || !rq->alloc_time_ns || !rq->start_time_ns)
@@ -1858,8 +1884,10 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)

on_q_ns = ktime_get_ns() - rq->alloc_time_ns;
rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
+ size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);

- if (on_q_ns <= ioc->params.qos[pidx] * NSEC_PER_USEC)
+ if (on_q_ns <= size_nsec ||
+ on_q_ns - size_nsec <= ioc->params.qos[pidx] * NSEC_PER_USEC)
this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_met);
else
this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_missed);
@@ -2267,6 +2295,7 @@ static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
spin_lock_irq(&ioc->lock);

if (enable) {
+ blk_stat_enable_accounting(ioc->rqos.q);
blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q);
ioc->enabled = true;
} else {
--
2.25.2

2020-04-14 23:49:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/4] iocost_monitor: drop string wrap around numbers when outputting json

Wrapping numbers in strings is used by some to work around bit-width issues in
some enviroments. The problem isn't innate to json and the workaround seems to
cause more integration problems than help. Let's drop the string wrapping.

Signed-off-by: Tejun Heo <[email protected]>
---
tools/cgroup/iocost_monitor.py | 42 +++++++++++++++++-----------------
1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/tools/cgroup/iocost_monitor.py b/tools/cgroup/iocost_monitor.py
index eb2363b868c5..188b3379b9a1 100644
--- a/tools/cgroup/iocost_monitor.py
+++ b/tools/cgroup/iocost_monitor.py
@@ -113,14 +113,14 @@ autop_names = {

def dict(self, now):
return { 'device' : devname,
- 'timestamp' : str(now),
- 'enabled' : str(int(self.enabled)),
- 'running' : str(int(self.running)),
- 'period_ms' : str(self.period_ms),
- 'period_at' : str(self.period_at),
- 'period_vtime_at' : str(self.vperiod_at),
- 'busy_level' : str(self.busy_level),
- 'vrate_pct' : str(self.vrate_pct), }
+ 'timestamp' : now,
+ 'enabled' : self.enabled,
+ 'running' : self.running,
+ 'period_ms' : self.period_ms,
+ 'period_at' : self.period_at,
+ 'period_vtime_at' : self.vperiod_at,
+ 'busy_level' : self.busy_level,
+ 'vrate_pct' : self.vrate_pct, }

def table_preamble_str(self):
state = ('RUN' if self.running else 'IDLE') if self.enabled else 'OFF'
@@ -175,19 +175,19 @@ autop_names = {

def dict(self, now, path):
out = { 'cgroup' : path,
- 'timestamp' : str(now),
- 'is_active' : str(int(self.is_active)),
- 'weight' : str(self.weight),
- 'weight_active' : str(self.active),
- 'weight_inuse' : str(self.inuse),
- 'hweight_active_pct' : str(self.hwa_pct),
- 'hweight_inuse_pct' : str(self.hwi_pct),
- 'inflight_pct' : str(self.inflight_pct),
- 'debt_ms' : str(self.debt_ms),
- 'use_delay' : str(self.use_delay),
- 'delay_ms' : str(self.delay_ms),
- 'usage_pct' : str(self.usage),
- 'address' : str(hex(self.address)) }
+ 'timestamp' : now,
+ 'is_active' : self.is_active,
+ 'weight' : self.weight,
+ 'weight_active' : self.active,
+ 'weight_inuse' : self.inuse,
+ 'hweight_active_pct' : self.hwa_pct,
+ 'hweight_inuse_pct' : self.hwi_pct,
+ 'inflight_pct' : self.inflight_pct,
+ 'debt_ms' : self.debt_ms,
+ 'use_delay' : self.use_delay,
+ 'delay_ms' : self.delay_ms,
+ 'usage_pct' : self.usage,
+ 'address' : self.address }
for i in range(len(self.usages)):
out[f'usage_pct_{i}'] = str(self.usages[i])
return out
--
2.25.2

2020-04-30 21:57:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v2 block/for-5.8] iocost: improve use_delay and latency target handling

On 4/13/20 10:27 AM, Tejun Heo wrote:
> Changes from v1[1]
>
> * Dropped 0002-block-add-request-io_data_len.patch and updated to use
> rq->stats_sectors instead as suggested by Pavel Begunkov.
>
> This patchset improves the following two iocost control behaviors.
>
> * iocost was failing to punish heavy shared IO generators (file metadata, memory
> reclaim) through use_delay mechanism - use_delay automatically decays which
> works well for iolatency but doesn't match how iocost behaves. This led to
> e.g. memory bombs which generate a lot of swap IOs to use over their allotted
> amount. This is fixed by adding non-decaying use_delay mechanism.
>
> * The same latency targets were being applied regardless of the IO sizes. While
> this works fine for loose targets, it gets in the way when trying to tigthen
> them - a latency target adequate for a 4k IO is too short for a 1 meg IO.
> iocost now discounts the size portion of cost when testing whether a given IO
> met or missed its latency target.
>
> While at it, it also makes minor changse to iocost_monitor.py.
>
> This patchset contains the following five patches.
>
> 0001-blk-iocost-switch-to-fixed-non-auto-decaying-use_del.patch
> 0002-blk-iocost-account-for-IO-size-when-testing-latencie.patch
> 0003-iocost_monitor-exit-successfully-if-interval-is-zero.patch
> 0004-iocost_monitor-drop-string-wrap-around-numbers-when-.patch
>
> and is also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git iocost-delay-latency-v2
>
> diffstat follows. Thanks.
>
> block/Kconfig | 1
> block/blk-cgroup.c | 6 ++++
> block/blk-iocost.c | 56 +++++++++++++++++++++++++++++------------
> include/linux/blk-cgroup.h | 43 ++++++++++++++++++++++++-------
> tools/cgroup/iocost_monitor.py | 48 +++++++++++++++++++----------------
> 5 files changed, 106 insertions(+), 48 deletions(-)

Applied, thanks Tejun.

--
Jens Axboe