2016-11-14 22:22:54

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 00/15] blk-throttle: add .high limit

Hi,

The background is we don't have an ioscheduler for blk-mq yet, so we can't
prioritize processes/cgroups. This patch set tries to add basic arbitration
between cgroups with blk-throttle. It adds a new limit io.high for
blk-throttle. It's only for cgroup2.

io.max is a hard limit throttling. cgroups with a max limit never dispatch more
IO than their max limit. While io.high is a best effort throttling. cgroups
with high limit can run above their high limit at appropriate time.
Specifically, if all cgroups reach their high limit, all cgroups can run above
their high limit. If any cgroup runs under its high limit, all other cgroups
will run according to their high limit.

An example usage is we have a high prio cgroup with high high limit and a low
prio cgroup with low high limit. If the high prio cgroup isn't running, the low
prio can run above its high limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its high limit, low prio cgroup will run
under its high limit. This will protect high prio cgroup to get more resources.
If both cgroups reach their high limit, both can run above their high limit
(eg, fully utilize disk bandwidth). All these can't be done with io.max limit.

The implementation is simple. The disk queue has 2 states LIMIT_HIGH and
LIMIT_MAX. In each disk state, we throttle cgroups according to the limit of
the state. That is io.high limit for LIMIT_HIGH state, io.max limit for
LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_HIGH/LIMIT_MAX according to the rule above. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.high, the disk state will remain in
LIMIT_MAX state. Users with only io.max set will find nothing changed with the
patches.

The first 8 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 8 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 9-15
adds more heuristics.

The basic framework has 2 major issues.
1. fluctuation. When the state is upgraded from LIMIT_HIGH to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way not expected.
For example, one cgroup's bandwidth will drop below its io.high limit very soon
after a upgrade. patch 9 has more details about the issue.
2. idle cgroup. cgroup with a io.high limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_HIGH state and all other
cgroups can't dispatch more IO above their high limit. Hence this is a waste of
disk bandwidth. patch 10 has more details about the issue.

For issue 1, we make cgroup bandwidth increase smoothly after a upgrade. This
will reduce the chance a cgroup's bandwidth drop under its high limit rapidly.
The smoothness means we could waste some bandwidth in the transition though.
But we must pay something for sharing.

The issue 2 is very hard to solve. The patch 10 uses the 'think time check'
idea borrowed from CFQ to detect idle cgroup. It's not perfect, eg, not works
well for high IO depth workloads. But it's the best I tried so far and in
practice works well. This definitively needs more tuning.

The big change in this version comes from patch 13 - 15. We add a latency
target for each cgroup. The goal is to solve issue 2. If a cgroup's average io
latency exceeds its latency target, the cgroup is considered as busy.

Please review, test and consider merge.

Thanks,
Shaohua

V3->V4:
- Add latency target for cgroup
- Fix bugs

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2


Shaohua Li (15):
blk-throttle: prepare support multiple limits
blk-throttle: add .high interface
blk-throttle: configure bps/iops limit for cgroup in high limit
blk-throttle: add upgrade logic for LIMIT_HIGH state
blk-throttle: add downgrade logic
blk-throttle: make sure expire time isn't too big
blk-throttle: make throtl_slice tunable
blk-throttle: detect completed idle cgroup
blk-throttle: make bandwidth change smooth
blk-throttle: add a simple idle detection
blk-throttle: add interface to configure think time threshold
blk-throttle: ignore idle cgroup limit
blk-throttle: add a mechanism to estimate IO latency
blk-throttle: add interface for per-cgroup target latency
blk-throttle: add latency target support

block/bio.c | 2 +
block/blk-sysfs.c | 18 +
block/blk-throttle.c | 1035 ++++++++++++++++++++++++++++++++++++++++++---
block/blk.h | 9 +
include/linux/blk_types.h | 4 +
5 files changed, 1001 insertions(+), 67 deletions(-)

--
2.9.3


2016-11-14 22:22:31

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 02/15] blk-throttle: add .high interface

Add high limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 132 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 103 insertions(+), 29 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 925aa1ed..a564215 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

enum {
+ LIMIT_HIGH,
LIMIT_MAX,
LIMIT_CNT,
};
@@ -414,6 +415,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
}

+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+ bool high_valid = false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_HIGH] != -1)
+ high_valid = true;
+ }
+ rcu_read_unlock();
+
+ if (high_valid)
+ td->limit_valid[LIMIT_HIGH] = true;
+ else
+ td->limit_valid[LIMIT_HIGH] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+ struct throtl_grp *tg = pd_to_tg(pd);
+
+ tg->bps[READ][LIMIT_HIGH] = -1;
+ tg->bps[WRITE][LIMIT_HIGH] = -1;
+ tg->iops[READ][LIMIT_HIGH] = -1;
+ tg->iops[WRITE][LIMIT_HIGH] = -1;
+
+ blk_throtl_update_valid_limit(tg->td);
+
+ if (tg->td->limit_index == LIMIT_HIGH &&
+ !tg->td->limit_valid[LIMIT_HIGH])
+ tg->td->limit_index = LIMIT_MAX;
+}
+
static void throtl_pd_free(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1283,7 +1324,7 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
};

-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1292,36 +1333,32 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
- tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
+ if (tg->bps[READ][off] == -1 && tg->bps[WRITE][off] == -1 &&
+ tg->iops[READ][off] == -1 && tg->iops[WRITE][off] == -1)
return 0;

- if (tg->bps[READ][LIMIT_MAX] != -1)
- snprintf(bufs[0], sizeof(bufs[0]), "%llu",
- tg->bps[READ][LIMIT_MAX]);
- if (tg->bps[WRITE][LIMIT_MAX] != -1)
- snprintf(bufs[1], sizeof(bufs[1]), "%llu",
- tg->bps[WRITE][LIMIT_MAX]);
- if (tg->iops[READ][LIMIT_MAX] != -1)
- snprintf(bufs[2], sizeof(bufs[2]), "%u",
- tg->iops[READ][LIMIT_MAX]);
- if (tg->iops[WRITE][LIMIT_MAX] != -1)
- snprintf(bufs[3], sizeof(bufs[3]), "%u",
- tg->iops[WRITE][LIMIT_MAX]);
+ if (tg->bps[READ][off] != -1)
+ snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ][off]);
+ if (tg->bps[WRITE][off] != -1)
+ snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE][off]);
+ if (tg->iops[READ][off] != -1)
+ snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ][off]);
+ if (tg->iops[WRITE][off] != -1)
+ snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE][off]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
return 0;
}

-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
{
- blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
&blkcg_policy_throtl, seq_cft(sf)->private, false);
return 0;
}

-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1329,6 +1366,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
int ret;
+ int index = of_cft(of)->private;

ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
if (ret)
@@ -1336,10 +1374,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ][LIMIT_MAX];
- v[1] = tg->bps[WRITE][LIMIT_MAX];
- v[2] = tg->iops[READ][LIMIT_MAX];
- v[3] = tg->iops[WRITE][LIMIT_MAX];
+ v[0] = tg->bps[READ][index];
+ v[1] = tg->bps[WRITE][index];
+ v[2] = tg->iops[READ][index];
+ v[3] = tg->iops[WRITE][index];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1376,11 +1414,37 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ][LIMIT_MAX] = v[0];
- tg->bps[WRITE][LIMIT_MAX] = v[1];
- tg->iops[READ][LIMIT_MAX] = v[2];
- tg->iops[WRITE][LIMIT_MAX] = v[3];
-
+ if (index == LIMIT_MAX) {
+ if ((v[0] < tg->bps[READ][LIMIT_HIGH] &&
+ tg->bps[READ][LIMIT_HIGH] != -1) ||
+ (v[1] < tg->bps[WRITE][LIMIT_HIGH] &&
+ tg->bps[WRITE][LIMIT_HIGH] != -1) ||
+ (v[2] < tg->iops[READ][LIMIT_HIGH] &&
+ tg->iops[READ][LIMIT_HIGH] != -1) ||
+ (v[3] < tg->iops[WRITE][LIMIT_HIGH] &&
+ tg->iops[WRITE][LIMIT_HIGH] != -1)) {
+ ret = -EINVAL;
+ goto out_finish;
+ }
+ } else if (index == LIMIT_HIGH) {
+ if ((v[0] > tg->bps[READ][LIMIT_MAX] && v[0] != -1) ||
+ (v[1] > tg->bps[WRITE][LIMIT_MAX] && v[1] != -1) ||
+ (v[2] > tg->iops[READ][LIMIT_MAX] && v[2] != -1) ||
+ (v[3] > tg->iops[WRITE][LIMIT_MAX] && v[3] != -1)) {
+ ret = -EINVAL;
+ goto out_finish;
+ }
+ }
+ tg->bps[READ][index] = v[0];
+ tg->bps[WRITE][index] = v[1];
+ tg->iops[READ][index] = v[2];
+ tg->iops[WRITE][index] = v[3];
+
+ if (index == LIMIT_HIGH) {
+ blk_throtl_update_valid_limit(tg->td);
+ if (tg->td->limit_valid[LIMIT_HIGH])
+ tg->td->limit_index = LIMIT_HIGH;
+ }
tg_conf_updated(tg);
ret = 0;
out_finish:
@@ -1390,10 +1454,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

static struct cftype throtl_files[] = {
{
+ .name = "high",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_HIGH,
+ },
+ {
.name = "max",
.flags = CFTYPE_NOT_ON_ROOT,
- .seq_show = tg_print_max,
- .write = tg_set_max,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_MAX,
},
{ } /* terminate */
};
@@ -1412,6 +1484,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_alloc_fn = throtl_pd_alloc,
.pd_init_fn = throtl_pd_init,
.pd_online_fn = throtl_pd_online,
+ .pd_offline_fn = throtl_pd_offline,
.pd_free_fn = throtl_pd_free,
};

@@ -1591,6 +1664,7 @@ int blk_throtl_init(struct request_queue *q)
td->queue = q;

td->limit_valid[LIMIT_MAX] = true;
+ td->limit_index = LIMIT_MAX;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-11-14 22:22:34

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 05/15] blk-throttle: add downgrade logic

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its high limit for some time, the queue should be downgraded to lower
state as one cgroup's high limit isn't met.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 188 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 34a75e5..d177252 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -136,6 +136,13 @@ struct throtl_grp {
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];

+ unsigned long last_high_overflow_time[2];
+
+ uint64_t last_bytes_disp[2];
+ unsigned int last_io_disp[2];
+
+ unsigned long last_check_time;
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -155,6 +162,9 @@ struct throtl_data
struct work_struct dispatch_work;
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
+
+ unsigned long high_upgrade_time;
+ unsigned long high_downgrade_time;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -896,6 +906,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio->bi_iter.bi_size;
tg->io_disp[rw]++;
+ tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+ tg->last_io_disp[rw]++;

/*
* REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1510,6 +1522,65 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static unsigned long __tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+ unsigned long rtime = -1, wtime = -1;
+
+ if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[READ][LIMIT_MAX] != -1 ||
+ tg->iops[READ][LIMIT_MAX] != -1)
+ rtime = tg->last_high_overflow_time[READ];
+ if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_MAX] != -1 ||
+ tg->iops[WRITE][LIMIT_MAX] != -1)
+ wtime = tg->last_high_overflow_time[WRITE];
+ return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *parent_sq;
+ struct throtl_grp *parent = tg;
+ unsigned long ret = __tg_last_high_overflow_time(tg);
+
+ while (true) {
+ parent_sq = parent->service_queue.parent_sq;
+ parent = sq_to_tg(parent_sq);
+ if (!parent)
+ break;
+ if (((parent->bps[READ][LIMIT_HIGH] != -1 &&
+ parent->bps[READ][LIMIT_HIGH] >=
+ tg->bps[READ][LIMIT_HIGH]) ||
+ (parent->bps[READ][LIMIT_HIGH] == -1 &&
+ parent->bps[READ][LIMIT_MAX] >=
+ tg->bps[READ][LIMIT_HIGH])) &&
+ ((parent->bps[WRITE][LIMIT_HIGH] != -1 &&
+ parent->bps[WRITE][LIMIT_HIGH] >=
+ tg->bps[WRITE][LIMIT_HIGH]) ||
+ (parent->bps[WRITE][LIMIT_HIGH] == -1 &&
+ parent->bps[WRITE][LIMIT_MAX] >=
+ tg->bps[WRITE][LIMIT_HIGH])) &&
+ ((parent->iops[READ][LIMIT_HIGH] != -1 &&
+ parent->iops[READ][LIMIT_HIGH] >=
+ tg->iops[READ][LIMIT_HIGH]) ||
+ (parent->iops[READ][LIMIT_HIGH] == -1 &&
+ parent->iops[READ][LIMIT_MAX] >=
+ tg->iops[READ][LIMIT_HIGH])) &&
+ ((parent->iops[WRITE][LIMIT_HIGH] != -1 &&
+ parent->iops[WRITE][LIMIT_HIGH] >=
+ tg->iops[WRITE][LIMIT_HIGH]) ||
+ (parent->iops[WRITE][LIMIT_HIGH] == -1 &&
+ parent->iops[WRITE][LIMIT_MAX] >=
+ tg->iops[WRITE][LIMIT_HIGH])))
+ break;
+ if (time_after(__tg_last_high_overflow_time(parent), ret))
+ ret = __tg_last_high_overflow_time(parent);
+ }
+ return ret;
+}
+
static bool throtl_upgrade_check_one(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1557,6 +1628,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_HIGH)
return false;

+ if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+ return false;
+
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1580,6 +1654,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
struct blkcg_gq *blkg;

td->limit_index = LIMIT_MAX;
+ td->high_upgrade_time = jiffies;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1595,6 +1670,111 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, &td->dispatch_work);
}

+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+ td->limit_index = new;
+ td->high_downgrade_time = jiffies;
+}
+
+static bool throtl_downgrade_check_one(struct throtl_grp *tg)
+{
+ struct throtl_data *td = tg->td;
+ unsigned long now = jiffies;
+
+ /*
+ * If cgroup is below high limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (time_after_eq(now, td->high_upgrade_time + throtl_slice) &&
+ time_after_eq(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ return true;
+ return false;
+}
+
+static bool throtl_downgrade_check_hierarchy(struct throtl_grp *tg)
+{
+ if (!throtl_downgrade_check_one(tg))
+ return false;
+ while (true) {
+ if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+ !tg_to_blkg(tg)->parent))
+ break;
+
+ if (!throtl_downgrade_check_one(tg))
+ return false;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ }
+ return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+ uint64_t bps;
+ unsigned int iops;
+ unsigned long elapsed_time;
+ unsigned long now = jiffies;
+
+ if (tg->td->limit_index != LIMIT_MAX ||
+ !tg->td->limit_valid[LIMIT_HIGH])
+ return;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ return;
+ if (time_after(tg->last_check_time + throtl_slice, now))
+ return;
+
+ if (time_before(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ return;
+
+ elapsed_time = now - tg->last_check_time;
+ tg->last_check_time = now;
+
+ if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[READ][LIMIT_MAX] != -1) {
+ bps = tg->last_bytes_disp[READ] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[READ][LIMIT_HIGH] ||
+ bps >= tg->bps[READ][LIMIT_MAX])
+ tg->last_high_overflow_time[READ] = now;
+ }
+
+ if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_MAX] != -1) {
+ bps = tg->last_bytes_disp[WRITE] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[WRITE][LIMIT_HIGH] ||
+ bps >= tg->bps[WRITE][LIMIT_MAX])
+ tg->last_high_overflow_time[WRITE] = now;
+ }
+
+ if (tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_MAX] != -1) {
+ iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+ if (iops >= tg->iops[READ][LIMIT_HIGH] ||
+ iops >= tg->iops[READ][LIMIT_MAX])
+ tg->last_high_overflow_time[READ] = now;
+ }
+
+ if (tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_MAX] != -1) {
+ iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+ if (iops >= tg->iops[WRITE][LIMIT_HIGH] ||
+ iops >= tg->iops[WRITE][LIMIT_MAX])
+ tg->last_high_overflow_time[WRITE] = now;
+ }
+
+ /*
+ * If cgroup is below high limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (throtl_downgrade_check_hierarchy(tg))
+ throtl_downgrade_state(tg->td, LIMIT_HIGH);
+
+ tg->last_bytes_disp[READ] = 0;
+ tg->last_bytes_disp[WRITE] = 0;
+ tg->last_io_disp[READ] = 0;
+ tg->last_io_disp[WRITE] = 0;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1619,12 +1799,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ if (tg->last_high_overflow_time[rw] == 0)
+ tg->last_high_overflow_time[rw] = jiffies;
+ throtl_downgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
+ tg->last_high_overflow_time[rw] = jiffies;
if (throtl_can_upgrade(tg->td, tg)) {
throtl_upgrade_state(tg->td);
goto again;
@@ -1668,6 +1852,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

+ tg->last_high_overflow_time[rw] = jiffies;
+
bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
@@ -1778,6 +1964,8 @@ int blk_throtl_init(struct request_queue *q)

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
+ td->high_upgrade_time = jiffies;
+ td->high_downgrade_time = jiffies;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-11-14 22:22:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 03/15] blk-throttle: configure bps/iops limit for cgroup in high limit

each queue will have a state machine. Initially queue is in LIMIT_HIGH
state, which means all cgroups will be throttled according to their high
limit. After all cgroups with high limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
cgroups without high limit will use max limit for their high limit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a564215..ec53671 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -208,12 +208,29 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)

static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
- return tg->bps[rw][tg->td->limit_index];
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ uint64_t ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return -1;
+ ret = tg->bps[rw][tg->td->limit_index];
+ if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ return tg->bps[rw][LIMIT_MAX];
+
+ return ret;
}

static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
- return tg->iops[rw][tg->td->limit_index];
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ unsigned int ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return -1;
+ ret = tg->iops[rw][tg->td->limit_index];
+ if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ return tg->iops[rw][LIMIT_MAX];
+ return ret;
}

/**
--
2.9.3

2016-11-14 22:23:25

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 14/15] blk-throttle: add interface for per-cgroup target latency

Add interface for per-cgroup target latency. This latency is for 4k
request.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a05d351..ac4d9ea 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -147,6 +147,8 @@ struct throtl_grp {
unsigned long last_check_time;

int upgrade_check_batch;
+
+ u64 latency_target;
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -463,6 +465,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
tg->iops[rw][index] = -1;
}
}
+ /* target latency default 0, eg, always not meet */

return &tg->pd;
}
@@ -1572,6 +1575,64 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
return ret ?: nbytes;
}

+static u64 tg_prfill_latency_target(struct seq_file *sf,
+ struct blkg_policy_data *pd, int off)
+{
+ struct throtl_grp *tg = pd_to_tg(pd);
+ const char *dname = blkg_dev_name(pd->blkg);
+
+ if (!dname)
+ return 0;
+ if (tg->latency_target == 0)
+ return 0;
+
+ seq_printf(sf, "%s 4k_lat=%llu\n", dname, tg->latency_target);
+ return 0;
+}
+
+static int tg_print_latency_target(struct seq_file *sf, void *v)
+{
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ tg_prfill_latency_target, &blkcg_policy_throtl,
+ seq_cft(sf)->private, false);
+ return 0;
+}
+
+static ssize_t tg_set_latency_target(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct blkcg *blkcg = css_to_blkcg(of_css(of));
+ struct blkg_conf_ctx ctx;
+ struct throtl_grp *tg;
+ int ret = -EINVAL;
+ char tok[27];
+ char *p;
+ u64 val;
+
+ ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
+ if (ret)
+ return ret;
+
+ tg = blkg_to_tg(ctx.blkg);
+
+ if (sscanf(ctx.body, "%26s", tok) != 1)
+ goto out_finish;
+
+ p = tok;
+ strsep(&p, "=");
+ if (!p || kstrtou64(p, 10, &val))
+ goto out_finish;
+
+ if (strcmp(tok, "4k_lat"))
+ goto out_finish;
+
+ tg->latency_target = val;
+ ret = 0;
+out_finish:
+ blkg_conf_finish(&ctx);
+ return ret ?: nbytes;
+}
+
static struct cftype throtl_files[] = {
{
.name = "high",
@@ -1587,6 +1648,12 @@ static struct cftype throtl_files[] = {
.write = tg_set_limit,
.private = LIMIT_MAX,
},
+ {
+ .name = "latency_target",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = tg_print_latency_target,
+ .write = tg_set_latency_target,
+ },
{ } /* terminate */
};

--
2.9.3

2016-11-14 22:23:50

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 12/15] blk-throttle: ignore idle cgroup limit

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 39 +++++++++++++++++++++++++--------------
1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e403e88..01b494d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -146,8 +146,7 @@ struct throtl_grp {

unsigned long last_check_time;

- unsigned long last_dispatch_time[2];
-
+ int upgrade_check_batch;
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -487,8 +486,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
* Update has_rules[] after a new group is brought online.
*/
tg_update_has_rules(tg);
- tg->last_dispatch_time[READ] = jiffies;
- tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1667,9 +1664,8 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
return true;

if (time_after_eq(jiffies,
- tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
- time_after_eq(jiffies,
- tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ tg_last_high_overflow_time(tg) + tg->td->throtl_slice) &&
+ throtl_tg_is_idle(tg))
return true;
return false;
}
@@ -1718,6 +1714,24 @@ static bool throtl_can_upgrade(struct throtl_data *td,
return true;
}

+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+ if (tg->td->limit_index != LIMIT_HIGH)
+ return;
+
+ if (!time_after_eq(jiffies,
+ __tg_last_high_overflow_time(tg) + tg->td->throtl_slice))
+ return;
+
+ tg->upgrade_check_batch++;
+ if (tg->upgrade_check_batch < 16)
+ return;
+ tg->upgrade_check_batch = 0;
+
+ if (throtl_can_upgrade(tg->td, NULL))
+ throtl_upgrade_state(tg->td);
+}
+
static void throtl_upgrade_state(struct throtl_data *td)
{
struct cgroup_subsys_state *pos_css;
@@ -1752,18 +1766,15 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

- if (time_after_eq(now, tg->last_dispatch_time[READ] +
- td->throtl_slice) &&
- time_after_eq(now, tg->last_dispatch_time[WRITE] +
- td->throtl_slice))
- return false;
/*
* If cgroup is below high limit, consider downgrade and throttle other
* cgroups
*/
if (time_after_eq(now, td->high_upgrade_time + td->throtl_slice) &&
time_after_eq(now, tg_last_high_overflow_time(tg) +
- td->throtl_slice))
+ td->throtl_slice) &&
+ (!throtl_tg_is_idle(tg) ||
+ !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
return true;
return false;
}
@@ -1902,10 +1913,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
- tg->last_dispatch_time[rw] = jiffies;
if (tg->last_high_overflow_time[rw] == 0)
tg->last_high_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
+ throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
--
2.9.3

2016-11-14 22:24:00

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 15/15] blk-throttle: add latency target support

One hard problem adding .high limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its high limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We calculate the average request size and average latency of a cgroup.
Then we can calculate the target latency for the cgroup with the average
request size and the equation. In queue LIMIT_HIGH state, if a cgroup
doesn't dispatch enough IO against high limit but its average latency is
lower than its target latency, we treat the cgroup idle. In this case
other cgroups can dispatch more IO, eg, across their high limit.
Similarly in queue LIMIT_MAX state, if a cgroup doesn't dispatch enough
IO but its average latency is higher than its target latency, we treat
the cgroup busy. In this case, we should throttle other cgroups to make
the first cgroup's latency lower.

If cgroup's average request size is big (currently sets to 128k), we
always treat the cgroup busy (the think time check is still effective
though).

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---
include/linux/blk_types.h | 1 +
2 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ac4d9ea..d07f332 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -156,6 +156,12 @@ struct throtl_grp {
u64 last_finish_time;
u64 checked_last_finish_time;
u64 avg_ttime;
+
+ unsigned int bio_batch;
+ u64 total_latency;
+ u64 avg_latency;
+ u64 total_size;
+ u64 avg_size;
};

/* We measure latency for request size from 4k to 4k * ( 1 << 4) */
@@ -1734,12 +1740,30 @@ static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
return ret;
}

+static u64 throtl_target_latency(struct throtl_data *td,
+ struct throtl_grp *tg)
+{
+ if (td->line_slope == 0 || tg->latency_target == 0)
+ return 0;
+
+ /* latency_target + f(avg_size) - f(4k) */
+ return td->line_slope * ((tg->avg_size >> 10) - 4) +
+ tg->latency_target;
+}
+
static bool throtl_tg_is_idle(struct throtl_grp *tg)
{
- /* cgroup is idle if average think time is more than threshold */
- return ktime_get_ns() - tg->last_finish_time >
+ /*
+ * cgroup is idle if:
+ * 1. average think time is higher than threshold
+ * 2. average request size is small and average latency is higher
+ * than target
+ */
+ return (ktime_get_ns() - tg->last_finish_time >
4 * tg->td->idle_ttime_threshold ||
- tg->avg_ttime > tg->td->idle_ttime_threshold;
+ tg->avg_ttime > tg->td->idle_ttime_threshold) ||
+ (tg->avg_latency && tg->avg_size && tg->avg_size <= 128 * 1024 &&
+ tg->avg_latency < throtl_target_latency(tg->td, tg));
}

static bool throtl_upgrade_check_one(struct throtl_grp *tg)
@@ -2123,6 +2147,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
bio_associate_current(bio);
bio->bi_cg_private = q;
bio->bi_cg_size = bio_sectors(bio);
+ bio->bi_cg_enter_time = ktime_get_ns();

blk_throtl_update_ttime(tg);

@@ -2264,6 +2289,33 @@ void blk_throtl_bio_endio(struct bio *bio)
}
}

+ if (bio->bi_cg_enter_time && finish_time > bio->bi_cg_enter_time &&
+ tg->latency_target) {
+ lat = finish_time - bio->bi_cg_enter_time;
+ tg->total_latency += lat;
+ tg->total_size += bio->bi_cg_size << 9;
+ tg->bio_batch++;
+ }
+
+ if (tg->bio_batch >= 8) {
+ int batch = tg->bio_batch;
+ u64 size = tg->total_size;
+
+ lat = tg->total_latency;
+
+ tg->bio_batch = 0;
+ tg->total_latency = 0;
+ tg->total_size = 0;
+
+ if (batch) {
+ do_div(lat, batch);
+ tg->avg_latency = (tg->avg_latency * 7 +
+ lat) >> 3;
+ do_div(size, batch);
+ tg->avg_size = (tg->avg_size * 7 + size) >> 3;
+ }
+ }
+
end:
rcu_read_unlock();
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 45bb437..fe87a20 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -61,6 +61,7 @@ struct bio {
struct cgroup_subsys_state *bi_css;
void *bi_cg_private;
u64 bi_cg_issue_time;
+ u64 bi_cg_enter_time;
sector_t bi_cg_size;
#endif
union {
--
2.9.3

2016-11-14 22:24:11

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 10/15] blk-throttle: add a simple idle detection

A cgroup gets assigned a high limit, but the cgroup could never dispatch
enough IO to cross the high limit. In such case, the queue state machine
will remain in LIMIT_HIGH state and all other cgroups will be throttled
according to high limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to higher state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its high limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_HIGH. But
when queue gets upgraded to higher state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its high limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

Signed-off-by: Shaohua Li <[email protected]>
---
block/bio.c | 2 ++
block/blk-throttle.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++-
block/blk.h | 2 ++
include/linux/blk_types.h | 1 +
4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index db85c57..7baa86d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
#include <linux/cgroup.h>

#include <trace/events/block.h>
+#include "blk.h"

/*
* Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1759,6 +1760,7 @@ void bio_endio(struct bio *bio)
goto again;
}

+ blk_throtl_bio_endio(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
}
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 45a28c4..cb5fd85 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,8 @@ static int throtl_quantum = 32;
/* Throttling is performed over 100ms slice and after that slice is renewed */
#define DFL_THROTL_SLICE (HZ / 10)
#define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */

static struct blkcg_policy blkcg_policy_throtl;

@@ -149,6 +151,10 @@ struct throtl_grp {
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
+
+ u64 last_finish_time;
+ u64 checked_last_finish_time;
+ u64 avg_ttime;
};

struct throtl_data
@@ -172,6 +178,8 @@ struct throtl_data
unsigned long high_downgrade_time;

unsigned int scale;
+
+ u64 idle_ttime_threshold;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -1629,6 +1637,14 @@ static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
return ret;
}

+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+ /* cgroup is idle if average think time is more than threshold */
+ return ktime_get_ns() - tg->last_finish_time >
+ 4 * tg->td->idle_ttime_threshold ||
+ tg->avg_ttime > tg->td->idle_ttime_threshold;
+}
+
static bool throtl_upgrade_check_one(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1837,6 +1853,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
tg->last_io_disp[WRITE] = 0;
}

+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+ u64 now = ktime_get_ns();
+ u64 last_finish_time = tg->last_finish_time;
+
+ if (now <= last_finish_time || last_finish_time == 0 ||
+ last_finish_time == tg->checked_last_finish_time)
+ return;
+
+ tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3;
+ tg->checked_last_finish_time = last_finish_time;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1848,6 +1877,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

WARN_ON_ONCE(!rcu_read_lock_held());

+ if (tg->td->idle_ttime_threshold == -1) {
+ if (blk_queue_nonrot(q))
+ tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
+ else
+ tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+ }
+
/* see throtl_charge_bio() */
if ((bio->bi_opf & REQ_THROTTLED) || !tg->has_rules[rw])
goto out;
@@ -1857,6 +1893,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

+ bio_associate_current(bio);
+ bio->bi_cg_private = q;
+
+ blk_throtl_update_ttime(tg);
+
sq = &tg->service_queue;

again:
@@ -1917,7 +1958,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

tg->last_high_overflow_time[rw] = jiffies;

- bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;
@@ -1946,6 +1986,34 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
return throttled;
}

+void blk_throtl_bio_endio(struct bio *bio)
+{
+ struct blkcg *blkcg;
+ struct blkcg_gq *blkg;
+ struct throtl_grp *tg;
+ struct request_queue *q;
+
+ q = bio->bi_cg_private;
+ if (!q)
+ return;
+ bio->bi_cg_private = NULL;
+
+ rcu_read_lock();
+ blkcg = bio_blkcg(bio);
+ if (!blkcg)
+ goto end;
+ blkg = blkg_lookup(blkcg, q);
+ if (!blkg)
+ goto end;
+
+ tg = blkg_to_tg(blkg ?: q->root_blkg);
+
+ tg->last_finish_time = ktime_get_ns();
+
+end:
+ rcu_read_unlock();
+}
+
/*
* Dispatch all bios from all children tg's queued on @parent_sq. On
* return, @parent_sq is guaranteed to not have any active children tg's
@@ -2030,6 +2098,8 @@ int blk_throtl_init(struct request_queue *q)
td->limit_index = LIMIT_MAX;
td->high_upgrade_time = jiffies;
td->high_downgrade_time = jiffies;
+
+ td->idle_ttime_threshold = -1;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
diff --git a/block/blk.h b/block/blk.h
index 39c14dd..b433f35 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,10 +292,12 @@ extern void blk_throtl_exit(struct request_queue *q);
extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_slice_store(struct request_queue *q,
const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
#endif /* CONFIG_BLK_DEV_THROTTLING */

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cd395ec..ff8dd24 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -59,6 +59,7 @@ struct bio {
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
+ void *bi_cg_private;
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
--
2.9.3

2016-11-14 22:24:05

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 08/15] blk-throttle: detect completed idle cgroup

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to thier lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e85b2b6..32cc6ec 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -144,6 +144,8 @@ struct throtl_grp {

unsigned long last_check_time;

+ unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -438,11 +440,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)

static void throtl_pd_online(struct blkg_policy_data *pd)
{
+ struct throtl_grp *tg = pd_to_tg(pd);
/*
* We don't want new groups to escape the limits of its ancestors.
* Update has_rules[] after a new group is brought online.
*/
- tg_update_has_rules(pd_to_tg(pd));
+ tg_update_has_rules(tg);
+ tg->last_dispatch_time[READ] = jiffies;
+ tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1611,6 +1616,12 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+ if (time_after_eq(jiffies,
+ tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+ time_after_eq(jiffies,
+ tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ return true;
return false;
}

@@ -1691,6 +1702,11 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

+ if (time_after_eq(now, tg->last_dispatch_time[READ] +
+ td->throtl_slice) &&
+ time_after_eq(now, tg->last_dispatch_time[WRITE] +
+ td->throtl_slice))
+ return false;
/*
* If cgroup is below high limit, consider downgrade and throttle other
* cgroups
@@ -1811,6 +1827,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ tg->last_dispatch_time[rw] = jiffies;
if (tg->last_high_overflow_time[rw] == 0)
tg->last_high_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
--
2.9.3

2016-11-14 22:24:03

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 09/15] blk-throttle: make bandwidth change smooth

When cgroups all reach high limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their high limit. For example, cg1
high limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach high limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its high limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_HIGH, then all cgroups are throttled to their high limit (T3). cg2
will have bandwidth below its high limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_HIGH to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 10/80

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its high limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its high limit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 42 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 32cc6ec..45a28c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -170,6 +170,8 @@ struct throtl_data

unsigned long high_upgrade_time;
unsigned long high_downgrade_time;
+
+ unsigned int scale;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -224,12 +226,27 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
uint64_t ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return -1;
- ret = tg->bps[rw][tg->td->limit_index];
- if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ td = tg->td;
+ ret = tg->bps[rw][td->limit_index];
+ if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_HIGH] != -1) {
+ uint64_t increase;
+
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->high_upgrade_time + td->scale * td->throtl_slice)) {
+ unsigned int time = jiffies - td->high_upgrade_time;
+
+ td->scale = time / td->throtl_slice;
+ }
+ increase = (tg->bps[rw][LIMIT_HIGH] >> 1) * td->scale;
+ ret = min(tg->bps[rw][LIMIT_MAX],
+ tg->bps[rw][LIMIT_HIGH] + increase);
+ }
+ if (ret == -1 && td->limit_index == LIMIT_HIGH)
return tg->bps[rw][LIMIT_MAX];

return ret;
@@ -238,12 +255,28 @@ static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
unsigned int ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return -1;
- ret = tg->iops[rw][tg->td->limit_index];
- if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ td = tg->td;
+ ret = tg->iops[rw][td->limit_index];
+ if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_HIGH] != -1) {
+ uint64_t increase;
+
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->high_upgrade_time + td->scale * td->throtl_slice)) {
+ unsigned int time = jiffies - td->high_upgrade_time;
+
+ td->scale = time / td->throtl_slice;
+ }
+
+ increase = (tg->iops[rw][LIMIT_HIGH] >> 1) * td->scale;
+ ret = min(tg->iops[rw][LIMIT_MAX],
+ tg->iops[rw][LIMIT_HIGH] + (unsigned int)increase);
+ }
+ if (ret == -1 && td->limit_index == LIMIT_HIGH)
return tg->iops[rw][LIMIT_MAX];
return ret;
}
@@ -1676,6 +1709,7 @@ static void throtl_upgrade_state(struct throtl_data *td)

td->limit_index = LIMIT_MAX;
td->high_upgrade_time = jiffies;
+ td->scale = 0;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
--
2.9.3

2016-11-14 22:23:47

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

We try to set a latency target for each cgroup. The problem is latency
highly depends on request size, users can't configure the target for
every request size. The idea is users configure latency target for 4k
IO, we estimate the target latency for other request size IO.

To do this, we sample some data, eg, average latency for request size
4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit
the data (x is request size in KB, f(x) is the latency). Then we can use
the equation to estimate IO target latency for any request.

To increase the chance of sampling, we actually collect data for any IO
size less than 64k, then calcualte an average latency/size. This is ok
for line fit because the equation should work for average request
size/latency too.

But we shouldn't sample data at any time. If disk is congested, the
calculated data will not represent the disk's capability. Hence we only
do the sampling when block throttling is in the HIGH limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, high limit is too high, calculated latency target will be
higher.

How does the equation fit to actual data? I collected data from 4
different SSDs (one SATA, 3 NVMe). The error range is quite small. The
big difference between measured latency and calculated latency generally
comes from 4k IO. The biggest one has around 30% difference, which isn't
terrible as we don't need accurate latency target. We don't know if line
fit works for other SSDs though. For big request size latency, the error
range seems big. But this mechanism is to determine if we should
throttle IO (eg, if cgroup is idle). If cgroups average request size is
big, we can simply treat it as busy, hence we don't need the mechanism.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. So this latency target feature is for SSD only.

The patch uses below algorithm to calculate the equation:
https://en.wikipedia.org/wiki/Simple_linear_regression

TODO: the latency sampling is better moving to request layer

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 191 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/blk_types.h | 2 +
2 files changed, 190 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 01b494d..a05d351 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -156,6 +156,20 @@ struct throtl_grp {
u64 avg_ttime;
};

+/* We measure latency for request size from 4k to 4k * ( 1 << 4) */
+#define LATENCY_BUCKET_SIZE 5
+
+struct latency_bucket {
+ u64 total_latency;
+ u64 total_size;
+ int samples;
+};
+
+struct avg_latency_bucket {
+ u64 latency;
+ u64 size;
+};
+
struct throtl_data
{
/* service tree for active throtl groups */
@@ -179,6 +193,12 @@ struct throtl_data
unsigned int scale;

u64 idle_ttime_threshold;
+
+ struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+ struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+ struct latency_bucket __percpu *latency_buckets;
+ s64 line_slope;
+ unsigned long last_calculate_time;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -288,6 +308,19 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
return ret;
}

+static int request_bucket_index(sector_t sectors)
+{
+ int i;
+
+ for (i = LATENCY_BUCKET_SIZE - 1; i >= 0; i--) {
+ if (sectors > (1 << (i + 3)))
+ break;
+ }
+ if (i == LATENCY_BUCKET_SIZE - 1)
+ return -1;
+ return i + 1;
+}
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -1877,6 +1910,120 @@ static void blk_throtl_update_ttime(struct throtl_grp *tg)
tg->checked_last_finish_time = last_finish_time;
}

+static void throtl_calculate_line_slope(struct throtl_data *td)
+{
+ struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+ s64 sumX;
+ s64 sumY;
+ s64 sumXY;
+ s64 sumX2;
+ s64 xMean;
+ s64 yMean;
+ s64 denominator;
+ s64 slope;
+ int i, cpu;
+ int valid_lat;
+ u64 last_latency = 0;
+
+ if (!blk_queue_nonrot(td->queue))
+ return;
+ if (time_before(jiffies, td->last_calculate_time + HZ))
+ return;
+ td->last_calculate_time = jiffies;
+
+ memset(avg_latency, 0, sizeof(avg_latency));
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+ for_each_possible_cpu(cpu) {
+ struct latency_bucket *bucket;
+
+ bucket = per_cpu_ptr(td->latency_buckets, cpu);
+ tmp->total_latency += bucket[i].total_latency;
+ tmp->total_size += bucket[i].total_size;
+ tmp->samples += bucket[i].samples;
+ bucket[i].total_latency = 0;
+ bucket[i].total_size = 0;
+ bucket[i].samples = 0;
+ }
+
+ if (tmp->samples >= 32) {
+ u64 latency = tmp->total_latency;
+ u64 size = tmp->total_size;
+ int samples = tmp->samples;
+
+ tmp->total_latency = 0;
+ tmp->total_size = 0;
+ tmp->samples = 0;
+ do_div(size, samples);
+ if (size == 0 || size > (1 << (i + 12)))
+ continue;
+ avg_latency[i].size = size;
+ do_div(latency, samples);
+ if (latency == 0)
+ continue;
+ avg_latency[i].latency = latency;
+ }
+ }
+
+ valid_lat = 0;
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ if (!td->avg_buckets[i].latency && !avg_latency[i].latency)
+ continue;
+ valid_lat++;
+ if (!td->avg_buckets[i].latency) {
+ td->avg_buckets[i].latency = avg_latency[i].latency;
+ td->avg_buckets[i].size = avg_latency[i].size;
+ continue;
+ }
+ if (!avg_latency[i].latency)
+ continue;
+ /* make it smooth */
+ td->avg_buckets[i].latency = (td->avg_buckets[i].latency * 7 +
+ avg_latency[i].latency) >> 3;
+ td->avg_buckets[i].size = (td->avg_buckets[i].size * 7 +
+ avg_latency[i].size) >> 3;
+ /* filter out abnormal latency */
+ if (td->avg_buckets[i].latency <= last_latency) {
+ td->avg_buckets[i].latency = 0;
+ valid_lat--;
+ } else
+ last_latency = td->avg_buckets[i].latency;
+ }
+
+ if (valid_lat < 2)
+ return;
+
+ sumX = 0;
+ sumY = 0;
+ sumXY = 0;
+ sumX2 = 0;
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ u64 x, y;
+
+ if (td->avg_buckets[i].latency == 0)
+ continue;
+
+ x = td->avg_buckets[i].size >> 10;
+ y = td->avg_buckets[i].latency;
+ sumX += x;
+ sumY += y;
+
+ sumXY += x * y;
+ sumX2 += x * x;
+ }
+
+ xMean = sumX;
+ do_div(xMean, valid_lat);
+ yMean = sumY;
+ do_div(yMean, valid_lat);
+ denominator = sumX2 - sumX * xMean;
+
+ slope = sumXY - sumX * yMean;
+ do_div(slope, denominator);
+ td->line_slope = slope;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1901,11 +2048,14 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

spin_lock_irq(q->queue_lock);

+ throtl_calculate_line_slope(tg->td);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

bio_associate_current(bio);
bio->bi_cg_private = q;
+ bio->bi_cg_size = bio_sectors(bio);

blk_throtl_update_ttime(tg);

@@ -1992,8 +2142,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
* don't want bios to leave with the flag set. Clear the flag if
* being issued.
*/
- if (!throttled)
+ if (!throttled) {
+ if (blk_queue_nonrot(q))
+ bio->bi_cg_issue_time = ktime_get_ns();
bio->bi_opf &= ~REQ_THROTTLED;
+ }
return throttled;
}

@@ -2003,6 +2156,9 @@ void blk_throtl_bio_endio(struct bio *bio)
struct blkcg_gq *blkg;
struct throtl_grp *tg;
struct request_queue *q;
+ struct throtl_data *td;
+ u64 finish_time;
+ u64 lat;

q = bio->bi_cg_private;
if (!q)
@@ -2019,7 +2175,27 @@ void blk_throtl_bio_endio(struct bio *bio)

tg = blkg_to_tg(blkg ?: q->root_blkg);

- tg->last_finish_time = ktime_get_ns();
+ finish_time = ktime_get_ns();
+ tg->last_finish_time = finish_time;
+
+ td = tg->td;
+
+ if (bio->bi_cg_issue_time && finish_time > bio->bi_cg_issue_time) {
+ int index;
+
+ lat = finish_time - bio->bi_cg_issue_time;
+ index = request_bucket_index(bio->bi_cg_size);
+ if (index >= 0 && bio_op(bio) == REQ_OP_READ &&
+ td->limit_index == LIMIT_HIGH) {
+ struct latency_bucket *latency;
+
+ latency = get_cpu_ptr(td->latency_buckets);
+ latency[index].total_latency += lat;
+ latency[index].total_size += bio->bi_cg_size << 9;
+ latency[index].samples++;
+ put_cpu_ptr(td->latency_buckets);
+ }
+ }

end:
rcu_read_unlock();
@@ -2097,6 +2273,12 @@ int blk_throtl_init(struct request_queue *q)
td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
if (!td)
return -ENOMEM;
+ td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+ LATENCY_BUCKET_SIZE, __alignof__(u64));
+ if (!td->latency_buckets) {
+ kfree(td);
+ return -ENOMEM;
+ }

INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
throtl_service_queue_init(&td->service_queue);
@@ -2113,8 +2295,10 @@ int blk_throtl_init(struct request_queue *q)
td->idle_ttime_threshold = -1;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
- if (ret)
+ if (ret) {
+ free_percpu(td->latency_buckets);
kfree(td);
+ }
return ret;
}

@@ -2123,6 +2307,7 @@ void blk_throtl_exit(struct request_queue *q)
BUG_ON(!q->td);
throtl_shutdown_wq(q);
blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+ free_percpu(q->td->latency_buckets);
kfree(q->td);
}

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ff8dd24..45bb437 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -60,6 +60,8 @@ struct bio {
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
void *bi_cg_private;
+ u64 bi_cg_issue_time;
+ sector_t bi_cg_size;
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
--
2.9.3

2016-11-14 22:25:51

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 04/15] blk-throttle: add upgrade logic for LIMIT_HIGH state

When queue is in LIMIT_HIGH state and all cgroups with high limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower high
limit than parent. Parent's high limit is meaningless. If children's
bps/iops cross high limit, we can upgrade queue state. The other case is
children has higher high limit than parent. Children's high limit is
meaningless. As long as parent's bps/iops cross high limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 100 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ec53671..34a75e5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -456,6 +456,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
td->limit_valid[LIMIT_HIGH] = false;
}

+static void throtl_upgrade_state(struct throtl_data *td);
static void throtl_pd_offline(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -467,9 +468,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)

blk_throtl_update_valid_limit(tg->td);

- if (tg->td->limit_index == LIMIT_HIGH &&
- !tg->td->limit_valid[LIMIT_HIGH])
- tg->td->limit_index = LIMIT_MAX;
+ if (!tg->td->limit_valid[tg->td->limit_index])
+ throtl_upgrade_state(tg->td);
}

static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1077,6 +1077,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
return nr_disp;
}

+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg);
/**
* throtl_pending_timer_fn - timer function for service_queue->pending_timer
* @arg: the throtl_service_queue being serviced
@@ -1103,6 +1105,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
int ret;

spin_lock_irq(q->queue_lock);
+ if (throtl_can_upgrade(td, NULL))
+ throtl_upgrade_state(td);
+
again:
parent_sq = sq->parent_sq;
dispatched = false;
@@ -1505,6 +1510,91 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static bool throtl_upgrade_check_one(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *sq = &tg->service_queue;
+ bool read_limit, write_limit;
+
+ /* if cgroup reaches high/max limit, it's ok to next limit */
+ read_limit = tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[READ][LIMIT_MAX] != -1 ||
+ tg->iops[READ][LIMIT_MAX] != -1;
+ write_limit = tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_MAX] != -1 ||
+ tg->iops[WRITE][LIMIT_MAX] != -1;
+ if (read_limit && sq->nr_queued[READ] &&
+ (!write_limit || sq->nr_queued[WRITE]))
+ return true;
+ if (write_limit && sq->nr_queued[WRITE] &&
+ (!read_limit || sq->nr_queued[READ]))
+ return true;
+ return false;
+}
+
+static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg)
+{
+ if (throtl_upgrade_check_one(tg))
+ return true;
+ while (true) {
+ if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+ !tg_to_blkg(tg)->parent))
+ return false;
+ if (throtl_upgrade_check_one(tg))
+ return true;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ }
+ return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ if (td->limit_index != LIMIT_HIGH)
+ return false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg == this_tg)
+ continue;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ continue;
+ if (!throtl_upgrade_check_hierarchy(tg)) {
+ rcu_read_unlock();
+ return false;
+ }
+ }
+ rcu_read_unlock();
+ return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ td->limit_index = LIMIT_MAX;
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+ struct throtl_service_queue *sq = &tg->service_queue;
+
+ tg->disptime = jiffies - 1;
+ throtl_select_dispatch(sq);
+ throtl_schedule_next_dispatch(sq, false);
+ }
+ rcu_read_unlock();
+ throtl_select_dispatch(&td->service_queue);
+ throtl_schedule_next_dispatch(&td->service_queue, false);
+ queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1527,14 +1617,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

sq = &tg->service_queue;

+again:
while (true) {
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
- if (!tg_may_dispatch(tg, bio, NULL))
+ if (!tg_may_dispatch(tg, bio, NULL)) {
+ if (throtl_can_upgrade(tg->td, tg)) {
+ throtl_upgrade_state(tg->td);
+ goto again;
+ }
break;
+ }

/* within limits, let's charge and dispatch directly */
throtl_charge_bio(tg, bio);
--
2.9.3

2016-11-14 22:25:55

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 06/15] blk-throttle: make sure expire time isn't too big

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d177252..eff3120 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -589,6 +589,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
+ unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+ if (time_after(expires, max_expire))
+ expires = max_expire;
mod_timer(&sq->pending_timer, expires);
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
expires - jiffies, jiffies);
--
2.9.3

2016-11-14 22:26:01

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 07/15] blk-throttle: make throtl_slice tunable

throtl_slice is important for blk-throttling. A lot of stuffes depend on
it, for example, throughput measurement. It has 100ms default value,
which is not appropriate for all disks. For example, for SSD we might
use a smaller value to make the throughput smoother. This patch makes it
tunable.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-sysfs.c | 11 ++++++++
block/blk-throttle.c | 77 +++++++++++++++++++++++++++++++++++++---------------
block/blk.h | 3 ++
3 files changed, 69 insertions(+), 22 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9cc8d7c..3e284e4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -526,6 +526,14 @@ static struct queue_sysfs_entry queue_dax_entry = {
.show = queue_dax_show,
};

+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_slice_entry = {
+ .attr = {.name = "throttling_slice", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_throtl_slice_show,
+ .store = blk_throtl_slice_store,
+};
+#endif
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -553,6 +561,9 @@ static struct attribute *default_attrs[] = {
&queue_poll_entry.attr,
&queue_wc_entry.attr,
&queue_dax_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+ &throtl_slice_entry.attr,
+#endif
NULL,
};

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index eff3120..e85b2b6 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
static int throtl_quantum = 32;

/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10; /* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)

static struct blkcg_policy blkcg_policy_throtl;

@@ -158,6 +159,8 @@ struct throtl_data
/* Total Number of queued bios on READ and WRITE lists */
unsigned int nr_queued[2];

+ unsigned int throtl_slice;
+
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
unsigned int limit_index;
@@ -589,7 +592,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
- unsigned long max_expire = jiffies + 8 * throtl_slice;
+ unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;

if (time_after(expires, max_expire))
expires = max_expire;
@@ -650,7 +653,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
if (time_after_eq(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;

- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -662,7 +665,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
tg->bytes_disp[rw] = 0;
tg->io_disp[rw] = 0;
tg->slice_start[rw] = jiffies;
- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -672,13 +675,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
}

static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
throtl_log(&tg->service_queue,
"[%c] extend slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -718,19 +721,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
* is bad because it does not allow new slice to start.
*/

- throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+ throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);

time_elapsed = jiffies - tg->slice_start[rw];

- nr_slices = time_elapsed / throtl_slice;
+ nr_slices = time_elapsed / tg->td->throtl_slice;

if (!nr_slices)
return;
- tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
+ io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) /
+ HZ;

if (!bytes_trim && !io_trim)
return;
@@ -745,7 +749,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
else
tg->io_disp[rw] = 0;

- tg->slice_start[rw] += nr_slices * throtl_slice;
+ tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;

throtl_log(&tg->service_queue,
"[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -765,9 +769,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

/*
* jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -814,9 +818,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
@@ -881,8 +885,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
throtl_start_new_slice(tg, rw);
else {
- if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
- throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+ if (time_before(tg->slice_end[rw],
+ jiffies + tg->td->throtl_slice))
+ throtl_extend_slice(tg, rw,
+ jiffies + tg->td->throtl_slice);
}

if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1632,7 +1638,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_HIGH)
return false;

- if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+ if (time_before(jiffies, td->high_downgrade_time + td->throtl_slice))
return false;

rcu_read_lock();
@@ -1689,8 +1695,9 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
* If cgroup is below high limit, consider downgrade and throttle other
* cgroups
*/
- if (time_after_eq(now, td->high_upgrade_time + throtl_slice) &&
- time_after_eq(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ if (time_after_eq(now, td->high_upgrade_time + td->throtl_slice) &&
+ time_after_eq(now, tg_last_high_overflow_time(tg) +
+ td->throtl_slice))
return true;
return false;
}
@@ -1723,10 +1730,11 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
return;
if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
return;
- if (time_after(tg->last_check_time + throtl_slice, now))
+ if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
return;

- if (time_before(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ if (time_before(now, tg_last_high_overflow_time(tg) +
+ tg->td->throtl_slice))
return;

elapsed_time = now - tg->last_check_time;
@@ -1965,6 +1973,7 @@ int blk_throtl_init(struct request_queue *q)

q->td = td;
td->queue = q;
+ td->throtl_slice = DFL_THROTL_SLICE;

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
@@ -1985,6 +1994,30 @@ void blk_throtl_exit(struct request_queue *q)
kfree(q->td);
}

+ssize_t blk_throtl_slice_show(struct request_queue *q, char *page)
+{
+ if (!q->td)
+ return -EINVAL;
+ return sprintf(page, "%ums\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_slice_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long v;
+ unsigned long t;
+
+ if (!q->td)
+ return -EINVAL;
+ if (kstrtoul(page, 10, &v))
+ return -EINVAL;
+ t = msecs_to_jiffies(v);
+ if (t == 0 || t > MAX_THROTL_SLICE)
+ return -EINVAL;
+ q->td->throtl_slice = t;
+ return count;
+}
+
static int __init throtl_init(void)
{
kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index 74444c4..39c14dd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -289,6 +289,9 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
extern void blk_throtl_drain(struct request_queue *q);
extern int blk_throtl_init(struct request_queue *q);
extern void blk_throtl_exit(struct request_queue *q);
+extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_slice_store(struct request_queue *q,
+ const char *page, size_t count);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
--
2.9.3

2016-11-14 22:25:58

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 01/15] blk-throttle: prepare support multiple limits

We are going to support high/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 110 ++++++++++++++++++++++++++++++++-------------------
1 file changed, 69 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a3ea826..925aa1ed 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {

#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

+enum {
+ LIMIT_MAX,
+ LIMIT_CNT,
+};
+
struct throtl_grp {
/* must be the first member */
struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
bool has_rules[2];

/* bytes per second rate limits */
- uint64_t bps[2];
+ uint64_t bps[2][LIMIT_CNT];

/* IOPS limits */
- unsigned int iops[2];
+ unsigned int iops[2][LIMIT_CNT];

/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data

/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
+ unsigned int limit_index;
+ bool limit_valid[LIMIT_CNT];
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
}

+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->iops[rw][tg->td->limit_index];
+}
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
{
struct throtl_grp *tg;
- int rw;
+ int rw, index;

tg = kzalloc_node(sizeof(*tg), gfp, node);
if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}

RB_CLEAR_NODE(&tg->rb_node);
- tg->bps[READ] = -1;
- tg->bps[WRITE] = -1;
- tg->iops[READ] = -1;
- tg->iops[WRITE] = -1;
+ for (rw = READ; rw <= WRITE; rw++) {
+ for (index = 0; index < LIMIT_CNT; index++) {
+ tg->bps[rw][index] = -1;
+ tg->iops[rw][index] = -1;
+ }
+ }

return &tg->pd;
}
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
static void tg_update_has_rules(struct throtl_grp *tg)
{
struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+ struct throtl_data *td = tg->td;
int rw;

for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
- (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+ (td->limit_valid[td->limit_index] &&
+ (tg_bps_limit(tg, rw) != -1 ||
+ tg_iops_limit(tg, rw) != -1));
}

static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)

if (!nr_slices)
return;
- tmp = tg->bps[rw] * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+ io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;

if (!bytes_trim && !io_trim)
return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
* have been trimmed.
*/

- tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+ tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);

if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
}

/* Calc approx time to dispatch */
- jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+ jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;

if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);

- tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+ tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;

@@ -736,7 +758,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
- jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+ jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));

if (!jiffy_wait)
jiffy_wait = 1;
@@ -771,7 +793,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+ if (tg_bps_limit(tg, rw) == -1 && tg_iops_limit(tg, rw) == -1) {
if (wait)
*wait = 0;
return true;
@@ -1148,8 +1170,8 @@ static void tg_conf_updated(struct throtl_grp *tg)

throtl_log(&tg->service_queue,
"limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
- tg->bps[READ], tg->bps[WRITE],
- tg->iops[READ], tg->iops[WRITE]);
+ tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+ tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));

/*
* Update has_rules[] flags for the updated tg's subtree. A tg is
@@ -1226,25 +1248,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
static struct cftype throtl_legacy_files[] = {
{
.name = "throttle.read_bps_device",
- .private = offsetof(struct throtl_grp, bps[READ]),
+ .private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.write_bps_device",
- .private = offsetof(struct throtl_grp, bps[WRITE]),
+ .private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.read_iops_device",
- .private = offsetof(struct throtl_grp, iops[READ]),
+ .private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
{
.name = "throttle.write_iops_device",
- .private = offsetof(struct throtl_grp, iops[WRITE]),
+ .private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
@@ -1270,18 +1292,22 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
- tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+ if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
+ tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
return 0;

- if (tg->bps[READ] != -1)
- snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
- if (tg->bps[WRITE] != -1)
- snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
- if (tg->iops[READ] != -1)
- snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
- if (tg->iops[WRITE] != -1)
- snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+ if (tg->bps[READ][LIMIT_MAX] != -1)
+ snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+ tg->bps[READ][LIMIT_MAX]);
+ if (tg->bps[WRITE][LIMIT_MAX] != -1)
+ snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+ tg->bps[WRITE][LIMIT_MAX]);
+ if (tg->iops[READ][LIMIT_MAX] != -1)
+ snprintf(bufs[2], sizeof(bufs[2]), "%u",
+ tg->iops[READ][LIMIT_MAX]);
+ if (tg->iops[WRITE][LIMIT_MAX] != -1)
+ snprintf(bufs[3], sizeof(bufs[3]), "%u",
+ tg->iops[WRITE][LIMIT_MAX]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1310,10 +1336,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ];
- v[1] = tg->bps[WRITE];
- v[2] = tg->iops[READ];
- v[3] = tg->iops[WRITE];
+ v[0] = tg->bps[READ][LIMIT_MAX];
+ v[1] = tg->bps[WRITE][LIMIT_MAX];
+ v[2] = tg->iops[READ][LIMIT_MAX];
+ v[3] = tg->iops[WRITE][LIMIT_MAX];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1350,10 +1376,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ] = v[0];
- tg->bps[WRITE] = v[1];
- tg->iops[READ] = v[2];
- tg->iops[WRITE] = v[3];
+ tg->bps[READ][LIMIT_MAX] = v[0];
+ tg->bps[WRITE][LIMIT_MAX] = v[1];
+ tg->iops[READ][LIMIT_MAX] = v[2];
+ tg->iops[WRITE][LIMIT_MAX] = v[3];

tg_conf_updated(tg);
ret = 0;
@@ -1451,8 +1477,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
/* out-of-limit, queue to @tg */
throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
rw == READ ? 'R' : 'W',
- tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
- tg->io_disp[rw], tg->iops[rw],
+ tg->bytes_disp[rw], bio->bi_iter.bi_size,
+ tg_bps_limit(tg, rw),
+ tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

bio_associate_current(bio);
@@ -1563,6 +1590,7 @@ int blk_throtl_init(struct request_queue *q)
q->td = td;
td->queue = q;

+ td->limit_valid[LIMIT_MAX] = true;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-11-14 22:25:47

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V4 11/15] blk-throttle: add interface to configure think time threshold

Add interface to configure the threshold

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-sysfs.c | 7 +++++++
block/blk-throttle.c | 25 +++++++++++++++++++++++++
block/blk.h | 4 ++++
3 files changed, 36 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3e284e4..f15aeed 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -532,6 +532,12 @@ static struct queue_sysfs_entry throtl_slice_entry = {
.show = blk_throtl_slice_show,
.store = blk_throtl_slice_store,
};
+
+static struct queue_sysfs_entry throtl_idle_threshold_entry = {
+ .attr = {.name = "throttling_idle_threshold", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_throtl_idle_threshold_show,
+ .store = blk_throtl_idle_threshold_store,
+};
#endif

static struct attribute *default_attrs[] = {
@@ -563,6 +569,7 @@ static struct attribute *default_attrs[] = {
&queue_dax_entry.attr,
#ifdef CONFIG_BLK_DEV_THROTTLING
&throtl_slice_entry.attr,
+ &throtl_idle_threshold_entry.attr,
#endif
NULL,
};
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cb5fd85..e403e88 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2139,6 +2139,31 @@ ssize_t blk_throtl_slice_store(struct request_queue *q,
return count;
}

+ssize_t blk_throtl_idle_threshold_show(struct request_queue *q, char *page)
+{
+ u64 threshold = q->td->idle_ttime_threshold;
+
+ if (!q->td)
+ return -EINVAL;
+ do_div(threshold, 1000);
+ return sprintf(page, "%lluus\n", threshold);
+}
+
+ssize_t blk_throtl_idle_threshold_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long v;
+
+ if (!q->td)
+ return -EINVAL;
+ if (kstrtoul(page, 10, &v))
+ return -EINVAL;
+ if (v == 0)
+ return -EINVAL;
+ q->td->idle_ttime_threshold = v * 1000;
+ return count;
+}
+
static int __init throtl_init(void)
{
kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index b433f35..2ebde12 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,6 +292,10 @@ extern void blk_throtl_exit(struct request_queue *q);
extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_slice_store(struct request_queue *q,
const char *page, size_t count);
+extern ssize_t blk_throtl_idle_threshold_show(struct request_queue *q,
+ char *page);
+extern ssize_t blk_throtl_idle_threshold_store(struct request_queue *q,
+ const char *page, size_t count);
extern void blk_throtl_bio_endio(struct bio *bio);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
--
2.9.3

2016-11-14 22:46:34

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On 11/14/2016 02:22 PM, Shaohua Li wrote:
> The background is we don't have an ioscheduler for blk-mq yet, so we can't
> prioritize processes/cgroups. This patch set tries to add basic arbitration
> between cgroups with blk-throttle. It adds a new limit io.high for
> blk-throttle. It's only for cgroup2.

Hello Shaohua,

My understanding of this work is that a significant part of it will have
to be reverted once blk-mq supports I/O scheduling, e.g. the code for
detecting whether the I/O submitter is idle. Shouldn't this kind of
infrastructure be added after support has been added in blk-mq for I/O
scheduling?

Thanks,

Bart.

2016-11-14 23:41:40

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

Hi Shaohua,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc5]
[cannot apply to block/for-next next-20161114]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-high-limit/20161115-063105
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa

All warnings (new ones prefixed by >>):

In file included from ./arch/xtensa/include/generated/asm/div64.h:1:0,
from include/linux/kernel.h:142,
from include/linux/list.h:8,
from include/linux/module.h:9,
from block/blk-throttle.c:7:
block/blk-throttle.c: In function 'throtl_calculate_line_slope':
include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
^
>> block/blk-throttle.c:2017:2: note: in expansion of macro 'do_div'
do_div(xMean, valid_lat);
^
include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
^
block/blk-throttle.c:2019:2: note: in expansion of macro 'do_div'
do_div(yMean, valid_lat);
^
include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
^
block/blk-throttle.c:2023:2: note: in expansion of macro 'do_div'
do_div(slope, denominator);
^

vim +/do_div +2017 block/blk-throttle.c

2001 for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
2002 u64 x, y;
2003
2004 if (td->avg_buckets[i].latency == 0)
2005 continue;
2006
2007 x = td->avg_buckets[i].size >> 10;
2008 y = td->avg_buckets[i].latency;
2009 sumX += x;
2010 sumY += y;
2011
2012 sumXY += x * y;
2013 sumX2 += x * x;
2014 }
2015
2016 xMean = sumX;
> 2017 do_div(xMean, valid_lat);
2018 yMean = sumY;
2019 do_div(yMean, valid_lat);
2020 denominator = sumX2 - sumX * xMean;
2021
2022 slope = sumXY - sumX * yMean;
2023 do_div(slope, denominator);
2024 td->line_slope = slope;
2025 }

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.72 kB)
.config.gz (45.92 kB)
Download all attachments

2016-11-15 00:05:53

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On Mon, Nov 14, 2016 at 02:46:22PM -0800, Bart Van Assche wrote:
> On 11/14/2016 02:22 PM, Shaohua Li wrote:
> > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > prioritize processes/cgroups. This patch set tries to add basic arbitration
> > between cgroups with blk-throttle. It adds a new limit io.high for
> > blk-throttle. It's only for cgroup2.
>
> Hello Shaohua,
>
> My understanding of this work is that a significant part of it will have to
> be reverted once blk-mq supports I/O scheduling, e.g. the code for detecting
> whether the I/O submitter is idle. Shouldn't this kind of infrastructure be
> added after support has been added in blk-mq for I/O scheduling?

Sure, if we have a CFQ-like io scheduler for blk-mq, this is largly not
required. But we don't have one yet and nothing is floating around either. The
conservative throttling is relatively easy to implement and achive similar
goal. The throttling could be still useful even with ioscheduler as throttling
is faster if we are talking about CFQ-like scheduler. I don't think this should
be blocked to wait for I/O scheduling. There was a long discussion in last
post, and we agreed the throttling and io scheduler aren't mutually exclusive.
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

Thanks,
Shaohua

2016-11-15 00:41:52

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On 11/14/2016 04:05 PM, Shaohua Li wrote:
> On Mon, Nov 14, 2016 at 02:46:22PM -0800, Bart Van Assche wrote:
>> On 11/14/2016 02:22 PM, Shaohua Li wrote:
>>> The background is we don't have an ioscheduler for blk-mq yet, so we can't
>>> prioritize processes/cgroups. This patch set tries to add basic arbitration
>>> between cgroups with blk-throttle. It adds a new limit io.high for
>>> blk-throttle. It's only for cgroup2.
>>
>> My understanding of this work is that a significant part of it will have to
>> be reverted once blk-mq supports I/O scheduling, e.g. the code for detecting
>> whether the I/O submitter is idle. Shouldn't this kind of infrastructure be
>> added after support has been added in blk-mq for I/O scheduling?
>
> Sure, if we have a CFQ-like io scheduler for blk-mq, this is largly not
> required. But we don't have one yet and nothing is floating around either. The
> conservative throttling is relatively easy to implement and achive similar
> goal. The throttling could be still useful even with ioscheduler as throttling
> is faster if we are talking about CFQ-like scheduler. I don't think this should
> be blocked to wait for I/O scheduling. There was a long discussion in last
> post, and we agreed the throttling and io scheduler aren't mutually exclusive.
> http://marc.info/?l=linux-kernel&m=147552964708965&w=2

Hello Shaohua,

Thank you for pointing me to the discussion thread about v3 of this
patch series. Did I see correctly that one of the conclusions was that
for users this mechanism is hard to configure? Are we providing a good
service to Linux users by providing a mechanism that is hard to configure?

Thanks,

Bart.

2016-11-15 00:49:49

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On Mon, Nov 14, 2016 at 04:41:33PM -0800, Bart Van Assche wrote:
> On 11/14/2016 04:05 PM, Shaohua Li wrote:
> > On Mon, Nov 14, 2016 at 02:46:22PM -0800, Bart Van Assche wrote:
> > > On 11/14/2016 02:22 PM, Shaohua Li wrote:
> > > > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > > > prioritize processes/cgroups. This patch set tries to add basic arbitration
> > > > between cgroups with blk-throttle. It adds a new limit io.high for
> > > > blk-throttle. It's only for cgroup2.
> > >
> > > My understanding of this work is that a significant part of it will have to
> > > be reverted once blk-mq supports I/O scheduling, e.g. the code for detecting
> > > whether the I/O submitter is idle. Shouldn't this kind of infrastructure be
> > > added after support has been added in blk-mq for I/O scheduling?
> >
> > Sure, if we have a CFQ-like io scheduler for blk-mq, this is largly not
> > required. But we don't have one yet and nothing is floating around either. The
> > conservative throttling is relatively easy to implement and achive similar
> > goal. The throttling could be still useful even with ioscheduler as throttling
> > is faster if we are talking about CFQ-like scheduler. I don't think this should
> > be blocked to wait for I/O scheduling. There was a long discussion in last
> > post, and we agreed the throttling and io scheduler aren't mutually exclusive.
> > http://marc.info/?l=linux-kernel&m=147552964708965&w=2
>
> Hello Shaohua,
>
> Thank you for pointing me to the discussion thread about v3 of this patch
> series. Did I see correctly that one of the conclusions was that for users
> this mechanism is hard to configure? Are we providing a good service to
> Linux users by providing a mechanism that is hard to configure?

Yes, this is a kind of low level knob and is expected to be configured by
experienced users. This sucks, but we really don't have good solutions. If
anybody has better ideas, I'm happy to try.

Thanks,
Shaohua

2016-11-15 01:29:32

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On Mon, Nov 14, 2016 at 05:18:28PM -0800, Bart Van Assche wrote:
> On 11/14/2016 04:49 PM, Shaohua Li wrote:
> > On Mon, Nov 14, 2016 at 04:41:33PM -0800, Bart Van Assche wrote:
> > > Thank you for pointing me to the discussion thread about v3 of this patch
> > > series. Did I see correctly that one of the conclusions was that for users
> > > this mechanism is hard to configure? Are we providing a good service to
> > > Linux users by providing a mechanism that is hard to configure?
> >
> > Yes, this is a kind of low level knob and is expected to be configured by
> > experienced users. This sucks, but we really don't have good solutions. If
> > anybody has better ideas, I'm happy to try.
>
> Hello Shaohua,
>
> An approach I have been considering to analyze further is as follows:
> * For rotational media use an algorithm like BFQ to preserve sequentiality
> of workloads and to guarantee fairness. This means that one application
> submits I/O per time slot.
> * For SSDs, multiplex I/O from multiple applications during a single time
> slot to keep the queue depth high. Throttle I/O if needed to realize
> fairness.
>
> Implementing this approach requires an approach for estimating I/O cost
> based on the request characteristics (offset and size) and the device type
> (rotational or SSD). This may require measuring the time that was needed to
> process past requests and to use that information in a learning algorithm.
>
> Unless someone can convince me of the opposite I think that coming up with
> an algorithm for estimating I/O cost is essential to guarantee I/O fairness
> without requesting users to perform complicated parameter configurations.

That's what I tried before:
http://marc.info/?l=linux-kernel&m=145617863208940&w=2

Unfortunately estimating I/O cost and disk capability is very hard if not
impossible. People objected using bandwidth or iops to estimate I/O cost.

Thanks,
Shaohua

2016-11-15 01:34:17

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On 11/14/2016 04:49 PM, Shaohua Li wrote:
> On Mon, Nov 14, 2016 at 04:41:33PM -0800, Bart Van Assche wrote:
>> Thank you for pointing me to the discussion thread about v3 of this patch
>> series. Did I see correctly that one of the conclusions was that for users
>> this mechanism is hard to configure? Are we providing a good service to
>> Linux users by providing a mechanism that is hard to configure?
>
> Yes, this is a kind of low level knob and is expected to be configured by
> experienced users. This sucks, but we really don't have good solutions. If
> anybody has better ideas, I'm happy to try.

Hello Shaohua,

An approach I have been considering to analyze further is as follows:
* For rotational media use an algorithm like BFQ to preserve
sequentiality of workloads and to guarantee fairness. This means that
one application submits I/O per time slot.
* For SSDs, multiplex I/O from multiple applications during a single
time slot to keep the queue depth high. Throttle I/O if needed to
realize fairness.

Implementing this approach requires an approach for estimating I/O cost
based on the request characteristics (offset and size) and the device
type (rotational or SSD). This may require measuring the time that was
needed to process past requests and to use that information in a
learning algorithm.

Unless someone can convince me of the opposite I think that coming up
with an algorithm for estimating I/O cost is essential to guarantee I/O
fairness without requesting users to perform complicated parameter
configurations.

Bart.


2016-11-15 03:58:15

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

Hi Shaohua,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc5]
[cannot apply to block/for-next next-20161114]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-high-limit/20161115-063105
config: openrisc-allyesconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=openrisc

All warnings (new ones prefixed by >>):

block/blk-throttle.c: In function 'throtl_calculate_line_slope':
>> block/blk-throttle.c:2017:2: warning: comparison of distinct pointer types lacks a cast
block/blk-throttle.c:2019:2: warning: comparison of distinct pointer types lacks a cast
block/blk-throttle.c:2023:2: warning: comparison of distinct pointer types lacks a cast

vim +2017 block/blk-throttle.c

2001 for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
2002 u64 x, y;
2003
2004 if (td->avg_buckets[i].latency == 0)
2005 continue;
2006
2007 x = td->avg_buckets[i].size >> 10;
2008 y = td->avg_buckets[i].latency;
2009 sumX += x;
2010 sumY += y;
2011
2012 sumXY += x * y;
2013 sumX2 += x * x;
2014 }
2015
2016 xMean = sumX;
> 2017 do_div(xMean, valid_lat);
2018 yMean = sumY;
2019 do_div(yMean, valid_lat);
2020 denominator = sumX2 - sumX * xMean;
2021
2022 slope = sumXY - sumX * yMean;
2023 do_div(slope, denominator);
2024 td->line_slope = slope;
2025 }

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.86 kB)
.config.gz (38.58 kB)
Download all attachments

2016-11-15 19:53:52

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On 11/14/2016 05:28 PM, Shaohua Li wrote:
> On Mon, Nov 14, 2016 at 05:18:28PM -0800, Bart Van Assche wrote:
>> Unless someone can convince me of the opposite I think that coming up with
>> an algorithm for estimating I/O cost is essential to guarantee I/O fairness
>> without requesting users to perform complicated parameter configurations.
>
> That's what I tried before:
> http://marc.info/?l=linux-kernel&m=145617863208940&w=2

That URL refers to v2 of this patch series. Sorry but I have not found
the I/O cost estimation algorithm I was referring to in v2 of this patch
series.

Bart.

2016-11-15 21:31:37

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 00/15] blk-throttle: add .high limit

On Tue, Nov 15, 2016 at 11:53:39AM -0800, Bart Van Assche wrote:
> On 11/14/2016 05:28 PM, Shaohua Li wrote:
> > On Mon, Nov 14, 2016 at 05:18:28PM -0800, Bart Van Assche wrote:
> > > Unless someone can convince me of the opposite I think that coming up with
> > > an algorithm for estimating I/O cost is essential to guarantee I/O fairness
> > > without requesting users to perform complicated parameter configurations.
> >
> > That's what I tried before:
> > http://marc.info/?l=linux-kernel&m=145617863208940&w=2
>
> That URL refers to v2 of this patch series. Sorry but I have not found the
> I/O cost estimation algorithm I was referring to in v2 of this patch series.

It's a v2, but not the v2 of this series. There is no algorithm to estimate I/O
cost, that one just uses IOPS or bandwidth to determine I/O cost.

Thanks,
Shaohua

2016-11-22 20:02:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 02/15] blk-throttle: add .high interface

Hello, Shaohua.

Sorry about the delay.

On Mon, Nov 14, 2016 at 02:22:09PM -0800, Shaohua Li wrote:
> @@ -1376,11 +1414,37 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
> goto out_finish;
> }
>
> - tg->bps[READ][LIMIT_MAX] = v[0];
> - tg->bps[WRITE][LIMIT_MAX] = v[1];
> - tg->iops[READ][LIMIT_MAX] = v[2];
> - tg->iops[WRITE][LIMIT_MAX] = v[3];
> -
> + if (index == LIMIT_MAX) {
> + if ((v[0] < tg->bps[READ][LIMIT_HIGH] &&
> + tg->bps[READ][LIMIT_HIGH] != -1) ||
> + (v[1] < tg->bps[WRITE][LIMIT_HIGH] &&
> + tg->bps[WRITE][LIMIT_HIGH] != -1) ||
> + (v[2] < tg->iops[READ][LIMIT_HIGH] &&
> + tg->iops[READ][LIMIT_HIGH] != -1) ||
> + (v[3] < tg->iops[WRITE][LIMIT_HIGH] &&
> + tg->iops[WRITE][LIMIT_HIGH] != -1)) {
> + ret = -EINVAL;
> + goto out_finish;

Is this necessary? memcg doesn't put restrictions on input but just
enforces whatever is configured. I think it'd be better to follow the
same model here too. Hmm... is this because throtl will be able to
choose either all high or max limits per cgroup?

And this isn't from your patches but can we please switch to
UINT64_MAX instead of -1?

> + }
> + } else if (index == LIMIT_HIGH) {
> + if ((v[0] > tg->bps[READ][LIMIT_MAX] && v[0] != -1) ||
> + (v[1] > tg->bps[WRITE][LIMIT_MAX] && v[1] != -1) ||
> + (v[2] > tg->iops[READ][LIMIT_MAX] && v[2] != -1) ||
> + (v[3] > tg->iops[WRITE][LIMIT_MAX] && v[3] != -1)) {

Ditto here.

> @@ -1412,6 +1484,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
> .pd_alloc_fn = throtl_pd_alloc,
> .pd_init_fn = throtl_pd_init,
> .pd_online_fn = throtl_pd_online,
> + .pd_offline_fn = throtl_pd_offline,
> .pd_free_fn = throtl_pd_free,
> };

I haven't read the whole thing yet but this looks a bit suspicious. A
css going offline indicates that the destruction of the css started.
I don't get why that'd reset high limits. There can be a lot of async
IOs after offline.

Thanks.

--
tejun

2016-11-22 20:16:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 03/15] blk-throttle: configure bps/iops limit for cgroup in high limit

On Mon, Nov 14, 2016 at 02:22:10PM -0800, Shaohua Li wrote:
> each queue will have a state machine. Initially queue is in LIMIT_HIGH
> state, which means all cgroups will be throttled according to their high
> limit. After all cgroups with high limit cross the limit, the queue state
> gets upgraded to LIMIT_MAX state.
> cgroups without high limit will use max limit for their high limit.

Haven't looked at the actual mechanism yet so please correct me if I'm
getting it wrong but at least the above explanation reads incorrect.
Shouldn't max limits applied regardless of whether other cgroups are
above high or not?

Thanks.

--
tejun

2016-11-22 21:21:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 05/15] blk-throttle: add downgrade logic

Hello,

On Mon, Nov 14, 2016 at 02:22:12PM -0800, Shaohua Li wrote:
> +static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
> +{
> + struct throtl_service_queue *parent_sq;
> + struct throtl_grp *parent = tg;
> + unsigned long ret = __tg_last_high_overflow_time(tg);
> +
> + while (true) {
> + parent_sq = parent->service_queue.parent_sq;
> + parent = sq_to_tg(parent_sq);
> + if (!parent)
> + break;
> + if (((parent->bps[READ][LIMIT_HIGH] != -1 &&
> + parent->bps[READ][LIMIT_HIGH] >=
> + tg->bps[READ][LIMIT_HIGH]) ||
> + (parent->bps[READ][LIMIT_HIGH] == -1 &&
> + parent->bps[READ][LIMIT_MAX] >=
> + tg->bps[READ][LIMIT_HIGH])) &&
> + ((parent->bps[WRITE][LIMIT_HIGH] != -1 &&
> + parent->bps[WRITE][LIMIT_HIGH] >=
> + tg->bps[WRITE][LIMIT_HIGH]) ||
> + (parent->bps[WRITE][LIMIT_HIGH] == -1 &&
> + parent->bps[WRITE][LIMIT_MAX] >=
> + tg->bps[WRITE][LIMIT_HIGH])) &&
> + ((parent->iops[READ][LIMIT_HIGH] != -1 &&
> + parent->iops[READ][LIMIT_HIGH] >=
> + tg->iops[READ][LIMIT_HIGH]) ||
> + (parent->iops[READ][LIMIT_HIGH] == -1 &&
> + parent->iops[READ][LIMIT_MAX] >=
> + tg->iops[READ][LIMIT_HIGH])) &&
> + ((parent->iops[WRITE][LIMIT_HIGH] != -1 &&
> + parent->iops[WRITE][LIMIT_HIGH] >=
> + tg->iops[WRITE][LIMIT_HIGH]) ||
> + (parent->iops[WRITE][LIMIT_HIGH] == -1 &&
> + parent->iops[WRITE][LIMIT_MAX] >=
> + tg->iops[WRITE][LIMIT_HIGH])))
> + break;
> + if (time_after(__tg_last_high_overflow_time(parent), ret))
> + ret = __tg_last_high_overflow_time(parent);
> + }
> + return ret;
> +}

Heh, I'm not really following the upgrade/downgrade logic. I'm having
trouble understanding two things.

1. A cgroup and its high and max limits don't have much to do with
other cgroups and their limits. I don't get how the choice between
high and max limits can be a td-wide state.

2. Comparing parent's and child's limits and saying that either can be
ignored because one is higher than the other isn't correct. A
parent's limit doesn't apply to each child separately. It has to
be aggregated. e.g. you can ignore a parent's setting if the sum
of all children's limits is smaller than the parent's but then
again there could still be a lower limit higher up the tree, so
they would still have to be examined.

Thanks.

--
tejun

2016-11-22 21:27:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 07/15] blk-throttle: make throtl_slice tunable

Hello,

On Mon, Nov 14, 2016 at 02:22:14PM -0800, Shaohua Li wrote:
> throtl_slice is important for blk-throttling. A lot of stuffes depend on
> it, for example, throughput measurement. It has 100ms default value,
> which is not appropriate for all disks. For example, for SSD we might
> use a smaller value to make the throughput smoother. This patch makes it
> tunable.

It bothers me a bit because time slice doesn't mean anything inherent
to throttling. It really is an implementation detail - throttling can
be implemented at per-operation level without time slice involved at
all. It's okay to expose the knob if necessary but the meaning of the
knob is almost completely arbitrary to users (as it has no inherent
meaning).

Thanks.

--
tejun

2016-11-22 21:42:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 05/15] blk-throttle: add downgrade logic

Hello,

On Tue, Nov 22, 2016 at 04:21:21PM -0500, Tejun Heo wrote:
> 1. A cgroup and its high and max limits don't have much to do with
> other cgroups and their limits. I don't get how the choice between
> high and max limits can be a td-wide state.

Ah, okay, this combines with idle cgroup detection to determine
whether the cgroups should be allowed to exceed high limits. It makes
more sense to me now. In that case, for the high/max limit range
issues, the enforced high/max limits can simply follow what's implied
by the configuration. e.g. if high=100 max=80, just behave as if both
high and max are 80.

> 2. Comparing parent's and child's limits and saying that either can be
> ignored because one is higher than the other isn't correct. A
> parent's limit doesn't apply to each child separately. It has to
> be aggregated. e.g. you can ignore a parent's setting if the sum
> of all children's limits is smaller than the parent's but then
> again there could still be a lower limit higher up the tree, so
> they would still have to be examined.

This part still seems weird tho. What am I misunderstanding?

Thanks.

--
tejun

2016-11-22 23:08:58

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 02/15] blk-throttle: add .high interface

On Tue, Nov 22, 2016 at 03:02:53PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> Sorry about the delay.
>
> On Mon, Nov 14, 2016 at 02:22:09PM -0800, Shaohua Li wrote:
> > @@ -1376,11 +1414,37 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
> > goto out_finish;
> > }
> >
> > - tg->bps[READ][LIMIT_MAX] = v[0];
> > - tg->bps[WRITE][LIMIT_MAX] = v[1];
> > - tg->iops[READ][LIMIT_MAX] = v[2];
> > - tg->iops[WRITE][LIMIT_MAX] = v[3];
> > -
> > + if (index == LIMIT_MAX) {
> > + if ((v[0] < tg->bps[READ][LIMIT_HIGH] &&
> > + tg->bps[READ][LIMIT_HIGH] != -1) ||
> > + (v[1] < tg->bps[WRITE][LIMIT_HIGH] &&
> > + tg->bps[WRITE][LIMIT_HIGH] != -1) ||
> > + (v[2] < tg->iops[READ][LIMIT_HIGH] &&
> > + tg->iops[READ][LIMIT_HIGH] != -1) ||
> > + (v[3] < tg->iops[WRITE][LIMIT_HIGH] &&
> > + tg->iops[WRITE][LIMIT_HIGH] != -1)) {
> > + ret = -EINVAL;
> > + goto out_finish;
>
> Is this necessary? memcg doesn't put restrictions on input but just
> enforces whatever is configured. I think it'd be better to follow the
> same model here too. Hmm... is this because throtl will be able to
> choose either all high or max limits per cgroup?

Thanks for your time!

Yep, the limit could be high or max. It's doable moving the restrictions on
input, but will increase trouble using the limits. If high is bigger than max,
can I set high to max? if not, I'd prefer to keep the restrictions.

> And this isn't from your patches but can we please switch to
> UINT64_MAX instead of -1?

Sure, will fix this.

> > + }
> > + } else if (index == LIMIT_HIGH) {
> > + if ((v[0] > tg->bps[READ][LIMIT_MAX] && v[0] != -1) ||
> > + (v[1] > tg->bps[WRITE][LIMIT_MAX] && v[1] != -1) ||
> > + (v[2] > tg->iops[READ][LIMIT_MAX] && v[2] != -1) ||
> > + (v[3] > tg->iops[WRITE][LIMIT_MAX] && v[3] != -1)) {
>
> Ditto here.
>
> > @@ -1412,6 +1484,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
> > .pd_alloc_fn = throtl_pd_alloc,
> > .pd_init_fn = throtl_pd_init,
> > .pd_online_fn = throtl_pd_online,
> > + .pd_offline_fn = throtl_pd_offline,
> > .pd_free_fn = throtl_pd_free,
> > };
>
> I haven't read the whole thing yet but this looks a bit suspicious. A
> css going offline indicates that the destruction of the css started.
> I don't get why that'd reset high limits. There can be a lot of async
> IOs after offline.

Ok. We do want to reset the limits. Because if no cgroup has high limit, we
definitively should use max limit for all cgroups. The whole state machine
(switching between high and max limit) is meaningless in that case.

Is pd_free_fn a good place to guarantee anyc IOs finish?

Thanks,
Shaohua

2016-11-22 23:11:41

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 03/15] blk-throttle: configure bps/iops limit for cgroup in high limit

On Tue, Nov 22, 2016 at 03:16:43PM -0500, Tejun Heo wrote:
> On Mon, Nov 14, 2016 at 02:22:10PM -0800, Shaohua Li wrote:
> > each queue will have a state machine. Initially queue is in LIMIT_HIGH
> > state, which means all cgroups will be throttled according to their high
> > limit. After all cgroups with high limit cross the limit, the queue state
> > gets upgraded to LIMIT_MAX state.
> > cgroups without high limit will use max limit for their high limit.
>
> Haven't looked at the actual mechanism yet so please correct me if I'm
> getting it wrong but at least the above explanation reads incorrect.
> Shouldn't max limits applied regardless of whether other cgroups are
> above high or not?

That's true. Since the max limit is higher than high limit (the interface patch
2 has the check), if cgroup is throttled according to high limit, max limit is
respected too.

Thanks,
Shaohua

2016-11-22 23:18:43

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 07/15] blk-throttle: make throtl_slice tunable

On Tue, Nov 22, 2016 at 04:27:15PM -0500, Tejun Heo wrote:
> Hello,
>
> On Mon, Nov 14, 2016 at 02:22:14PM -0800, Shaohua Li wrote:
> > throtl_slice is important for blk-throttling. A lot of stuffes depend on
> > it, for example, throughput measurement. It has 100ms default value,
> > which is not appropriate for all disks. For example, for SSD we might
> > use a smaller value to make the throughput smoother. This patch makes it
> > tunable.
>
> It bothers me a bit because time slice doesn't mean anything inherent
> to throttling. It really is an implementation detail - throttling can
> be implemented at per-operation level without time slice involved at
> all. It's okay to expose the knob if necessary but the meaning of the
> knob is almost completely arbitrary to users (as it has no inherent
> meaning).

Hmm, it's not a real 'time slice'. The name is a bit confusion. Maybe rename it
to 'throtl_interval' or 'throtl_sampling_time'? not sure. bandwidth and iops
are always in terms of a time interval we measure them. We can't say the
iops/bw for a single io. So this is really a tuable knob.

Thanks,
Shaohua

2016-11-22 23:38:42

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 05/15] blk-throttle: add downgrade logic

On Tue, Nov 22, 2016 at 04:42:00PM -0500, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 22, 2016 at 04:21:21PM -0500, Tejun Heo wrote:
> > 1. A cgroup and its high and max limits don't have much to do with
> > other cgroups and their limits. I don't get how the choice between
> > high and max limits can be a td-wide state.
>
> Ah, okay, this combines with idle cgroup detection to determine
> whether the cgroups should be allowed to exceed high limits. It makes
> more sense to me now. In that case, for the high/max limit range
> issues, the enforced high/max limits can simply follow what's implied
> by the configuration. e.g. if high=100 max=80, just behave as if both
> high and max are 80.
>
> > 2. Comparing parent's and child's limits and saying that either can be
> > ignored because one is higher than the other isn't correct. A
> > parent's limit doesn't apply to each child separately. It has to
> > be aggregated. e.g. you can ignore a parent's setting if the sum
> > of all children's limits is smaller than the parent's but then
> > again there could still be a lower limit higher up the tree, so
> > they would still have to be examined.
>
> This part still seems weird tho. What am I misunderstanding?

You are right, the checks are unncessary. I'll delete them.

Thanks,
Shaohua

2016-11-23 21:11:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 02/15] blk-throttle: add .high interface

Hello,

On Tue, Nov 22, 2016 at 03:08:36PM -0800, Shaohua Li wrote:
> Yep, the limit could be high or max. It's doable moving the restrictions on
> input, but will increase trouble using the limits. If high is bigger than max,
> can I set high to max? if not, I'd prefer to keep the restrictions.

You can do that internally but userland should keep seeing what it
configured.

> > I haven't read the whole thing yet but this looks a bit suspicious. A
> > css going offline indicates that the destruction of the css started.
> > I don't get why that'd reset high limits. There can be a lot of async
> > IOs after offline.
>
> Ok. We do want to reset the limits. Because if no cgroup has high limit, we
> definitively should use max limit for all cgroups. The whole state machine
> (switching between high and max limit) is meaningless in that case.
>
> Is pd_free_fn a good place to guarantee anyc IOs finish?

Yeap.

Thanks.

--
tejun

2016-11-23 21:17:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 07/15] blk-throttle: make throtl_slice tunable

Hello,

On Tue, Nov 22, 2016 at 03:18:24PM -0800, Shaohua Li wrote:
> Hmm, it's not a real 'time slice'. The name is a bit confusion. Maybe rename it
> to 'throtl_interval' or 'throtl_sampling_time'? not sure. bandwidth and iops
> are always in terms of a time interval we measure them. We can't say the
> iops/bw for a single io. So this is really a tuable knob.

Yeah, maybe using a more indicative name is better. However, even if
we say that this is the sampling period, it's not clear how adjusting
the knob would affect the behavior as that's not something clearly
defined in blk-throtl's operation model. For contrast, compare it
with the latency target, the implemented behavior might not succeed to
follow the intended configuration perfectly but what the intention of
the configuration is clear regardless.

That said, if this needs to be a tunable knob, it's fine to have it,
but let's at least try to document what the effects of changing the
variable is.

Thanks.

--
tejun

2016-11-23 21:23:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 09/15] blk-throttle: make bandwidth change smooth

Hello,

On Mon, Nov 14, 2016 at 02:22:16PM -0800, Shaohua Li wrote:
> cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
> -> 45/75 -> 10/80

I wonder whether it'd make sense to make the clamping down gradual too
(way faster than the ramping up but still gradual). The control model
isn't immediate anyway and doing the other direction gradually too
might not hurt the overall control accuracy all that much.

Thanks.

--
tejun

2016-11-23 21:32:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 11/15] blk-throttle: add interface to configure think time threshold

On Mon, Nov 14, 2016 at 02:22:18PM -0800, Shaohua Li wrote:
> Add interface to configure the threshold
>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> block/blk-sysfs.c | 7 +++++++
> block/blk-throttle.c | 25 +++++++++++++++++++++++++
> block/blk.h | 4 ++++
> 3 files changed, 36 insertions(+)
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 3e284e4..f15aeed 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -532,6 +532,12 @@ static struct queue_sysfs_entry throtl_slice_entry = {
> .show = blk_throtl_slice_show,
> .store = blk_throtl_slice_store,
> };
> +
> +static struct queue_sysfs_entry throtl_idle_threshold_entry = {
> + .attr = {.name = "throttling_idle_threshold", .mode = S_IRUGO | S_IWUSR },
> + .show = blk_throtl_idle_threshold_show,
> + .store = blk_throtl_idle_threshold_store,
> +};

Shouldn't this be a per-cgroup setting along with latency target?
These two are the parameters which define how the cgroup should be
treated time-wise.

Thanks.

--
tejun

2016-11-23 21:46:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 10/15] blk-throttle: add a simple idle detection

Hello, Shaohua.

On Mon, Nov 14, 2016 at 02:22:17PM -0800, Shaohua Li wrote:
> Unfortunately it's very hard to determine if a cgroup is real idle. This
> patch uses the 'think time check' idea from CFQ for the purpose. Please
> note, the idea doesn't work for all workloads. For example, a workload
> with io depth 8 has disk utilization 100%, hence think time is 0, eg,
> not idle. But the workload can run higher bandwidth with io depth 16.
> Compared to io depth 16, the io depth 8 workload is idle. We use the
> idea to roughly determine if a cgroup is idle.

Hmm... I'm not sure thinktime is the best measure here. Think time is
used by cfq mainly to tell the likely future behavior of a workload so
that cfq can take speculative actions on the prediction. However,
given that the implemented high limit behavior tries to provide a
certain level of latency target, using the predictive thinktime to
regulate behavior might lead to too unpredictable behaviors.

Moreover, I don't see why we need to bother with predictions anyway.
cfq needed it but I don't think that's the case for blk-throtl. It
can just provide idle threshold where a cgroup which hasn't issued an
IO over that threshold is considered idle. That'd be a lot easier to
understand and configure from userland while providing a good enough
mechanism to prevent idle cgroups from clamping down utilization for
too long.

Thanks.

--
tejun

2016-11-24 01:00:19

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 09/15] blk-throttle: make bandwidth change smooth

On Wed, Nov 23, 2016 at 04:23:35PM -0500, Tejun Heo wrote:
> Hello,
>
> On Mon, Nov 14, 2016 at 02:22:16PM -0800, Shaohua Li wrote:
> > cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
> > -> 45/75 -> 10/80
>
> I wonder whether it'd make sense to make the clamping down gradual too
> (way faster than the ramping up but still gradual). The control model
> isn't immediate anyway and doing the other direction gradually too
> might not hurt the overall control accuracy all that much.

Sure, that's in my todo list. Part of the reason is I didn't figure out an easy
to do it. I'll keep trying later.

Thanks,
Shaohua

2016-11-24 01:06:50

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 11/15] blk-throttle: add interface to configure think time threshold

On Wed, Nov 23, 2016 at 04:32:43PM -0500, Tejun Heo wrote:
> On Mon, Nov 14, 2016 at 02:22:18PM -0800, Shaohua Li wrote:
> > Add interface to configure the threshold
> >
> > Signed-off-by: Shaohua Li <[email protected]>
> > ---
> > block/blk-sysfs.c | 7 +++++++
> > block/blk-throttle.c | 25 +++++++++++++++++++++++++
> > block/blk.h | 4 ++++
> > 3 files changed, 36 insertions(+)
> >
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 3e284e4..f15aeed 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -532,6 +532,12 @@ static struct queue_sysfs_entry throtl_slice_entry = {
> > .show = blk_throtl_slice_show,
> > .store = blk_throtl_slice_store,
> > };
> > +
> > +static struct queue_sysfs_entry throtl_idle_threshold_entry = {
> > + .attr = {.name = "throttling_idle_threshold", .mode = S_IRUGO | S_IWUSR },
> > + .show = blk_throtl_idle_threshold_show,
> > + .store = blk_throtl_idle_threshold_store,
> > +};
>
> Shouldn't this be a per-cgroup setting along with latency target?
> These two are the parameters which define how the cgroup should be
> treated time-wise.

It should be easy to make it per-cgroup. Just not sure if it should be
per-cgroup. The logic is if the disk is faster, wait time should be shorter to
not harm performance. So it sounds like a per-disk characteristic.

Thanks,
Shaohua

2016-11-24 01:55:08

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 10/15] blk-throttle: add a simple idle detection

On Wed, Nov 23, 2016 at 04:46:19PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Mon, Nov 14, 2016 at 02:22:17PM -0800, Shaohua Li wrote:
> > Unfortunately it's very hard to determine if a cgroup is real idle. This
> > patch uses the 'think time check' idea from CFQ for the purpose. Please
> > note, the idea doesn't work for all workloads. For example, a workload
> > with io depth 8 has disk utilization 100%, hence think time is 0, eg,
> > not idle. But the workload can run higher bandwidth with io depth 16.
> > Compared to io depth 16, the io depth 8 workload is idle. We use the
> > idea to roughly determine if a cgroup is idle.
>
> Hmm... I'm not sure thinktime is the best measure here. Think time is
> used by cfq mainly to tell the likely future behavior of a workload so
> that cfq can take speculative actions on the prediction. However,
> given that the implemented high limit behavior tries to provide a
> certain level of latency target, using the predictive thinktime to
> regulate behavior might lead to too unpredictable behaviors.

Latency just reflects one side of the IO. Latency and think time haven't any
relationship. For example, a cgroup dispatching 1 IO per second can still have
high latency. If we only take latency account, we will think the cgroup is
busy, which is not justified.

> Moreover, I don't see why we need to bother with predictions anyway.
> cfq needed it but I don't think that's the case for blk-throtl. It
> can just provide idle threshold where a cgroup which hasn't issued an
> IO over that threshold is considered idle. That'd be a lot easier to
> understand and configure from userland while providing a good enough
> mechanism to prevent idle cgroups from clamping down utilization for
> too long.

We could do this, but it will only work for very idle workload, eg, the
workload is completely idle. If workload dispatches IO sporadically, this will
likely not work. The average think time is more precise for predication.

Thanks,
Shaohua

2016-11-28 22:08:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 11/15] blk-throttle: add interface to configure think time threshold

Hello, Shaohua.

On Wed, Nov 23, 2016 at 05:06:30PM -0800, Shaohua Li wrote:
> > Shouldn't this be a per-cgroup setting along with latency target?
> > These two are the parameters which define how the cgroup should be
> > treated time-wise.
>
> It should be easy to make it per-cgroup. Just not sure if it should be
> per-cgroup. The logic is if the disk is faster, wait time should be shorter to
> not harm performance. So it sounds like a per-disk characteristic.

Yes, this is something dependent on the device, but also on the
workload. For both this parameter and the latency target, it seems
that they should be specified along with the actual device limits so
that they follow the same convention and can be specified per cgroup *
block device. What do you think?

Thanks.

--
tejun

2016-11-28 22:14:41

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 11/15] blk-throttle: add interface to configure think time threshold

On Mon, Nov 28, 2016 at 05:08:18PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Wed, Nov 23, 2016 at 05:06:30PM -0800, Shaohua Li wrote:
> > > Shouldn't this be a per-cgroup setting along with latency target?
> > > These two are the parameters which define how the cgroup should be
> > > treated time-wise.
> >
> > It should be easy to make it per-cgroup. Just not sure if it should be
> > per-cgroup. The logic is if the disk is faster, wait time should be shorter to
> > not harm performance. So it sounds like a per-disk characteristic.
>
> Yes, this is something dependent on the device, but also on the
> workload. For both this parameter and the latency target, it seems
> that they should be specified along with the actual device limits so
> that they follow the same convention and can be specified per cgroup *
> block device. What do you think?

That's ok, I'm totally fine to make it per cgroup and per disk.

Thanks,
Shaohua

2016-11-28 22:21:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 10/15] blk-throttle: add a simple idle detection

Hello, Shaohua.

On Wed, Nov 23, 2016 at 05:15:18PM -0800, Shaohua Li wrote:
> > Hmm... I'm not sure thinktime is the best measure here. Think time is
> > used by cfq mainly to tell the likely future behavior of a workload so
> > that cfq can take speculative actions on the prediction. However,
> > given that the implemented high limit behavior tries to provide a
> > certain level of latency target, using the predictive thinktime to
> > regulate behavior might lead to too unpredictable behaviors.
>
> Latency just reflects one side of the IO. Latency and think time haven't any
> relationship. For example, a cgroup dispatching 1 IO per second can still have
> high latency. If we only take latency account, we will think the cgroup is
> busy, which is not justified.

Yes, the two are indepndent metrics; however, whether a cgroup is
considered idle or not affects whether blk-throttle will adhere to the
latency target or not. Thinktime is a magic number which can be good
but whose behavior can be very difficult to predict from outside the
black box. What I was trying to say was that putting in thinktime
here can greatly weaken the configured latency target in unobvious
ways.

> > Moreover, I don't see why we need to bother with predictions anyway.
> > cfq needed it but I don't think that's the case for blk-throtl. It
> > can just provide idle threshold where a cgroup which hasn't issued an
> > IO over that threshold is considered idle. That'd be a lot easier to
> > understand and configure from userland while providing a good enough
> > mechanism to prevent idle cgroups from clamping down utilization for
> > too long.
>
> We could do this, but it will only work for very idle workload, eg, the
> workload is completely idle. If workload dispatches IO sporadically, this will
> likely not work. The average think time is more precise for predication.

But we can increase sharing by upping the target latency. That should
be the main knob - if low, the user wants stricter service guarantee
at the cost of lower overall utilization; if high, the workload can
deal with higher latency and the system can achieve higher overall
utilization. I think the idle detection should be an extra mechanism
which can be used to ignore cgroup-disk combinations which are staying
idle for a long time.

Thanks.

--
tejun

2016-11-28 23:11:03

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 10/15] blk-throttle: add a simple idle detection

On Mon, Nov 28, 2016 at 05:21:48PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Wed, Nov 23, 2016 at 05:15:18PM -0800, Shaohua Li wrote:
> > > Hmm... I'm not sure thinktime is the best measure here. Think time is
> > > used by cfq mainly to tell the likely future behavior of a workload so
> > > that cfq can take speculative actions on the prediction. However,
> > > given that the implemented high limit behavior tries to provide a
> > > certain level of latency target, using the predictive thinktime to
> > > regulate behavior might lead to too unpredictable behaviors.
> >
> > Latency just reflects one side of the IO. Latency and think time haven't any
> > relationship. For example, a cgroup dispatching 1 IO per second can still have
> > high latency. If we only take latency account, we will think the cgroup is
> > busy, which is not justified.
>
> Yes, the two are indepndent metrics; however, whether a cgroup is
> considered idle or not affects whether blk-throttle will adhere to the
> latency target or not. Thinktime is a magic number which can be good
> but whose behavior can be very difficult to predict from outside the
> black box. What I was trying to say was that putting in thinktime
> here can greatly weaken the configured latency target in unobvious
> ways.
>
> > > Moreover, I don't see why we need to bother with predictions anyway.
> > > cfq needed it but I don't think that's the case for blk-throtl. It
> > > can just provide idle threshold where a cgroup which hasn't issued an
> > > IO over that threshold is considered idle. That'd be a lot easier to
> > > understand and configure from userland while providing a good enough
> > > mechanism to prevent idle cgroups from clamping down utilization for
> > > too long.
> >
> > We could do this, but it will only work for very idle workload, eg, the
> > workload is completely idle. If workload dispatches IO sporadically, this will
> > likely not work. The average think time is more precise for predication.
>
> But we can increase sharing by upping the target latency. That should
> be the main knob - if low, the user wants stricter service guarantee
> at the cost of lower overall utilization; if high, the workload can
> deal with higher latency and the system can achieve higher overall
> utilization. I think the idle detection should be an extra mechanism
> which can be used to ignore cgroup-disk combinations which are staying
> idle for a long time.

Yes, we can increase target latency to increase sharing. But latency and think
time are different. In the example I mentioned earlier, we must increase the
latency target very big to increase sharing even the cgroup just sends 1 IO per
second. Don't think this's what users want. In a summary, we can't only use
latency to determine if cgroups could dispatch more IO.

Currently the think time idle detection is an extra mechanism to ignore cgroup
limit. So we currently we only ignore cgroup limit when think time is big or
latency is small. This does make the behavior a little bit difficult to
predict, eg, not respect latency target sometimes, but this is necessary to
have better sharing.

Thanks,
Shaohua

2016-11-29 17:08:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 10/15] blk-throttle: add a simple idle detection

Hello, Shaohua.

On Mon, Nov 28, 2016 at 03:10:18PM -0800, Shaohua Li wrote:
> > But we can increase sharing by upping the target latency. That should
> > be the main knob - if low, the user wants stricter service guarantee
> > at the cost of lower overall utilization; if high, the workload can
> > deal with higher latency and the system can achieve higher overall
> > utilization. I think the idle detection should be an extra mechanism
> > which can be used to ignore cgroup-disk combinations which are staying
> > idle for a long time.
>
> Yes, we can increase target latency to increase sharing. But latency and think
> time are different. In the example I mentioned earlier, we must increase the
> latency target very big to increase sharing even the cgroup just sends 1 IO per
> second. Don't think this's what users want. In a summary, we can't only use
> latency to determine if cgroups could dispatch more IO.
>
> Currently the think time idle detection is an extra mechanism to ignore cgroup
> limit. So we currently we only ignore cgroup limit when think time is big or
> latency is small. This does make the behavior a little bit difficult to
> predict, eg, not respect latency target sometimes, but this is necessary to
> have better sharing.

So, it's not like we can get better sharing for free. It always comes
at the cost of (best effort) latency guarantee. Using thinktime for
idle detection doesn't mean that we get higher utilization for free.
If we get higher utilization by using thinktime instead of plain idle
detection, it means that we're sacrificing latency guarantee more with
thinktime, so I don't think the argument that using thinktime leads to
higher utilization is a clear winner.

That is not to say that there's no benefit to thinktime. I can
imagine cases where it'd allow us to ride the line between acceptable
latency and good overall utilization better; however, that also comes
with cases where one has to wonder "what's going on? I have no idea
what it's doing".

Given that blk-throttle is gonna ask for explicit and detailed
configuration from its users, I think it's vital that it has config
knobs which are immediately clear. Being tedious is already a burden
and I don't think adding unpredictability there is a good idea.

Thanks.

--
tejun

2016-11-29 17:24:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

Hello, Shaohua.

On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote:
> To do this, we sample some data, eg, average latency for request size
> 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit
> the data (x is request size in KB, f(x) is the latency). Then we can use
> the equation to estimate IO target latency for any request.

As discussed separately, it might make more sense to just use the avg
of the closest bucket instead of trying to line-fit the buckets, but
it's an implementation detail and whatever which works is fine.

> Hard disk is completely different. Latency depends on spindle seek
> instead of request size. So this latency target feature is for SSD only.

I'm not sure about this. While a disk's latency profile is way higher
and more erratic than SSDs, that doesn't make latency target useless.
Sure, it'll be more crude but there's a significant difference between
a cgroup having <= 20ms overall latency and experiencing multi-sec
latency.

Thanks.

--
tejun

2016-11-29 17:31:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 15/15] blk-throttle: add latency target support

Hello,

On Mon, Nov 14, 2016 at 02:22:22PM -0800, Shaohua Li wrote:
> One hard problem adding .high limit is to detect idle cgroup. If one
> cgroup doesn't dispatch enough IO against its high limit, we must have a
> mechanism to determine if other cgroups dispatch more IO. We added the
> think time detection mechanism before, but it doesn't work for all
> workloads. Here we add a latency based approach.

As I wrote before, I think that the two mechanisms should operate on
two mostly separate aspects of io control - latency control for
arbitrating active cgroups and idle detection to count out cgroups
which are sitting doing nothing - instead of the two meachanisms
possibly competing.

> static bool throtl_tg_is_idle(struct throtl_grp *tg)
> {
> - /* cgroup is idle if average think time is more than threshold */
> - return ktime_get_ns() - tg->last_finish_time >
> + /*
> + * cgroup is idle if:
> + * 1. average think time is higher than threshold
> + * 2. average request size is small and average latency is higher
^
lower, right?
> + * than target
> + */

So, this looks like too much magic to me. How would one configure for
a workload which may issue small IOs, say, every few seconds but
requries low latency?

Thanks.

--
tejun

2016-11-29 18:14:30

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 15/15] blk-throttle: add latency target support

On Tue, Nov 29, 2016 at 12:31:08PM -0500, Tejun Heo wrote:
> Hello,
>
> On Mon, Nov 14, 2016 at 02:22:22PM -0800, Shaohua Li wrote:
> > One hard problem adding .high limit is to detect idle cgroup. If one
> > cgroup doesn't dispatch enough IO against its high limit, we must have a
> > mechanism to determine if other cgroups dispatch more IO. We added the
> > think time detection mechanism before, but it doesn't work for all
> > workloads. Here we add a latency based approach.
>
> As I wrote before, I think that the two mechanisms should operate on
> two mostly separate aspects of io control - latency control for
> arbitrating active cgroups and idle detection to count out cgroups
> which are sitting doing nothing - instead of the two meachanisms
> possibly competing.

What the patches do doesn't conflict what you are talking about. We need a way
to detect if cgroups are idle or active. I think the problem is how to define
'active' and 'idle'. We must quantify the state. We could use:
1. plain idle detection
2. think time idle detection

1 is a subset of 2. Both need a knob to specify the time. 2 is more generic.
Probably the function name 'throtl_tg_is_idle' is misleading. It really means
'the cgroup's high limit can be ignored, other cgorups can dispatch more IO'

> > static bool throtl_tg_is_idle(struct throtl_grp *tg)
> > {
> > - /* cgroup is idle if average think time is more than threshold */
> > - return ktime_get_ns() - tg->last_finish_time >
> > + /*
> > + * cgroup is idle if:
> > + * 1. average think time is higher than threshold
> > + * 2. average request size is small and average latency is higher
> ^
> lower, right?
oh, yes

> > + * than target
> > + */
>
> So, this looks like too much magic to me. How would one configure for
> a workload which may issue small IOs, say, every few seconds but
> requries low latency?

configure the think time threshold to several seconds and configure the latency
target, it should do the job.

Thanks,
Shaohua

2016-11-29 18:31:11

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

On Tue, Nov 29, 2016 at 12:24:35PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote:
> > To do this, we sample some data, eg, average latency for request size
> > 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit
> > the data (x is request size in KB, f(x) is the latency). Then we can use
> > the equation to estimate IO target latency for any request.
>
> As discussed separately, it might make more sense to just use the avg
> of the closest bucket instead of trying to line-fit the buckets, but
> it's an implementation detail and whatever which works is fine.

that is still like a line fit. Don't think there is big difference.

> > Hard disk is completely different. Latency depends on spindle seek
> > instead of request size. So this latency target feature is for SSD only.
>
> I'm not sure about this. While a disk's latency profile is way higher
> and more erratic than SSDs, that doesn't make latency target useless.
> Sure, it'll be more crude but there's a significant difference between
> a cgroup having <= 20ms overall latency and experiencing multi-sec
> latency.

Sure, latency target is useful for hardisk too. But we need a different
stragety. For hard disk, the latency highly depends on seek. Probably we can
make the latency target the same for all request size. Not sure if average
latency makes sense. Need more tests with hard disk. I'd like to forcus on SSD
in current stage.

Thanks,
Shaohua

2016-11-29 22:36:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

Hello,

On Tue, Nov 29, 2016 at 10:30:44AM -0800, Shaohua Li wrote:
> > As discussed separately, it might make more sense to just use the avg
> > of the closest bucket instead of trying to line-fit the buckets, but
> > it's an implementation detail and whatever which works is fine.
>
> that is still like a line fit. Don't think there is big difference.

Yeah, just wondering whether that'd be simpler.

> > > Hard disk is completely different. Latency depends on spindle seek
> > > instead of request size. So this latency target feature is for SSD only.
> >
> > I'm not sure about this. While a disk's latency profile is way higher
> > and more erratic than SSDs, that doesn't make latency target useless.
> > Sure, it'll be more crude but there's a significant difference between
> > a cgroup having <= 20ms overall latency and experiencing multi-sec
> > latency.
>
> Sure, latency target is useful for hardisk too. But we need a different
> stragety. For hard disk, the latency highly depends on seek. Probably we can
> make the latency target the same for all request size. Not sure if average
> latency makes sense. Need more tests with hard disk. I'd like to forcus on SSD
> in current stage.

Sure, it's fine to focus on SSDs for now but with either line fitting
or bucketed avg, it should be fine, right? The slope of the line
would be way lower and the deviation would be higher but that doesn't
really get in the way here.

Thanks.

--
tejun

2016-11-29 22:54:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V4 15/15] blk-throttle: add latency target support

Hello,

On Tue, Nov 29, 2016 at 10:14:03AM -0800, Shaohua Li wrote:
> What the patches do doesn't conflict what you are talking about. We need a way
> to detect if cgroups are idle or active. I think the problem is how to define
> 'active' and 'idle'. We must quantify the state. We could use:
> 1. plain idle detection
> 2. think time idle detection
>
> 1 is a subset of 2. Both need a knob to specify the time. 2 is more generic.
> Probably the function name 'throtl_tg_is_idle' is misleading. It really means
> 'the cgroup's high limit can be ignored, other cgorups can dispatch more IO'

Yeah, both work towards about the same goal. I feel a bit icky about
using thinktime as it seems more complicated than called for here.

> > > static bool throtl_tg_is_idle(struct throtl_grp *tg)
> > > {
> > > - /* cgroup is idle if average think time is more than threshold */
> > > - return ktime_get_ns() - tg->last_finish_time >
> > > + /*
> > > + * cgroup is idle if:
> > > + * 1. average think time is higher than threshold
> > > + * 2. average request size is small and average latency is higher
> > ^
> > lower, right?
> oh, yes
>
> > > + * than target
> > > + */
> >
> > So, this looks like too much magic to me. How would one configure for
> > a workload which may issue small IOs, say, every few seconds but
> > requries low latency?
>
> configure the think time threshold to several seconds and configure the latency
> target, it should do the job.

Sure, with a high enough number, it'd do the same thing but it's a
fuzzy number which can be difficult to tell from user's point of view.
Implementation-wise, this isn't a huge difference but I'm worried that
this can fall into the trap of "this isn't doing what I'm expecing it
to" - "try to nudge that number a bit" situation.

If we have latency target and a dumb idle setting. Each's role is
clear - latency target determines the guarantee that we want to give
to that cgroup and accordingly how much utilization we're willing to
sacrifice for that, and idle period to ignore the cgroup if it's idle
for a relatively long term. The distinction between the two knobs is
fairly clear.

With thinktime, the roles of each knob seem more muddled in that
thinktime would be a knob which can also be used to fine-tune
not-too-active sharing.

Most of our differences might be coming from where we assign
importance. I think that if a cgroup wants to have latency target, it
should be the primary parameter and followed as strictly and clearly
as possible even if that means lower overall utilization. If a cgroup
issues IOs sporadically and thinktime can increase utilization
(compared to dumb idle detection), that means that the cgroup wouldn't
be getting the target latency that it configured. If such situation
is acceptable, wouldn't it make sense to lower the target latency
instead?

Thanks.

--
tejun

2016-11-29 23:39:49

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V4 15/15] blk-throttle: add latency target support

On Tue, Nov 29, 2016 at 05:54:46PM -0500, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 29, 2016 at 10:14:03AM -0800, Shaohua Li wrote:
> > What the patches do doesn't conflict what you are talking about. We need a way
> > to detect if cgroups are idle or active. I think the problem is how to define
> > 'active' and 'idle'. We must quantify the state. We could use:
> > 1. plain idle detection
> > 2. think time idle detection
> >
> > 1 is a subset of 2. Both need a knob to specify the time. 2 is more generic.
> > Probably the function name 'throtl_tg_is_idle' is misleading. It really means
> > 'the cgroup's high limit can be ignored, other cgorups can dispatch more IO'
>
> Yeah, both work towards about the same goal. I feel a bit icky about
> using thinktime as it seems more complicated than called for here.
>
> > > > static bool throtl_tg_is_idle(struct throtl_grp *tg)
> > > > {
> > > > - /* cgroup is idle if average think time is more than threshold */
> > > > - return ktime_get_ns() - tg->last_finish_time >
> > > > + /*
> > > > + * cgroup is idle if:
> > > > + * 1. average think time is higher than threshold
> > > > + * 2. average request size is small and average latency is higher
> > > ^
> > > lower, right?
> > oh, yes
> >
> > > > + * than target
> > > > + */
> > >
> > > So, this looks like too much magic to me. How would one configure for
> > > a workload which may issue small IOs, say, every few seconds but
> > > requries low latency?
> >
> > configure the think time threshold to several seconds and configure the latency
> > target, it should do the job.
>
> Sure, with a high enough number, it'd do the same thing but it's a
> fuzzy number which can be difficult to tell from user's point of view.
> Implementation-wise, this isn't a huge difference but I'm worried that
> this can fall into the trap of "this isn't doing what I'm expecing it
> to" - "try to nudge that number a bit" situation.
>
> If we have latency target and a dumb idle setting. Each's role is
> clear - latency target determines the guarantee that we want to give
> to that cgroup and accordingly how much utilization we're willing to
> sacrifice for that, and idle period to ignore the cgroup if it's idle
> for a relatively long term. The distinction between the two knobs is
> fairly clear.
>
> With thinktime, the roles of each knob seem more muddled in that
> thinktime would be a knob which can also be used to fine-tune
> not-too-active sharing.

The dumb idle or think time idle is about implementation choice. Let me take
this way. Defien a knob called 'idle_time'. In the first implementation, we
implement the knob as dump idle. Later we implement it as think time idle.
Would this make you feel better? Or just using the new name 'idle_time' alreay
makes you happy?

For dump idle, we probably can't let user configure the 'idle_time' too small
though.

> Most of our differences might be coming from where we assign
> importance. I think that if a cgroup wants to have latency target, it
> should be the primary parameter and followed as strictly and clearly
> as possible even if that means lower overall utilization. If a cgroup
> issues IOs sporadically and thinktime can increase utilization
> (compared to dumb idle detection), that means that the cgroup wouldn't
> be getting the target latency that it configured. If such situation
> is acceptable, wouldn't it make sense to lower the target latency
> instead?

lowering the target latency doesn't really help. In a giving latency target,
cgroup can dispatch 1 IO per second or 1000 IO per second. The reality is if
application stops dispatching IO (idle) and if application's IO latency is high
haven't any relationship.

Thanks,
Shaohua