2016-12-15 20:38:48

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 00/17] blk-throttle: add .low limit

Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 9 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 9 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 10-17
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
11 - 13 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
14 - 17 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knobs.
Configuration would not be easy for normal users. It would be powerful for
experienced users though.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based
calculation'
- Rebase and fix other problems

Issue pointed out by Tejun isn't fixed yet:
- .pd_offline_fn vs .pd_free_fn. .pd_free_fn seems too late to change states

V3->V4:
- Add latency target for cgroup
- Fix bugs
http://marc.info/?l=linux-block&m=147916216512915&w=2

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2

Shaohua Li (17):
blk-throttle: use U64_MAX/UINT_MAX to replace -1
blk-throttle: prepare support multiple limits
blk-throttle: add .low interface
blk-throttle: configure bps/iops limit for cgroup in low limit
blk-throttle: add upgrade logic for LIMIT_LOW state
blk-throttle: add downgrade logic
blk-throttle: make sure expire time isn't too big
blk-throttle: make throtl_slice tunable
blk-throttle: detect completed idle cgroup
blk-throttle: make bandwidth change smooth
blk-throttle: add a simple idle detection
blk-throttle: add interface to configure idle time threshold
blk-throttle: ignore idle cgroup limit
blk-throttle: add interface for per-cgroup target latency
block: track request size in blk_issue_stat
blk-throttle: add a mechanism to estimate IO latency
blk-throttle: add latency target support

Documentation/block/queue-sysfs.txt | 6 +
block/bio.c | 2 +
block/blk-core.c | 2 +-
block/blk-mq.c | 2 +-
block/blk-stat.c | 11 +-
block/blk-stat.h | 29 +-
block/blk-sysfs.c | 10 +
block/blk-throttle.c | 914 +++++++++++++++++++++++++++++++++---
block/blk-wbt.h | 10 +-
block/blk.h | 7 +
include/linux/blk_types.h | 10 +-
11 files changed, 909 insertions(+), 94 deletions(-)

--
2.9.3


2016-12-15 20:33:24

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold

Add interface to configure the threshold. The io.low interface will
like:
echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

idle is in microsecond unit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3cd8400..0218ff9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -180,6 +180,8 @@ struct throtl_data
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];

+ u64 dft_idle_ttime_threshold;
+
unsigned long low_upgrade_time;
unsigned long low_downgrade_time;

@@ -1427,6 +1429,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
const char *dname = blkg_dev_name(pd->blkg);
char bufs[4][21] = { "max", "max", "max", "max" };
+ char idle_time[26] = "";

if (!dname)
return 0;
@@ -1434,7 +1437,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
if (tg->bps_conf[READ][off] == U64_MAX &&
tg->bps_conf[WRITE][off] == U64_MAX &&
tg->iops_conf[READ][off] == UINT_MAX &&
- tg->iops_conf[WRITE][off] == UINT_MAX)
+ tg->iops_conf[WRITE][off] == UINT_MAX &&
+ (off != LIMIT_LOW || tg->idle_ttime_threshold ==
+ tg->td->dft_idle_ttime_threshold))
return 0;

if (tg->bps_conf[READ][off] != U64_MAX)
@@ -1449,9 +1454,16 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
if (tg->iops_conf[WRITE][off] != UINT_MAX)
snprintf(bufs[3], sizeof(bufs[3]), "%u",
tg->iops_conf[WRITE][off]);
+ if (off == LIMIT_LOW) {
+ if (tg->idle_ttime_threshold == U64_MAX)
+ strcpy(idle_time, " idle=max");
+ else
+ snprintf(idle_time, sizeof(idle_time), " idle=%llu",
+ div_u64(tg->idle_ttime_threshold, 1000));
+ }

- seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
- dname, bufs[0], bufs[1], bufs[2], bufs[3]);
+ seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
+ dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
return 0;
}

@@ -1469,6 +1481,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct blkg_conf_ctx ctx;
struct throtl_grp *tg;
u64 v[4];
+ u64 idle_time;
int ret;
int index = of_cft(of)->private;

@@ -1483,6 +1496,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = tg->iops_conf[READ][index];
v[3] = tg->iops_conf[WRITE][index];

+ idle_time = tg->idle_ttime_threshold;
while (true) {
char tok[27]; /* wiops=18446744073709551616 */
char *p;
@@ -1514,6 +1528,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = min_t(u64, val, UINT_MAX);
else if (!strcmp(tok, "wiops"))
v[3] = min_t(u64, val, UINT_MAX);
+ else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
+ idle_time = val;
else
goto out_finish;
}
@@ -1542,6 +1558,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
blk_throtl_update_valid_limit(tg->td);
if (tg->td->limit_valid[LIMIT_LOW])
tg->td->limit_index = LIMIT_LOW;
+ tg->idle_ttime_threshold = (idle_time == U64_MAX) ?
+ U64_MAX : idle_time * 1000;
}
tg_conf_updated(tg);
ret = 0;
@@ -1862,6 +1880,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
else
tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+ if (tg->td->dft_idle_ttime_threshold == 0)
+ tg->td->dft_idle_ttime_threshold =
+ tg->idle_ttime_threshold;
}

/* see throtl_charge_bio() */
--
2.9.3

2016-12-15 20:33:22

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 09/17] blk-throttle: detect completed idle cgroup

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cd10c65..a0ba961 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -148,6 +148,8 @@ struct throtl_grp {

unsigned long last_check_time;

+ unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -437,11 +439,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)

static void throtl_pd_online(struct blkg_policy_data *pd)
{
+ struct throtl_grp *tg = pd_to_tg(pd);
/*
* We don't want new groups to escape the limits of its ancestors.
* Update has_rules[] after a new group is brought online.
*/
- tg_update_has_rules(pd_to_tg(pd));
+ tg_update_has_rules(tg);
+ tg->last_dispatch_time[READ] = jiffies;
+ tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1582,6 +1587,12 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+ if (time_after_eq(jiffies,
+ tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+ time_after_eq(jiffies,
+ tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ return true;
return false;
}

@@ -1660,6 +1671,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

+ if (time_after_eq(now, tg->last_dispatch_time[READ] +
+ td->throtl_slice) &&
+ time_after_eq(now, tg->last_dispatch_time[WRITE] +
+ td->throtl_slice))
+ return false;
/*
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
@@ -1769,6 +1785,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
--
2.9.3

2016-12-15 20:33:53

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 06/17] blk-throttle: add downgrade logic

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 148 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cfd74cfc..0f65fce 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -140,6 +140,13 @@ struct throtl_grp {
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];

+ unsigned long last_low_overflow_time[2];
+
+ uint64_t last_bytes_disp[2];
+ unsigned int last_io_disp[2];
+
+ unsigned long last_check_time;
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -159,6 +166,9 @@ struct throtl_data
struct work_struct dispatch_work;
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
+
+ unsigned long low_upgrade_time;
+ unsigned long low_downgrade_time;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -896,6 +906,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio->bi_iter.bi_size;
tg->io_disp[rw]++;
+ tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+ tg->last_io_disp[rw]++;

/*
* BIO_THROTTLED is used to prevent the same bio to be throttled
@@ -1511,6 +1523,36 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+ unsigned long rtime = -1, wtime = -1;
+
+ if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+ tg->iops[READ][LIMIT_LOW] != UINT_MAX)
+ rtime = tg->last_low_overflow_time[READ];
+ if (tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+ tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
+ wtime = tg->last_low_overflow_time[WRITE];
+ return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *parent_sq;
+ struct throtl_grp *parent = tg;
+ unsigned long ret = __tg_last_low_overflow_time(tg);
+
+ while (true) {
+ parent_sq = parent->service_queue.parent_sq;
+ parent = sq_to_tg(parent_sq);
+ if (!parent)
+ break;
+ if (time_after(__tg_last_low_overflow_time(parent), ret))
+ ret = __tg_last_low_overflow_time(parent);
+ }
+ return ret;
+}
+
static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1555,6 +1597,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_LOW)
return false;

+ if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+ return false;
+
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1578,6 +1623,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
struct blkcg_gq *blkg;

td->limit_index = LIMIT_MAX;
+ td->low_upgrade_time = jiffies;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1593,6 +1639,100 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, &td->dispatch_work);
}

+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+ td->limit_index = new;
+ td->low_downgrade_time = jiffies;
+}
+
+static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
+{
+ struct throtl_data *td = tg->td;
+ unsigned long now = jiffies;
+
+ /*
+ * If cgroup is below low limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
+ time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ return true;
+ return false;
+}
+
+static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg)
+{
+ while (true) {
+ if (!throtl_tg_can_downgrade(tg))
+ return false;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+ !tg_to_blkg(tg)->parent))
+ break;
+ }
+ return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+ uint64_t bps;
+ unsigned int iops;
+ unsigned long elapsed_time;
+ unsigned long now = jiffies;
+
+ if (tg->td->limit_index != LIMIT_MAX ||
+ !tg->td->limit_valid[LIMIT_LOW])
+ return;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ return;
+ if (time_after(tg->last_check_time + throtl_slice, now))
+ return;
+
+ elapsed_time = now - tg->last_check_time;
+ tg->last_check_time = now;
+
+ if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ return;
+
+ if (tg->bps[READ][LIMIT_LOW] != U64_MAX) {
+ bps = tg->last_bytes_disp[READ] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[READ][LIMIT_LOW])
+ tg->last_low_overflow_time[READ] = now;
+ }
+
+ if (tg->bps[WRITE][LIMIT_LOW] != U64_MAX) {
+ bps = tg->last_bytes_disp[WRITE] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[WRITE][LIMIT_LOW])
+ tg->last_low_overflow_time[WRITE] = now;
+ }
+
+ if (tg->iops[READ][LIMIT_LOW] != UINT_MAX) {
+ iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+ if (iops >= tg->iops[READ][LIMIT_LOW])
+ tg->last_low_overflow_time[READ] = now;
+ }
+
+ if (tg->iops[WRITE][LIMIT_LOW] != UINT_MAX) {
+ iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+ if (iops >= tg->iops[WRITE][LIMIT_LOW])
+ tg->last_low_overflow_time[WRITE] = now;
+ }
+
+ /*
+ * If cgroup is below low limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (throtl_hierarchy_can_downgrade(tg))
+ throtl_downgrade_state(tg->td, LIMIT_LOW);
+
+ tg->last_bytes_disp[READ] = 0;
+ tg->last_bytes_disp[WRITE] = 0;
+ tg->last_io_disp[READ] = 0;
+ tg->last_io_disp[WRITE] = 0;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1617,12 +1757,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ if (tg->last_low_overflow_time[rw] == 0)
+ tg->last_low_overflow_time[rw] = jiffies;
+ throtl_downgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
+ tg->last_low_overflow_time[rw] = jiffies;
if (throtl_can_upgrade(tg->td, tg)) {
throtl_upgrade_state(tg->td);
goto again;
@@ -1666,6 +1810,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

+ tg->last_low_overflow_time[rw] = jiffies;
+
bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
@@ -1776,6 +1922,8 @@ int blk_throtl_init(struct request_queue *q)

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
+ td->low_upgrade_time = jiffies;
+ td->low_downgrade_time = jiffies;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-12-15 20:33:51

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit

each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal between low limit and max
limit configured by user. Last patch already did the convertion.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fcc4199..e55bd36 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,11 +212,21 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)

static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ uint64_t ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return U64_MAX;
return tg->bps[rw][tg->td->limit_index];
}

static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ unsigned int ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return UINT_MAX;
return tg->iops[rw][tg->td->limit_index];
}

--
2.9.3

2016-12-15 20:33:19

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 17/17] blk-throttle: add latency target support

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 41 ++++++++++++++++++++++++++++++++++++-----
1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1dc707a..915ebf5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -162,6 +162,10 @@ struct throtl_grp {
u64 checked_last_finish_time;
u64 avg_ttime;
u64 idle_ttime_threshold;
+
+ unsigned int bio_cnt; /* total bios */
+ unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+ unsigned long bio_cnt_reset_time;
};

/* We measure latency for request size from <= 4k to >= 1M */
@@ -1688,11 +1692,14 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
* - single idle is too long, longer than a fixed value (in case user
* configure a too big threshold) or 4 times of slice
* - average think time is more than threshold
+ * - IO latency is largely below threshold
*/
u64 time = (u64)jiffies_to_usecs(4 * tg->td->throtl_slice) * 1000;
time = min_t(u64, MAX_IDLE_TIME, time);
return ktime_get_ns() - tg->last_finish_time > time ||
- tg->avg_ttime > tg->idle_ttime_threshold;
+ tg->avg_ttime > tg->idle_ttime_threshold ||
+ (tg->latency_target && tg->bio_cnt &&
+ tg->bad_bio_cnt * 5 < tg->bio_cnt);
}

static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2170,12 +2177,36 @@ void blk_throtl_bio_endio(struct bio *bio)

start_time = blk_stat_time(&bio->bi_issue_stat);
finish_time = __blk_stat_time(finish_time);
- if (start_time && finish_time > start_time &&
- tg->td->track_bio_latency == 1 &&
- !(bio->bi_issue_stat.stat & SKIP_TRACK)) {
- lat = finish_time - start_time;
+ if (!start_time || finish_time <= start_time)
+ return;
+
+ lat = finish_time - start_time;
+ if (tg->td->track_bio_latency == 1 &&
+ !(bio->bi_issue_stat.stat & SKIP_TRACK))
throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
bio_op(bio), lat);
+
+ if (tg->latency_target) {
+ int bucket;
+ unsigned int threshold;
+
+ bucket = request_bucket_index(
+ blk_stat_size(&bio->bi_issue_stat));
+ threshold = tg->td->avg_buckets[bucket].latency +
+ tg->latency_target;
+ if (lat > threshold)
+ tg->bad_bio_cnt++;
+ /*
+ * Not race free, could get wrong count, which means cgroups
+ * will be throttled
+ */
+ tg->bio_cnt++;
+ }
+
+ if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+ tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+ tg->bio_cnt /= 2;
+ tg->bad_bio_cnt /= 2;
}
}

--
2.9.3

2016-12-15 20:34:31

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0218ff9..62fe72ea 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,8 +151,6 @@ struct throtl_grp {

unsigned long last_check_time;

- unsigned long last_dispatch_time[2];
-
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -495,8 +493,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
* Update has_rules[] after a new group is brought online.
*/
tg_update_has_rules(tg);
- tg->last_dispatch_time[READ] = jiffies;
- tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1669,9 +1665,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
return true;

if (time_after_eq(jiffies,
- tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
- time_after_eq(jiffies,
- tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+ throtl_tg_is_idle(tg))
return true;
return false;
}
@@ -1718,6 +1713,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
return true;
}

+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+ unsigned long now = jiffies;
+
+ if (tg->td->limit_index != LIMIT_LOW)
+ return;
+
+ if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+ return;
+
+ tg->last_check_time = now;
+
+ if (!time_after_eq(now,
+ __tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+ return;
+
+ if (throtl_can_upgrade(tg->td, NULL))
+ throtl_upgrade_state(tg->td);
+}
+
static void throtl_upgrade_state(struct throtl_data *td)
{
struct cgroup_subsys_state *pos_css;
@@ -1759,18 +1774,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

- if (time_after_eq(now, tg->last_dispatch_time[READ] +
- td->throtl_slice) &&
- time_after_eq(now, tg->last_dispatch_time[WRITE] +
- td->throtl_slice))
- return false;
/*
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
*/
if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
time_after_eq(now, tg_last_low_overflow_time(tg) +
- td->throtl_slice))
+ td->throtl_slice) &&
+ (!throtl_tg_is_idle(tg) ||
+ !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
return true;
return false;
}
@@ -1904,10 +1916,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
- tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
+ throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
--
2.9.3

2016-12-15 20:36:47

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 15/17] block: track request size in blk_issue_stat

Currently there is no way to know the request size when the request is
finished. Next patch will need this info, so add to blk_issue_stat. With
this, we will have 49bits to track time, which still is very long time.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-mq.c | 2 +-
block/blk-stat.c | 7 ++++---
block/blk-stat.h | 29 +++++++++++++++++++++++------
block/blk-wbt.h | 10 +++++-----
include/linux/blk_types.h | 2 +-
6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..485c32d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2459,7 +2459,7 @@ void blk_start_request(struct request *req)
blk_dequeue_request(req);

if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
- blk_stat_set_issue_time(&req->issue_stat);
+ blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
req->rq_flags |= RQF_STATS;
wbt_issue(req->q->rq_wb, &req->issue_stat);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4bf850e..891db62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -474,7 +474,7 @@ void blk_mq_start_request(struct request *rq)
rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);

if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
- blk_stat_set_issue_time(&rq->issue_stat);
+ blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
rq->rq_flags |= RQF_STATS;
wbt_issue(q->rq_wb, &rq->issue_stat);
}
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 9b43efb..0469855 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -236,10 +236,11 @@ void blk_stat_clear(struct request_queue *q)
}
}

-void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
{
- stat->time = (stat->time & BLK_STAT_MASK) |
- (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+ stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
+ (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
+ (blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
}

/*
diff --git a/block/blk-stat.h b/block/blk-stat.h
index a2050a0..462197f 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -8,12 +8,19 @@
#define BLK_STAT_NSEC_MASK ~(BLK_STAT_NSEC - 1)

/*
- * Upper 3 bits can be used elsewhere
+ * from upper:
+ * 3 bits: reserved for other usage
+ * 12 bits: size
+ * 49 bits: time
*/
#define BLK_STAT_RES_BITS 3
-#define BLK_STAT_SHIFT (64 - BLK_STAT_RES_BITS)
-#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SHIFT) - 1)
-#define BLK_STAT_MASK ~BLK_STAT_TIME_MASK
+#define BLK_STAT_SIZE_BITS 12
+#define BLK_STAT_RES_SHIFT (64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_SIZE_SHIFT (BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
+#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SIZE_SHIFT) - 1)
+#define BLK_STAT_SIZE_MASK \
+ (((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT)
+#define BLK_STAT_RES_MASK (~((1ULL << BLK_STAT_RES_SHIFT) - 1))

enum {
BLK_STAT_READ = 0,
@@ -26,7 +33,7 @@ void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
void blk_stat_clear(struct request_queue *);
void blk_stat_init(struct blk_rq_stat *);
bool blk_stat_is_current(struct blk_rq_stat *);
-void blk_stat_set_issue_time(struct blk_issue_stat *);
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size);
bool blk_stat_enable(struct request_queue *);

static inline u64 __blk_stat_time(u64 time)
@@ -36,7 +43,17 @@ static inline u64 __blk_stat_time(u64 time)

static inline u64 blk_stat_time(struct blk_issue_stat *stat)
{
- return __blk_stat_time(stat->time);
+ return __blk_stat_time(stat->stat);
+}
+
+static inline sector_t blk_capped_size(sector_t size)
+{
+ return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1);
+}
+
+static inline sector_t blk_stat_size(struct blk_issue_stat *stat)
+{
+ return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT;
}

#endif
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 65f1de5..7265f1f 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -32,27 +32,27 @@ enum {

static inline void wbt_clear_state(struct blk_issue_stat *stat)
{
- stat->time &= BLK_STAT_TIME_MASK;
+ stat->stat &= ~BLK_STAT_RES_MASK;
}

static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
{
- return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+ return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
}

static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
{
- stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+ stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
}

static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
{
- return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
+ return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
}

static inline bool wbt_is_read(struct blk_issue_stat *stat)
{
- return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
+ return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
}

struct rq_wait {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 01019a7..8159a6c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -257,7 +257,7 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
}

struct blk_issue_stat {
- u64 time;
+ u64 stat;
};

#define BLK_RQ_STAT_BATCH 64
--
2.9.3

2016-12-15 20:37:06

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 10/17] blk-throttle: make bandwidth change smooth

When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a0ba961..6b2f365 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -174,6 +174,8 @@ struct throtl_data

unsigned long low_upgrade_time;
unsigned long low_downgrade_time;
+
+ unsigned int scale;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -228,21 +230,58 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
uint64_t ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return U64_MAX;
- return tg->bps[rw][tg->td->limit_index];
+
+ td = tg->td;
+ ret = tg->bps[rw][td->limit_index];
+ if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] !=
+ tg->bps[rw][LIMIT_MAX]) {
+ uint64_t increase;
+
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->low_upgrade_time + td->scale * td->throtl_slice)) {
+ unsigned int time = jiffies - td->low_upgrade_time;
+
+ td->scale = time / td->throtl_slice;
+ }
+ increase = (tg->bps[rw][LIMIT_LOW] >> 1) * td->scale;
+ ret = min(tg->bps[rw][LIMIT_MAX],
+ tg->bps[rw][LIMIT_LOW] + increase);
+ }
+ return ret;
}

static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
unsigned int ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return UINT_MAX;
- return tg->iops[rw][tg->td->limit_index];
+
+ td = tg->td;
+ ret = tg->iops[rw][td->limit_index];
+ if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] !=
+ tg->iops[rw][LIMIT_MAX]) {
+ uint64_t increase;
+
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->low_upgrade_time + td->scale * td->throtl_slice)) {
+ unsigned int time = jiffies - td->low_upgrade_time;
+
+ td->scale = time / td->throtl_slice;
+ }
+
+ increase = (tg->iops[rw][LIMIT_LOW] >> 1) * td->scale;
+ ret = min(tg->iops[rw][LIMIT_MAX],
+ tg->iops[rw][LIMIT_LOW] + (unsigned int)increase);
+ }
+ return ret;
}

/**
@@ -1645,6 +1684,7 @@ static void throtl_upgrade_state(struct throtl_data *td)

td->limit_index = LIMIT_MAX;
td->low_upgrade_time = jiffies;
+ td->scale = 0;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1662,6 +1702,13 @@ static void throtl_upgrade_state(struct throtl_data *td)

static void throtl_downgrade_state(struct throtl_data *td, int new)
{
+ td->scale /= 2;
+
+ if (td->scale) {
+ td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
+ return;
+ }
+
td->limit_index = new;
td->low_downgrade_time = jiffies;
}
--
2.9.3

2016-12-15 20:37:09

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 08/17] blk-throttle: make throtl_slice tunable

throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li <[email protected]>
---
Documentation/block/queue-sysfs.txt | 6 +++
block/blk-sysfs.c | 10 +++++
block/blk-throttle.c | 77 ++++++++++++++++++++++++++-----------
block/blk.h | 3 ++
4 files changed, 74 insertions(+), 22 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 5164215..91f371d 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -192,5 +192,11 @@ scaling back writes. Writing a value of '0' to this file disables the
feature. Writing a value of '-1' to this file resets the value to the
default setting.

+throttle_sample_time (RW)
+-------------------------
+This is the time window that blk-throttle samples data, in millisecond.
+blk-throttle makes decision based on the samplings. Lower time means cgroups
+have more smooth throughput, but higher CPU overhead. This exists only when
+CONFIG_BLK_DEV_THROTTLING is enabled.

Jens Axboe <[email protected]>, February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce05..0e3fb2a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -690,6 +690,13 @@ static struct queue_sysfs_entry queue_wb_lat_entry = {
.show = queue_wb_lat_show,
.store = queue_wb_lat_store,
};
+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_sample_time_entry = {
+ .attr = {.name = "throttle_sample_time", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_throtl_sample_time_show,
+ .store = blk_throtl_sample_time_store,
+};
+#endif

static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
@@ -724,6 +731,9 @@ static struct attribute *default_attrs[] = {
&queue_stats_entry.attr,
&queue_wb_lat_entry.attr,
&queue_poll_delay_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+ &throtl_sample_time_entry.attr,
+#endif
NULL,
};

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 41ec72c..cd10c65 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
static int throtl_quantum = 32;

/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10; /* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)

static struct blkcg_policy blkcg_policy_throtl;

@@ -162,6 +163,8 @@ struct throtl_data
/* Total Number of queued bios on READ and WRITE lists */
unsigned int nr_queued[2];

+ unsigned int throtl_slice;
+
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
unsigned int limit_index;
@@ -588,7 +591,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
- unsigned long max_expire = jiffies + 8 * throtl_slice;
+ unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;

if (time_after(expires, max_expire))
expires = max_expire;
@@ -649,7 +652,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
if (time_after_eq(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;

- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -661,7 +664,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
tg->bytes_disp[rw] = 0;
tg->io_disp[rw] = 0;
tg->slice_start[rw] = jiffies;
- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -671,13 +674,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
}

static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
throtl_log(&tg->service_queue,
"[%c] extend slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -717,19 +720,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
* is bad because it does not allow new slice to start.
*/

- throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+ throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);

time_elapsed = jiffies - tg->slice_start[rw];

- nr_slices = time_elapsed / throtl_slice;
+ nr_slices = time_elapsed / tg->td->throtl_slice;

if (!nr_slices)
return;
- tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
+ io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) /
+ HZ;

if (!bytes_trim && !io_trim)
return;
@@ -744,7 +748,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
else
tg->io_disp[rw] = 0;

- tg->slice_start[rw] += nr_slices * throtl_slice;
+ tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;

throtl_log(&tg->service_queue,
"[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -764,9 +768,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

/*
* jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -813,9 +817,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
@@ -881,8 +885,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
throtl_start_new_slice(tg, rw);
else {
- if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
- throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+ if (time_before(tg->slice_end[rw],
+ jiffies + tg->td->throtl_slice))
+ throtl_extend_slice(tg, rw,
+ jiffies + tg->td->throtl_slice);
}

if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1601,7 +1607,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_LOW)
return false;

- if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+ if (time_before(jiffies, td->low_downgrade_time + td->throtl_slice))
return false;

rcu_read_lock();
@@ -1658,8 +1664,9 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
*/
- if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
- time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
+ time_after_eq(now, tg_last_low_overflow_time(tg) +
+ td->throtl_slice))
return true;
return false;
}
@@ -1689,13 +1696,14 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
return;
if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
return;
- if (time_after(tg->last_check_time + throtl_slice, now))
+ if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
return;

elapsed_time = now - tg->last_check_time;
tg->last_check_time = now;

- if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ if (time_before(now, tg_last_low_overflow_time(tg) +
+ tg->td->throtl_slice))
return;

if (tg->bps[READ][LIMIT_LOW] != U64_MAX) {
@@ -1923,6 +1931,7 @@ int blk_throtl_init(struct request_queue *q)

q->td = td;
td->queue = q;
+ td->throtl_slice = DFL_THROTL_SLICE;

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
@@ -1943,6 +1952,30 @@ void blk_throtl_exit(struct request_queue *q)
kfree(q->td);
}

+ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
+{
+ if (!q->td)
+ return -EINVAL;
+ return sprintf(page, "%u\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long v;
+ unsigned long t;
+
+ if (!q->td)
+ return -EINVAL;
+ if (kstrtoul(page, 10, &v))
+ return -EINVAL;
+ t = msecs_to_jiffies(v);
+ if (t == 0 || t > MAX_THROTL_SLICE)
+ return -EINVAL;
+ q->td->throtl_slice = t;
+ return count;
+}
+
static int __init throtl_init(void)
{
kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index 041185e..e83e757 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -290,6 +290,9 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
extern void blk_throtl_drain(struct request_queue *q);
extern int blk_throtl_init(struct request_queue *q);
extern void blk_throtl_exit(struct request_queue *q);
+extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+ const char *page, size_t count);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
--
2.9.3

2016-12-15 20:37:13

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 03/17] blk-throttle: add .low interface

Add low limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 134 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 106 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a75bfa2..fcc4199 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

enum {
+ LIMIT_LOW,
LIMIT_MAX,
LIMIT_CNT,
};
@@ -124,11 +125,15 @@ struct throtl_grp {
/* are there any throtl rules between this group and td? */
bool has_rules[2];

- /* bytes per second rate limits */
+ /* internally used bytes per second rate limits */
uint64_t bps[2][LIMIT_CNT];
+ /* user configured bps limits */
+ uint64_t bps_conf[2][LIMIT_CNT];

- /* IOPS limits */
+ /* internally used IOPS limits */
unsigned int iops[2][LIMIT_CNT];
+ /* user configured IOPS limits */
+ unsigned int iops_conf[2][LIMIT_CNT];

/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -355,6 +360,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
for (index = 0; index < LIMIT_CNT; index++) {
tg->bps[rw][index] = U64_MAX;
tg->iops[rw][index] = UINT_MAX;
+ tg->bps_conf[rw][index] = U64_MAX;
+ tg->iops_conf[rw][index] = UINT_MAX;
}
}

@@ -414,6 +421,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
}

+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+ bool low_valid = false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+ tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+ tg->iops[READ][LIMIT_LOW] != UINT_MAX ||
+ tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
+ low_valid = true;
+ }
+ rcu_read_unlock();
+
+ if (low_valid)
+ td->limit_valid[LIMIT_LOW] = true;
+ else
+ td->limit_valid[LIMIT_LOW] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+ struct throtl_grp *tg = pd_to_tg(pd);
+
+ tg->bps[READ][LIMIT_LOW] = U64_MAX;
+ tg->bps[WRITE][LIMIT_LOW] = U64_MAX;
+ tg->iops[READ][LIMIT_LOW] = UINT_MAX;
+ tg->iops[WRITE][LIMIT_LOW] = UINT_MAX;
+
+ blk_throtl_update_valid_limit(tg->td);
+
+ if (tg->td->limit_index == LIMIT_LOW &&
+ !tg->td->limit_valid[LIMIT_LOW])
+ tg->td->limit_index = LIMIT_MAX;
+}
+
static void throtl_pd_free(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1284,7 +1331,7 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
};

-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1294,38 +1341,38 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
if (!dname)
return 0;

- if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
- tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
- tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
- tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
+ if (tg->bps_conf[READ][off] == U64_MAX &&
+ tg->bps_conf[WRITE][off] == U64_MAX &&
+ tg->iops_conf[READ][off] == UINT_MAX &&
+ tg->iops_conf[WRITE][off] == UINT_MAX)
return 0;

- if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+ if (tg->bps_conf[READ][off] != U64_MAX)
snprintf(bufs[0], sizeof(bufs[0]), "%llu",
- tg->bps[READ][LIMIT_MAX]);
- if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+ tg->bps_conf[READ][off]);
+ if (tg->bps_conf[WRITE][off] != U64_MAX)
snprintf(bufs[1], sizeof(bufs[1]), "%llu",
- tg->bps[WRITE][LIMIT_MAX]);
- if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+ tg->bps_conf[WRITE][off]);
+ if (tg->iops_conf[READ][off] != UINT_MAX)
snprintf(bufs[2], sizeof(bufs[2]), "%u",
- tg->iops[READ][LIMIT_MAX]);
- if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+ tg->iops_conf[READ][off]);
+ if (tg->iops_conf[WRITE][off] != UINT_MAX)
snprintf(bufs[3], sizeof(bufs[3]), "%u",
- tg->iops[WRITE][LIMIT_MAX]);
+ tg->iops_conf[WRITE][off]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
return 0;
}

-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
{
- blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
&blkcg_policy_throtl, seq_cft(sf)->private, false);
return 0;
}

-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1333,6 +1380,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
int ret;
+ int index = of_cft(of)->private;

ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
if (ret)
@@ -1340,10 +1388,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ][LIMIT_MAX];
- v[1] = tg->bps[WRITE][LIMIT_MAX];
- v[2] = tg->iops[READ][LIMIT_MAX];
- v[3] = tg->iops[WRITE][LIMIT_MAX];
+ v[0] = tg->bps_conf[READ][index];
+ v[1] = tg->bps_conf[WRITE][index];
+ v[2] = tg->iops_conf[READ][index];
+ v[3] = tg->iops_conf[WRITE][index];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1380,11 +1428,31 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ][LIMIT_MAX] = v[0];
- tg->bps[WRITE][LIMIT_MAX] = v[1];
- tg->iops[READ][LIMIT_MAX] = v[2];
- tg->iops[WRITE][LIMIT_MAX] = v[3];
+ tg->bps_conf[READ][index] = v[0];
+ tg->bps_conf[WRITE][index] = v[1];
+ tg->iops_conf[READ][index] = v[2];
+ tg->iops_conf[WRITE][index] = v[3];

+ if (index == LIMIT_MAX) {
+ tg->bps[READ][index] = v[0];
+ tg->bps[WRITE][index] = v[1];
+ tg->iops[READ][index] = v[2];
+ tg->iops[WRITE][index] = v[3];
+ }
+ tg->bps[READ][LIMIT_LOW] = min(tg->bps_conf[READ][LIMIT_LOW],
+ tg->bps_conf[READ][LIMIT_MAX]);
+ tg->bps[WRITE][LIMIT_LOW] = min(tg->bps_conf[WRITE][LIMIT_LOW],
+ tg->bps_conf[WRITE][LIMIT_MAX]);
+ tg->iops[READ][LIMIT_LOW] = min(tg->iops_conf[READ][LIMIT_LOW],
+ tg->iops_conf[READ][LIMIT_MAX]);
+ tg->iops[WRITE][LIMIT_LOW] = min(tg->iops_conf[WRITE][LIMIT_LOW],
+ tg->iops_conf[WRITE][LIMIT_MAX]);
+
+ if (index == LIMIT_LOW) {
+ blk_throtl_update_valid_limit(tg->td);
+ if (tg->td->limit_valid[LIMIT_LOW])
+ tg->td->limit_index = LIMIT_LOW;
+ }
tg_conf_updated(tg);
ret = 0;
out_finish:
@@ -1394,10 +1462,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

static struct cftype throtl_files[] = {
{
+ .name = "low",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_LOW,
+ },
+ {
.name = "max",
.flags = CFTYPE_NOT_ON_ROOT,
- .seq_show = tg_print_max,
- .write = tg_set_max,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_MAX,
},
{ } /* terminate */
};
@@ -1416,6 +1492,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_alloc_fn = throtl_pd_alloc,
.pd_init_fn = throtl_pd_init,
.pd_online_fn = throtl_pd_online,
+ .pd_offline_fn = throtl_pd_offline,
.pd_free_fn = throtl_pd_free,
};

@@ -1595,6 +1672,7 @@ int blk_throtl_init(struct request_queue *q)
td->queue = q;

td->limit_valid[LIMIT_MAX] = true;
+ td->limit_index = LIMIT_MAX;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-12-15 20:37:11

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency

Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the threshold, other cgroups can safely dispatch more IO even
their bandwidth is higher than their low limits. On the other hand, if
the first cgroup's latency is higher than the threshold, other cgroups
are throttled to their low limits. So the target latency determines how
we efficiently utilize free disk resource without sacifice of worload's
IO latency.

For example, assume 4k IO average latency is 50us when disk isn't
congested. A cgroup sets the target latency to 30us. Then the cgroup can
accept 50+30=80us IO latency. If the cgroupt's average IO latency is
90us and its bandwidth is below low limit, other cgroups are throttled
to their low limit. If the cgroup's average IO latency is 60us, other
cgroups are allowed to dispatch more IO. When other cgroups dispatch
more IO, the first cgroup's IO latency will increase. If it increases to
81us, we then throttle other cgroups.

User will configure the interface in this way:
echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

latency is in microsecond unit

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 62fe72ea..3431b1d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,6 +151,7 @@ struct throtl_grp {

unsigned long last_check_time;

+ u64 latency_target;
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -438,6 +439,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}
tg->idle_ttime_threshold = U64_MAX;

+ /*
+ * target latency default 0, eg, latency threshold is 0, which means
+ * cgroup's latency is always higher than threshold
+ */
+
return &tg->pd;
}

@@ -1426,6 +1432,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
const char *dname = blkg_dev_name(pd->blkg);
char bufs[4][21] = { "max", "max", "max", "max" };
char idle_time[26] = "";
+ char latency_time[26] = "";

if (!dname)
return 0;
@@ -1434,8 +1441,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
tg->bps_conf[WRITE][off] == U64_MAX &&
tg->iops_conf[READ][off] == UINT_MAX &&
tg->iops_conf[WRITE][off] == UINT_MAX &&
- (off != LIMIT_LOW || tg->idle_ttime_threshold ==
- tg->td->dft_idle_ttime_threshold))
+ (off != LIMIT_LOW ||
+ (tg->idle_ttime_threshold == tg->td->dft_idle_ttime_threshold &&
+ tg->latency_target == 0)))
return 0;

if (tg->bps_conf[READ][off] != U64_MAX)
@@ -1456,10 +1464,18 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
else
snprintf(idle_time, sizeof(idle_time), " idle=%llu",
div_u64(tg->idle_ttime_threshold, 1000));
+
+ if (tg->latency_target == U64_MAX)
+ strcpy(latency_time, " latency=max");
+ else
+ snprintf(latency_time, sizeof(latency_time),
+ " latency=%llu",
+ div_u64(tg->latency_target, 1000));
}

- seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
- dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
+ seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n",
+ dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time,
+ latency_time);
return 0;
}

@@ -1478,6 +1494,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
u64 idle_time;
+ u64 latency_time;
int ret;
int index = of_cft(of)->private;

@@ -1493,6 +1510,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[3] = tg->iops_conf[WRITE][index];

idle_time = tg->idle_ttime_threshold;
+ latency_time = tg->latency_target;
while (true) {
char tok[27]; /* wiops=18446744073709551616 */
char *p;
@@ -1526,6 +1544,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[3] = min_t(u64, val, UINT_MAX);
else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
idle_time = val;
+ else if (off == LIMIT_LOW && !strcmp(tok, "latency"))
+ latency_time = val;
else
goto out_finish;
}
@@ -1556,6 +1576,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
tg->td->limit_index = LIMIT_LOW;
tg->idle_ttime_threshold = (idle_time == U64_MAX) ?
U64_MAX : idle_time * 1000;
+ tg->latency_target = (latency_time == U64_MAX) ?
+ U64_MAX : latency_time * 1000;
}
tg_conf_updated(tg);
ret = 0;
--
2.9.3

2016-12-15 20:37:03

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-stat.c | 4 ++
block/blk-throttle.c | 159 +++++++++++++++++++++++++++++++++++++++++++++-
block/blk.h | 2 +
include/linux/blk_types.h | 9 +--
4 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 0469855..15c1621 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -7,6 +7,7 @@
#include <linux/blk-mq.h>

#include "blk-stat.h"
+#include "blk.h"
#include "blk-mq.h"

static void blk_stat_flush_batch(struct blk_rq_stat *stat)
@@ -204,6 +205,9 @@ void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
__blk_stat_init(stat, now);

value = now - blk_stat_time(&rq->issue_stat);
+
+ blk_throtl_stat_add(rq, value);
+
if (value > stat->max)
stat->max = value;
if (value < stat->min)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3431b1d..1dc707a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,6 +25,8 @@ static int throtl_quantum = 32;
#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
#define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */

+#define SKIP_TRACK (((u64)1) << BLK_STAT_RES_SHIFT)
+
static struct blkcg_policy blkcg_policy_throtl;

/* A workqueue to queue throttle related work */
@@ -162,6 +164,19 @@ struct throtl_grp {
u64 idle_ttime_threshold;
};

+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+ u64 total_latency;
+ int samples;
+};
+
+struct avg_latency_bucket {
+ u64 latency;
+ bool valid;
+};
+
struct throtl_data
{
/* service tree for active throtl groups */
@@ -185,6 +200,13 @@ struct throtl_data
unsigned long low_downgrade_time;

unsigned int scale;
+
+ struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+ struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+ struct latency_bucket __percpu *latency_buckets;
+ unsigned long last_calculate_time;
+
+ unsigned int track_bio_latency;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -293,6 +315,13 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
return ret;
}

+static int request_bucket_index(sector_t sectors)
+{
+ int order = order_base_2(sectors);
+
+ return clamp_t(int, order - 3, 0, LATENCY_BUCKET_SIZE - 1);
+}
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -1896,12 +1925,74 @@ static void blk_throtl_update_ttime(struct throtl_grp *tg)
tg->checked_last_finish_time = last_finish_time;
}

+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+ struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+ int i, cpu;
+ u64 last_latency = 0;
+ u64 latency;
+
+ if (!blk_queue_nonrot(td->queue))
+ return;
+ if (time_before(jiffies, td->last_calculate_time + HZ))
+ return;
+ td->last_calculate_time = jiffies;
+
+ memset(avg_latency, 0, sizeof(avg_latency));
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+ for_each_possible_cpu(cpu) {
+ struct latency_bucket *bucket;
+
+ /* this isn't race free, but ok in practice */
+ bucket = per_cpu_ptr(td->latency_buckets, cpu);
+ tmp->total_latency += bucket[i].total_latency;
+ tmp->samples += bucket[i].samples;
+ bucket[i].total_latency = 0;
+ bucket[i].samples = 0;
+ }
+
+ if (tmp->samples >= 32) {
+ int samples = tmp->samples;
+
+ latency = tmp->total_latency;
+
+ tmp->total_latency = 0;
+ tmp->samples = 0;
+ do_div(latency, samples);
+ if (latency == 0)
+ continue;
+ avg_latency[i].latency = latency;
+ }
+ }
+
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ if (!avg_latency[i].latency) {
+ if (td->avg_buckets[i].latency < last_latency)
+ td->avg_buckets[i].latency = last_latency;
+ continue;
+ }
+
+ if (!td->avg_buckets[i].valid)
+ latency = avg_latency[i].latency;
+ else
+ latency = (td->avg_buckets[i].latency * 7 +
+ avg_latency[i].latency) >> 3;
+
+ td->avg_buckets[i].latency = max(latency, last_latency);
+ td->avg_buckets[i].valid = true;
+ last_latency = td->avg_buckets[i].latency;
+ }
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
struct throtl_qnode *qn = NULL;
struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
struct throtl_service_queue *sq;
+ struct throtl_data *td;
bool rw = bio_data_dir(bio);
bool throttled = false;
int ret;
@@ -1919,12 +2010,21 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
tg->idle_ttime_threshold;
}

+ td = tg->td;
+ if (td->track_bio_latency == UINT_MAX) {
+ td->track_bio_latency = !q->mq_ops && !q->request_fn;
+ if (!td->track_bio_latency)
+ blk_stat_enable(q);
+ }
+
/* see throtl_charge_bio() */
if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
goto out;

spin_lock_irq(q->queue_lock);

+ throtl_update_latency_buckets(tg->td);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

@@ -1932,6 +2032,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
if (ret == 0 || ret == -EBUSY)
bio->bi_cg_private = tg;

+ blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
+
blk_throtl_update_ttime(tg);

sq = &tg->service_queue;
@@ -2019,19 +2121,62 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
*/
if (!throttled)
bio_clear_flag(bio, BIO_THROTTLED);
+ else
+ bio->bi_issue_stat.stat |= SKIP_TRACK;
return throttled;
}

+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+ int op, u64 time)
+{
+ struct latency_bucket *latency;
+ int index;
+
+ if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+ !blk_queue_nonrot(td->queue))
+ return;
+
+ index = request_bucket_index(size);
+
+ latency = get_cpu_ptr(td->latency_buckets);
+ latency[index].total_latency += time;
+ latency[index].samples++;
+ put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_stat_add(struct request *rq, u64 time)
+{
+ struct request_queue *q = rq->q;
+ struct throtl_data *td = q->td;
+
+ throtl_track_latency(td, blk_stat_size(&rq->issue_stat),
+ req_op(rq), time);
+}
+
void blk_throtl_bio_endio(struct bio *bio)
{
struct throtl_grp *tg;
+ u64 finish_time;
+ u64 start_time;
+ u64 lat;

tg = bio->bi_cg_private;
if (!tg)
return;
bio->bi_cg_private = NULL;

- tg->last_finish_time = ktime_get_ns();
+ finish_time = ktime_get_ns();
+ tg->last_finish_time = finish_time;
+
+ start_time = blk_stat_time(&bio->bi_issue_stat);
+ finish_time = __blk_stat_time(finish_time);
+ if (start_time && finish_time > start_time &&
+ tg->td->track_bio_latency == 1 &&
+ !(bio->bi_issue_stat.stat & SKIP_TRACK)) {
+ lat = finish_time - start_time;
+ throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
+ bio_op(bio), lat);
+ }
}

/*
@@ -2106,6 +2251,12 @@ int blk_throtl_init(struct request_queue *q)
td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
if (!td)
return -ENOMEM;
+ td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+ LATENCY_BUCKET_SIZE, __alignof__(u64));
+ if (!td->latency_buckets) {
+ kfree(td);
+ return -ENOMEM;
+ }

INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
throtl_service_queue_init(&td->service_queue);
@@ -2119,10 +2270,13 @@ int blk_throtl_init(struct request_queue *q)
td->low_upgrade_time = jiffies;
td->low_downgrade_time = jiffies;

+ td->track_bio_latency = UINT_MAX;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
- if (ret)
+ if (ret) {
+ free_percpu(td->latency_buckets);
kfree(td);
+ }
return ret;
}

@@ -2131,6 +2285,7 @@ void blk_throtl_exit(struct request_queue *q)
BUG_ON(!q->td);
throtl_shutdown_wq(q);
blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+ free_percpu(q->td->latency_buckets);
kfree(q->td);
}

diff --git a/block/blk.h b/block/blk.h
index 921d04f..dfb44e6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,11 +294,13 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
const char *page, size_t count);
extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_stat_add(struct request *rq, u64 time);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline void blk_throtl_exit(struct request_queue *q) { }
static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
#endif /* CONFIG_BLK_DEV_THROTTLING */

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8159a6c..d88a3c0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -17,6 +17,10 @@ struct io_context;
struct cgroup_subsys_state;
typedef void (bio_end_io_t) (struct bio *);

+struct blk_issue_stat {
+ u64 stat;
+};
+
/*
* main unit of I/O for the block layer and lower layers (ie drivers and
* stacking drivers)
@@ -59,6 +63,7 @@ struct bio {
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
void *bi_cg_private;
+ struct blk_issue_stat bi_issue_stat;
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
@@ -256,10 +261,6 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
}

-struct blk_issue_stat {
- u64 stat;
-};
-
#define BLK_RQ_STAT_BATCH 64

struct blk_rq_stat {
--
2.9.3

2016-12-15 20:37:01

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 02/17] blk-throttle: prepare support multiple limits

We are going to support low/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 114 +++++++++++++++++++++++++++++++++------------------
1 file changed, 73 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e45bf50..a75bfa2 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {

#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

+enum {
+ LIMIT_MAX,
+ LIMIT_CNT,
+};
+
struct throtl_grp {
/* must be the first member */
struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
bool has_rules[2];

/* bytes per second rate limits */
- uint64_t bps[2];
+ uint64_t bps[2][LIMIT_CNT];

/* IOPS limits */
- unsigned int iops[2];
+ unsigned int iops[2][LIMIT_CNT];

/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data

/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
+ unsigned int limit_index;
+ bool limit_valid[LIMIT_CNT];
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
}

+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->iops[rw][tg->td->limit_index];
+}
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
{
struct throtl_grp *tg;
- int rw;
+ int rw, index;

tg = kzalloc_node(sizeof(*tg), gfp, node);
if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}

RB_CLEAR_NODE(&tg->rb_node);
- tg->bps[READ] = U64_MAX;
- tg->bps[WRITE] = U64_MAX;
- tg->iops[READ] = UINT_MAX;
- tg->iops[WRITE] = UINT_MAX;
+ for (rw = READ; rw <= WRITE; rw++) {
+ for (index = 0; index < LIMIT_CNT; index++) {
+ tg->bps[rw][index] = U64_MAX;
+ tg->iops[rw][index] = UINT_MAX;
+ }
+ }

return &tg->pd;
}
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
static void tg_update_has_rules(struct throtl_grp *tg)
{
struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+ struct throtl_data *td = tg->td;
int rw;

for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
- (tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
+ (td->limit_valid[td->limit_index] &&
+ (tg_bps_limit(tg, rw) != U64_MAX ||
+ tg_iops_limit(tg, rw) != UINT_MAX));
}

static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)

if (!nr_slices)
return;
- tmp = tg->bps[rw] * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+ io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;

if (!bytes_trim && !io_trim)
return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
* have been trimmed.
*/

- tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+ tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);

if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
}

/* Calc approx time to dispatch */
- jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+ jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;

if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);

- tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+ tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;

@@ -736,7 +758,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
- jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+ jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));

if (!jiffy_wait)
jiffy_wait = 1;
@@ -771,7 +793,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
+ if (tg_bps_limit(tg, rw) == U64_MAX &&
+ tg_iops_limit(tg, rw) == UINT_MAX) {
if (wait)
*wait = 0;
return true;
@@ -1148,8 +1171,8 @@ static void tg_conf_updated(struct throtl_grp *tg)

throtl_log(&tg->service_queue,
"limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
- tg->bps[READ], tg->bps[WRITE],
- tg->iops[READ], tg->iops[WRITE]);
+ tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+ tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));

/*
* Update has_rules[] flags for the updated tg's subtree. A tg is
@@ -1226,25 +1249,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
static struct cftype throtl_legacy_files[] = {
{
.name = "throttle.read_bps_device",
- .private = offsetof(struct throtl_grp, bps[READ]),
+ .private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.write_bps_device",
- .private = offsetof(struct throtl_grp, bps[WRITE]),
+ .private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.read_iops_device",
- .private = offsetof(struct throtl_grp, iops[READ]),
+ .private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
{
.name = "throttle.write_iops_device",
- .private = offsetof(struct throtl_grp, iops[WRITE]),
+ .private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
@@ -1270,18 +1293,25 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
- tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
+
+ if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
+ tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
+ tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
+ tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
return 0;

- if (tg->bps[READ] != U64_MAX)
- snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
- if (tg->bps[WRITE] != U64_MAX)
- snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
- if (tg->iops[READ] != UINT_MAX)
- snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
- if (tg->iops[WRITE] != UINT_MAX)
- snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+ if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+ snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+ tg->bps[READ][LIMIT_MAX]);
+ if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+ snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+ tg->bps[WRITE][LIMIT_MAX]);
+ if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+ snprintf(bufs[2], sizeof(bufs[2]), "%u",
+ tg->iops[READ][LIMIT_MAX]);
+ if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+ snprintf(bufs[3], sizeof(bufs[3]), "%u",
+ tg->iops[WRITE][LIMIT_MAX]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1310,10 +1340,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ];
- v[1] = tg->bps[WRITE];
- v[2] = tg->iops[READ];
- v[3] = tg->iops[WRITE];
+ v[0] = tg->bps[READ][LIMIT_MAX];
+ v[1] = tg->bps[WRITE][LIMIT_MAX];
+ v[2] = tg->iops[READ][LIMIT_MAX];
+ v[3] = tg->iops[WRITE][LIMIT_MAX];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1350,10 +1380,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ] = v[0];
- tg->bps[WRITE] = v[1];
- tg->iops[READ] = v[2];
- tg->iops[WRITE] = v[3];
+ tg->bps[READ][LIMIT_MAX] = v[0];
+ tg->bps[WRITE][LIMIT_MAX] = v[1];
+ tg->iops[READ][LIMIT_MAX] = v[2];
+ tg->iops[WRITE][LIMIT_MAX] = v[3];

tg_conf_updated(tg);
ret = 0;
@@ -1451,8 +1481,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
/* out-of-limit, queue to @tg */
throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
rw == READ ? 'R' : 'W',
- tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
- tg->io_disp[rw], tg->iops[rw],
+ tg->bytes_disp[rw], bio->bi_iter.bi_size,
+ tg_bps_limit(tg, rw),
+ tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

bio_associate_current(bio);
@@ -1563,6 +1594,7 @@ int blk_throtl_init(struct request_queue *q)
q->td = td;
td->queue = q;

+ td->limit_valid[LIMIT_MAX] = true;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-12-15 20:38:43

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0f65fce..41ec72c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -588,6 +588,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
+ unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+ if (time_after(expires, max_expire))
+ expires = max_expire;
mod_timer(&sq->pending_timer, expires);
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
expires - jiffies, jiffies);
--
2.9.3

2016-12-15 20:38:52

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state

When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops cross low limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 97 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e55bd36..cfd74cfc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -455,6 +455,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
td->limit_valid[LIMIT_LOW] = false;
}

+static void throtl_upgrade_state(struct throtl_data *td);
static void throtl_pd_offline(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -466,9 +467,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)

blk_throtl_update_valid_limit(tg->td);

- if (tg->td->limit_index == LIMIT_LOW &&
- !tg->td->limit_valid[LIMIT_LOW])
- tg->td->limit_index = LIMIT_MAX;
+ if (!tg->td->limit_valid[tg->td->limit_index])
+ throtl_upgrade_state(tg->td);
}

static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1077,6 +1077,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
return nr_disp;
}

+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg);
/**
* throtl_pending_timer_fn - timer function for service_queue->pending_timer
* @arg: the throtl_service_queue being serviced
@@ -1103,6 +1105,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
int ret;

spin_lock_irq(q->queue_lock);
+ if (throtl_can_upgrade(td, NULL))
+ throtl_upgrade_state(td);
+
again:
parent_sq = sq->parent_sq;
dispatched = false;
@@ -1506,6 +1511,88 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *sq = &tg->service_queue;
+ bool read_limit, write_limit;
+
+ /*
+ * if cgroup reaches low/max limit (max >= low), it's ok to next
+ * limit
+ */
+ read_limit = tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+ tg->iops[READ][LIMIT_LOW] != UINT_MAX;
+ write_limit = tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+ tg->iops[WRITE][LIMIT_LOW] != UINT_MAX;
+ if (read_limit && sq->nr_queued[READ] &&
+ (!write_limit || sq->nr_queued[WRITE]))
+ return true;
+ if (write_limit && sq->nr_queued[WRITE] &&
+ (!read_limit || sq->nr_queued[READ]))
+ return true;
+ return false;
+}
+
+static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
+{
+ while (true) {
+ if (throtl_tg_can_upgrade(tg))
+ return true;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+ !tg_to_blkg(tg)->parent))
+ return false;
+ }
+ return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ if (td->limit_index != LIMIT_LOW)
+ return false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg == this_tg)
+ continue;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ continue;
+ if (!throtl_hierarchy_can_upgrade(tg)) {
+ rcu_read_unlock();
+ return false;
+ }
+ }
+ rcu_read_unlock();
+ return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ td->limit_index = LIMIT_MAX;
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+ struct throtl_service_queue *sq = &tg->service_queue;
+
+ tg->disptime = jiffies - 1;
+ throtl_select_dispatch(sq);
+ throtl_schedule_next_dispatch(sq, false);
+ }
+ rcu_read_unlock();
+ throtl_select_dispatch(&td->service_queue);
+ throtl_schedule_next_dispatch(&td->service_queue, false);
+ queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1528,14 +1615,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

sq = &tg->service_queue;

+again:
while (true) {
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
- if (!tg_may_dispatch(tg, bio, NULL))
+ if (!tg_may_dispatch(tg, bio, NULL)) {
+ if (throtl_can_upgrade(tg->td, tg)) {
+ throtl_upgrade_state(tg->td);
+ goto again;
+ }
break;
+ }

/* within limits, let's charge and dispatch directly */
throtl_charge_bio(tg, bio);
--
2.9.3

2016-12-15 20:38:50

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 11/17] blk-throttle: add a simple idle detection

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

Signed-off-by: Shaohua Li <[email protected]>
---
block/bio.c | 2 ++
block/blk-throttle.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++-
block/blk.h | 2 ++
include/linux/blk_types.h | 1 +
4 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 2b37502..948ebc1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
#include <linux/cgroup.h>

#include <trace/events/block.h>
+#include "blk.h"

/*
* Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1813,6 +1814,7 @@ void bio_endio(struct bio *bio)
goto again;
}

+ blk_throtl_bio_endio(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
}
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6b2f365..3cd8400 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,9 @@ static int throtl_quantum = 32;
/* Throttling is performed over 100ms slice and after that slice is renewed */
#define DFL_THROTL_SLICE (HZ / 10)
#define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
+#define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */

static struct blkcg_policy blkcg_policy_throtl;

@@ -153,6 +156,11 @@ struct throtl_grp {
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
+
+ u64 last_finish_time;
+ u64 checked_last_finish_time;
+ u64 avg_ttime;
+ u64 idle_ttime_threshold;
};

struct throtl_data
@@ -428,6 +436,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
tg->iops_conf[rw][index] = UINT_MAX;
}
}
+ tg->idle_ttime_threshold = U64_MAX;

return &tg->pd;
}
@@ -1607,6 +1616,20 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
return ret;
}

+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+ /*
+ * cgroup is idle if:
+ * - single idle is too long, longer than a fixed value (in case user
+ * configure a too big threshold) or 4 times of slice
+ * - average think time is more than threshold
+ */
+ u64 time = (u64)jiffies_to_usecs(4 * tg->td->throtl_slice) * 1000;
+ time = min_t(u64, MAX_IDLE_TIME, time);
+ return ktime_get_ns() - tg->last_finish_time > time ||
+ tg->avg_ttime > tg->idle_ttime_threshold;
+}
+
static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1808,6 +1831,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
tg->last_io_disp[WRITE] = 0;
}

+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+ u64 now = ktime_get_ns();
+ u64 last_finish_time = tg->last_finish_time;
+
+ if (now <= last_finish_time || last_finish_time == 0 ||
+ last_finish_time == tg->checked_last_finish_time)
+ return;
+
+ tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3;
+ tg->checked_last_finish_time = last_finish_time;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1816,9 +1852,18 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
+ int ret;

WARN_ON_ONCE(!rcu_read_lock_held());

+ /* nonrot bit isn't set in queue creation */
+ if (tg->idle_ttime_threshold == U64_MAX) {
+ if (blk_queue_nonrot(q))
+ tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
+ else
+ tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+ }
+
/* see throtl_charge_bio() */
if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
goto out;
@@ -1828,6 +1873,12 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

+ ret = bio_associate_current(bio);
+ if (ret == 0 || ret == -EBUSY)
+ bio->bi_cg_private = tg;
+
+ blk_throtl_update_ttime(tg);
+
sq = &tg->service_queue;

again:
@@ -1888,7 +1939,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

tg->last_low_overflow_time[rw] = jiffies;

- bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;
@@ -1917,6 +1967,18 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
return throttled;
}

+void blk_throtl_bio_endio(struct bio *bio)
+{
+ struct throtl_grp *tg;
+
+ tg = bio->bi_cg_private;
+ if (!tg)
+ return;
+ bio->bi_cg_private = NULL;
+
+ tg->last_finish_time = ktime_get_ns();
+}
+
/*
* Dispatch all bios from all children tg's queued on @parent_sq. On
* return, @parent_sq is guaranteed to not have any active children tg's
@@ -2001,6 +2063,7 @@ int blk_throtl_init(struct request_queue *q)
td->limit_index = LIMIT_MAX;
td->low_upgrade_time = jiffies;
td->low_downgrade_time = jiffies;
+
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
diff --git a/block/blk.h b/block/blk.h
index e83e757..921d04f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -293,10 +293,12 @@ extern void blk_throtl_exit(struct request_queue *q);
extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
#endif /* CONFIG_BLK_DEV_THROTTLING */

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 519ea2c..01019a7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -58,6 +58,7 @@ struct bio {
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
+ void *bi_cg_private;
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
--
2.9.3

2016-12-15 20:38:46

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1

clean up the code to avoid using -1

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6bb4fe..e45bf50 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -334,10 +334,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}

RB_CLEAR_NODE(&tg->rb_node);
- tg->bps[READ] = -1;
- tg->bps[WRITE] = -1;
- tg->iops[READ] = -1;
- tg->iops[WRITE] = -1;
+ tg->bps[READ] = U64_MAX;
+ tg->bps[WRITE] = U64_MAX;
+ tg->iops[READ] = UINT_MAX;
+ tg->iops[WRITE] = UINT_MAX;

return &tg->pd;
}
@@ -380,7 +380,7 @@ static void tg_update_has_rules(struct throtl_grp *tg)

for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
- (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+ (tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
}

static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -771,7 +771,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+ if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
if (wait)
*wait = 0;
return true;
@@ -1110,7 +1110,7 @@ static u64 tg_prfill_conf_u64(struct seq_file *sf, struct blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
u64 v = *(u64 *)((void *)tg + off);

- if (v == -1)
+ if (v == U64_MAX)
return 0;
return __blkg_prfill_u64(sf, pd, v);
}
@@ -1121,7 +1121,7 @@ static u64 tg_prfill_conf_uint(struct seq_file *sf, struct blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
unsigned int v = *(unsigned int *)((void *)tg + off);

- if (v == -1)
+ if (v == UINT_MAX)
return 0;
return __blkg_prfill_u64(sf, pd, v);
}
@@ -1195,7 +1195,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
if (sscanf(ctx.body, "%llu", &v) != 1)
goto out_finish;
if (!v)
- v = -1;
+ v = U64_MAX;

tg = blkg_to_tg(ctx.blkg);

@@ -1270,17 +1270,17 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
- tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+ if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
+ tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
return 0;

- if (tg->bps[READ] != -1)
+ if (tg->bps[READ] != U64_MAX)
snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
- if (tg->bps[WRITE] != -1)
+ if (tg->bps[WRITE] != U64_MAX)
snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
- if (tg->iops[READ] != -1)
+ if (tg->iops[READ] != UINT_MAX)
snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
- if (tg->iops[WRITE] != -1)
+ if (tg->iops[WRITE] != UINT_MAX)
snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
@@ -1318,7 +1318,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
while (true) {
char tok[27]; /* wiops=18446744073709551616 */
char *p;
- u64 val = -1;
+ u64 val = U64_MAX;
int len;

if (sscanf(ctx.body, "%26s%n", tok, &len) != 1)
--
2.9.3

2016-12-16 02:03:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH V5 15/17] block: track request size in blk_issue_stat

Hi Shaohua,

[auto build test WARNING on block/for-next]
[also build test WARNING on next-20161215]
[cannot apply to v4.9]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-low-limit/20161216-093257
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: i386-randconfig-s0-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All warnings (new ones prefixed by >>):

block/blk-stat.c: In function 'blk_stat_set_issue':
>> block/blk-stat.c:243:26: warning: left shift count >= width of type [-Wshift-count-overflow]
(blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
^~

vim +243 block/blk-stat.c

227 queue_for_each_hw_ctx(q, hctx, i) {
228 hctx_for_each_ctx(hctx, ctx, j) {
229 blk_stat_init(&ctx->stat[BLK_STAT_READ]);
230 blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
231 }
232 }
233 } else {
234 blk_stat_init(&q->rq_stats[BLK_STAT_READ]);
235 blk_stat_init(&q->rq_stats[BLK_STAT_WRITE]);
236 }
237 }
238
239 void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
240 {
241 stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
242 (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
> 243 (blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
244 }
245
246 /*
247 * Enable stat tracking, return whether it was enabled
248 */
249 bool blk_stat_enable(struct request_queue *q)
250 {
251 if (!test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.88 kB)
.config.gz (27.95 kB)
Download all attachments