2017-03-27 18:39:04

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 00/18] blk-throttle: add .low limit

Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 11 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 11 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 12-18
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
13 - 15 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
16 - 18 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knob.
Configuration would not be easy for normal users. Based on discussion in LSF,
add a configure option for the interface and mark it experimental, so user can
test/benchmark. If the interface is really bad, we can change/remove the
interface.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V6->V7:
- Add a configure option for the low limit and mark it experimental as
discussed in LSF. All user interfaces are only enabled with the option on.
- Don't overload blk stat, which will simplify the code. This will add extra
space in bio/request though with the low interface configure option on.
- Rebase against Jesn's for-4.12/block tree

V5->V6:
- Change default setting for io.low limit. It's 0 now, which makes more sense
- The default setting for latency is still 0, the default setting for idle time
becomes bigger. So with the default settings, cgroups have small latency but
disk sharing could be harmed
- Addressed other issues pointed out by Tejun
http://marc.info/?l=linux-kernel&m=148445203815062&w=2

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based
calculation'
- Rebase and fix other problems

Issue pointed out by Tejun isn't fixed yet:
- .pd_offline_fn vs .pd_free_fn. .pd_free_fn seems too late to change states
http://marc.info/?l=linux-kernel&m=148183437022975&w=2

V3->V4:
- Add latency target for cgroup
- Fix bugs
http://marc.info/?l=linux-block&m=147916216512915&w=2

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2

Shaohua Li (18):
blk-throttle: use U64_MAX/UINT_MAX to replace -1
blk-throttle: prepare support multiple limits
blk-throttle: add configure option for new .low interface
blk-throttle: add .low interface
blk-throttle: configure bps/iops limit for cgroup in low limit
blk-throttle: add upgrade logic for LIMIT_LOW state
blk-throttle: add downgrade logic
blk-throttle: make sure expire time isn't too big
blk-throttle: make throtl_slice tunable
blk-throttle: choose a small throtl_slice for SSD
blk-throttle: detect completed idle cgroup
blk-throttle: make bandwidth change smooth
blk-throttle: add a simple idle detection
blk-throttle: add interface to configure idle time threshold
blk-throttle: ignore idle cgroup limit
blk-throttle: add interface for per-cgroup target latency
blk-throttle: add a mechanism to estimate IO latency
blk-throttle: add latency target support

Documentation/block/queue-sysfs.txt | 6 +
block/Kconfig | 12 +
block/bio.c | 2 +
block/blk-core.c | 4 +
block/blk-mq.c | 4 +
block/blk-sysfs.c | 13 +
block/blk-throttle.c | 992 +++++++++++++++++++++++++++++++++---
block/blk.h | 14 +
include/linux/blk_types.h | 4 +
include/linux/blkdev.h | 3 +
10 files changed, 979 insertions(+), 75 deletions(-)

--
2.9.3


2017-03-27 17:53:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 16/18] blk-throttle: add interface for per-cgroup target latency

Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the threshold, other cgroups can safely dispatch more IO even
their bandwidth is higher than their low limits. On the other hand, if
the first cgroup's latency is higher than the threshold, other cgroups
are throttled to their low limits. So the target latency determines how
we efficiently utilize free disk resource without sacifice of worload's
IO latency.

For example, assume 4k IO average latency is 50us when disk isn't
congested. A cgroup sets the target latency to 30us. Then the cgroup can
accept 50+30=80us IO latency. If the cgroupt's average IO latency is
90us and its bandwidth is below low limit, other cgroups are throttled
to their low limit. If the cgroup's average IO latency is 60us, other
cgroups are allowed to dispatch more IO. When other cgroups dispatch
more IO, the first cgroup's IO latency will increase. If it increases to
81us, we then throttle other cgroups.

User will configure the interface in this way:
echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

latency is in microsecond unit

By default, latency target is 0, which means to guarantee IO latency.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0ea8698..6e1c298 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,6 +25,8 @@ static int throtl_quantum = 32;
#define DFL_IDLE_THRESHOLD_SSD (1000L) /* 1 ms */
#define DFL_IDLE_THRESHOLD_HD (100L * 1000) /* 100 ms */
#define MAX_IDLE_TIME (5L * 1000 * 1000) /* 5 s */
+/* default latency target is 0, eg, guarantee IO latency by default */
+#define DFL_LATENCY_TARGET (0)

static struct blkcg_policy blkcg_policy_throtl;

@@ -152,6 +154,7 @@ struct throtl_grp {

unsigned long last_check_time;

+ unsigned long latency_target; /* us */
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -449,6 +452,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
/* LIMIT_LOW will have default value 0 */

+ tg->latency_target = DFL_LATENCY_TARGET;
+
return &tg->pd;
}

@@ -1445,6 +1450,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
u64 bps_dft;
unsigned int iops_dft;
char idle_time[26] = "";
+ char latency_time[26] = "";

if (!dname)
return 0;
@@ -1461,8 +1467,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
tg->bps_conf[WRITE][off] == bps_dft &&
tg->iops_conf[READ][off] == iops_dft &&
tg->iops_conf[WRITE][off] == iops_dft &&
- (off != LIMIT_LOW || tg->idletime_threshold ==
- tg->td->dft_idletime_threshold))
+ (off != LIMIT_LOW ||
+ (tg->idletime_threshold == tg->td->dft_idletime_threshold &&
+ tg->latency_target == DFL_LATENCY_TARGET)))
return 0;

if (tg->bps_conf[READ][off] != bps_dft)
@@ -1483,10 +1490,17 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
else
snprintf(idle_time, sizeof(idle_time), " idle=%lu",
tg->idletime_threshold);
+
+ if (tg->latency_target == ULONG_MAX)
+ strcpy(latency_time, " latency=max");
+ else
+ snprintf(latency_time, sizeof(latency_time),
+ " latency=%lu", tg->latency_target);
}

- seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
- dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
+ seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n",
+ dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time,
+ latency_time);
return 0;
}

@@ -1505,6 +1519,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
unsigned long idle_time;
+ unsigned long latency_time;
int ret;
int index = of_cft(of)->private;

@@ -1520,6 +1535,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[3] = tg->iops_conf[WRITE][index];

idle_time = tg->idletime_threshold;
+ latency_time = tg->latency_target;
while (true) {
char tok[27]; /* wiops=18446744073709551616 */
char *p;
@@ -1553,6 +1569,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[3] = min_t(u64, val, UINT_MAX);
else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
idle_time = val;
+ else if (off == LIMIT_LOW && !strcmp(tok, "latency"))
+ latency_time = val;
else
goto out_finish;
}
@@ -1583,6 +1601,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
tg->td->limit_index = LIMIT_LOW;
tg->idletime_threshold = (idle_time == ULONG_MAX) ?
ULONG_MAX : idle_time;
+ tg->latency_target = (latency_time == ULONG_MAX) ?
+ ULONG_MAX : latency_time;
}
tg_conf_updated(tg);
ret = 0;
--
2.9.3

2017-03-27 17:53:40

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 11/18] blk-throttle: detect completed idle cgroup

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Please note this will be replaced with a more sophisticated algorithm
later, but this demonstrates the idea how we handle idle cgroups, so I
leave it here.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d00c1c1..014b2e9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -149,6 +149,8 @@ struct throtl_grp {

unsigned long last_check_time;

+ unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -445,11 +447,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)

static void throtl_pd_online(struct blkg_policy_data *pd)
{
+ struct throtl_grp *tg = pd_to_tg(pd);
/*
* We don't want new groups to escape the limits of its ancestors.
* Update has_rules[] after a new group is brought online.
*/
- tg_update_has_rules(pd_to_tg(pd));
+ tg_update_has_rules(tg);
+ tg->last_dispatch_time[READ] = jiffies;
+ tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1615,6 +1620,12 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+ if (time_after_eq(jiffies,
+ tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+ time_after_eq(jiffies,
+ tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ return true;
return false;
}

@@ -1692,6 +1703,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

+ if (time_after_eq(now, tg->last_dispatch_time[READ] +
+ td->throtl_slice) &&
+ time_after_eq(now, tg->last_dispatch_time[WRITE] +
+ td->throtl_slice))
+ return false;
/*
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
@@ -1800,6 +1816,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
--
2.9.3

2017-03-27 17:54:06

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 17/18] blk-throttle: add a mechanism to estimate IO latency

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-core.c | 4 +
block/blk-mq.c | 4 +
block/blk-throttle.c | 181 ++++++++++++++++++++++++++++++++++++++++++++--
block/blk.h | 4 +
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 3 +
6 files changed, 192 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad388d5e..d5b5169 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2482,6 +2482,8 @@ void blk_start_request(struct request *req)
{
blk_dequeue_request(req);

+ blk_throtl_start_request(req);
+
if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
blk_stat_set_issue_time(&req->issue_stat);
req->rq_flags |= RQF_STATS;
@@ -2703,6 +2705,8 @@ void blk_finish_request(struct request *req, int error)
{
struct request_queue *q = req->q;

+ blk_throtl_finish_request(req);
+
if (req->rq_flags & RQF_STATS)
blk_stat_add(req);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45b9beb..b04a564 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -434,6 +434,8 @@ static void blk_mq_ipi_complete_request(struct request *rq)

static void blk_mq_stat_add(struct request *rq)
{
+ blk_throtl_finish_request(rq);
+
if (rq->rq_flags & RQF_STATS) {
blk_mq_poll_stats_start(rq->q);
blk_stat_add(rq);
@@ -487,6 +489,8 @@ void blk_mq_start_request(struct request *rq)

trace_block_rq_issue(q, rq);

+ blk_throtl_start_request(rq);
+
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
blk_stat_set_issue_time(&rq->issue_stat);
rq->rq_flags |= RQF_STATS;
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6e1c298..4b9c6a1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -28,6 +28,13 @@ static int throtl_quantum = 32;
/* default latency target is 0, eg, guarantee IO latency by default */
#define DFL_LATENCY_TARGET (0)

+#define THROTL_STAT(time, size) \
+ (((u64)time & (((u64)1 << 48) - 1)) | \
+ (((u64)size & (((u64)1 << 12) - 1)) << 48))
+#define THROTL_SKIP_LAT ((u64)1 << 63)
+#define THROTL_STAT_TIME(stat) (stat & (((u64)1 << 48) - 1))
+#define THROTL_STAT_SIZE(stat) ((stat >> 48) & (((u64)1 << 12) - 1))
+
static struct blkcg_policy blkcg_policy_throtl;

/* A workqueue to queue throttle related work */
@@ -165,6 +172,19 @@ struct throtl_grp {
unsigned long idletime_threshold; /* us */
};

+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+ unsigned long total_latency; /* ns / 1024 */
+ int samples;
+};
+
+struct avg_latency_bucket {
+ unsigned long latency; /* ns / 1024 */
+ bool valid;
+};
+
struct throtl_data
{
/* service tree for active throtl groups */
@@ -188,6 +208,13 @@ struct throtl_data
unsigned long low_downgrade_time;

unsigned int scale;
+
+ struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+ struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+ struct latency_bucket __percpu *latency_buckets;
+ unsigned long last_calculate_time;
+
+ bool track_bio_latency;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -306,6 +333,9 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
return ret;
}

+#define request_bucket_index(sectors) \
+ clamp_t(int, order_base_2(sectors) - 3, 0, LATENCY_BUCKET_SIZE - 1)
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -1931,6 +1961,73 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
tg->checked_last_finish_time = last_finish_time;
}

+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+ struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+ int i, cpu;
+ unsigned long last_latency = 0;
+ unsigned long latency;
+
+ if (!blk_queue_nonrot(td->queue))
+ return;
+ if (time_before(jiffies, td->last_calculate_time + HZ))
+ return;
+ td->last_calculate_time = jiffies;
+
+ memset(avg_latency, 0, sizeof(avg_latency));
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+ for_each_possible_cpu(cpu) {
+ struct latency_bucket *bucket;
+
+ /* this isn't race free, but ok in practice */
+ bucket = per_cpu_ptr(td->latency_buckets, cpu);
+ tmp->total_latency += bucket[i].total_latency;
+ tmp->samples += bucket[i].samples;
+ bucket[i].total_latency = 0;
+ bucket[i].samples = 0;
+ }
+
+ if (tmp->samples >= 32) {
+ int samples = tmp->samples;
+
+ latency = tmp->total_latency;
+
+ tmp->total_latency = 0;
+ tmp->samples = 0;
+ latency /= samples;
+ if (latency == 0)
+ continue;
+ avg_latency[i].latency = latency;
+ }
+ }
+
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ if (!avg_latency[i].latency) {
+ if (td->avg_buckets[i].latency < last_latency)
+ td->avg_buckets[i].latency = last_latency;
+ continue;
+ }
+
+ if (!td->avg_buckets[i].valid)
+ latency = avg_latency[i].latency;
+ else
+ latency = (td->avg_buckets[i].latency * 7 +
+ avg_latency[i].latency) >> 3;
+
+ td->avg_buckets[i].latency = max(latency, last_latency);
+ td->avg_buckets[i].valid = true;
+ last_latency = td->avg_buckets[i].latency;
+ }
+}
+#else
+static inline void throtl_update_latency_buckets(struct throtl_data *td)
+{
+}
+#endif
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1939,6 +2036,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
+ struct throtl_data *td = tg->td;
int ret;

WARN_ON_ONCE(!rcu_read_lock_held());
@@ -1949,6 +2047,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

spin_lock_irq(q->queue_lock);

+ throtl_update_latency_buckets(td);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

@@ -1956,6 +2056,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
if (ret == 0 || ret == -EBUSY)
bio->bi_cg_private = tg;
+ bio->bi_throtl_stat = THROTL_STAT(ktime_get_ns(), bio_sectors(bio));
#endif
blk_throtl_update_idletime(tg);

@@ -1974,8 +2075,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
tg->last_low_overflow_time[rw] = jiffies;
- if (throtl_can_upgrade(tg->td, tg)) {
- throtl_upgrade_state(tg->td);
+ if (throtl_can_upgrade(td, tg)) {
+ throtl_upgrade_state(td);
goto again;
}
break;
@@ -2019,7 +2120,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

tg->last_low_overflow_time[rw] = jiffies;

- tg->td->nr_queued[rw]++;
+ td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;

@@ -2044,20 +2145,79 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
*/
if (!throttled)
bio_clear_flag(bio, BIO_THROTTLED);
+
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ if (throttled || !td->track_bio_latency)
+ bio->bi_throtl_stat |= THROTL_SKIP_LAT;
+#endif
return throttled;
}

#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+ int op, unsigned long time)
+{
+ struct latency_bucket *latency;
+ int index;
+
+ if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+ !blk_queue_nonrot(td->queue))
+ return;
+
+ index = request_bucket_index(size);
+
+ latency = get_cpu_ptr(td->latency_buckets);
+ latency[index].total_latency += time;
+ latency[index].samples++;
+ put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_start_request(struct request *rq)
+{
+ rq->throtl_stat = THROTL_STAT(ktime_get_ns(),
+ blk_rq_sectors(rq));
+}
+
+void blk_throtl_finish_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct throtl_data *td = q->td;
+ u64 finish_time = THROTL_STAT_TIME(ktime_get_ns());
+ u64 time_ns;
+
+ if (finish_time < THROTL_STAT_TIME(rq->throtl_stat))
+ return;
+ time_ns = finish_time - THROTL_STAT_TIME(rq->throtl_stat);
+
+ throtl_track_latency(td, THROTL_STAT_SIZE(rq->throtl_stat),
+ req_op(rq), time_ns >> 10);
+}
+
void blk_throtl_bio_endio(struct bio *bio)
{
struct throtl_grp *tg;
+ u64 finish_time_ns;
+ unsigned long finish_time;
+ unsigned long start_time;
+ unsigned long lat;

tg = bio->bi_cg_private;
if (!tg)
return;
bio->bi_cg_private = NULL;

- tg->last_finish_time = ktime_get_ns() >> 10;
+ finish_time_ns = ktime_get_ns();
+ tg->last_finish_time = finish_time_ns >> 10;
+
+ start_time = THROTL_STAT_TIME(bio->bi_throtl_stat) >> 10;
+ finish_time = THROTL_STAT_TIME(finish_time_ns) >> 10;
+ if (start_time && finish_time > start_time &&
+ !(bio->bi_throtl_stat & THROTL_SKIP_LAT)) {
+ lat = finish_time - start_time;
+ throtl_track_latency(tg->td,
+ THROTL_STAT_SIZE(bio->bi_throtl_stat),
+ bio_op(bio), lat);
+ }
}
#endif

@@ -2133,6 +2293,12 @@ int blk_throtl_init(struct request_queue *q)
td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
if (!td)
return -ENOMEM;
+ td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+ LATENCY_BUCKET_SIZE, __alignof__(u64));
+ if (!td->latency_buckets) {
+ kfree(td);
+ return -ENOMEM;
+ }

INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
throtl_service_queue_init(&td->service_queue);
@@ -2147,8 +2313,10 @@ int blk_throtl_init(struct request_queue *q)

/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
- if (ret)
+ if (ret) {
+ free_percpu(td->latency_buckets);
kfree(td);
+ }
return ret;
}

@@ -2157,6 +2325,7 @@ void blk_throtl_exit(struct request_queue *q)
BUG_ON(!q->td);
throtl_shutdown_wq(q);
blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+ free_percpu(q->td->latency_buckets);
kfree(q->td);
}

@@ -2181,6 +2350,8 @@ void blk_throtl_register_queue(struct request_queue *q)
td->throtl_slice = DFL_THROTL_SLICE_HD;
#endif

+ td->track_bio_latency = !q->mq_ops && !q->request_fn;
+
/*
* some tg are created before queue is fully initialized, eg, nonrot
* isn't initialized yet
diff --git a/block/blk.h b/block/blk.h
index 3ac833e..fa5610e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -331,8 +331,12 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
const char *page, size_t count);
extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_start_request(struct request *rq);
+extern void blk_throtl_finish_request(struct request *rq);
#else
static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_start_request(struct request *rq) { }
+static inline void blk_throtl_finish_request(struct request *rq) { }
#endif

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 07a9e96..112fd26 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -60,6 +60,7 @@ struct bio {
struct cgroup_subsys_state *bi_css;
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
void *bi_cg_private;
+ u64 bi_throtl_stat;
#endif
#endif
union {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1a7dc42..b14ae55 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -204,6 +204,9 @@ struct request {
struct request_list *rl; /* rl this rq is alloced from */
unsigned long long start_time_ns;
unsigned long long io_start_time_ns; /* when passed to hardware */
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ u64 throtl_stat;
+#endif
#endif
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
--
2.9.3

2017-03-27 17:53:49

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 12/18] blk-throttle: make bandwidth change smooth

When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Scale up is linear. The limit scales up 1/2 .low limit every
throtl_slice after upgrade. The scale up will stop if the adjusted limit
hits .max limit. Scale down is exponential. We cut the scale value half
if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
fully downgrade the queue to LIMIT_LOW state.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 014b2e9..62984fc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -175,6 +175,8 @@ struct throtl_data

unsigned long low_upgrade_time;
unsigned long low_downgrade_time;
+
+ unsigned int scale;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -226,29 +228,70 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
}

+/*
+ * cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is to
+ * make the IO dispatch more smooth.
+ * Scale up: linearly scale up according to lapsed time since upgrade. For
+ * every throtl_slice, the limit scales up 1/2 .low limit till the
+ * limit hits .max limit
+ * Scale down: exponentially scale down if a cgroup doesn't hit its .low limit
+ */
+static uint64_t throtl_adjusted_limit(uint64_t low, struct throtl_data *td)
+{
+ /* arbitrary value to avoid too big scale */
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->low_upgrade_time + td->scale * td->throtl_slice))
+ td->scale = (jiffies - td->low_upgrade_time) / td->throtl_slice;
+
+ return low + (low >> 1) * td->scale;
+}
+
static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
uint64_t ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return U64_MAX;
- ret = tg->bps[rw][tg->td->limit_index];
- if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+
+ td = tg->td;
+ ret = tg->bps[rw][td->limit_index];
+ if (ret == 0 && td->limit_index == LIMIT_LOW)
return tg->bps[rw][LIMIT_MAX];
+
+ if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] &&
+ tg->bps[rw][LIMIT_LOW] != tg->bps[rw][LIMIT_MAX]) {
+ uint64_t adjusted;
+
+ adjusted = throtl_adjusted_limit(tg->bps[rw][LIMIT_LOW], td);
+ ret = min(tg->bps[rw][LIMIT_MAX], adjusted);
+ }
return ret;
}

static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
unsigned int ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return UINT_MAX;
- ret = tg->iops[rw][tg->td->limit_index];
+ td = tg->td;
+ ret = tg->iops[rw][td->limit_index];
if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
return tg->iops[rw][LIMIT_MAX];
+
+ if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] &&
+ tg->iops[rw][LIMIT_LOW] != tg->iops[rw][LIMIT_MAX]) {
+ uint64_t adjusted;
+
+ adjusted = throtl_adjusted_limit(tg->iops[rw][LIMIT_LOW], td);
+ if (adjusted > UINT_MAX)
+ adjusted = UINT_MAX;
+ ret = min_t(unsigned int, tg->iops[rw][LIMIT_MAX], adjusted);
+ }
return ret;
}

@@ -1677,6 +1720,7 @@ static void throtl_upgrade_state(struct throtl_data *td)

td->limit_index = LIMIT_MAX;
td->low_upgrade_time = jiffies;
+ td->scale = 0;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1694,6 +1738,13 @@ static void throtl_upgrade_state(struct throtl_data *td)

static void throtl_downgrade_state(struct throtl_data *td, int new)
{
+ td->scale /= 2;
+
+ if (td->scale) {
+ td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
+ return;
+ }
+
td->limit_index = new;
td->low_downgrade_time = jiffies;
}
--
2.9.3

2017-03-27 17:54:17

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 01/18] blk-throttle: use U64_MAX/UINT_MAX to replace -1

clean up the code to avoid using -1

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8fab716..0d5d85b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -334,10 +334,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}

RB_CLEAR_NODE(&tg->rb_node);
- tg->bps[READ] = -1;
- tg->bps[WRITE] = -1;
- tg->iops[READ] = -1;
- tg->iops[WRITE] = -1;
+ tg->bps[READ] = U64_MAX;
+ tg->bps[WRITE] = U64_MAX;
+ tg->iops[READ] = UINT_MAX;
+ tg->iops[WRITE] = UINT_MAX;

return &tg->pd;
}
@@ -380,7 +380,7 @@ static void tg_update_has_rules(struct throtl_grp *tg)

for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
- (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+ (tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
}

static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -771,7 +771,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+ if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
if (wait)
*wait = 0;
return true;
@@ -1112,7 +1112,7 @@ static u64 tg_prfill_conf_u64(struct seq_file *sf, struct blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
u64 v = *(u64 *)((void *)tg + off);

- if (v == -1)
+ if (v == U64_MAX)
return 0;
return __blkg_prfill_u64(sf, pd, v);
}
@@ -1123,7 +1123,7 @@ static u64 tg_prfill_conf_uint(struct seq_file *sf, struct blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
unsigned int v = *(unsigned int *)((void *)tg + off);

- if (v == -1)
+ if (v == UINT_MAX)
return 0;
return __blkg_prfill_u64(sf, pd, v);
}
@@ -1197,7 +1197,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
if (sscanf(ctx.body, "%llu", &v) != 1)
goto out_finish;
if (!v)
- v = -1;
+ v = U64_MAX;

tg = blkg_to_tg(ctx.blkg);

@@ -1272,17 +1272,17 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
- tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+ if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
+ tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
return 0;

- if (tg->bps[READ] != -1)
+ if (tg->bps[READ] != U64_MAX)
snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
- if (tg->bps[WRITE] != -1)
+ if (tg->bps[WRITE] != U64_MAX)
snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
- if (tg->iops[READ] != -1)
+ if (tg->iops[READ] != UINT_MAX)
snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
- if (tg->iops[WRITE] != -1)
+ if (tg->iops[WRITE] != UINT_MAX)
snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
@@ -1320,7 +1320,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
while (true) {
char tok[27]; /* wiops=18446744073709551616 */
char *p;
- u64 val = -1;
+ u64 val = U64_MAX;
int len;

if (sscanf(ctx.body, "%26s%n", tok, &len) != 1)
--
2.9.3

2017-03-27 17:54:26

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 13/18] blk-throttle: add a simple idle detection

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.

Signed-off-by: Shaohua Li <[email protected]>
---
block/bio.c | 2 ++
block/blk-throttle.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++-
block/blk.h | 3 ++
include/linux/blk_types.h | 3 ++
4 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 6194a8c..f1857c0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
#include <linux/cgroup.h>

#include <trace/events/block.h>
+#include "blk.h"

/*
* Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1845,6 +1846,7 @@ void bio_endio(struct bio *bio)
goto again;
}

+ blk_throtl_bio_endio(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
}
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 62984fc..6300f3e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -22,6 +22,9 @@ static int throtl_quantum = 32;
#define DFL_THROTL_SLICE_HD (HZ / 10)
#define DFL_THROTL_SLICE_SSD (HZ / 50)
#define MAX_THROTL_SLICE (HZ)
+#define DFL_IDLE_THRESHOLD_SSD (1000L) /* 1 ms */
+#define DFL_IDLE_THRESHOLD_HD (100L * 1000) /* 100 ms */
+#define MAX_IDLE_TIME (5L * 1000 * 1000) /* 5 s */

static struct blkcg_policy blkcg_policy_throtl;

@@ -154,6 +157,11 @@ struct throtl_grp {
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
+
+ unsigned long last_finish_time; /* ns / 1024 */
+ unsigned long checked_last_finish_time; /* ns / 1024 */
+ unsigned long avg_idletime; /* ns / 1024 */
+ unsigned long idletime_threshold; /* us */
};

struct throtl_data
@@ -468,6 +476,11 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
tg->td = td;
+
+ if (blk_queue_nonrot(td->queue))
+ tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+ else
+ tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
}

/*
@@ -1644,6 +1657,21 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
return ret;
}

+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+ /*
+ * cgroup is idle if:
+ * - single idle is too long, longer than a fixed value (in case user
+ * configure a too big threshold) or 4 times of slice
+ * - average think time is more than threshold
+ */
+ unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);
+
+ time = min_t(unsigned long, MAX_IDLE_TIME, time);
+ return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
+ tg->avg_idletime > tg->idletime_threshold;
+}
+
static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1843,6 +1871,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
tg->last_io_disp[WRITE] = 0;
}

+static void blk_throtl_update_idletime(struct throtl_grp *tg)
+{
+ unsigned long now = ktime_get_ns() >> 10;
+ unsigned long last_finish_time = tg->last_finish_time;
+
+ if (now <= last_finish_time || last_finish_time == 0 ||
+ last_finish_time == tg->checked_last_finish_time)
+ return;
+
+ tg->avg_idletime = (tg->avg_idletime * 7 + now - last_finish_time) >> 3;
+ tg->checked_last_finish_time = last_finish_time;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1851,6 +1892,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
+ int ret;

WARN_ON_ONCE(!rcu_read_lock_held());

@@ -1863,6 +1905,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

+ ret = bio_associate_current(bio);
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ if (ret == 0 || ret == -EBUSY)
+ bio->bi_cg_private = tg;
+#endif
+ blk_throtl_update_idletime(tg);
+
sq = &tg->service_queue;

again:
@@ -1923,7 +1972,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

tg->last_low_overflow_time[rw] = jiffies;

- bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;
@@ -1952,6 +2000,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
return throttled;
}

+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+void blk_throtl_bio_endio(struct bio *bio)
+{
+ struct throtl_grp *tg;
+
+ tg = bio->bi_cg_private;
+ if (!tg)
+ return;
+ bio->bi_cg_private = NULL;
+
+ tg->last_finish_time = ktime_get_ns() >> 10;
+}
+#endif
+
/*
* Dispatch all bios from all children tg's queued on @parent_sq. On
* return, @parent_sq is guaranteed to not have any active children tg's
@@ -2035,6 +2097,7 @@ int blk_throtl_init(struct request_queue *q)
td->limit_index = LIMIT_MAX;
td->low_upgrade_time = jiffies;
td->low_downgrade_time = jiffies;
+
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
@@ -2053,6 +2116,8 @@ void blk_throtl_exit(struct request_queue *q)
void blk_throtl_register_queue(struct request_queue *q)
{
struct throtl_data *td;
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;

td = q->td;
BUG_ON(!td);
@@ -2065,6 +2130,21 @@ void blk_throtl_register_queue(struct request_queue *q)
/* if no low limit, use previous default */
td->throtl_slice = DFL_THROTL_SLICE_HD;
#endif
+
+ /*
+ * some tg are created before queue is fully initialized, eg, nonrot
+ * isn't initialized yet
+ */
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (blk_queue_nonrot(q))
+ tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+ else
+ tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+ }
+ rcu_read_unlock();
}

#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
diff --git a/block/blk.h b/block/blk.h
index 13070c3..3ac833e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -330,6 +330,9 @@ static inline void blk_throtl_register_queue(struct request_queue *q) { }
extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
+#else
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
#endif

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 270119a..07a9e96 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -58,6 +58,9 @@ struct bio {
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ void *bi_cg_private;
+#endif
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
--
2.9.3

2017-03-27 17:54:47

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 15/18] blk-throttle: ignore idle cgroup limit

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f03e158..0ea8698 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -152,8 +152,6 @@ struct throtl_grp {

unsigned long last_check_time;

- unsigned long last_dispatch_time[2];
-
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -508,8 +506,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
* Update has_rules[] after a new group is brought online.
*/
tg_update_has_rules(tg);
- tg->last_dispatch_time[READ] = jiffies;
- tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1708,9 +1704,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
return true;

if (time_after_eq(jiffies,
- tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
- time_after_eq(jiffies,
- tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+ throtl_tg_is_idle(tg))
return true;
return false;
}
@@ -1756,6 +1751,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
return true;
}

+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+ unsigned long now = jiffies;
+
+ if (tg->td->limit_index != LIMIT_LOW)
+ return;
+
+ if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+ return;
+
+ tg->last_check_time = now;
+
+ if (!time_after_eq(now,
+ __tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+ return;
+
+ if (throtl_can_upgrade(tg->td, NULL))
+ throtl_upgrade_state(tg->td);
+}
+
static void throtl_upgrade_state(struct throtl_data *td)
{
struct cgroup_subsys_state *pos_css;
@@ -1797,18 +1812,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

- if (time_after_eq(now, tg->last_dispatch_time[READ] +
- td->throtl_slice) &&
- time_after_eq(now, tg->last_dispatch_time[WRITE] +
- td->throtl_slice))
- return false;
/*
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
*/
if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
time_after_eq(now, tg_last_low_overflow_time(tg) +
- td->throtl_slice))
+ td->throtl_slice) &&
+ (!throtl_tg_is_idle(tg) ||
+ !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
return true;
return false;
}
@@ -1931,10 +1943,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
- tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
+ throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
--
2.9.3

2017-03-27 17:54:38

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 10/18] blk-throttle: choose a small throtl_slice for SSD

The throtl_slice is 100ms by default. This is a long time for SSD, a lot
of IO can run. To make cgroups have smoother throughput, we choose a
small value (20ms) for SSD.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-sysfs.c | 2 ++
block/blk-throttle.c | 23 ++++++++++++++++++++---
block/blk.h | 2 ++
3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b315e62..7f090dd 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -906,6 +906,8 @@ int blk_register_queue(struct gendisk *disk)

blk_wb_init(q);

+ blk_throtl_register_queue(q);
+
if (q->request_fn || (q->mq_ops && q->elevator)) {
ret = elv_register_queue(q);
if (ret) {
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 93841da..d00c1c1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -18,8 +18,9 @@ static int throtl_grp_quantum = 8;
/* Total max dispatch from all groups in one round */
static int throtl_quantum = 32;

-/* Throttling is performed over 100ms slice and after that slice is renewed */
-#define DFL_THROTL_SLICE (HZ / 10)
+/* Throttling is performed over a slice and after that slice is renewed */
+#define DFL_THROTL_SLICE_HD (HZ / 10)
+#define DFL_THROTL_SLICE_SSD (HZ / 50)
#define MAX_THROTL_SLICE (HZ)

static struct blkcg_policy blkcg_policy_throtl;
@@ -1961,7 +1962,6 @@ int blk_throtl_init(struct request_queue *q)

q->td = td;
td->queue = q;
- td->throtl_slice = DFL_THROTL_SLICE;

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
@@ -1982,6 +1982,23 @@ void blk_throtl_exit(struct request_queue *q)
kfree(q->td);
}

+void blk_throtl_register_queue(struct request_queue *q)
+{
+ struct throtl_data *td;
+
+ td = q->td;
+ BUG_ON(!td);
+
+ if (blk_queue_nonrot(q))
+ td->throtl_slice = DFL_THROTL_SLICE_SSD;
+ else
+ td->throtl_slice = DFL_THROTL_SLICE_HD;
+#ifndef CONFIG_BLK_DEV_THROTTLING_LOW
+ /* if no low limit, use previous default */
+ td->throtl_slice = DFL_THROTL_SLICE_HD;
+#endif
+}
+
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
{
diff --git a/block/blk.h b/block/blk.h
index bcd3de6..13070c3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -319,10 +319,12 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
extern void blk_throtl_drain(struct request_queue *q);
extern int blk_throtl_init(struct request_queue *q);
extern void blk_throtl_exit(struct request_queue *q);
+extern void blk_throtl_register_queue(struct request_queue *q);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_register_queue(struct request_queue *q) { }
#endif /* CONFIG_BLK_DEV_THROTTLING */
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
--
2.9.3

2017-03-27 18:15:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V7 00/18] blk-throttle: add .low limit

On 03/27/2017 11:51 AM, Shaohua Li wrote:
> V6->V7:
> - Don't overload blk stat, which will simplify the code. This will add extra
> space in bio/request though with the low interface configure option on.

Hmm, why? As far as I can see, it just makes things worse.

--
Jens Axboe

2017-03-27 18:18:55

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 02/18] blk-throttle: prepare support multiple limits

We are going to support low/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 110 ++++++++++++++++++++++++++++++++-------------------
1 file changed, 70 insertions(+), 40 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0d5d85b..1da9f30 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {

#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

+enum {
+ LIMIT_MAX,
+ LIMIT_CNT,
+};
+
struct throtl_grp {
/* must be the first member */
struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
bool has_rules[2];

/* bytes per second rate limits */
- uint64_t bps[2];
+ uint64_t bps[2][LIMIT_CNT];

/* IOPS limits */
- unsigned int iops[2];
+ unsigned int iops[2][LIMIT_CNT];

/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data

/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
+ unsigned int limit_index;
+ bool limit_valid[LIMIT_CNT];
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
}

+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->iops[rw][tg->td->limit_index];
+}
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -334,10 +351,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}

RB_CLEAR_NODE(&tg->rb_node);
- tg->bps[READ] = U64_MAX;
- tg->bps[WRITE] = U64_MAX;
- tg->iops[READ] = UINT_MAX;
- tg->iops[WRITE] = UINT_MAX;
+ tg->bps[READ][LIMIT_MAX] = U64_MAX;
+ tg->bps[WRITE][LIMIT_MAX] = U64_MAX;
+ tg->iops[READ][LIMIT_MAX] = UINT_MAX;
+ tg->iops[WRITE][LIMIT_MAX] = UINT_MAX;

return &tg->pd;
}
@@ -376,11 +393,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
static void tg_update_has_rules(struct throtl_grp *tg)
{
struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+ struct throtl_data *td = tg->td;
int rw;

for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
- (tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
+ (td->limit_valid[td->limit_index] &&
+ (tg_bps_limit(tg, rw) != U64_MAX ||
+ tg_iops_limit(tg, rw) != UINT_MAX));
}

static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +652,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)

if (!nr_slices)
return;
- tmp = tg->bps[rw] * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+ io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;

if (!bytes_trim && !io_trim)
return;
@@ -682,7 +702,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
* have been trimmed.
*/

- tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+ tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);

if (tmp > UINT_MAX)
@@ -697,7 +717,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
}

/* Calc approx time to dispatch */
- jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+ jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;

if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +744,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);

- tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+ tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;

@@ -736,7 +756,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
- jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+ jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));

if (!jiffy_wait)
jiffy_wait = 1;
@@ -771,7 +791,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
+ if (tg_bps_limit(tg, rw) == U64_MAX &&
+ tg_iops_limit(tg, rw) == UINT_MAX) {
if (wait)
*wait = 0;
return true;
@@ -1150,8 +1171,8 @@ static void tg_conf_updated(struct throtl_grp *tg)

throtl_log(&tg->service_queue,
"limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
- tg->bps[READ], tg->bps[WRITE],
- tg->iops[READ], tg->iops[WRITE]);
+ tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+ tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));

/*
* Update has_rules[] flags for the updated tg's subtree. A tg is
@@ -1228,25 +1249,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
static struct cftype throtl_legacy_files[] = {
{
.name = "throttle.read_bps_device",
- .private = offsetof(struct throtl_grp, bps[READ]),
+ .private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.write_bps_device",
- .private = offsetof(struct throtl_grp, bps[WRITE]),
+ .private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.read_iops_device",
- .private = offsetof(struct throtl_grp, iops[READ]),
+ .private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
{
.name = "throttle.write_iops_device",
- .private = offsetof(struct throtl_grp, iops[WRITE]),
+ .private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
@@ -1272,18 +1293,25 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
- tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
+
+ if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
+ tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
+ tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
+ tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
return 0;

- if (tg->bps[READ] != U64_MAX)
- snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
- if (tg->bps[WRITE] != U64_MAX)
- snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
- if (tg->iops[READ] != UINT_MAX)
- snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
- if (tg->iops[WRITE] != UINT_MAX)
- snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+ if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+ snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+ tg->bps[READ][LIMIT_MAX]);
+ if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+ snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+ tg->bps[WRITE][LIMIT_MAX]);
+ if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+ snprintf(bufs[2], sizeof(bufs[2]), "%u",
+ tg->iops[READ][LIMIT_MAX]);
+ if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+ snprintf(bufs[3], sizeof(bufs[3]), "%u",
+ tg->iops[WRITE][LIMIT_MAX]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1312,10 +1340,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ];
- v[1] = tg->bps[WRITE];
- v[2] = tg->iops[READ];
- v[3] = tg->iops[WRITE];
+ v[0] = tg->bps[READ][LIMIT_MAX];
+ v[1] = tg->bps[WRITE][LIMIT_MAX];
+ v[2] = tg->iops[READ][LIMIT_MAX];
+ v[3] = tg->iops[WRITE][LIMIT_MAX];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1352,10 +1380,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ] = v[0];
- tg->bps[WRITE] = v[1];
- tg->iops[READ] = v[2];
- tg->iops[WRITE] = v[3];
+ tg->bps[READ][LIMIT_MAX] = v[0];
+ tg->bps[WRITE][LIMIT_MAX] = v[1];
+ tg->iops[READ][LIMIT_MAX] = v[2];
+ tg->iops[WRITE][LIMIT_MAX] = v[3];

tg_conf_updated(tg);
ret = 0;
@@ -1453,8 +1481,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
/* out-of-limit, queue to @tg */
throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
rw == READ ? 'R' : 'W',
- tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
- tg->io_disp[rw], tg->iops[rw],
+ tg->bytes_disp[rw], bio->bi_iter.bi_size,
+ tg_bps_limit(tg, rw),
+ tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

bio_associate_current(bio);
@@ -1565,6 +1594,7 @@ int blk_throtl_init(struct request_queue *q)
q->td = td;
td->queue = q;

+ td->limit_valid[LIMIT_MAX] = true;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2017-03-27 18:23:24

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 05/18] blk-throttle: configure bps/iops limit for cgroup in low limit

each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal value between low limit and
max limit configured by user. If the minimal value is 0, which means the
cgroup doesn't configure low limit, we will use max limit to throttle
the cgroup and the cgroup is ready to upgrade to LIMIT_MAX

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b7b69ec..1fade50 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,12 +212,28 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)

static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
- return tg->bps[rw][tg->td->limit_index];
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ uint64_t ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return U64_MAX;
+ ret = tg->bps[rw][tg->td->limit_index];
+ if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+ return tg->bps[rw][LIMIT_MAX];
+ return ret;
}

static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
- return tg->iops[rw][tg->td->limit_index];
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ unsigned int ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return UINT_MAX;
+ ret = tg->iops[rw][tg->td->limit_index];
+ if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+ return tg->iops[rw][LIMIT_MAX];
+ return ret;
}

/**
--
2.9.3

2017-03-27 18:27:05

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 14/18] blk-throttle: add interface to configure idle time threshold

Add interface to configure the threshold. The io.low interface will
like:
echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

idle is in microsecond unit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 41 ++++++++++++++++++++++++++++-------------
1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6300f3e..f03e158 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -181,6 +181,8 @@ struct throtl_data
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];

+ unsigned long dft_idletime_threshold; /* us */
+
unsigned long low_upgrade_time;
unsigned long low_downgrade_time;

@@ -477,10 +479,7 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
tg->td = td;

- if (blk_queue_nonrot(td->queue))
- tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
- else
- tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+ tg->idletime_threshold = td->dft_idletime_threshold;
}

/*
@@ -1449,6 +1448,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
char bufs[4][21] = { "max", "max", "max", "max" };
u64 bps_dft;
unsigned int iops_dft;
+ char idle_time[26] = "";

if (!dname)
return 0;
@@ -1464,7 +1464,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
if (tg->bps_conf[READ][off] == bps_dft &&
tg->bps_conf[WRITE][off] == bps_dft &&
tg->iops_conf[READ][off] == iops_dft &&
- tg->iops_conf[WRITE][off] == iops_dft)
+ tg->iops_conf[WRITE][off] == iops_dft &&
+ (off != LIMIT_LOW || tg->idletime_threshold ==
+ tg->td->dft_idletime_threshold))
return 0;

if (tg->bps_conf[READ][off] != bps_dft)
@@ -1479,9 +1481,16 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
if (tg->iops_conf[WRITE][off] != iops_dft)
snprintf(bufs[3], sizeof(bufs[3]), "%u",
tg->iops_conf[WRITE][off]);
+ if (off == LIMIT_LOW) {
+ if (tg->idletime_threshold == ULONG_MAX)
+ strcpy(idle_time, " idle=max");
+ else
+ snprintf(idle_time, sizeof(idle_time), " idle=%lu",
+ tg->idletime_threshold);
+ }

- seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
- dname, bufs[0], bufs[1], bufs[2], bufs[3]);
+ seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
+ dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
return 0;
}

@@ -1499,6 +1508,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct blkg_conf_ctx ctx;
struct throtl_grp *tg;
u64 v[4];
+ unsigned long idle_time;
int ret;
int index = of_cft(of)->private;

@@ -1513,6 +1523,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = tg->iops_conf[READ][index];
v[3] = tg->iops_conf[WRITE][index];

+ idle_time = tg->idletime_threshold;
while (true) {
char tok[27]; /* wiops=18446744073709551616 */
char *p;
@@ -1544,6 +1555,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = min_t(u64, val, UINT_MAX);
else if (!strcmp(tok, "wiops"))
v[3] = min_t(u64, val, UINT_MAX);
+ else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
+ idle_time = val;
else
goto out_finish;
}
@@ -1572,6 +1585,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
blk_throtl_update_limit_valid(tg->td);
if (tg->td->limit_valid[LIMIT_LOW])
tg->td->limit_index = LIMIT_LOW;
+ tg->idletime_threshold = (idle_time == ULONG_MAX) ?
+ ULONG_MAX : idle_time;
}
tg_conf_updated(tg);
ret = 0;
@@ -2122,10 +2137,13 @@ void blk_throtl_register_queue(struct request_queue *q)
td = q->td;
BUG_ON(!td);

- if (blk_queue_nonrot(q))
+ if (blk_queue_nonrot(q)) {
td->throtl_slice = DFL_THROTL_SLICE_SSD;
- else
+ td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+ } else {
td->throtl_slice = DFL_THROTL_SLICE_HD;
+ td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+ }
#ifndef CONFIG_BLK_DEV_THROTTLING_LOW
/* if no low limit, use previous default */
td->throtl_slice = DFL_THROTL_SLICE_HD;
@@ -2139,10 +2157,7 @@ void blk_throtl_register_queue(struct request_queue *q)
blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);

- if (blk_queue_nonrot(q))
- tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
- else
- tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+ tg->idletime_threshold = td->dft_idletime_threshold;
}
rcu_read_unlock();
}
--
2.9.3

2017-03-27 18:28:24

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 18/18] blk-throttle: add latency target support

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4b9c6a1..e506d94 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -170,6 +170,10 @@ struct throtl_grp {
unsigned long checked_last_finish_time; /* ns / 1024 */
unsigned long avg_idletime; /* ns / 1024 */
unsigned long idletime_threshold; /* us */
+
+ unsigned int bio_cnt; /* total bios */
+ unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+ unsigned long bio_cnt_reset_time;
};

/* We measure latency for request size from <= 4k to >= 1M */
@@ -1725,12 +1729,15 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
* - single idle is too long, longer than a fixed value (in case user
* configure a too big threshold) or 4 times of slice
* - average think time is more than threshold
+ * - IO latency is largely below threshold
*/
unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);

time = min_t(unsigned long, MAX_IDLE_TIME, time);
return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
- tg->avg_idletime > tg->idletime_threshold;
+ tg->avg_idletime > tg->idletime_threshold ||
+ (tg->latency_target && tg->bio_cnt &&
+ tg->bad_bio_cnt * 5 < tg->bio_cnt);
}

static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2211,12 +2218,36 @@ void blk_throtl_bio_endio(struct bio *bio)

start_time = THROTL_STAT_TIME(bio->bi_throtl_stat) >> 10;
finish_time = THROTL_STAT_TIME(finish_time_ns) >> 10;
- if (start_time && finish_time > start_time &&
- !(bio->bi_throtl_stat & THROTL_SKIP_LAT)) {
- lat = finish_time - start_time;
+ if (!start_time || finish_time <= start_time)
+ return;
+
+ lat = finish_time - start_time;
+ if (!(bio->bi_throtl_stat & THROTL_SKIP_LAT))
throtl_track_latency(tg->td,
THROTL_STAT_SIZE(bio->bi_throtl_stat),
bio_op(bio), lat);
+
+ if (tg->latency_target) {
+ int bucket;
+ unsigned int threshold;
+
+ bucket = request_bucket_index(
+ THROTL_STAT_SIZE(bio->bi_throtl_stat));
+ threshold = tg->td->avg_buckets[bucket].latency +
+ tg->latency_target;
+ if (lat > threshold)
+ tg->bad_bio_cnt++;
+ /*
+ * Not race free, could get wrong count, which means cgroups
+ * will be throttled
+ */
+ tg->bio_cnt++;
+ }
+
+ if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+ tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+ tg->bio_cnt /= 2;
+ tg->bad_bio_cnt /= 2;
}
}
#endif
--
2.9.3

2017-03-27 18:32:43

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 08/18] blk-throttle: make sure expire time isn't too big

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7878ec1..2073b48 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -590,6 +590,17 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
+ unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+ /*
+ * Since we are adjusting the throttle limit dynamically, the sleep
+ * time calculated according to previous limit might be invalid. It's
+ * possible the cgroup sleep time is very long and no other cgroups
+ * have IO running so notify the limit changes. Make sure the cgroup
+ * doesn't sleep too long to avoid the missed notification.
+ */
+ if (time_after(expires, max_expire))
+ expires = max_expire;
mod_timer(&sq->pending_timer, expires);
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
expires - jiffies, jiffies);
--
2.9.3

2017-03-27 18:33:53

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 07/18] blk-throttle: add downgrade logic

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 156 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index dd382d8..7878ec1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -140,6 +140,13 @@ struct throtl_grp {
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];

+ unsigned long last_low_overflow_time[2];
+
+ uint64_t last_bytes_disp[2];
+ unsigned int last_io_disp[2];
+
+ unsigned long last_check_time;
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -159,6 +166,9 @@ struct throtl_data
struct work_struct dispatch_work;
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
+
+ unsigned long low_upgrade_time;
+ unsigned long low_downgrade_time;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -898,6 +908,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio->bi_iter.bi_size;
tg->io_disp[rw]++;
+ tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+ tg->last_io_disp[rw]++;

/*
* BIO_THROTTLED is used to prevent the same bio to be throttled
@@ -1527,6 +1539,45 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+ unsigned long rtime = jiffies, wtime = jiffies;
+
+ if (tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW])
+ rtime = tg->last_low_overflow_time[READ];
+ if (tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW])
+ wtime = tg->last_low_overflow_time[WRITE];
+ return min(rtime, wtime);
+}
+
+/* tg should not be an intermediate node */
+static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *parent_sq;
+ struct throtl_grp *parent = tg;
+ unsigned long ret = __tg_last_low_overflow_time(tg);
+
+ while (true) {
+ parent_sq = parent->service_queue.parent_sq;
+ parent = sq_to_tg(parent_sq);
+ if (!parent)
+ break;
+
+ /*
+ * The parent doesn't have low limit, it always reaches low
+ * limit. Its overflow time is useless for children
+ */
+ if (!parent->bps[READ][LIMIT_LOW] &&
+ !parent->iops[READ][LIMIT_LOW] &&
+ !parent->bps[WRITE][LIMIT_LOW] &&
+ !parent->iops[WRITE][LIMIT_LOW])
+ continue;
+ if (time_after(__tg_last_low_overflow_time(parent), ret))
+ ret = __tg_last_low_overflow_time(parent);
+ }
+ return ret;
+}
+
static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1570,6 +1621,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_LOW)
return false;

+ if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+ return false;
+
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1593,6 +1647,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
struct blkcg_gq *blkg;

td->limit_index = LIMIT_MAX;
+ td->low_upgrade_time = jiffies;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1608,6 +1663,99 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, &td->dispatch_work);
}

+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+ td->limit_index = new;
+ td->low_downgrade_time = jiffies;
+}
+
+static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
+{
+ struct throtl_data *td = tg->td;
+ unsigned long now = jiffies;
+
+ /*
+ * If cgroup is below low limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
+ time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ return true;
+ return false;
+}
+
+static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg)
+{
+ while (true) {
+ if (!throtl_tg_can_downgrade(tg))
+ return false;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ if (!tg || !tg_to_blkg(tg)->parent)
+ break;
+ }
+ return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+ uint64_t bps;
+ unsigned int iops;
+ unsigned long elapsed_time;
+ unsigned long now = jiffies;
+
+ if (tg->td->limit_index != LIMIT_MAX ||
+ !tg->td->limit_valid[LIMIT_LOW])
+ return;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ return;
+ if (time_after(tg->last_check_time + throtl_slice, now))
+ return;
+
+ elapsed_time = now - tg->last_check_time;
+ tg->last_check_time = now;
+
+ if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ return;
+
+ if (tg->bps[READ][LIMIT_LOW]) {
+ bps = tg->last_bytes_disp[READ] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[READ][LIMIT_LOW])
+ tg->last_low_overflow_time[READ] = now;
+ }
+
+ if (tg->bps[WRITE][LIMIT_LOW]) {
+ bps = tg->last_bytes_disp[WRITE] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[WRITE][LIMIT_LOW])
+ tg->last_low_overflow_time[WRITE] = now;
+ }
+
+ if (tg->iops[READ][LIMIT_LOW]) {
+ iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+ if (iops >= tg->iops[READ][LIMIT_LOW])
+ tg->last_low_overflow_time[READ] = now;
+ }
+
+ if (tg->iops[WRITE][LIMIT_LOW]) {
+ iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+ if (iops >= tg->iops[WRITE][LIMIT_LOW])
+ tg->last_low_overflow_time[WRITE] = now;
+ }
+
+ /*
+ * If cgroup is below low limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (throtl_hierarchy_can_downgrade(tg))
+ throtl_downgrade_state(tg->td, LIMIT_LOW);
+
+ tg->last_bytes_disp[READ] = 0;
+ tg->last_bytes_disp[WRITE] = 0;
+ tg->last_io_disp[READ] = 0;
+ tg->last_io_disp[WRITE] = 0;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1632,12 +1780,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ if (tg->last_low_overflow_time[rw] == 0)
+ tg->last_low_overflow_time[rw] = jiffies;
+ throtl_downgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
+ tg->last_low_overflow_time[rw] = jiffies;
if (throtl_can_upgrade(tg->td, tg)) {
throtl_upgrade_state(tg->td);
goto again;
@@ -1681,6 +1833,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

+ tg->last_low_overflow_time[rw] = jiffies;
+
bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
@@ -1791,6 +1945,8 @@ int blk_throtl_init(struct request_queue *q)

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
+ td->low_upgrade_time = jiffies;
+ td->low_downgrade_time = jiffies;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2017-03-27 18:35:55

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 06/18] blk-throttle: add upgrade logic for LIMIT_LOW state

When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
the cgroup has pending request. Since cgroup is throttled according to
the limit, pending request means the cgroup reaches the limit.

If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction
IO, the other direction IO could be severly harmed.

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops (which is a sum of childrens
bps/iops) cross low limit, we can upgrade queue state.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 96 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1fade50..dd382d8 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -457,6 +457,7 @@ static void blk_throtl_update_limit_valid(struct throtl_data *td)
td->limit_valid[LIMIT_LOW] = low_valid;
}

+static void throtl_upgrade_state(struct throtl_data *td);
static void throtl_pd_offline(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -468,9 +469,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)

blk_throtl_update_limit_valid(tg->td);

- if (tg->td->limit_index == LIMIT_LOW &&
- !tg->td->limit_valid[LIMIT_LOW])
- tg->td->limit_index = LIMIT_MAX;
+ if (!tg->td->limit_valid[tg->td->limit_index])
+ throtl_upgrade_state(tg->td);
}

static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1081,6 +1081,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
return nr_disp;
}

+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg);
/**
* throtl_pending_timer_fn - timer function for service_queue->pending_timer
* @arg: the throtl_service_queue being serviced
@@ -1107,6 +1109,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
int ret;

spin_lock_irq(q->queue_lock);
+ if (throtl_can_upgrade(td, NULL))
+ throtl_upgrade_state(td);
+
again:
parent_sq = sq->parent_sq;
dispatched = false;
@@ -1522,6 +1527,87 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *sq = &tg->service_queue;
+ bool read_limit, write_limit;
+
+ /*
+ * if cgroup reaches low limit (if low limit is 0, the cgroup always
+ * reaches), it's ok to upgrade to next limit
+ */
+ read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW];
+ write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW];
+ if (!read_limit && !write_limit)
+ return true;
+ if (read_limit && sq->nr_queued[READ] &&
+ (!write_limit || sq->nr_queued[WRITE]))
+ return true;
+ if (write_limit && sq->nr_queued[WRITE] &&
+ (!read_limit || sq->nr_queued[READ]))
+ return true;
+ return false;
+}
+
+static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
+{
+ while (true) {
+ if (throtl_tg_can_upgrade(tg))
+ return true;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ if (!tg || !tg_to_blkg(tg)->parent)
+ return false;
+ }
+ return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ if (td->limit_index != LIMIT_LOW)
+ return false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg == this_tg)
+ continue;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ continue;
+ if (!throtl_hierarchy_can_upgrade(tg)) {
+ rcu_read_unlock();
+ return false;
+ }
+ }
+ rcu_read_unlock();
+ return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ td->limit_index = LIMIT_MAX;
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+ struct throtl_service_queue *sq = &tg->service_queue;
+
+ tg->disptime = jiffies - 1;
+ throtl_select_dispatch(sq);
+ throtl_schedule_next_dispatch(sq, false);
+ }
+ rcu_read_unlock();
+ throtl_select_dispatch(&td->service_queue);
+ throtl_schedule_next_dispatch(&td->service_queue, false);
+ queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1544,14 +1630,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

sq = &tg->service_queue;

+again:
while (true) {
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
- if (!tg_may_dispatch(tg, bio, NULL))
+ if (!tg_may_dispatch(tg, bio, NULL)) {
+ if (throtl_can_upgrade(tg->td, tg)) {
+ throtl_upgrade_state(tg->td);
+ goto again;
+ }
break;
+ }

/* within limits, let's charge and dispatch directly */
throtl_charge_bio(tg, bio);
--
2.9.3

2017-03-27 18:37:22

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 09/18] blk-throttle: make throtl_slice tunable

throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li <[email protected]>
---
Documentation/block/queue-sysfs.txt | 6 +++
block/blk-sysfs.c | 11 ++++++
block/blk-throttle.c | 79 ++++++++++++++++++++++++++-----------
block/blk.h | 5 +++
4 files changed, 79 insertions(+), 22 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index c0a3bb5..b7f6bdc 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -192,5 +192,11 @@ scaling back writes. Writing a value of '0' to this file disables the
feature. Writing a value of '-1' to this file resets the value to the
default setting.

+throttle_sample_time (RW)
+-------------------------
+This is the time window that blk-throttle samples data, in millisecond.
+blk-throttle makes decision based on the samplings. Lower time means cgroups
+have more smooth throughput, but higher CPU overhead. This exists only when
+CONFIG_BLK_DEV_THROTTLING_LOW is enabled.

Jens Axboe <[email protected]>, February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fa831cb..b315e62 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -677,6 +677,14 @@ static struct queue_sysfs_entry queue_wb_lat_entry = {
.store = queue_wb_lat_store,
};

+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static struct queue_sysfs_entry throtl_sample_time_entry = {
+ .attr = {.name = "throttle_sample_time", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_throtl_sample_time_show,
+ .store = blk_throtl_sample_time_store,
+};
+#endif
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -710,6 +718,9 @@ static struct attribute *default_attrs[] = {
&queue_dax_entry.attr,
&queue_wb_lat_entry.attr,
&queue_poll_delay_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ &throtl_sample_time_entry.attr,
+#endif
NULL,
};

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2073b48..93841da 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
static int throtl_quantum = 32;

/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10; /* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ)

static struct blkcg_policy blkcg_policy_throtl;

@@ -162,6 +163,8 @@ struct throtl_data
/* Total Number of queued bios on READ and WRITE lists */
unsigned int nr_queued[2];

+ unsigned int throtl_slice;
+
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
unsigned int limit_index;
@@ -590,7 +593,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
- unsigned long max_expire = jiffies + 8 * throtl_slice;
+ unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;

/*
* Since we are adjusting the throttle limit dynamically, the sleep
@@ -658,7 +661,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
if (time_after_eq(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;

- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -670,7 +673,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
tg->bytes_disp[rw] = 0;
tg->io_disp[rw] = 0;
tg->slice_start[rw] = jiffies;
- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -680,13 +683,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
}

static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
throtl_log(&tg->service_queue,
"[%c] extend slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -726,19 +729,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
* is bad because it does not allow new slice to start.
*/

- throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+ throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);

time_elapsed = jiffies - tg->slice_start[rw];

- nr_slices = time_elapsed / throtl_slice;
+ nr_slices = time_elapsed / tg->td->throtl_slice;

if (!nr_slices)
return;
- tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
+ io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) /
+ HZ;

if (!bytes_trim && !io_trim)
return;
@@ -753,7 +757,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
else
tg->io_disp[rw] = 0;

- tg->slice_start[rw] += nr_slices * throtl_slice;
+ tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;

throtl_log(&tg->service_queue,
"[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -773,9 +777,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

/*
* jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -822,9 +826,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
@@ -890,8 +894,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
throtl_start_new_slice(tg, rw);
else {
- if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
- throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+ if (time_before(tg->slice_end[rw],
+ jiffies + tg->td->throtl_slice))
+ throtl_extend_slice(tg, rw,
+ jiffies + tg->td->throtl_slice);
}

if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1632,7 +1638,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_LOW)
return false;

- if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+ if (time_before(jiffies, td->low_downgrade_time + td->throtl_slice))
return false;

rcu_read_lock();
@@ -1689,8 +1695,9 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
*/
- if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
- time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
+ time_after_eq(now, tg_last_low_overflow_time(tg) +
+ td->throtl_slice))
return true;
return false;
}
@@ -1719,13 +1726,14 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
return;
if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
return;
- if (time_after(tg->last_check_time + throtl_slice, now))
+ if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
return;

elapsed_time = now - tg->last_check_time;
tg->last_check_time = now;

- if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+ if (time_before(now, tg_last_low_overflow_time(tg) +
+ tg->td->throtl_slice))
return;

if (tg->bps[READ][LIMIT_LOW]) {
@@ -1953,6 +1961,7 @@ int blk_throtl_init(struct request_queue *q)

q->td = td;
td->queue = q;
+ td->throtl_slice = DFL_THROTL_SLICE;

td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
@@ -1973,6 +1982,32 @@ void blk_throtl_exit(struct request_queue *q)
kfree(q->td);
}

+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
+{
+ if (!q->td)
+ return -EINVAL;
+ return sprintf(page, "%u\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long v;
+ unsigned long t;
+
+ if (!q->td)
+ return -EINVAL;
+ if (kstrtoul(page, 10, &v))
+ return -EINVAL;
+ t = msecs_to_jiffies(v);
+ if (t == 0 || t > MAX_THROTL_SLICE)
+ return -EINVAL;
+ q->td->throtl_slice = t;
+ return count;
+}
+#endif
+
static int __init throtl_init(void)
{
kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index d1ea4bd9..bcd3de6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -324,5 +324,10 @@ static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline void blk_throtl_exit(struct request_queue *q) { }
#endif /* CONFIG_BLK_DEV_THROTTLING */
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+ const char *page, size_t count);
+#endif

#endif /* BLK_INTERNAL_H */
--
2.9.3

2017-03-27 18:38:19

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 04/18] blk-throttle: add .low interface

Add low limit for cgroup and corresponding cgroup interface. To be
consistent with memcg, we allow users configure .low limit higher than
.max limit. But the internal logic always assumes .low limit is lower
than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
userspace configuration. Old bps/iops fields in throtl_grp will be the
actual limit we use for throttling.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 144 +++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 116 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1da9f30..b7b69ec 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

enum {
+ LIMIT_LOW,
LIMIT_MAX,
LIMIT_CNT,
};
@@ -124,11 +125,15 @@ struct throtl_grp {
/* are there any throtl rules between this group and td? */
bool has_rules[2];

- /* bytes per second rate limits */
+ /* internally used bytes per second rate limits */
uint64_t bps[2][LIMIT_CNT];
+ /* user configured bps limits */
+ uint64_t bps_conf[2][LIMIT_CNT];

- /* IOPS limits */
+ /* internally used IOPS limits */
unsigned int iops[2][LIMIT_CNT];
+ /* user configured IOPS limits */
+ unsigned int iops_conf[2][LIMIT_CNT];

/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -355,6 +360,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
tg->bps[WRITE][LIMIT_MAX] = U64_MAX;
tg->iops[READ][LIMIT_MAX] = UINT_MAX;
tg->iops[WRITE][LIMIT_MAX] = UINT_MAX;
+ tg->bps_conf[READ][LIMIT_MAX] = U64_MAX;
+ tg->bps_conf[WRITE][LIMIT_MAX] = U64_MAX;
+ tg->iops_conf[READ][LIMIT_MAX] = UINT_MAX;
+ tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
+ /* LIMIT_LOW will have default value 0 */

return &tg->pd;
}
@@ -412,6 +422,41 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
}

+static void blk_throtl_update_limit_valid(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+ bool low_valid = false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] ||
+ tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW])
+ low_valid = true;
+ }
+ rcu_read_unlock();
+
+ td->limit_valid[LIMIT_LOW] = low_valid;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+ struct throtl_grp *tg = pd_to_tg(pd);
+
+ tg->bps[READ][LIMIT_LOW] = 0;
+ tg->bps[WRITE][LIMIT_LOW] = 0;
+ tg->iops[READ][LIMIT_LOW] = 0;
+ tg->iops[WRITE][LIMIT_LOW] = 0;
+
+ blk_throtl_update_limit_valid(tg->td);
+
+ if (tg->td->limit_index == LIMIT_LOW &&
+ !tg->td->limit_valid[LIMIT_LOW])
+ tg->td->limit_index = LIMIT_MAX;
+}
+
static void throtl_pd_free(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1284,48 +1329,58 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
};

-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
struct throtl_grp *tg = pd_to_tg(pd);
const char *dname = blkg_dev_name(pd->blkg);
char bufs[4][21] = { "max", "max", "max", "max" };
+ u64 bps_dft;
+ unsigned int iops_dft;

if (!dname)
return 0;

- if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
- tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
- tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
- tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
+ if (off == LIMIT_LOW) {
+ bps_dft = 0;
+ iops_dft = 0;
+ } else {
+ bps_dft = U64_MAX;
+ iops_dft = UINT_MAX;
+ }
+
+ if (tg->bps_conf[READ][off] == bps_dft &&
+ tg->bps_conf[WRITE][off] == bps_dft &&
+ tg->iops_conf[READ][off] == iops_dft &&
+ tg->iops_conf[WRITE][off] == iops_dft)
return 0;

- if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+ if (tg->bps_conf[READ][off] != bps_dft)
snprintf(bufs[0], sizeof(bufs[0]), "%llu",
- tg->bps[READ][LIMIT_MAX]);
- if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+ tg->bps_conf[READ][off]);
+ if (tg->bps_conf[WRITE][off] != bps_dft)
snprintf(bufs[1], sizeof(bufs[1]), "%llu",
- tg->bps[WRITE][LIMIT_MAX]);
- if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+ tg->bps_conf[WRITE][off]);
+ if (tg->iops_conf[READ][off] != iops_dft)
snprintf(bufs[2], sizeof(bufs[2]), "%u",
- tg->iops[READ][LIMIT_MAX]);
- if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+ tg->iops_conf[READ][off]);
+ if (tg->iops_conf[WRITE][off] != iops_dft)
snprintf(bufs[3], sizeof(bufs[3]), "%u",
- tg->iops[WRITE][LIMIT_MAX]);
+ tg->iops_conf[WRITE][off]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
return 0;
}

-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
{
- blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
&blkcg_policy_throtl, seq_cft(sf)->private, false);
return 0;
}

-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1333,6 +1388,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
int ret;
+ int index = of_cft(of)->private;

ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
if (ret)
@@ -1340,10 +1396,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ][LIMIT_MAX];
- v[1] = tg->bps[WRITE][LIMIT_MAX];
- v[2] = tg->iops[READ][LIMIT_MAX];
- v[3] = tg->iops[WRITE][LIMIT_MAX];
+ v[0] = tg->bps_conf[READ][index];
+ v[1] = tg->bps_conf[WRITE][index];
+ v[2] = tg->iops_conf[READ][index];
+ v[3] = tg->iops_conf[WRITE][index];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1380,11 +1436,31 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ][LIMIT_MAX] = v[0];
- tg->bps[WRITE][LIMIT_MAX] = v[1];
- tg->iops[READ][LIMIT_MAX] = v[2];
- tg->iops[WRITE][LIMIT_MAX] = v[3];
+ tg->bps_conf[READ][index] = v[0];
+ tg->bps_conf[WRITE][index] = v[1];
+ tg->iops_conf[READ][index] = v[2];
+ tg->iops_conf[WRITE][index] = v[3];

+ if (index == LIMIT_MAX) {
+ tg->bps[READ][index] = v[0];
+ tg->bps[WRITE][index] = v[1];
+ tg->iops[READ][index] = v[2];
+ tg->iops[WRITE][index] = v[3];
+ }
+ tg->bps[READ][LIMIT_LOW] = min(tg->bps_conf[READ][LIMIT_LOW],
+ tg->bps_conf[READ][LIMIT_MAX]);
+ tg->bps[WRITE][LIMIT_LOW] = min(tg->bps_conf[WRITE][LIMIT_LOW],
+ tg->bps_conf[WRITE][LIMIT_MAX]);
+ tg->iops[READ][LIMIT_LOW] = min(tg->iops_conf[READ][LIMIT_LOW],
+ tg->iops_conf[READ][LIMIT_MAX]);
+ tg->iops[WRITE][LIMIT_LOW] = min(tg->iops_conf[WRITE][LIMIT_LOW],
+ tg->iops_conf[WRITE][LIMIT_MAX]);
+
+ if (index == LIMIT_LOW) {
+ blk_throtl_update_limit_valid(tg->td);
+ if (tg->td->limit_valid[LIMIT_LOW])
+ tg->td->limit_index = LIMIT_LOW;
+ }
tg_conf_updated(tg);
ret = 0;
out_finish:
@@ -1393,11 +1469,21 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
}

static struct cftype throtl_files[] = {
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ {
+ .name = "low",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_LOW,
+ },
+#endif
{
.name = "max",
.flags = CFTYPE_NOT_ON_ROOT,
- .seq_show = tg_print_max,
- .write = tg_set_max,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_MAX,
},
{ } /* terminate */
};
@@ -1416,6 +1502,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_alloc_fn = throtl_pd_alloc,
.pd_init_fn = throtl_pd_init,
.pd_online_fn = throtl_pd_online,
+ .pd_offline_fn = throtl_pd_offline,
.pd_free_fn = throtl_pd_free,
};

@@ -1595,6 +1682,7 @@ int blk_throtl_init(struct request_queue *q)
td->queue = q;

td->limit_valid[LIMIT_MAX] = true;
+ td->limit_index = LIMIT_MAX;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2017-03-27 18:41:13

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V7 03/18] blk-throttle: add configure option for new .low interface

As discussed in LSF, add configure option for the interface and mark it
as experimental, so people can try/test.

Signed-off-by: Shaohua Li <[email protected]>
---
block/Kconfig | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index e9f780f..89cd28f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -115,6 +115,18 @@ config BLK_DEV_THROTTLING

See Documentation/cgroups/blkio-controller.txt for more information.

+config BLK_DEV_THROTTLING_LOW
+ bool "Block throttling .low limit interface support (EXPERIMENTAL)"
+ depends on BLK_DEV_THROTTLING
+ default n
+ ---help---
+ Add .low limit interface for block throttling. The low limit is a best
+ effort limit to prioritize cgroups. Depending on the setting, the limit
+ can be used to protect cgroups in terms of bandwidth/iops and better
+ utilize disk resource.
+
+ Note, this is an experimental interface and could be changed someday.
+
config BLK_CMDLINE_PARSER
bool "Block device command line partition parser"
default n
--
2.9.3

2017-03-27 19:01:08

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V7 00/18] blk-throttle: add .low limit

On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote:
> On 03/27/2017 11:51 AM, Shaohua Li wrote:
> > V6->V7:
> > - Don't overload blk stat, which will simplify the code. This will add extra
> > space in bio/request though with the low interface configure option on.
>
> Hmm, why? As far as I can see, it just makes things worse.

Part of the reason is the new callback based blk stat makes enabling stat less
straightforward. I probably can create a fake callback and enalbing the stat
though. Only the last 3 patches depend on this, if you prefer, I can resend
them and add the overload back.

Thanks,
Shaohua

2017-03-27 19:11:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V7 00/18] blk-throttle: add .low limit

On 03/27/2017 01:00 PM, Shaohua Li wrote:
> On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote:
>> On 03/27/2017 11:51 AM, Shaohua Li wrote:
>>> V6->V7:
>>> - Don't overload blk stat, which will simplify the code. This will add extra
>>> space in bio/request though with the low interface configure option on.
>>
>> Hmm, why? As far as I can see, it just makes things worse.
>
> Part of the reason is the new callback based blk stat makes enabling stat less
> straightforward. I probably can create a fake callback and enalbing the stat
> though. Only the last 3 patches depend on this, if you prefer, I can resend
> them and add the overload back.

Yeah, I think that would be a big improvement.

--
Jens Axboe

2017-03-27 22:52:06

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 0/3] blk-throttle: add .low limit fix

Jens,

The 3 patches replace the last two patches (patch 17/18) of the series I sent
out today. The new patch overloads blk stat as you suggested.

Thanks,
Shaohua

Shaohua Li (3):
block: track request size in blk_issue_stat
blk-throttle: add a mechanism to estimate IO latency
blk-throttle: add latency target support

block/blk-core.c | 2 +-
block/blk-mq.c | 2 +-
block/blk-stat.c | 15 +++-
block/blk-stat.h | 44 +++++++---
block/blk-throttle.c | 199 ++++++++++++++++++++++++++++++++++++++++++++--
block/blk-wbt.h | 10 +--
block/blk.h | 2 +
include/linux/blk_types.h | 9 ++-
8 files changed, 254 insertions(+), 29 deletions(-)

--
2.9.3

2017-03-27 22:58:25

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 1/3] block: track request size in blk_issue_stat

Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-mq.c | 2 +-
block/blk-stat.h | 41 ++++++++++++++++++++++++++++++-----------
block/blk-wbt.h | 10 +++++-----
include/linux/blk_types.h | 2 +-
5 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad388d5e..4234332 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2483,7 +2483,7 @@ void blk_start_request(struct request *req)
blk_dequeue_request(req);

if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
- blk_stat_set_issue_time(&req->issue_stat);
+ blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
req->rq_flags |= RQF_STATS;
wbt_issue(req->q->rq_wb, &req->issue_stat);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45b9beb..caef6ee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -488,7 +488,7 @@ void blk_mq_start_request(struct request *rq)
trace_block_rq_issue(q, rq);

if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
- blk_stat_set_issue_time(&rq->issue_stat);
+ blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
rq->rq_flags |= RQF_STATS;
wbt_issue(q->rq_wb, &rq->issue_stat);
}
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 6ad5b8c..ee47f81 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -8,12 +8,19 @@
#include <linux/timer.h>

/*
- * Upper 3 bits can be used elsewhere
+ * from upper:
+ * 3 bits: reserved for other usage
+ * 12 bits: size
+ * 49 bits: time
*/
#define BLK_STAT_RES_BITS 3
-#define BLK_STAT_SHIFT (64 - BLK_STAT_RES_BITS)
-#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SHIFT) - 1)
-#define BLK_STAT_MASK ~BLK_STAT_TIME_MASK
+#define BLK_STAT_SIZE_BITS 12
+#define BLK_STAT_RES_SHIFT (64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_SIZE_SHIFT (BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
+#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SIZE_SHIFT) - 1)
+#define BLK_STAT_SIZE_MASK \
+ (((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT)
+#define BLK_STAT_RES_MASK (~((1ULL << BLK_STAT_RES_SHIFT) - 1))

/**
* struct blk_stat_callback - Block statistics callback.
@@ -73,12 +80,6 @@ void blk_free_queue_stats(struct blk_queue_stats *);

void blk_stat_add(struct request *);

-static inline void blk_stat_set_issue_time(struct blk_issue_stat *stat)
-{
- stat->time = ((stat->time & BLK_STAT_MASK) |
- (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK));
-}
-
static inline u64 __blk_stat_time(u64 time)
{
return time & BLK_STAT_TIME_MASK;
@@ -86,7 +87,25 @@ static inline u64 __blk_stat_time(u64 time)

static inline u64 blk_stat_time(struct blk_issue_stat *stat)
{
- return __blk_stat_time(stat->time);
+ return __blk_stat_time(stat->stat);
+}
+
+static inline sector_t blk_capped_size(sector_t size)
+{
+ return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1);
+}
+
+static inline sector_t blk_stat_size(struct blk_issue_stat *stat)
+{
+ return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT;
+}
+
+static inline void blk_stat_set_issue(struct blk_issue_stat *stat,
+ sector_t size)
+{
+ stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
+ (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
+ (((u64)blk_capped_size(size)) << BLK_STAT_SIZE_SHIFT);
}

/*
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 591ff2f..ad6c785 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -32,27 +32,27 @@ enum {

static inline void wbt_clear_state(struct blk_issue_stat *stat)
{
- stat->time &= BLK_STAT_TIME_MASK;
+ stat->stat &= ~BLK_STAT_RES_MASK;
}

static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
{
- return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+ return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
}

static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
{
- stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+ stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
}

static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
{
- return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
+ return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
}

static inline bool wbt_is_read(struct blk_issue_stat *stat)
{
- return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
+ return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
}

struct rq_wait {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 07a9e96..3ad5673 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -287,7 +287,7 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
}

struct blk_issue_stat {
- u64 time;
+ u64 stat;
};

struct blk_rq_stat {
--
2.9.3

2017-03-27 22:59:34

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 3/3] blk-throttle: add latency target support

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 140da29..c82bf9b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -165,6 +165,10 @@ struct throtl_grp {
unsigned long checked_last_finish_time; /* ns / 1024 */
unsigned long avg_idletime; /* ns / 1024 */
unsigned long idletime_threshold; /* us */
+
+ unsigned int bio_cnt; /* total bios */
+ unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+ unsigned long bio_cnt_reset_time;
};

/* We measure latency for request size from <= 4k to >= 1M */
@@ -1720,12 +1724,15 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
* - single idle is too long, longer than a fixed value (in case user
* configure a too big threshold) or 4 times of slice
* - average think time is more than threshold
+ * - IO latency is largely below threshold
*/
unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);

time = min_t(unsigned long, MAX_IDLE_TIME, time);
return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
- tg->avg_idletime > tg->idletime_threshold;
+ tg->avg_idletime > tg->idletime_threshold ||
+ (tg->latency_target && tg->bio_cnt &&
+ tg->bad_bio_cnt * 5 < tg->bio_cnt);
}

static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2194,12 +2201,36 @@ void blk_throtl_bio_endio(struct bio *bio)

start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
finish_time = __blk_stat_time(finish_time_ns) >> 10;
+ if (!start_time || finish_time <= start_time)
+ return;
+
+ lat = finish_time - start_time;
/* this is only for bio based driver */
- if (start_time && finish_time > start_time &&
- !(bio->bi_issue_stat.stat & SKIP_LATENCY)) {
- lat = finish_time - start_time;
+ if (!(bio->bi_issue_stat.stat & SKIP_LATENCY))
throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
bio_op(bio), lat);
+
+ if (tg->latency_target) {
+ int bucket;
+ unsigned int threshold;
+
+ bucket = request_bucket_index(
+ blk_stat_size(&bio->bi_issue_stat));
+ threshold = tg->td->avg_buckets[bucket].latency +
+ tg->latency_target;
+ if (lat > threshold)
+ tg->bad_bio_cnt++;
+ /*
+ * Not race free, could get wrong count, which means cgroups
+ * will be throttled
+ */
+ tg->bio_cnt++;
+ }
+
+ if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+ tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+ tg->bio_cnt /= 2;
+ tg->bad_bio_cnt /= 2;
}
}
#endif
--
2.9.3

2017-03-27 23:07:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 2/3] blk-throttle: add a mechanism to estimate IO latency

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-stat.c | 15 ++++-
block/blk-stat.h | 3 +
block/blk-throttle.c | 166 ++++++++++++++++++++++++++++++++++++++++++++--
block/blk.h | 2 +
include/linux/blk_types.h | 9 +--
5 files changed, 185 insertions(+), 10 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 188b535..e77ec52 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -9,12 +9,14 @@

#include "blk-stat.h"
#include "blk-mq.h"
+#include "blk.h"

#define BLK_RQ_STAT_BATCH 64

struct blk_queue_stats {
struct list_head callbacks;
spinlock_t lock;
+ bool enable_accounting;
};

unsigned int blk_stat_rq_ddir(const struct request *rq)
@@ -96,6 +98,8 @@ void blk_stat_add(struct request *rq)

value = now - blk_stat_time(&rq->issue_stat);

+ blk_throtl_stat_add(rq, value);
+
rcu_read_lock();
list_for_each_entry_rcu(cb, &q->stats->callbacks, list) {
if (blk_stat_is_active(cb)) {
@@ -190,7 +194,7 @@ void blk_stat_remove_callback(struct request_queue *q,
{
spin_lock(&q->stats->lock);
list_del_rcu(&cb->list);
- if (list_empty(&q->stats->callbacks))
+ if (list_empty(&q->stats->callbacks) && !q->stats->enable_accounting)
clear_bit(QUEUE_FLAG_STATS, &q->queue_flags);
spin_unlock(&q->stats->lock);

@@ -215,6 +219,14 @@ void blk_stat_free_callback(struct blk_stat_callback *cb)
}
EXPORT_SYMBOL_GPL(blk_stat_free_callback);

+void blk_stat_enable_accounting(struct request_queue *q)
+{
+ spin_lock(&q->stats->lock);
+ q->stats->enable_accounting = true;
+ set_bit(QUEUE_FLAG_STATS, &q->queue_flags);
+ spin_unlock(&q->stats->lock);
+}
+
struct blk_queue_stats *blk_alloc_queue_stats(void)
{
struct blk_queue_stats *stats;
@@ -225,6 +237,7 @@ struct blk_queue_stats *blk_alloc_queue_stats(void)

INIT_LIST_HEAD(&stats->callbacks);
spin_lock_init(&stats->lock);
+ stats->enable_accounting = false;

return stats;
}
diff --git a/block/blk-stat.h b/block/blk-stat.h
index ee47f81..53f08a6 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -108,6 +108,9 @@ static inline void blk_stat_set_issue(struct blk_issue_stat *stat,
(((u64)blk_capped_size(size)) << BLK_STAT_SIZE_SHIFT);
}

+/* record time/size info in request but not add a callback */
+void blk_stat_enable_accounting(struct request_queue *q);
+
/*
* blk_stat_rq_ddir() - Bucket callback function for the request data direction.
* @rq: Request.
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6e1c298..140da29 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -28,6 +28,8 @@ static int throtl_quantum = 32;
/* default latency target is 0, eg, guarantee IO latency by default */
#define DFL_LATENCY_TARGET (0)

+#define SKIP_LATENCY (((u64)1) << BLK_STAT_RES_SHIFT)
+
static struct blkcg_policy blkcg_policy_throtl;

/* A workqueue to queue throttle related work */
@@ -165,6 +167,19 @@ struct throtl_grp {
unsigned long idletime_threshold; /* us */
};

+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+ unsigned long total_latency; /* ns / 1024 */
+ int samples;
+};
+
+struct avg_latency_bucket {
+ unsigned long latency; /* ns / 1024 */
+ bool valid;
+};
+
struct throtl_data
{
/* service tree for active throtl groups */
@@ -188,6 +203,13 @@ struct throtl_data
unsigned long low_downgrade_time;

unsigned int scale;
+
+ struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+ struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+ struct latency_bucket __percpu *latency_buckets;
+ unsigned long last_calculate_time;
+
+ bool track_bio_latency;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -306,6 +328,9 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
return ret;
}

+#define request_bucket_index(sectors) \
+ clamp_t(int, order_base_2(sectors) - 3, 0, LATENCY_BUCKET_SIZE - 1)
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -1931,6 +1956,73 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
tg->checked_last_finish_time = last_finish_time;
}

+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+ struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+ int i, cpu;
+ unsigned long last_latency = 0;
+ unsigned long latency;
+
+ if (!blk_queue_nonrot(td->queue))
+ return;
+ if (time_before(jiffies, td->last_calculate_time + HZ))
+ return;
+ td->last_calculate_time = jiffies;
+
+ memset(avg_latency, 0, sizeof(avg_latency));
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+ for_each_possible_cpu(cpu) {
+ struct latency_bucket *bucket;
+
+ /* this isn't race free, but ok in practice */
+ bucket = per_cpu_ptr(td->latency_buckets, cpu);
+ tmp->total_latency += bucket[i].total_latency;
+ tmp->samples += bucket[i].samples;
+ bucket[i].total_latency = 0;
+ bucket[i].samples = 0;
+ }
+
+ if (tmp->samples >= 32) {
+ int samples = tmp->samples;
+
+ latency = tmp->total_latency;
+
+ tmp->total_latency = 0;
+ tmp->samples = 0;
+ latency /= samples;
+ if (latency == 0)
+ continue;
+ avg_latency[i].latency = latency;
+ }
+ }
+
+ for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+ if (!avg_latency[i].latency) {
+ if (td->avg_buckets[i].latency < last_latency)
+ td->avg_buckets[i].latency = last_latency;
+ continue;
+ }
+
+ if (!td->avg_buckets[i].valid)
+ latency = avg_latency[i].latency;
+ else
+ latency = (td->avg_buckets[i].latency * 7 +
+ avg_latency[i].latency) >> 3;
+
+ td->avg_buckets[i].latency = max(latency, last_latency);
+ td->avg_buckets[i].valid = true;
+ last_latency = td->avg_buckets[i].latency;
+ }
+}
+#else
+static inline void throtl_update_latency_buckets(struct throtl_data *td)
+{
+}
+#endif
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1939,6 +2031,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
+ struct throtl_data *td = tg->td;
int ret;

WARN_ON_ONCE(!rcu_read_lock_held());
@@ -1949,6 +2042,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

spin_lock_irq(q->queue_lock);

+ throtl_update_latency_buckets(td);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

@@ -1956,6 +2051,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
if (ret == 0 || ret == -EBUSY)
bio->bi_cg_private = tg;
+ blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
#endif
blk_throtl_update_idletime(tg);

@@ -1974,8 +2070,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
tg->last_low_overflow_time[rw] = jiffies;
- if (throtl_can_upgrade(tg->td, tg)) {
- throtl_upgrade_state(tg->td);
+ if (throtl_can_upgrade(td, tg)) {
+ throtl_upgrade_state(td);
goto again;
}
break;
@@ -2019,7 +2115,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

tg->last_low_overflow_time[rw] = jiffies;

- tg->td->nr_queued[rw]++;
+ td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;

@@ -2044,20 +2140,67 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
*/
if (!throttled)
bio_clear_flag(bio, BIO_THROTTLED);
+
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+ if (throttled || !td->track_bio_latency)
+ bio->bi_issue_stat.stat |= SKIP_LATENCY;
+#endif
return throttled;
}

#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+ int op, unsigned long time)
+{
+ struct latency_bucket *latency;
+ int index;
+
+ if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+ !blk_queue_nonrot(td->queue))
+ return;
+
+ index = request_bucket_index(size);
+
+ latency = get_cpu_ptr(td->latency_buckets);
+ latency[index].total_latency += time;
+ latency[index].samples++;
+ put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_stat_add(struct request *rq, u64 time_ns)
+{
+ struct request_queue *q = rq->q;
+ struct throtl_data *td = q->td;
+
+ throtl_track_latency(td, blk_stat_size(&rq->issue_stat),
+ req_op(rq), time_ns >> 10);
+}
+
void blk_throtl_bio_endio(struct bio *bio)
{
struct throtl_grp *tg;
+ u64 finish_time_ns;
+ unsigned long finish_time;
+ unsigned long start_time;
+ unsigned long lat;

tg = bio->bi_cg_private;
if (!tg)
return;
bio->bi_cg_private = NULL;

- tg->last_finish_time = ktime_get_ns() >> 10;
+ finish_time_ns = ktime_get_ns();
+ tg->last_finish_time = finish_time_ns >> 10;
+
+ start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
+ finish_time = __blk_stat_time(finish_time_ns) >> 10;
+ /* this is only for bio based driver */
+ if (start_time && finish_time > start_time &&
+ !(bio->bi_issue_stat.stat & SKIP_LATENCY)) {
+ lat = finish_time - start_time;
+ throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
+ bio_op(bio), lat);
+ }
}
#endif

@@ -2133,6 +2276,12 @@ int blk_throtl_init(struct request_queue *q)
td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
if (!td)
return -ENOMEM;
+ td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+ LATENCY_BUCKET_SIZE, __alignof__(u64));
+ if (!td->latency_buckets) {
+ kfree(td);
+ return -ENOMEM;
+ }

INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
throtl_service_queue_init(&td->service_queue);
@@ -2147,8 +2296,10 @@ int blk_throtl_init(struct request_queue *q)

/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
- if (ret)
+ if (ret) {
+ free_percpu(td->latency_buckets);
kfree(td);
+ }
return ret;
}

@@ -2157,6 +2308,7 @@ void blk_throtl_exit(struct request_queue *q)
BUG_ON(!q->td);
throtl_shutdown_wq(q);
blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+ free_percpu(q->td->latency_buckets);
kfree(q->td);
}

@@ -2181,6 +2333,10 @@ void blk_throtl_register_queue(struct request_queue *q)
td->throtl_slice = DFL_THROTL_SLICE_HD;
#endif

+ td->track_bio_latency = !q->mq_ops && !q->request_fn;
+ if (!td->track_bio_latency)
+ blk_stat_enable_accounting(q);
+
/*
* some tg are created before queue is fully initialized, eg, nonrot
* isn't initialized yet
diff --git a/block/blk.h b/block/blk.h
index 3ac833e..07d3751 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -331,8 +331,10 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
const char *page, size_t count);
extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_stat_add(struct request *rq, u64 time);
#else
static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
#endif

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3ad5673..67bcf8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -17,6 +17,10 @@ struct io_context;
struct cgroup_subsys_state;
typedef void (bio_end_io_t) (struct bio *);

+struct blk_issue_stat {
+ u64 stat;
+};
+
/*
* main unit of I/O for the block layer and lower layers (ie drivers and
* stacking drivers)
@@ -60,6 +64,7 @@ struct bio {
struct cgroup_subsys_state *bi_css;
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
void *bi_cg_private;
+ struct blk_issue_stat bi_issue_stat;
#endif
#endif
union {
@@ -286,10 +291,6 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
return (cookie & BLK_QC_T_INTERNAL) != 0;
}

-struct blk_issue_stat {
- u64 stat;
-};
-
struct blk_rq_stat {
s64 mean;
u64 min;
--
2.9.3

2017-03-28 15:58:27

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: add .low limit fix

On 03/27/2017 04:19 PM, Shaohua Li wrote:
> Jens,
>
> The 3 patches replace the last two patches (patch 17/18) of the series I sent
> out today. The new patch overloads blk stat as you suggested.

Thanks! With that, I've queued it up for 4.12.

--
Jens Axboe